Changes between Version 4 and Version 5 of partnerLIP


Ignore:
Timestamp:
Jan 5, 2012, 1:18:28 PM (13 years ago)
Author:
pfeautri
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • partnerLIP

    v4 v5  
    1515Kayhan M. Imre, Cesur Baransel and Harun Artuner:Efficient and Scalable Routing Algorithms for Collective Communication Operations on 2D All-Port Torus Networks. International Journal of Parallel Programming, 39,6,2011,746-782.
    1616
     17=== Paul's Own Thoughts on Process Networks on Chip ===
     18==== Landscape ====
     19
     20Due to technological problems (quantum effects, power dissipation)
     21the clock frequency of present day digital circuits
     22seems to have reached an upper bound around
     233 GHz. However, the gate density is still increasing, albeit not
     24as fast as it did before the millenium. Increases in processing
     25power, which are required by new applications, can only be found by
     26increased parallelism. Reaching the exascale level (10^18 floating point
     27operations per seconds) will need billions of processors.
     28Embedded and mobile applications, while less greedy, will run on
     29thousands of processors, as exemplified by the Intel MIC (1000 cores)
     30or the Kalray chip (from 256 to 1024 processors in the near future).
     31
     32Increasing the number of processors does not mechanically result in
     33increased processing power: there are many bottlenecks or walls to be breached
     34before being able to efficiently use such architectures, among them
     35the power wall, the memory and pin wall, the communication bottleneck,
     36and the most important one, the programmer wall.
     37
     38==== Memory ====
     39
     40Each processor consumes a few data words per cycle. These words can come from
     41registers, from several levels of local memory (caches or scratchpads)
     42from global memory or from permanent storage. Since the efficiency of a
     43memory hierarchy cannot be made arbitrarily large, the number of memory
     44banks will be a fraction of the number of processors. Since each processor chip
     45can be connected only to a small number of memory chips (the pin wall),
     46the system will necessarily have a Non Uniform Memory Access latency.
     47A pure dynamic replacement policy, as found for instance in present day
     48caches and virtual memory may not scale up to the necessary sizes. Beside,
     49such a policy forbids an accurate execution time prediction, and is unsuitable
     50for real-time applications. In the same way, it seems that dynamic
     51coherence protocols will not scale: the required data movements must
     52be orchestrated by the application software, possibly with the help of
     53the compiler.
     54
     55==== Control and Parallelism ====
     56
     57Conditional jumps and parallelism have never fitted well. This is true for instance
     58- for superscalar and pipelined processors, which have to stall instruction
     59  issue while waiting for the outcome of a test, or to resort to heuristics
     60  like speculation, branch prediction and delay slots,
     61- for SIMD systems, in which all processors execute the same code, some of them
     62  being prevented from modifying memory by an activity bit, which is set to the
     63  outcome of a test,
     64- vector processors and GPU use variants of this scheme (mask vector, thread
     65  serialization)
     66- in hardware, there are two solutions:
     67  * start the test and both branches at the same time, and feed both results
     68    to a multiplexer controlled by the result of the test (efficient if
     69    all three branches are well balanced, but costly in silicon area)
     70  * use the result of the test to select the next state of a finite state
     71    machine (same problem as for ordinary processors: the test is in the
     72    critical path; however, it is easier to anticipate if dependences permit).
     73
     74It is likely that introducing control for dataflow systems will be at least
     75as difficult as for ordinary processors. Observe that there is no
     76problem introducing tests in dataflow *languages*; but the net result is that
     77one will lose almost all interesting properties, like static scheduling,
     78deadlock detection, channel bounds, and even determinism if one relax
     79the "only one writer and one reader per channel" constraint.
     80
     81A suggestion: start from the Communicating Regular Processes model and introduce
     82control progressively. Control in purely computational parts is easy: one just
     83has to be careful to discard dependences between opposite branches of a test.
     84Since in CRP, reading is not destructive, there is no problem with conditional reads,
     85except that they might enlarge the lifespan of some channel tokens. Dealing
     86with conditional writes is more difficult. One must insure that the single
     87assignment property is not violated. Beside, conditional writes may create
     88undefined values (bubbles) in channels. How to deal with bubbles at the
     89receiving end?
     90
     91The net result is that conditionals in CRP processes have to be severely
     92restricted, and these restrictions must be verifiable by the compiler.
     93
     94==== Synchronous Languages and the Polyhedral Model ====
     95
     96There has been in the past an attempt to combine SIGNAL and MMALPHA , but I have
     97no information on the subject. An obvious possibility is to allows signal of
     98arrays (signal whose values are arrays instead of scalars). Since all elements
     99of the array will have the same clock, there will be no modification to the
     100clock calculus. The point-wise operators will have to be extended to allow
     101polyhedral operations (loops or vector notation). Scheduling will be done in two steps:
     102firstly, by the usual clock calculus, and second, at each event, a polyhedral
     103scheduling which may even uncover additional parallelism.
     104
     105A more ambitious proposal is to introduce array of signals. Here, each elementary
     106signal may have its own clock. Analyzing such a system may necessitate a
     107parametric clock calculus.
     108
     109==== Network Scheduling ====
     110
     111Present day communication networks, from Networks on Chips to the Internet, operate
     112in "best effort" mode. Each message bears a routing prefix, and each router dispatches it
     113as best it can, according to local information, and sometime to scanty global
     114information (e.g. neighboring link congestion). In some cases, one may create a
     115route and assign to it some portion of the available bandwidth. Another solution would
     116be to use passive routers, able only to connect their input ports to their output ports
     117in arbitrary pattern, under control either of the associated processor or of a
     118pre-stored configuration table, to be prepared beforehand by a compiler. See the
     119"polymorphic torus" paper for a similar proposal; note however that some of the assumptions
     120in this paper may not be valid for today architecture; an example is the hypothesis
     121that each router acts as a short circuit, and hence that the propagation delay across
     122the network is negligible. This scheme is also reminiscent of SIMD architectures
     123like the Connection Machine 2 or Propal, but in their case the connection pattern
     124was fixed: a hypercube for the CM2 and a shift register for Propal.
     125
     126As these examples show, this proposal implies synchrony, at least whenever a communication
     127is intended; the programming style must be SPMD and somewhat like BSP. This might
     128be barely possible for systems on chip like the Intel MIC or the Kalray chip, but is
     129excluded for exascale architectures; such architectures will probably use a hierarchy
     130of interconnects, and the proposed scheme may be applicable to the lowest level.
     131
     132Another point is that the compiler must have complete knowledge of the application
     133communication pattern and of the network topology. On the other hand, the placement
     134of the computations on the network is one of the unknowns of the problem. In abstract
     135term, one has to find both a placement and several communication patterns, the main
     136objective being the minimization of link contention. As such, it will need a mixture
     137of temporal and spatial reasoning, much like the construction of a railway timetable,
     138a very interesting subject.
     139
     140
     141
     142
     143
     144