Context Navigation

Changes between Version 4 and Version 5 of partnerLIP

Timestamp:: Jan 5, 2012, 1:18:28 PM (13 years ago)
Author:: pfeautri
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

partnerLIP

-                      v4
+                      v5
 Kayhan M. Imre, Cesur Baransel and Harun Artuner:Efficient and Scalable Routing Algorithms for Collective Communication Operations on 2D All-Port Torus Networks. International Journal of Parallel Programming, 39,6,2011,746-782.
+=== Paul's Own Thoughts on Process Networks on Chip ===
+==== Landscape ====
+Due to technological problems (quantum effects, power dissipation)
+the clock frequency of present day digital circuits
+seems to have reached an upper bound around
+GHz. However, the gate density is still increasing, albeit not
+as fast as it did before the millenium. Increases in processing
+power, which are required by new applications, can only be found by
+increased parallelism. Reaching the exascale level (10^18 floating point
+operations per seconds) will need billions of processors.
+Embedded and mobile applications, while less greedy, will run on
+thousands of processors, as exemplified by the Intel MIC (1000 cores)
+or the Kalray chip (from 256 to 1024 processors in the near future).
+Increasing the number of processors does not mechanically result in
+increased processing power: there are many bottlenecks or walls to be breached
+before being able to efficiently use such architectures, among them
+the power wall, the memory and pin wall, the communication bottleneck,
+and the most important one, the programmer wall.
+==== Memory ====
+Each processor consumes a few data words per cycle. These words can come from
+registers, from several levels of local memory (caches or scratchpads)
+from global memory or from permanent storage. Since the efficiency of a
+memory hierarchy cannot be made arbitrarily large, the number of memory
+banks will be a fraction of the number of processors. Since each processor chip
+can be connected only to a small number of memory chips (the pin wall),
+the system will necessarily have a Non Uniform Memory Access latency.
+A pure dynamic replacement policy, as found for instance in present day
+caches and virtual memory may not scale up to the necessary sizes. Beside,
+such a policy forbids an accurate execution time prediction, and is unsuitable
+for real-time applications. In the same way, it seems that dynamic
+coherence protocols will not scale: the required data movements must
+be orchestrated by the application software, possibly with the help of
+the compiler.
+==== Control and Parallelism ====
+Conditional jumps and parallelism have never fitted well. This is true for instance
+- for superscalar and pipelined processors, which have to stall instruction
+  issue while waiting for the outcome of a test, or to resort to heuristics
+  like speculation, branch prediction and delay slots,
+- for SIMD systems, in which all processors execute the same code, some of them
+  being prevented from modifying memory by an activity bit, which is set to the
+  outcome of a test,
+- vector processors and GPU use variants of this scheme (mask vector, thread
+  serialization)
+- in hardware, there are two solutions:
+  * start the test and both branches at the same time, and feed both results
+    to a multiplexer controlled by the result of the test (efficient if
+    all three branches are well balanced, but costly in silicon area)
+  * use the result of the test to select the next state of a finite state
+    machine (same problem as for ordinary processors: the test is in the
+    critical path; however, it is easier to anticipate if dependences permit).
+It is likely that introducing control for dataflow systems will be at least
+as difficult as for ordinary processors. Observe that there is no
+problem introducing tests in dataflow *languages*; but the net result is that
+one will lose almost all interesting properties, like static scheduling,
+deadlock detection, channel bounds, and even determinism if one relax
+the "only one writer and one reader per channel" constraint.
+A suggestion: start from the Communicating Regular Processes model and introduce
+control progressively. Control in purely computational parts is easy: one just
+has to be careful to discard dependences between opposite branches of a test.
+Since in CRP, reading is not destructive, there is no problem with conditional reads,
+except that they might enlarge the lifespan of some channel tokens. Dealing
+with conditional writes is more difficult. One must insure that the single
+assignment property is not violated. Beside, conditional writes may create
+undefined values (bubbles) in channels. How to deal with bubbles at the
+receiving end?
+The net result is that conditionals in CRP processes have to be severely
+restricted, and these restrictions must be verifiable by the compiler.
+==== Synchronous Languages and the Polyhedral Model ====
+There has been in the past an attempt to combine SIGNAL and MMALPHA , but I have
+no information on the subject. An obvious possibility is to allows signal of
+arrays (signal whose values are arrays instead of scalars). Since all elements
+of the array will have the same clock, there will be no modification to the
+clock calculus. The point-wise operators will have to be extended to allow
+polyhedral operations (loops or vector notation). Scheduling will be done in two steps:
+firstly, by the usual clock calculus, and second, at each event, a polyhedral
+scheduling which may even uncover additional parallelism.
+A more ambitious proposal is to introduce array of signals. Here, each elementary
+signal may have its own clock. Analyzing such a system may necessitate a
+parametric clock calculus.
+==== Network Scheduling ====
+Present day communication networks, from Networks on Chips to the Internet, operate
+in "best effort" mode. Each message bears a routing prefix, and each router dispatches it
+as best it can, according to local information, and sometime to scanty global
+information (e.g. neighboring link congestion). In some cases, one may create a
+route and assign to it some portion of the available bandwidth. Another solution would
+be to use passive routers, able only to connect their input ports to their output ports
+in arbitrary pattern, under control either of the associated processor or of a
+pre-stored configuration table, to be prepared beforehand by a compiler. See the
+"polymorphic torus" paper for a similar proposal; note however that some of the assumptions
+in this paper may not be valid for today architecture; an example is the hypothesis
+that each router acts as a short circuit, and hence that the propagation delay across
+the network is negligible. This scheme is also reminiscent of SIMD architectures
+like the Connection Machine 2 or Propal, but in their case the connection pattern
+was fixed: a hypercube for the CM2 and a shift register for Propal.
+As these examples show, this proposal implies synchrony, at least whenever a communication
+is intended; the programming style must be SPMD and somewhat like BSP. This might
+be barely possible for systems on chip like the Intel MIC or the Kalray chip, but is
+excluded for exascale architectures; such architectures will probably use a hierarchy
+of interconnects, and the proposed scheme may be applicable to the lowest level.
+Another point is that the compiler must have complete knowledge of the application
+communication pattern and of the network topology. On the other hand, the placement
+of the computations on the network is one of the unknowns of the problem. In abstract
+term, one has to find both a placement and several communication patterns, the main
+objective being the minimization of link contention. As such, it will need a mixture
+of temporal and spatial reasoning, much like the construction of a railway timetable,
+a very interesting subject.