| 17 | === Paul's Own Thoughts on Process Networks on Chip === |
| 18 | ==== Landscape ==== |
| 19 | |
| 20 | Due to technological problems (quantum effects, power dissipation) |
| 21 | the clock frequency of present day digital circuits |
| 22 | seems to have reached an upper bound around |
| 23 | 3 GHz. However, the gate density is still increasing, albeit not |
| 24 | as fast as it did before the millenium. Increases in processing |
| 25 | power, which are required by new applications, can only be found by |
| 26 | increased parallelism. Reaching the exascale level (10^18 floating point |
| 27 | operations per seconds) will need billions of processors. |
| 28 | Embedded and mobile applications, while less greedy, will run on |
| 29 | thousands of processors, as exemplified by the Intel MIC (1000 cores) |
| 30 | or the Kalray chip (from 256 to 1024 processors in the near future). |
| 31 | |
| 32 | Increasing the number of processors does not mechanically result in |
| 33 | increased processing power: there are many bottlenecks or walls to be breached |
| 34 | before being able to efficiently use such architectures, among them |
| 35 | the power wall, the memory and pin wall, the communication bottleneck, |
| 36 | and the most important one, the programmer wall. |
| 37 | |
| 38 | ==== Memory ==== |
| 39 | |
| 40 | Each processor consumes a few data words per cycle. These words can come from |
| 41 | registers, from several levels of local memory (caches or scratchpads) |
| 42 | from global memory or from permanent storage. Since the efficiency of a |
| 43 | memory hierarchy cannot be made arbitrarily large, the number of memory |
| 44 | banks will be a fraction of the number of processors. Since each processor chip |
| 45 | can be connected only to a small number of memory chips (the pin wall), |
| 46 | the system will necessarily have a Non Uniform Memory Access latency. |
| 47 | A pure dynamic replacement policy, as found for instance in present day |
| 48 | caches and virtual memory may not scale up to the necessary sizes. Beside, |
| 49 | such a policy forbids an accurate execution time prediction, and is unsuitable |
| 50 | for real-time applications. In the same way, it seems that dynamic |
| 51 | coherence protocols will not scale: the required data movements must |
| 52 | be orchestrated by the application software, possibly with the help of |
| 53 | the compiler. |
| 54 | |
| 55 | ==== Control and Parallelism ==== |
| 56 | |
| 57 | Conditional jumps and parallelism have never fitted well. This is true for instance |
| 58 | - for superscalar and pipelined processors, which have to stall instruction |
| 59 | issue while waiting for the outcome of a test, or to resort to heuristics |
| 60 | like speculation, branch prediction and delay slots, |
| 61 | - for SIMD systems, in which all processors execute the same code, some of them |
| 62 | being prevented from modifying memory by an activity bit, which is set to the |
| 63 | outcome of a test, |
| 64 | - vector processors and GPU use variants of this scheme (mask vector, thread |
| 65 | serialization) |
| 66 | - in hardware, there are two solutions: |
| 67 | * start the test and both branches at the same time, and feed both results |
| 68 | to a multiplexer controlled by the result of the test (efficient if |
| 69 | all three branches are well balanced, but costly in silicon area) |
| 70 | * use the result of the test to select the next state of a finite state |
| 71 | machine (same problem as for ordinary processors: the test is in the |
| 72 | critical path; however, it is easier to anticipate if dependences permit). |
| 73 | |
| 74 | It is likely that introducing control for dataflow systems will be at least |
| 75 | as difficult as for ordinary processors. Observe that there is no |
| 76 | problem introducing tests in dataflow *languages*; but the net result is that |
| 77 | one will lose almost all interesting properties, like static scheduling, |
| 78 | deadlock detection, channel bounds, and even determinism if one relax |
| 79 | the "only one writer and one reader per channel" constraint. |
| 80 | |
| 81 | A suggestion: start from the Communicating Regular Processes model and introduce |
| 82 | control progressively. Control in purely computational parts is easy: one just |
| 83 | has to be careful to discard dependences between opposite branches of a test. |
| 84 | Since in CRP, reading is not destructive, there is no problem with conditional reads, |
| 85 | except that they might enlarge the lifespan of some channel tokens. Dealing |
| 86 | with conditional writes is more difficult. One must insure that the single |
| 87 | assignment property is not violated. Beside, conditional writes may create |
| 88 | undefined values (bubbles) in channels. How to deal with bubbles at the |
| 89 | receiving end? |
| 90 | |
| 91 | The net result is that conditionals in CRP processes have to be severely |
| 92 | restricted, and these restrictions must be verifiable by the compiler. |
| 93 | |
| 94 | ==== Synchronous Languages and the Polyhedral Model ==== |
| 95 | |
| 96 | There has been in the past an attempt to combine SIGNAL and MMALPHA , but I have |
| 97 | no information on the subject. An obvious possibility is to allows signal of |
| 98 | arrays (signal whose values are arrays instead of scalars). Since all elements |
| 99 | of the array will have the same clock, there will be no modification to the |
| 100 | clock calculus. The point-wise operators will have to be extended to allow |
| 101 | polyhedral operations (loops or vector notation). Scheduling will be done in two steps: |
| 102 | firstly, by the usual clock calculus, and second, at each event, a polyhedral |
| 103 | scheduling which may even uncover additional parallelism. |
| 104 | |
| 105 | A more ambitious proposal is to introduce array of signals. Here, each elementary |
| 106 | signal may have its own clock. Analyzing such a system may necessitate a |
| 107 | parametric clock calculus. |
| 108 | |
| 109 | ==== Network Scheduling ==== |
| 110 | |
| 111 | Present day communication networks, from Networks on Chips to the Internet, operate |
| 112 | in "best effort" mode. Each message bears a routing prefix, and each router dispatches it |
| 113 | as best it can, according to local information, and sometime to scanty global |
| 114 | information (e.g. neighboring link congestion). In some cases, one may create a |
| 115 | route and assign to it some portion of the available bandwidth. Another solution would |
| 116 | be to use passive routers, able only to connect their input ports to their output ports |
| 117 | in arbitrary pattern, under control either of the associated processor or of a |
| 118 | pre-stored configuration table, to be prepared beforehand by a compiler. See the |
| 119 | "polymorphic torus" paper for a similar proposal; note however that some of the assumptions |
| 120 | in this paper may not be valid for today architecture; an example is the hypothesis |
| 121 | that each router acts as a short circuit, and hence that the propagation delay across |
| 122 | the network is negligible. This scheme is also reminiscent of SIMD architectures |
| 123 | like the Connection Machine 2 or Propal, but in their case the connection pattern |
| 124 | was fixed: a hypercube for the CM2 and a shift register for Propal. |
| 125 | |
| 126 | As these examples show, this proposal implies synchrony, at least whenever a communication |
| 127 | is intended; the programming style must be SPMD and somewhat like BSP. This might |
| 128 | be barely possible for systems on chip like the Intel MIC or the Kalray chip, but is |
| 129 | excluded for exascale architectures; such architectures will probably use a hierarchy |
| 130 | of interconnects, and the proposed scheme may be applicable to the lowest level. |
| 131 | |
| 132 | Another point is that the compiler must have complete knowledge of the application |
| 133 | communication pattern and of the network topology. On the other hand, the placement |
| 134 | of the computations on the network is one of the unknowns of the problem. In abstract |
| 135 | term, one has to find both a placement and several communication patterns, the main |
| 136 | objective being the minimization of link contention. As such, it will need a mixture |
| 137 | of temporal and spatial reasoning, much like the construction of a railway timetable, |
| 138 | a very interesting subject. |
| 139 | |
| 140 | |
| 141 | |
| 142 | |
| 143 | |
| 144 | |