| | 17 | === Paul's Own Thoughts on Process Networks on Chip === |
| | 18 | ==== Landscape ==== |
| | 19 | |
| | 20 | Due to technological problems (quantum effects, power dissipation) |
| | 21 | the clock frequency of present day digital circuits |
| | 22 | seems to have reached an upper bound around |
| | 23 | 3 GHz. However, the gate density is still increasing, albeit not |
| | 24 | as fast as it did before the millenium. Increases in processing |
| | 25 | power, which are required by new applications, can only be found by |
| | 26 | increased parallelism. Reaching the exascale level (10^18 floating point |
| | 27 | operations per seconds) will need billions of processors. |
| | 28 | Embedded and mobile applications, while less greedy, will run on |
| | 29 | thousands of processors, as exemplified by the Intel MIC (1000 cores) |
| | 30 | or the Kalray chip (from 256 to 1024 processors in the near future). |
| | 31 | |
| | 32 | Increasing the number of processors does not mechanically result in |
| | 33 | increased processing power: there are many bottlenecks or walls to be breached |
| | 34 | before being able to efficiently use such architectures, among them |
| | 35 | the power wall, the memory and pin wall, the communication bottleneck, |
| | 36 | and the most important one, the programmer wall. |
| | 37 | |
| | 38 | ==== Memory ==== |
| | 39 | |
| | 40 | Each processor consumes a few data words per cycle. These words can come from |
| | 41 | registers, from several levels of local memory (caches or scratchpads) |
| | 42 | from global memory or from permanent storage. Since the efficiency of a |
| | 43 | memory hierarchy cannot be made arbitrarily large, the number of memory |
| | 44 | banks will be a fraction of the number of processors. Since each processor chip |
| | 45 | can be connected only to a small number of memory chips (the pin wall), |
| | 46 | the system will necessarily have a Non Uniform Memory Access latency. |
| | 47 | A pure dynamic replacement policy, as found for instance in present day |
| | 48 | caches and virtual memory may not scale up to the necessary sizes. Beside, |
| | 49 | such a policy forbids an accurate execution time prediction, and is unsuitable |
| | 50 | for real-time applications. In the same way, it seems that dynamic |
| | 51 | coherence protocols will not scale: the required data movements must |
| | 52 | be orchestrated by the application software, possibly with the help of |
| | 53 | the compiler. |
| | 54 | |
| | 55 | ==== Control and Parallelism ==== |
| | 56 | |
| | 57 | Conditional jumps and parallelism have never fitted well. This is true for instance |
| | 58 | - for superscalar and pipelined processors, which have to stall instruction |
| | 59 | issue while waiting for the outcome of a test, or to resort to heuristics |
| | 60 | like speculation, branch prediction and delay slots, |
| | 61 | - for SIMD systems, in which all processors execute the same code, some of them |
| | 62 | being prevented from modifying memory by an activity bit, which is set to the |
| | 63 | outcome of a test, |
| | 64 | - vector processors and GPU use variants of this scheme (mask vector, thread |
| | 65 | serialization) |
| | 66 | - in hardware, there are two solutions: |
| | 67 | * start the test and both branches at the same time, and feed both results |
| | 68 | to a multiplexer controlled by the result of the test (efficient if |
| | 69 | all three branches are well balanced, but costly in silicon area) |
| | 70 | * use the result of the test to select the next state of a finite state |
| | 71 | machine (same problem as for ordinary processors: the test is in the |
| | 72 | critical path; however, it is easier to anticipate if dependences permit). |
| | 73 | |
| | 74 | It is likely that introducing control for dataflow systems will be at least |
| | 75 | as difficult as for ordinary processors. Observe that there is no |
| | 76 | problem introducing tests in dataflow *languages*; but the net result is that |
| | 77 | one will lose almost all interesting properties, like static scheduling, |
| | 78 | deadlock detection, channel bounds, and even determinism if one relax |
| | 79 | the "only one writer and one reader per channel" constraint. |
| | 80 | |
| | 81 | A suggestion: start from the Communicating Regular Processes model and introduce |
| | 82 | control progressively. Control in purely computational parts is easy: one just |
| | 83 | has to be careful to discard dependences between opposite branches of a test. |
| | 84 | Since in CRP, reading is not destructive, there is no problem with conditional reads, |
| | 85 | except that they might enlarge the lifespan of some channel tokens. Dealing |
| | 86 | with conditional writes is more difficult. One must insure that the single |
| | 87 | assignment property is not violated. Beside, conditional writes may create |
| | 88 | undefined values (bubbles) in channels. How to deal with bubbles at the |
| | 89 | receiving end? |
| | 90 | |
| | 91 | The net result is that conditionals in CRP processes have to be severely |
| | 92 | restricted, and these restrictions must be verifiable by the compiler. |
| | 93 | |
| | 94 | ==== Synchronous Languages and the Polyhedral Model ==== |
| | 95 | |
| | 96 | There has been in the past an attempt to combine SIGNAL and MMALPHA , but I have |
| | 97 | no information on the subject. An obvious possibility is to allows signal of |
| | 98 | arrays (signal whose values are arrays instead of scalars). Since all elements |
| | 99 | of the array will have the same clock, there will be no modification to the |
| | 100 | clock calculus. The point-wise operators will have to be extended to allow |
| | 101 | polyhedral operations (loops or vector notation). Scheduling will be done in two steps: |
| | 102 | firstly, by the usual clock calculus, and second, at each event, a polyhedral |
| | 103 | scheduling which may even uncover additional parallelism. |
| | 104 | |
| | 105 | A more ambitious proposal is to introduce array of signals. Here, each elementary |
| | 106 | signal may have its own clock. Analyzing such a system may necessitate a |
| | 107 | parametric clock calculus. |
| | 108 | |
| | 109 | ==== Network Scheduling ==== |
| | 110 | |
| | 111 | Present day communication networks, from Networks on Chips to the Internet, operate |
| | 112 | in "best effort" mode. Each message bears a routing prefix, and each router dispatches it |
| | 113 | as best it can, according to local information, and sometime to scanty global |
| | 114 | information (e.g. neighboring link congestion). In some cases, one may create a |
| | 115 | route and assign to it some portion of the available bandwidth. Another solution would |
| | 116 | be to use passive routers, able only to connect their input ports to their output ports |
| | 117 | in arbitrary pattern, under control either of the associated processor or of a |
| | 118 | pre-stored configuration table, to be prepared beforehand by a compiler. See the |
| | 119 | "polymorphic torus" paper for a similar proposal; note however that some of the assumptions |
| | 120 | in this paper may not be valid for today architecture; an example is the hypothesis |
| | 121 | that each router acts as a short circuit, and hence that the propagation delay across |
| | 122 | the network is negligible. This scheme is also reminiscent of SIMD architectures |
| | 123 | like the Connection Machine 2 or Propal, but in their case the connection pattern |
| | 124 | was fixed: a hypercube for the CM2 and a shift register for Propal. |
| | 125 | |
| | 126 | As these examples show, this proposal implies synchrony, at least whenever a communication |
| | 127 | is intended; the programming style must be SPMD and somewhat like BSP. This might |
| | 128 | be barely possible for systems on chip like the Intel MIC or the Kalray chip, but is |
| | 129 | excluded for exascale architectures; such architectures will probably use a hierarchy |
| | 130 | of interconnects, and the proposed scheme may be applicable to the lowest level. |
| | 131 | |
| | 132 | Another point is that the compiler must have complete knowledge of the application |
| | 133 | communication pattern and of the network topology. On the other hand, the placement |
| | 134 | of the computations on the network is one of the unknowns of the problem. In abstract |
| | 135 | term, one has to find both a placement and several communication patterns, the main |
| | 136 | objective being the minimization of link contention. As such, it will need a mixture |
| | 137 | of temporal and spatial reasoning, much like the construction of a railway timetable, |
| | 138 | a very interesting subject. |
| | 139 | |
| | 140 | |
| | 141 | |
| | 142 | |
| | 143 | |
| | 144 | |