|  | 17 | === Paul's Own Thoughts on Process Networks on Chip === | 
                          |  | 18 | ==== Landscape ==== | 
                          |  | 19 |  | 
                          |  | 20 | Due to technological problems (quantum effects, power dissipation) | 
                          |  | 21 | the clock frequency of present day digital circuits | 
                          |  | 22 | seems to have reached an upper bound around | 
                          |  | 23 | 3 GHz. However, the gate density is still increasing, albeit not | 
                          |  | 24 | as fast as it did before the millenium. Increases in processing | 
                          |  | 25 | power, which are required by new applications, can only be found by | 
                          |  | 26 | increased parallelism. Reaching the exascale level (10^18 floating point | 
                          |  | 27 | operations per seconds) will need billions of processors. | 
                          |  | 28 | Embedded and mobile applications, while less greedy, will run on | 
                          |  | 29 | thousands of processors, as exemplified by the Intel MIC (1000 cores) | 
                          |  | 30 | or the Kalray chip (from 256 to 1024 processors in the near future). | 
                          |  | 31 |  | 
                          |  | 32 | Increasing the number of processors does not mechanically result in | 
                          |  | 33 | increased processing power: there are many bottlenecks or walls to be breached | 
                          |  | 34 | before being able to efficiently use such architectures, among them | 
                          |  | 35 | the power wall, the memory and pin wall, the communication bottleneck, | 
                          |  | 36 | and the most important one, the programmer wall. | 
                          |  | 37 |  | 
                          |  | 38 | ==== Memory ==== | 
                          |  | 39 |  | 
                          |  | 40 | Each processor consumes a few data words per cycle. These words can come from | 
                          |  | 41 | registers, from several levels of local memory (caches or scratchpads) | 
                          |  | 42 | from global memory or from permanent storage. Since the efficiency of a | 
                          |  | 43 | memory hierarchy cannot be made arbitrarily large, the number of memory | 
                          |  | 44 | banks will be a fraction of the number of processors. Since each processor chip | 
                          |  | 45 | can be connected only to a small number of memory chips (the pin wall), | 
                          |  | 46 | the system will necessarily have a Non Uniform Memory Access latency. | 
                          |  | 47 | A pure dynamic replacement policy, as found for instance in present day | 
                          |  | 48 | caches and virtual memory may not scale up to the necessary sizes. Beside, | 
                          |  | 49 | such a policy forbids an accurate execution time prediction, and is unsuitable | 
                          |  | 50 | for real-time applications. In the same way, it seems that dynamic | 
                          |  | 51 | coherence protocols will not scale: the required data movements must | 
                          |  | 52 | be orchestrated by the application software, possibly with the help of | 
                          |  | 53 | the compiler. | 
                          |  | 54 |  | 
                          |  | 55 | ==== Control and Parallelism ==== | 
                          |  | 56 |  | 
                          |  | 57 | Conditional jumps and parallelism have never fitted well. This is true for instance | 
                          |  | 58 | - for superscalar and pipelined processors, which have to stall instruction | 
                          |  | 59 | issue while waiting for the outcome of a test, or to resort to heuristics | 
                          |  | 60 | like speculation, branch prediction and delay slots, | 
                          |  | 61 | - for SIMD systems, in which all processors execute the same code, some of them | 
                          |  | 62 | being prevented from modifying memory by an activity bit, which is set to the | 
                          |  | 63 | outcome of a test, | 
                          |  | 64 | - vector processors and GPU use variants of this scheme (mask vector, thread | 
                          |  | 65 | serialization) | 
                          |  | 66 | - in hardware, there are two solutions: | 
                          |  | 67 | * start the test and both branches at the same time, and feed both results | 
                          |  | 68 | to a multiplexer controlled by the result of the test (efficient if | 
                          |  | 69 | all three branches are well balanced, but costly in silicon area) | 
                          |  | 70 | * use the result of the test to select the next state of a finite state | 
                          |  | 71 | machine (same problem as for ordinary processors: the test is in the | 
                          |  | 72 | critical path; however, it is easier to anticipate if dependences permit). | 
                          |  | 73 |  | 
                          |  | 74 | It is likely that introducing control for dataflow systems will be at least | 
                          |  | 75 | as difficult as for ordinary processors. Observe that there is no | 
                          |  | 76 | problem introducing tests in dataflow *languages*; but the net result is that | 
                          |  | 77 | one will lose almost all interesting properties, like static scheduling, | 
                          |  | 78 | deadlock detection, channel bounds, and even determinism if one relax | 
                          |  | 79 | the "only one writer and one reader per channel" constraint. | 
                          |  | 80 |  | 
                          |  | 81 | A suggestion: start from the Communicating Regular Processes model and introduce | 
                          |  | 82 | control progressively. Control in purely computational parts is easy: one just | 
                          |  | 83 | has to be careful to discard dependences between opposite branches of a test. | 
                          |  | 84 | Since in CRP, reading is not destructive, there is no problem with conditional reads, | 
                          |  | 85 | except that they might enlarge the lifespan of some channel tokens. Dealing | 
                          |  | 86 | with conditional writes is more difficult. One must insure that the single | 
                          |  | 87 | assignment property is not violated. Beside, conditional writes may create | 
                          |  | 88 | undefined values (bubbles) in channels. How to deal with bubbles at the | 
                          |  | 89 | receiving end? | 
                          |  | 90 |  | 
                          |  | 91 | The net result is that conditionals in CRP processes have to be severely | 
                          |  | 92 | restricted, and these restrictions must be verifiable by the compiler. | 
                          |  | 93 |  | 
                          |  | 94 | ==== Synchronous Languages and the Polyhedral Model ==== | 
                          |  | 95 |  | 
                          |  | 96 | There has been in the past an attempt to combine SIGNAL and MMALPHA , but I have | 
                          |  | 97 | no information on the subject. An obvious possibility is to allows signal of | 
                          |  | 98 | arrays (signal whose values are arrays instead of scalars). Since all elements | 
                          |  | 99 | of the array will have the same clock, there will be no modification to the | 
                          |  | 100 | clock calculus. The point-wise operators will have to be extended to allow | 
                          |  | 101 | polyhedral operations (loops or vector notation). Scheduling will be done in two steps: | 
                          |  | 102 | firstly, by the usual clock calculus, and second, at each event, a polyhedral | 
                          |  | 103 | scheduling which may even uncover additional parallelism. | 
                          |  | 104 |  | 
                          |  | 105 | A more ambitious proposal is to introduce array of signals. Here, each elementary | 
                          |  | 106 | signal may have its own clock. Analyzing such a system may necessitate a | 
                          |  | 107 | parametric clock calculus. | 
                          |  | 108 |  | 
                          |  | 109 | ==== Network Scheduling ==== | 
                          |  | 110 |  | 
                          |  | 111 | Present day communication networks, from Networks on Chips to the Internet, operate | 
                          |  | 112 | in "best effort" mode. Each message bears a routing prefix, and each router dispatches it | 
                          |  | 113 | as best it can, according to local information, and sometime to scanty global | 
                          |  | 114 | information (e.g. neighboring link congestion). In some cases, one may create a | 
                          |  | 115 | route and assign to it some portion of the available bandwidth. Another solution would | 
                          |  | 116 | be to use passive routers, able only to connect their input ports to their output ports | 
                          |  | 117 | in arbitrary pattern, under control either of the associated processor or of a | 
                          |  | 118 | pre-stored configuration table, to be prepared beforehand by a compiler. See the | 
                          |  | 119 | "polymorphic torus" paper for a similar proposal; note however that some of the assumptions | 
                          |  | 120 | in this paper may not be valid for today architecture; an example is the hypothesis | 
                          |  | 121 | that each router acts as a short circuit, and hence that the propagation delay across | 
                          |  | 122 | the network is negligible. This scheme is also reminiscent of SIMD architectures | 
                          |  | 123 | like the Connection Machine 2 or Propal, but in their case the connection pattern | 
                          |  | 124 | was fixed: a hypercube for the CM2 and a shift register for Propal. | 
                          |  | 125 |  | 
                          |  | 126 | As these examples show, this proposal implies synchrony, at least whenever a communication | 
                          |  | 127 | is intended; the programming style must be SPMD and somewhat like BSP. This might | 
                          |  | 128 | be barely possible for systems on chip like the Intel MIC or the Kalray chip, but is | 
                          |  | 129 | excluded for exascale architectures; such architectures will probably use a hierarchy | 
                          |  | 130 | of interconnects, and the proposed scheme may be applicable to the lowest level. | 
                          |  | 131 |  | 
                          |  | 132 | Another point is that the compiler must have complete knowledge of the application | 
                          |  | 133 | communication pattern and of the network topology. On the other hand, the placement | 
                          |  | 134 | of the computations on the network is one of the unknowns of the problem. In abstract | 
                          |  | 135 | term, one has to find both a placement and several communication patterns, the main | 
                          |  | 136 | objective being the minimization of link contention. As such, it will need a mixture | 
                          |  | 137 | of temporal and spatial reasoning, much like the construction of a railway timetable, | 
                          |  | 138 | a very interesting subject. | 
                          |  | 139 |  | 
                          |  | 140 |  | 
                          |  | 141 |  | 
                          |  | 142 |  | 
                          |  | 143 |  | 
                          |  | 144 |  |