Context Navigation

Changes between Version 2 and Version 3 of Specification

Timestamp:: Jun 27, 2009, 2:17:53 PM (16 years ago)
Author:: alain
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

Specification

-                      v2
+                      v3
 The main technical issue is the scalability, as this architecture is intended to integrate up to 4096 cores (even if the first prototype will contain only 16 cores). The second technical issue is the power consumption, and all technical choices described below are driven by these two goals.
 == The processor core ==
+== 1. Processor core ==
 In order to obtain the best MIPS/MicroWatt ratio, the TSAR processor core is a simple 32 bits, single instruction issue RISC processor, with no superscalar features, no out of order execution, no branch prediction, no speculative execution. In order to avoid the enormous effort to develop a brand new compiler, TSAR will use an existing processor core. The choice is not important : It could be a MIPS32, a PPC405, a SPARC V8, or an ARM7 core, as all these processor cores have similar performances.
 …
 The first TSAR architecture demonstrator will use a MIPS32 processor core.
 == The memory layout ==
+== 2. Memory layout ==
 The physical address space size is a parameter. The maximal value is 1 Tbytes (40 bits physical address). For scalability reasons, the TSAR physical memory is logically shared, but physically distributed : The architecture is clusterized , and has a 2D mesh topology. Each cluster contains up to 4 processors, a local interconnect and one physical memory bank. The architecture is NUMA (Non Uniform Memory Access) : All processors can access all memory banks, but the access time, and the power consumption depend on the distance between the processor and the memory bank.
 …
 ==      The virtual memory support ==
+==      3. Virtual memory support ==
 The TSAR architecture implements a paginated virtual memory. It defines a generic MMU (Memory Management Unit), physically implemented in the L1 cache controller. This generic MMU is independent on the  processor core, and can be used with any 32 bits, single instruction issue RISC processor. To be independent from the processor core, the TLB MISS are handled by an hardwired FSM, and do not use any specific instructions.
 …
 In order to help the operating system to implement efficient page replacement policies, each entry in the page table contains three bits that are updated by the hardware MMU :  a dirty bit to indicate modifications, and two separated access bits for “local access” (processor and memory cache located in the same cluster), and “remote access” (processor and memory cache located in different clusters).
 == The DHCCP cache coherence protocol ==
+== 4. DHCCP cache coherence protocol ==
 The shared memory TSAR architecture implements the DHCCP protocol (Distributed Hybrid Cache Coherence Protocol). As it is not possible to monitor all simultaneous transaction in a distributed network on chip, the DHCCP protocol is  based on the global directory paradigm.
 …
 Finally, the DHCCP protocol is called “hybrid”, as it uses a multicast/update policy when the number of copies is lower than a given threshold, and automatically switches to a broadcast/invalidate policy when this number of copies exceeds this threshold.
 == The interconnection networks ==
+== 5. Interconnection networks ==
 The TSAR architecture requires a hierarchical two levels interconnect : each cluster must contain a local interconnect, and the communications between clusters relies on a global interconnect.
 As described in section, the DHCCP protocol defines three classes of transactions that must use three separated interconnection networks : the D_network, used for the direct read/write transactions; the C_network, used for coherence transactions; the X _network, used to access the external memory in case of Miss on the memory cache.
+As described in [CacheCoherence the cache coherence section], the DHCCP protocol defines three classes of transactions that must use three separated interconnection networks : the D_network, used for the direct read/write transactions; the C_network, used for coherence transactions; the X _network, used to access the external memory in case of Miss on the memory cache.
 The DSPIN network on chip (developed by the LIP6 laboratory) implements the D_network and the C_network. It has the requested 2D mesh topology, and  provides the shared memory TSAR architecture a truly scalable bandwidth. It supports the VCI/OCP standard, and implements a logically “flat” address space.  It is well suited to power consumption management, as it relies on the GALS (Globally Asynchronous, Locally Synchronous) approach : Both the voltage & the clock frequency can be independently adjusted in each cluster. It provides two fully separated virtual channels for the direct traffic and for the coherence traffic. It provides the broadcast service requested by the DHCCP protocol.
 …
 ==      Atomic instructions ==
+==      6. Atomic operations ==
 Any multi-processor architecture must provide an hardware support for atomic operations. These “read-then-write” atomic operations are used by the software for synchronization.
 …
 Each processor instruction set defines a different set of atomic instruction. The TSAR architecture implements the LL/SC mechanism, that are natively defined by the  MIPS32 & PPC405 processors, and are directly supported by the VCI/OCP standard. Other atomic instructions, such as the SWAP, or LDSTUB instructions defined by the SPARC processor can be emulated using the LL/SC instructions.
 With this mechanism, the TSAR architecture allows the system developers to use cachable spin-locks.
+With this mechanism, the TSAR architecture allows the software developers to use cachable spin-locks.