21 | | In the TSAR architecture, the memory controller is distributed, as it is implemented by the distributed memory caches |
22 | | (one per cluster). Therefore, the global directory itself is distributed. The memory cache being inclusive: |
23 | | a cache line L that is present in at least one L1 cache must be present in the corresponding memory cache cache |
24 | | (in the home cluster). With this property, the Global Directory can be implemented as an extension of the memory cache directory. |
| 20 | In the TSAR architecture, the memory controller is distributed, as it is implemented by the distributed L2 caches (one per cluster). Therefore, the global directory itself is distributed. The L2 cache is inclusive for all L1 caches: |
| 21 | a cache line L that is present in at least one L1 cache must be present in the owner L2 cache cache. With this property, the Global Directory can be implemented as an extension of the memory cache directory. |
26 | | In case of MISS, the memory cache controller must evict a victim line to bring in the missing line. In order to maintain the inclusive property, |
27 | | all copies of the evicted cache line in L1 caches must be invalidated. To do it, the memory cache controller must send |
28 | | invalidate requests to all L1 caches containing a copy. |
| 23 | In case of MISS, the L2 cache controller must evict a victim line to bring in the missing line. In order to maintain the inclusive property, all copies of the evicted cache line in L1 caches must be invalidated. To do it, the L2 cache controller must send invalidate requests to all L1 caches containing a copy. |
30 | | The TSAR architecture wants to guaranty the cache coherence by hardware, for both the data and instruction L1 caches. |
31 | | The modifications of shared data are very frequent events, but the number of copies is generally not very high. |
32 | | The modifications of shared code are very rare events (self modifying code, or dynamic libraries), but the number |
33 | | of replicated copies can be very large ( the exception handler, or the libc are generally replicated in all L1 caches ). |
34 | | Reflecting the different behaviour of data & instruction caches, the "hybrid" cache coherence protocol DHCCP defines two different strategies, |
35 | | depending on the number of copies : |
36 | | * '''MULTICAST_UPDATE''' : When the number of copies is smaller than the DHCCP threshold, the memory cache controller registers the locations of all the copies, and send a ''multicast_update'' transaction to each concerned L1 cache in case of modification. |
37 | | * '''BROADCAST_INVAL''' : When the number of copies is larger than the DHCCP threshold, the memory cache controller registers only the number of copies (without localization) and send a ''broadcast_invalidate'' transaction to all L1 caches in case of modication. |
| 25 | The TSAR architecture wants to guaranty the cache coherence by hardware, for both the data and instruction L1 caches. The modifications of shared data are very frequent events, but the number of copies is generally not very high. |
| 26 | The modifications of shared code are very rare events (self modifying code, or dynamic libraries), but the number of replicated copies can be very large ( the exception handler, or the libc are generally replicated in all L1 caches ). |
| 27 | Reflecting the different behaviour of data & instruction caches, the "hybrid" cache coherence protocol DHCCP defines two different strategies, depending on the number of copies : |
| 28 | * '''MULTICAST_UPDATE''' : When the number of copies is smaller than the DHCCP threshold, the L2 cache controller registers the locations of all the copies, and can send a dedicated ''update(L)'' request to each relevant L1 cache in case of modification of L. |
| 29 | * '''BROADCAST_INVAL''' : When the number of copies is larger than the DHCCP threshold, the memory cache controller registers only the number of copies (without localization) and broadcast an ''inval'' request to all L1 caches in case of modification of L. |
41 | | Three types of transactions, have been identified : |
42 | | * Direct transactions : READ / WRITE / LL / SC / CAS |
43 | | * Coherence transactions : MULTI_UPDATE / MULTI_INVAL / BROADCAST_INVAL / CLEANUP |
44 | | * External transactions : PUT / GET |
| 33 | Nine types of transactions, have been identified that can be split in two classes: |
| 34 | * 5 Direct transactions : READ / WRITE / LL / SC / CAS |
| 35 | * 4 Coherence transactions : MULTI_UPDATE / MULTI_INVAL / BROADCAST_INVAL / CLEANUP |
46 | | For dead-lock prevention, these three types of transaction must be transported on three (virtually or physically) separated networks. |
47 | | |
48 | | As a general rule, all these transactions respect the VCI advanced packet format, and there is one response packet for each command packet : |
49 | | For a burst transaction, a READ command packet contains one single flit, and the corresponding READ response packet contains N flits. |
50 | | Symmetrically, a WRITE command packet contains N flits, and the corresponding WRITE response packet contains one single flit. |
51 | | |
52 | | There is one exception : For a BROADCAST_INVAL transaction, the initiator sends one single flit VCI packet, |
53 | | but receives several single flit VCI response packets. |
| 37 | For dead-lock prevention, the transaction must be transported on three (virtually or physically) separated networks. |
57 | | These transactions are initiated by a processor (actually the L1 cache controller), or by another initiator |
58 | | (an I/O peripheral or hardware coprocessor with a DMA capability). This initiator can be located in any cluster. For those transactions, |
59 | | the target is a memory cache controller, acting as a physical memory bank, or another VCI target peripheral. This target can be located in any cluster. |
| 41 | These transactions are always initiated by the L1 cache controller, that can be located in any cluster. The target is a L2 cache controller, acting as a physical memory bank, that can be located in any cluster. |
61 | | The L1 cache controller can issue several simultaneous VCI transactions, that must be distinguished by the VCI TRDID and PKTID values. |
| 43 | All direct transactions require two packets: one ''command'' packet (from L1 to L2), and one ''response'' packet (from L2 to L1). |
| 44 | |
| 45 | To avoid deadlocks, the directs transactions require two separated physical networks |
| 46 | for commands and responses. |
| 47 | |
| 48 | For all direct transactions, the packet (command & responses) respect the VCI format |
| 49 | AS the L1 cache controller can issue several simultaneous direct transactions, that are distinguished by the VCI TRDID and PKTID values. |
75 | | For each cache line stored in the memory cache, the memory cache implements a Registration Table that contain the copies replicated in the L1 caches. Each entry in this Registration Table contains the SRCID of the L1 cache that contains a copy, as well as the type of the copy (instruction/data). When the same cache line is replicated in both the instruction cache and the data cache of a processor, this defines two separated entries in the Registration Table. When the number copies for a given cache line L exceeds the DHCCP threshold, the corresponding Registration Table is flushed, and the memory cache registers only the number of copies. |
| 63 | For each cache line stored in the L2 cache, the L2 cache implements a linked list of copies replicated in the L1 caches. Each entry in this list contains the SRCID of the L1 cache that contains a copy, as well as the type of the copy (instruction/data). If the same cache line is replicated in both the instruction cache and the data cache of a given core, this defines two separated entries in the list. When the number copies for a given cache line L exceeds the DHCCP threshold, the corresponding list of copies is flushed, and the L2 cache registers only the number of copies. |
77 | | The coherence transactions use a logically separated ''coherence network'', implementing a separated address space. |
78 | | All these transactions are write transactions. |
| 65 | A coherence transaction can be initiated by the L1 cache or by the L2 cache. |
| 66 | Depending on the transaction type, a coherence transaction can require two or three packets. |
| 67 | |
| 68 | * A '''CLEANUP''' transaction is initiated by the L1 cache when it must evict a line L for replacement, to signal to the owner L2 cache that it does not contains anymore a copy of L. This transaction requires two packet types: |
| 69 | 1. The L1 cache send a ''cleanup(L)'' packet to the owner L2 cache. |
| 70 | 1. The L2 cache returns a ''clack(L)'' packet to signal that its list of copies for L has been updated. |
| 71 | For the L1 cache, the '''CLEANUP''' transaction is completed when the L1 cache receive the ''clack'' packet. |
| 72 | |
| 73 | * A '''MULTI_UPDATE''' transaction is a multi-cast transaction initiated by the L2 cache when it receives a WRITE request to a replicated cache line, and the number of copies does not exceeds the DHCCP threshold. This transaction requires two packet types: |
| 74 | 1. The L2 send as many ''update(L,DATA)'' packets as the number of registered copies (but the writer). |
| 75 | 1. Each L1 cache returns an ''update_ack(L)'' packet to the L2 cache to signal that the local copy has been updated. |
| 76 | For the L2 cache, the '''MULTICAST_UPDATE''' transaction is completed when the L2 cache received all expected ''update_ack'' packets. |
| 77 | |
| 78 | * A '''MULTI_INVAL''' transaction is a multi-cast transaction, initiated by the L2 cache, when it must evict a given line L, and the number of copies does not exceeds the DHCCP threshold. To keep the inclusion property, all copies in L1 caches must be invalidated. This transaction requires three types of packets: |
| 79 | 1. The L2 cache send as many ''inval(L)'' packets as the number of registered copies to all registered L1 caches. |
| 80 | 1. Each L1 cache send a ''cleanup(L)'' packet to the L2 cache to signal that the local copy has been invalidated. |
| 81 | 1. The L2 cache returns to each L1 cache a ''clack(L) packet to signal that its list of copies for L has been updated. |
| 82 | For the L2 cache, the '''MULTI_INVAL''' transaction is completed when the last ''cleanup'' |
| 83 | packet has been received. |
80 | | * A '''MULTI_UPDATE''' transaction is a multi-cast transaction sent by the memory cache controller when it receives a WRITE request to a replicated cache line and the number of copies does not exceeds the DHCCP threshold. It sends as many VCI transactions as the number of registered copies (but the writer). The VCI command packet contains (N+2) flits. The VCI ADDRESS field is constant and contains the address of the memory mapped UPDATE register in the L1 cache. The VCI CMD field contains the VCI_WRITE code. As the memory cache controller can handle several simultaneous update/invalidate transactions, the VCI PKTID field contains the transaction index. The VCI PLEN field contains the value 4*N, where N is the actual number of modified words in the cache line. The line index (34 bits) is transported in the VCI WDATA and VCI BE fields (the two LSB bits), of the first flit. The first modified word index (3 bits) is transported in the WDATA field of the second flit, and the N modified words in the WDATA and BE fields of the N following flits. For each modified word, the VCI BE field can have a different value (including the 0x0 value). The VCI response packet contains one single flit. The memory cache controller counts the number of VCI responses to detect the completion of the MULTICAST_UPDATE transaction. |
| 85 | * A '''BROADCAST_INVAL''' transaction is a broadcast transaction initiated by a L2 cache when a line L has been modified by a WRITE, or when the line L must be evicted for replacement, and the number of copies exceeds the DHCCP threshold. This transaction request three types of packets: |
| 86 | 1. The L2 cache send to all L1 caches controller a ''bc_inval(L)'' broadcast packet. |
| 87 | 1. Each L1 cache that contains a copy of L send a ''cleanup(L)'' packet to the L2 cache to signal that the local copy has been invalidated. |
| 88 | 1. The L2 cache returns to each L1 cache that made a cleanup, a ''clack(L)'' packet to signal that its list of copies for L has been updated. |
| 89 | For the L2 cache, it simply decrement the counter of copies for each received ''cleanup'', and the '''MULTI_BROADCAST''' transaction is completed when the last ''cleanup'' packet has been received. |
101 | | All the external transactions use a separated ''external network'', implementing a separated address space. The memory cache and |
102 | | the external RAM controller ports used to access the external network respect a simplified version of the VCI advanced format : |
103 | | the VCI fields PLEN, PKTID, CONST, CONTIG and BE are not used. The VCI ADDRESS field contains 30 bits (a 64 bytes cache line index). |
104 | | The VCI WDATA & RDATA fields contain 64 bits, in order to improve the bandwidth. The VCI SRCID field contains the memory cache index (cluster index). |
105 | | As the memory cache controller can process several external transactions simultaneously, the VCI TRDID field contains the transaction index. |
| 101 | All these L2/L3 transactions use a separated ''external network'', implementing a separated address space. The memory cache and the external RAM controller ports used to access the external network respect a simplified version of the VCI advanced format : |
| 102 | the VCI fields PLEN, PKTID, CONST, CONTIG and BE are not used. The VCI ADDRESS field contains 30 bits (a 64 bytes cache line index). The VCI WDATA & RDATA fields contain 64 bits, in order to improve the bandwidth. The VCI SRCID field contains the memory cache index (cluster index). |
| 103 | As the L2 cache controller can process several external transactions simultaneously, the VCI TRDID field contains the transaction index. |