Changes between Version 20 and Version 21 of CacheCoherence


Ignore:
Timestamp:
Nov 17, 2019, 8:31:12 PM (5 years ago)
Author:
alain
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CacheCoherence

    v20 v21  
    1010This Global Directory stores the status of each cache line replicated in at least one L1 cache of the TSAR architecture.
    1111
    12 The main goal being the protocol scalability, the L1 caches implement a WRITE-THROUGH policy. The coherence protocol
    13 is much simpler than the MESI protocol used in most architectures implementing a WRITE_BACK policy.
     12The main goal being the protocol scalability, the L1 caches implement a WRITE-THROUGH policy. The coherence protocol is much simpler than the MESI or MSI protocols used in most architectures implementing a WRITE_BACK policy.
    1413With a WRITE-THROUGH policy, the main memory contains always the most recent value of a cache line,
    15 and there is NO exclusive ownership state for a L1 cache.
     14and there is NO exclusive ownership state for a cache line.
    1615
    1716The basic mechanism is the following : when the memory controller receives a WRITE request for a given cache line,
    1817he must send an UPDATE or INVAL request to all L1 caches containing a copy (but the writer).
    19 The write request is acknowledged only when all UPDATE or INVAL transactions are completed.
     18The write request is acknowledged to the writer only when all UPDATE or INVAL transactions are completed.
    2019
    21 In the TSAR architecture, the memory controller is distributed, as it is implemented by the distributed memory caches
    22 (one per cluster). Therefore, the global directory itself is distributed.  The memory cache being inclusive:
    23 a cache line L that is present in at least one L1 cache must be present in the corresponding memory cache cache
    24 (in the home cluster). With this property, the Global Directory can be implemented as an extension of the memory cache directory.
     20In the TSAR architecture, the memory controller is distributed, as it is implemented by the distributed L2 caches (one per cluster). Therefore, the global directory itself is distributed.  The L2 cache is inclusive for all L1 caches:
     21a cache line L that is present in at least one L1 cache must be present in the owner L2 cache cache. With this property, the Global Directory can be implemented as an extension of the memory cache directory.
    2522
    26 In case of MISS, the memory cache controller must evict a victim line to bring in the missing line. In order to maintain the inclusive property,
    27 all copies of the evicted cache line in L1 caches must be invalidated. To do it, the memory cache controller must send
    28 invalidate requests to all L1 caches containing a copy.
     23In case of MISS, the L2 cache controller must evict a victim line to bring in the missing line. In order to maintain the inclusive property, all copies of the evicted cache line in L1 caches must be invalidated. To do it, the L2 cache controller must send invalidate requests to all L1 caches containing a copy.
    2924
    30 The TSAR architecture wants to guaranty the cache coherence by hardware, for both the data and instruction L1 caches.
    31 The modifications of shared data are very frequent events, but the number of copies is generally not very high.
    32 The modifications of shared code are very rare events (self modifying code, or dynamic libraries), but the number
    33 of replicated copies can be very large ( the exception handler, or the libc are generally replicated in all L1 caches ).
    34 Reflecting the different behaviour of data & instruction caches, the "hybrid" cache coherence protocol DHCCP defines two different strategies,
    35 depending on the number of copies :
    36  * '''MULTICAST_UPDATE''' :  When the number of copies is smaller than the DHCCP threshold, the memory cache controller registers the locations of all the copies, and send a ''multicast_update'' transaction to each concerned L1 cache in case of modification.
    37  * '''BROADCAST_INVAL''' :  When the number of copies is larger than the DHCCP threshold, the memory cache controller registers only the number of copies (without localization) and send a ''broadcast_invalidate'' transaction  to all L1 caches in case of modication. 
     25The TSAR architecture wants to guaranty the cache coherence by hardware, for both the data and instruction L1 caches. The modifications of shared data are very frequent events, but the number of copies is generally not very high.
     26The modifications of shared code are very rare events (self modifying code, or dynamic libraries), but the number of replicated copies can be very large ( the exception handler, or the libc are generally replicated in all L1 caches ).
     27Reflecting the different behaviour of data & instruction caches, the "hybrid" cache coherence protocol DHCCP defines two different strategies, depending on the number of copies :
     28 * '''MULTICAST_UPDATE''' :  When the number of copies is smaller than the DHCCP threshold, the L2 cache controller registers the locations of all the copies, and can send a dedicated ''update(L)'' request to each relevant L1 cache in case of modification of L.
     29 * '''BROADCAST_INVAL''' :  When the number of copies is larger than the DHCCP threshold, the memory cache controller registers only the number of copies (without localization) and broadcast an ''inval'' request to all L1 caches in case of modification of L. 
    3830
    39 == 2.  Types of transaction ==
     31== 2.  Transactions between L1 and L2 caches ==
    4032
    41 Three types of transactions, have been identified :
    42  * Direct transactions : READ / WRITE / LL / SC / CAS
    43  * Coherence transactions : MULTI_UPDATE / MULTI_INVAL / BROADCAST_INVAL / CLEANUP
    44  * External transactions : PUT / GET
     33Nine types of transactions, have been identified that can be split in two classes:
     34 * 5 Direct transactions : READ / WRITE / LL / SC / CAS
     35 * 4 Coherence transactions : MULTI_UPDATE / MULTI_INVAL / BROADCAST_INVAL / CLEANUP
    4536       
    46 For dead-lock prevention, these three types of transaction must be transported on three (virtually or physically) separated networks.
    47 
    48 As a general rule, all these transactions respect the VCI advanced packet format, and there is one response packet for each command packet :
    49 For a burst transaction, a READ command packet contains one single flit, and the corresponding READ response packet contains N flits.
    50 Symmetrically, a WRITE command packet contains N flits, and the corresponding WRITE response packet contains one single flit.
    51 
    52 There is one exception : For a BROADCAST_INVAL transaction, the initiator sends one single flit VCI packet,
    53 but receives several single flit VCI response packets.
     37For dead-lock prevention, the transaction must be transported on three (virtually or physically) separated networks.
    5438 
    5539=== 2.1  Direct transactions ===
    5640 
    57 These transactions are initiated by a processor (actually the L1 cache controller), or by another initiator
    58 (an I/O peripheral or hardware coprocessor with a DMA capability). This initiator can be located in any cluster. For those transactions,
    59 the target is a memory cache controller, acting as a physical memory bank, or another VCI target peripheral. This target can be located in any cluster.
     41These transactions are always initiated by the L1 cache controller, that can be located in any cluster. The target is a L2 cache controller, acting as a physical memory bank, that can be located in any cluster.
    6042
    61 The L1 cache controller can issue several simultaneous VCI transactions, that must be distinguished by the VCI TRDID and PKTID values.
     43All direct transactions require two packets: one ''command'' packet (from L1 to L2), and one ''response'' packet (from L2 to L1).
     44
     45To avoid deadlocks, the directs transactions require two separated physical networks
     46for commands and responses.
     47
     48For all direct transactions, the packet (command & responses) respect the VCI format
     49AS the L1 cache controller can issue several simultaneous direct transactions, that are distinguished by the VCI TRDID and PKTID values.
    6250
    6351 * A '''READ''' transaction can have four sub-types: It can be instruction or data, and it can be cacheable or uncacheable. In case of a burst transaction the burst must be included in a 16 words cache line. This constraint applies for both the L1 cache controllers and the I/O controllers with a DMA capability. For all READ transaction, the VCI command packet contains one single VCI flit, and the VCI response packet contains at most 16 flits.
     
    7361=== 2.2 Coherence transactions ===
    7462
    75 For each cache line stored in the memory cache, the memory cache implements a Registration Table that contain the copies replicated in the L1 caches. Each entry in this Registration Table contains the SRCID of the L1 cache that contains a copy, as well as the type of the copy (instruction/data). When the same cache line is replicated in both the instruction cache and the data cache of a processor, this defines two separated entries in the Registration Table. When the number copies for a given cache line L exceeds the DHCCP threshold, the corresponding Registration Table is flushed, and the memory cache registers only the number of copies.
     63For each cache line stored in the L2 cache, the L2 cache implements a linked list of copies replicated in the L1 caches. Each entry in this list contains the SRCID of the L1 cache that contains a copy, as well as the type of the copy (instruction/data). If the same cache line is replicated in both the instruction cache and the data cache of a given core, this defines two separated entries in the list. When the number copies for a given cache line L exceeds the DHCCP threshold, the corresponding list of copies is flushed, and the L2 cache registers only the number of copies.
    7664
    77 The coherence transactions use a logically separated ''coherence network'', implementing a separated address space.
    78 All these transactions are write transactions.
     65A coherence transaction can be initiated by the L1 cache or by the L2 cache.
     66Depending on the transaction type, a coherence transaction can require two or three packets.
     67
     68 * A '''CLEANUP''' transaction is initiated by the L1 cache when it must evict a line L for replacement, to signal to the owner L2 cache that it does not contains anymore a copy of L. This transaction requires two packet types:
     69   1. The L1 cache send a ''cleanup(L)'' packet to the owner L2 cache.
     70   1. The L2 cache returns a ''clack(L)'' packet to signal that its list of copies for L has been updated.
     71For the L1 cache, the '''CLEANUP''' transaction is completed when the L1 cache receive the ''clack'' packet. 
     72
     73 * A '''MULTI_UPDATE''' transaction is a multi-cast transaction initiated by the L2 cache when it receives a WRITE request to a replicated cache line, and the number of copies does not exceeds the DHCCP threshold. This transaction requires two packet types:
     74   1. The L2 send as many ''update(L,DATA)'' packets as the number of registered copies (but the writer).
     75   1. Each L1 cache returns an ''update_ack(L)'' packet to the L2 cache to signal that the local copy has been updated.
     76For the L2 cache, the '''MULTICAST_UPDATE''' transaction is completed when the L2 cache received all expected ''update_ack'' packets.
     77
     78 * A '''MULTI_INVAL''' transaction is a multi-cast transaction, initiated by the L2 cache, when it must evict a given line L, and the number of copies does not exceeds the DHCCP threshold. To keep the inclusion property, all copies in L1 caches must be invalidated. This transaction requires three types of packets:
     79   1. The L2 cache send as many ''inval(L)'' packets as the number of registered copies to all registered L1 caches.
     80   1. Each L1 cache send a ''cleanup(L)'' packet to the L2 cache to signal that the local copy has been invalidated.
     81   1. The L2 cache returns to each L1 cache a ''clack(L) packet to signal that its list of copies for L has been updated.
     82For the L2 cache, the '''MULTI_INVAL''' transaction is completed when the last ''cleanup''
     83packet has been received.
    7984 
    80  * A '''MULTI_UPDATE''' transaction is a multi-cast transaction sent by the memory cache controller when it receives a WRITE request to a replicated cache line and the number of copies does not exceeds the DHCCP threshold. It sends as many VCI transactions as the number of registered copies (but the writer). The VCI command packet contains (N+2) flits. The VCI ADDRESS field is constant and contains the address of the memory mapped UPDATE register in the L1 cache. The VCI CMD field contains the VCI_WRITE code. As the memory cache controller can handle several simultaneous update/invalidate transactions, the VCI PKTID field contains the transaction index. The VCI PLEN field contains the value  4*N, where N is the actual number of modified words in the cache line. The line index (34 bits) is transported in the VCI WDATA and VCI BE fields (the two LSB bits), of the first flit. The first modified word index (3 bits) is transported in the WDATA field of the second flit, and the N modified words in the WDATA and BE fields of the N following flits. For each modified word, the VCI BE field can have a different value (including the 0x0 value). The VCI response packet contains one single flit. The memory cache controller counts the number of VCI responses to detect the completion of the MULTICAST_UPDATE transaction.
     85 * A '''BROADCAST_INVAL''' transaction is a broadcast transaction initiated by a L2 cache when a line L has been modified by a WRITE, or when the line L must be evicted for replacement, and the number of copies exceeds the DHCCP threshold. This transaction request three types of packets:
     86   1. The L2 cache send to all L1 caches controller a ''bc_inval(L)'' broadcast packet.
     87   1. Each L1 cache that contains a copy of L send a ''cleanup(L)'' packet to the L2 cache to signal that the local copy has been invalidated.
     88   1. The L2 cache returns to each L1 cache that made a cleanup, a ''clack(L)'' packet to signal that its list of copies for L has been updated.
     89For the L2 cache, it simply decrement the counter of copies for each received ''cleanup'', and the '''MULTI_BROADCAST''' transaction is completed when the last ''cleanup'' packet has been received.
    8190
    82  * A '''MULTI_INVAL''' transaction is a multi-cast transaction, that is composed of several VCI transactions. When a memory cache makes a cache line replacement (following a MISS in the memory cache), and the victim line has a number of copies smaller than the DHCCP threshold, it sends as many VCI transactions as the number of registered copies. Both the VCI command packet and the VCI response packet contain only one flit. The VCI ADDRESS field contains the address of the memory mapped INVAL register in the L1 cache. The VCI CMD field contains the VCI_WRITE code. As the memory cache controller can handle several update/invalidate transactions simultaneously, the VCI PKTID field contains the transaction index.The VCI WDATA and VCI BE (the two LSB bits) fields contain the 34 bits line index. The memory cache controller counts the number of VCI responses to detect the completion of the  MULTI_INVAL transaction.
     91As the '''MULTI_INVAL''' and '''BROADCAST_INVAL''' transactions require three packets, the coherence transactions require three separated physical networks
     92reqiore t
    8393
    84  * A '''BROADCAST_INVAL''' transaction is a broadcast transaction. This transaction is initiated when a memory cache controller replaces a line, or receives a WRITE request to a replicated cache line, and this cache line has a number of copies larger than the DHCCP threshold. The VCI command packet contains one single flit. This packet is replicated and dynamically broadcasted by the network itself. The VCI CMD field contains the VCI_WRITE code. The VCI ADDRESS field contains the global broadcast address 0x0000000003 (only the two LSB bits are set). The VCI WDATA and the VCI BE (the two LSB bits) field contain the line index. This VCI command is broadcasted to all L1 caches in the system, but only L1 caches that have a copy send a VCI response packet. All VCI response packets are independently returned to the memory cache initiator, that counts the number of VCI responses to detect the completion of the BROADCAST_INVAL transaction. If a L1 cache contains two copies of a cache line (i.e. the line is replicated in both the DATA cache, and the INSTRUCTION cache), it must send two VCI responses.
    8594
    86 The following table defines the coherence command encoding (4 LSB bits in the VCI ADDRESS field)
    87 || COMMAND TYPE           ||    ||
    88 || Invalidate Data        ||0000||
    89 || Invalidate Instruction ||0100||
    90 || Update Data            ||1000||
    91 || Update Instruction     ||1100||
    9295
    93  * A '''CLEANUP''' transaction is initiated by a L1 cache controller to a memory cache controller, to signal that a cache line copy has been removed from an instruction or data cache. Both the VCI command packet and the VCI response packet contain one single flit. For a CLEANUP transaction, the VCI ADDRESS field must contain the removed cache line address. The VCI TRDID fiels contains the value 0 for a data cache cleanup, and contains the value 1 for an instruction cache cleanup.
     96== 3.   Transactions between L2 and L3 caches===
    9497
    95 === 2.3 External transactions ===
    96 
    97 These transactions are initiated by the memory caches, to fetch or save a complete cache line in case of MISS in the memory cache.
    98 The general policy between the memory caches and the external memory is WRITE_BACK : The external memory is only updated
     98These transactions are initiated by the L2 caches, to fetch or save a complete cache to/from the L3 cache. The general policy between the memory caches and the external memory is WRITE_BACK : The external memory is only updated
    9999in case of line replacement. The target is always the external RAM controller.
    100100
    101 All the external transactions use a separated ''external network'', implementing a separated address space. The memory cache and
    102 the external RAM controller ports used to access the external network respect a simplified version of the VCI advanced format :
    103 the VCI fields PLEN, PKTID, CONST, CONTIG and BE are not used. The VCI ADDRESS field contains 30 bits (a 64 bytes cache line index). 
    104 The VCI WDATA & RDATA fields contain 64 bits, in order to improve the bandwidth. The VCI SRCID field contains the memory cache index (cluster index).
    105 As the memory cache controller can process several external transactions simultaneously, the VCI TRDID field contains the transaction index.
     101All these L2/L3 transactions use a separated ''external network'', implementing a separated address space. The memory cache and the external RAM controller ports used to access the external network respect a simplified version of the VCI advanced format :
     102the VCI fields PLEN, PKTID, CONST, CONTIG and BE are not used. The VCI ADDRESS field contains 30 bits (a 64 bytes cache line index). The VCI WDATA & RDATA fields contain 64 bits, in order to improve the bandwidth. The VCI SRCID field contains the memory cache index (cluster index).
     103As the L2 cache controller can process several external transactions simultaneously, the VCI TRDID field contains the transaction index.
    106104
    107  * For a '''GET''' transaction, the VCI command packet contains one single flit. The VCI CMD field contains the READ value. The VCI response packet contains 8 flits (corresponding to the 64 bytes of a cache line).
     105 * The L2 cache makes a '''GET''' transaction to the L3, to handle a L2 miss. The VCI command packet contains one single flit. The VCI CMD field contains the READ value. The VCI response packet contains 8 flits (corresponding to the 64 bytes of a cache line).
    108106
    109  * For a '''PUT''' transaction, the VCI command packet contains 8 flits. The VCI CMD field contains the WRITE value. The VCI response packet contains 1 flit.
     107 * The L2 cache makes a '''PUT''' transaction to the L3, to handle the replacement of a dirty line. The VCI command packet contains 8 flits. The VCI CMD field contains the WRITE value. The VCI response packet contains 1 single flit.
    110108
    111109