wiki:replication_distribution

Version 63 (modified by alain, 5 years ago) (diff)

--

Data replication & distribution policy

alain.greiner@…

The replication / distribution policy of data on the physical memory banks has two goals: enforce locality (as much as possible), and avoid contention (it is the main goal).

The data to be placed are the virtual segments defined - at compilation time - in the virtual space of the various user processes currently running, or in the virtual space of the operating system itself.

1. General principles

To actually control the placement of all these segments on the physical memory banks, the kernel uses the paged virtual memory MMU to map a virtual segment to a given physical memory bank in a given cluster.

A vseg is a contiguous memory zone in the process virtual space, defined by the two (base, size) values. All adresses in this interval can be accessed without segmentation violation: if the corresponding page is not mapped, the page fault will be handled by the kernel, and a physical page will be dynamically allocated (and initialized if required). A vseg always occupies an integer number of pages, as a given page cannot be shared by two different vsegs.

In all UNIX system (including almos-mkh), a vseg has some specific attributes defining access rights (readable, writable, executable, cachable, etc). But for almos-mkh, the vseg type defines also the replication and distribution policy:

  • A vseg is public when it can be accessed by any thread T of the involved process, whatever the cluster running the thread T. It is private when it can only be accessed by the threads running in the cluster containing the physical memory bank where this vseg is defined and mapped.
  • For a public vseg, ALMOS-MKH implements a global mapping : In all clusters, a given virtual address is mapped to the same physical address. For a private vseg, ALMOS-MKH implements a local mapping : the same virtual address can be mapped to different physical addresses, in different clusters.
  • A public vseg can be localized (all vseg pages are mapped in the same cluster), or distributed (different pages are mapped on different clusters). A private vseg is always localized.

The vseg structure and API is defined in the almos_mk/kernel/mm/vseg and almos-mkh/kernel/mm/vseg.c files.

In all UNIX systems, the process descriptor contains the table used by the MMU to make the virtual to physical address translation. An important feature of almos-mkh is the following: To avoid contention, in parallel applications creating a large number of threads in one single process P, almos-mkh replicates, the process descriptor in all clusters containing at least one thread of this process. These clusters are called active clusters.

In almos-mkh, the structure used by the MMU for address translation is called VMM (Virtual Memory Manager). For a process P in cluster K, the VMM(P,K) structure, contains two main sub-structures:

  • The VSL(P,K) is the list of virtual segments registered for process P in cluster K,
  • The GPT(P,K) is the generic page table, defining the actual physical mapping for each page of each vseg.

For a given process P, the different VMM(P,K) in different clusters can have different contents for several reasons :

  1. A private vseg can be registered in only one VSL(P,K) in cluster K, and be totally undefined in the others VSL(P,K').
  2. A public vseg can be replicated in deveral VSL(P,K), but the registration of a vseg in a given VSL(P,K) is on demand: the vseg is only registered in VSL(P,K) when a thread of process P running in cluster K try to access this vseg.
  3. Similarly, the mapping of a given virtual page VPN of a given vseg (i.e. the allocation of a physical page PPN to a virtual page VPN, and the registration of this PPN in the GPT(P,K) is on demand: the page table entry will be updated in the GPT(P,K) only when a thread of process P in cluster K try to access this VPN.

We have the following properties for the private vsegs:

  • the VSL(P,K) contains always all private vsegs in cluster K,
  • The GPT(P,K) contains all mapped entries corresponding to a private vseg in cluster K.

We have the following properties for the public vsegs:

  • the VSL(P,K) contains only the public vsegs that have been actually accessed by a thread of P running in cluster K.
  • Only the reference cluster KREF contains the complete VSL(P,KREF) of all public vsegs for the P process.
  • The GPT(P,K) contains only the entries that have been accessed by a thread running in cluster K.
  • Only the reference cluster KREF contains the complete GPT(P,KREF) of all mapped entries of public vsegs for the P process.

For the public vsegs, the VMM(P,K) structures - other than the reference one - can be considered as local caches. This creates a coherence problem, that is solved by the following rules :

  1. For the private vsegs, and the corresponding entries in the page table, the VSL(P,K) and the GPT(P,K) are only shared by the threads of P running in cluster K, and these structures can be privately handled by the local kernel instance in cluster K.
  2. When a given public vseg in the VSL, or a given entry in the GPT must be removed or modified, this modification must be done first in the reference cluster, and broadcast to all other clusters for update of local VSL or GPT copies.
  3. When a miss is detected in a non-reference cluster, the reference VMM(P,KREF) must be accessed first to check a possible false segmentation fault or a 'false page fault.

For more details on the VMM implementation, the API is defined in the almos_mkh/kernel/mm/vmm.h and almos-mkh/kernel/mm/vmm.c files.

2. User segments

The user segments are mapped in user land (generally the lower part of the virtual space), that can be accessed by a core running in user mode. This section describes the six types of user segments and the associated replication / distribution policy defined and implemented by almost-mkh:

2.1 CODE

This private segment contains the application code. It is replicated in all clusters. ALMOS-MK creates one CODE vseg per active cluster. For a process P, the CODE vseg is registered in the VSL(P,KREF) when the process is created in reference cluster KREF. In the other clusters K, the CODE vseg is registered in VSL(P,K) when a page fault is signaled by a thread of P running in cluster K. In each active cluster K, the CODE vseg is mapped in cluster K.

2.2 DATA

This public segment contains the user application global data. ALMOS-MK creates one single DATA vseg, that is registered in the reference VSL(P,KREF) when the process P is created in reference cluster KREF. In the other clusters K, the DATA vseg is registered in VSL(P,K) when a page fault is signaled by a thread of P running in cluster K. To avoid contention, this vseg is physically distributed on all clusters, with a page granularity : Two contiguous pages are generally stored in two different clusters, as the physical mapping is defined by the LSB bits of the VPN.

2.3 STACK

This private segment contains the execution stack of a thread. Almos-mkh creates one STACK vseg for each thread of P running in cluster K. This vseg is registered in the VSL(P,K) when the thread descriptor is dynamically created in cluster K. To enforce locality, this vseg is of course physically mapped in cluster K.

2.4 ANON

This public segment is dynamically created by ALMOS-MK to serve an anonymous mmap system call executed by a client thread running in a cluster K. The vseg is registered in VSL(P,KREF), but the vseg is mapped in the client cluster K.

2.5 FILE

This public segment is dynamically created by ALMOS-MK to serve a file based mmap system call executed by a client thread running in a cluster K. The vseg is registered in VSL(P,KREF), but the vseg is mapped in cluster Y containing the file cache.

2.6 REMOTE

This public segment is dynamically created by ALMOS-MK to serve a remote mmap system call, where a client thread running in a cluster X requests to create a new vseg mapped in another cluster Y. The vseg is registered in VSL(P,KREF), but the vseg is mapped in cluster Y specified by the user.

2.7 summary

This table summarize the replication, distribution & mapping rules for user segments:

Type Access Replication Mapping in physical space Allocation policy in virtual space
STACK private localized Read Write one per thread same cluster as thread using it dynamic (one stack allocator per cluster)
CODE private localized Read Only one per cluste same cluster as thread using it static (defined in .elf file)
DATA public distributed Read Write non replicated distributed on all clusters static (defined in .elf file)
ANON public localized Read Write non replicated same cluster as calling thread dynamic (one heap allocator per process
FILE public localized Read Write non replicated same cluster as the file cache dynamic (one heap allocator per process)
REMOTE public localized Read Write non replicated cluster defined by user dynamic (one heap allocator per process)

3. kernel segments

The kernel segments are implemented in kernel land (generally in the upper part of the virtual space), that is protected, and can only be accessed by a core running in kernel mode.

An user thread makes system calls to access protected resources. Therefore, the VMM(P,K) of a process descriptor P in a cluster K, must contains not only the user segments defined above, but also the kernel segments, to allow an user thread to access - after a syscall - the kernel code and the kernel data structures : Both the user segments virtual adresses, and the kernel segments virtual adresses must be translated to physical addresses.

Almost-mkh defines 3 types of kernel segments described below.

3.1 KCODE

The KCODE segment contains the kernel code defined in the kernel.elf file. To avoid contention and improve locality, almos-mkh replicates this code in all clusters. This code has already been copied in all clusters by the bootloader.

WARNING : there is only one segment defined in the kernel.elf file, but there is as many KCODE vsegs as the number of clusters. All these vsegs have the same virtual base address and the same size, but the physical adresses (defined in the GPTs) depend on the cluster, because we want to access the local copy. This is not a problem because A KCODE vseg is a private vseg, that is accessed only by local threads.

In each cluster K, and for each process P in cluster K (including the kernel process_zero), almos-mkh registers the KCODE vseg in all VSL(P,K), and map it in all the GPT(P,K). This vseg uses only big pages, and there is no on-demand paging for this type of vseg.

3.2 KDATA

The KDATA segment contains the kernel global data, statically allocated at compilation time, and defined in the kernel.elf file. To avoid contention and improve locality, almos-mkh defines a KDATA vseg in each cluster. The corresponding data have already been copied by the boot-loader in all clusters.

WARNING : there is only one segment defined in the kernel.elf file, but there is as many KDATA vsegs as the number of clusters. All these vsegs have the same virtual base address and the same size, but the physical addresses (defined in the GPTs), depends on the cluster, because we generally want to access the local copy. This seems very similar to the KCODE replication, but there is two big differences between the KCODE and the KDATA segments :

  1. The values contained in the N KDATA vsegs are initially identical, as they are all defined by the same kernel.elf file. But they are not read-only, and will evolve differently in different clusters.
  2. The N KDATA vsegs are public, and can be accessed by any instance of the kernel running in any cluster. Even if most accesses are local, a thread running in cluster K must be able to access a global variable stored in another cluster X, or to send a request to another kernel instance in cluster X, or to scan a globally distributed structure, such as the DQDT or the VFS.

To allows any thread running in any cluster to access the N KDATA vsegs, almos-mkh can register these N vsegs in all VSL(P,K), and map them in all GPT(P,K).

3.3 KHEAP

The KHEAP segment contains, in each cluster K, the kernel structures dynamically allocated by the kernel in cluster K to satisfy the users requests (such as the process descriptors, the thread descriptors, the vseg descriptors, the file descriptors, etc.). To avoid contention and improve locality, almos-mkh defines one KHEAP segment in each cluster implementing a physically distributed kernel heap. In each cluster, this KHEAP segment contains actually all physical memory that is not already allocated to store the KCODE and KDATA segments.

WARNING : most of these structures are locally allocated in cluster K by a thread running in cluster K, and are mostly accessed by the threads running in cluster K. But these structures are global variables : they can be created in any cluster, by any thread running in any other cluster, and can be accessed by any thread executing kernel code in any other cluster.

To allows any thread running in any cluster to access the N KHEAP vsegs, almos-mkh can register these N vsegs in all VSL(P,K), and map them in all GPT(P,K).

3.4 KSTACK

Any thread entering the kernel to execute a system call needs a kernel stack, that must be in the kernel land. This requires as many kernel stacks as the total number of threads (user threads + dedicated kernel threads) in the system. For each thread, almos-mkh implements the kernel stack in the 16 Kbytes thread descriptor, that is dynamically allocated in the KHEAP segment, when the thread is created. Therefore, there is no specific KSTACK segment type for the kernel stacks.

3.5 Local & Remote accesses

Almos-mkh defines two different policies to access the data stored in the N KDATA and KHEAP segments :

  • The local accesses to the local kernel structures can use normal pointers that will be translated by the MMU to local physical adresses.
  • The remote access to remote kernel structures must use the hal_remote_load( cxy , ptr ) and hal_remote_store( cxy , ptr ) functions, where ptr is a normal pointer to the KDATA or KHEAP vseg, and cxy is the remote cluster identifier. Notice that a non local kernel variable is therefore identified by and extended pointer XPTR( cxy , ptr ). With these remote access primitives, any kernel instance in any cluster can access any variable in any other cluster.

In other words, almost-mkh clearly distinguish the local accesses, that can use standard pointers, from the remote access that must use extended pointers. This can be seen as a bad constraint, but it clearly help to improve locality and avoid contention .

The remote_access primitives API is defined in the almos_mkh/hal/generic/hal_remote.h file.

4. Remote accesses implementation

The detailed implementation of the remote access functions described in section 3.5 depends on the target architecture.

4.1 Intel 64

On hardware architecture using 64 bits core, the virtual space is generally much larger than the physical space. The actual size of the virtual space is 256 Tbytes (virtual adresses are bounded to 48 bits) on Intel based multi-cores servers. It is therefore possible to map all segments described above in the virtual space:

  • the users segments defined in section 2. are accessed with normal pointers. They can be mapped in the user land (lower half of the 256 Tbytes).
  • the three local kernel segments (KCODE, KDATA, KHEAP) defined in section 3. are accessed with normal pointers. They can be mapped in the kernel land( upper half of the 256 Tbytes).
  • The N distributed KDATA and the N KHEAP segments are accessed using extended pointers XPTR(cxy,ptr). These 2N segments can also be mapped in the kernel land, and the translation from the extended to a normal pointer can be done by the remote_load(), and remote_store() functions.

In all cases, the Intel64 MMU is in charge to translate the virtual address (defined by the normal pointer) to the relevant physical address.

4.2 TSAR-MIPS32

The TSAR architecture uses 32 bits cores, to reduce the power consumption. This creates a big problem to access the remote KDATA and KHEAP segments : with 1 Gbytes of physical memory per cluster, and 256 cluster, the total physical space covered by the N KHEAP segments is 256 Gbytes. This is much larger than the 4 Gbytes virtual space addressable by a 32 bits virtual address. The consequence is very simple: we cannot use the MIPS_32 MMU to make the virtual to physical address translation when a thread wants to access a remote KDATA or KHEAP segment.

But the TSAR architecture provides two useful features to simplify the traduction from an extended pointer XPTR(cxy,ptr) to a 40 bits physical address :

  1. The TSAR 40 bits physical address has a specific format : it is the concaténation of an 8 bits CXY field, and a 32 bits LPADDR field, where the CXY defines

the cluster identifier, and the LPADDR is the local physical address inside the cluster.

  1. the MIPS32 core used BY the TSAR architecture defines, besides the standard MMU, another - non-standard - hardware mechanism for address translation : A 40 bits physical address is simply build by appending to each 32 bits virtual address a 8 bits extension contained in a software controllable register, called DATA_PADDR_EXT.

In the TSAR architecture, and for any process P in any cluster K, almost-mkh registers only one extra KCODE vseg in the VMM[P,K), because almos-mkh uses the INST-MMU for instruction addresses translation, but does NOT not use the DATA-MMU for data addresses translation : When a core enters the kernel, the DATA-MMU is deactivated, and it is only reactivated when the core returns to user code.

In the TSAR implementation, the default value contained in the DATA_PADDR_EXT register is the (ocal_cxy), to access the local physical memory. For remote accesses, the remote_load() and remote_store() functions set the extension register DATA_PADDR_EXT to the target (cxy) before the remote access, and restore it to (local_cx) after the remote access.

The price to pay for this physical addressing is to precisely control the KCODE and the KDATA segments when compiling the kernel.

The implementation of the hal_remote_load() and hal_remote_store() functions for the TSAR architecture is available in the almos_mkh/hal/tsar_mips32/core.hal_remote.c file.

5. Virtual space organisation

This section describes the almost-mkh assumptions regarding the virtual space organisation. It clearly depends on the size of the virtual space.

5.1 Intel 64

TODO

5.2 TSAR-MIP32

The virtual address space of an user process P is split in 5 fixed size zones, defined by configuration parameters in https://www-soc.lip6.fr/trac/almos-mkh/browser/trunk/kernel/kernel_config.h. Each zone contains one or several vsegs, as described below.

5.1.1 The kernel zone

It contains the kcode vseg (type KCODE), that must be mapped in all user processes. It is located in the lower part of the virtual space, and starts a address 0. Its size cannot be less than a big page size (2 Mbytes for the TSAR architecture), because it will be mapped as one (or several) big pages in the GPT. 5.1.2 The utils zone

It contains the two args and envs vsegs, whose sizes are defined by specific configuration parameters. The args vseg (DATA type) contains the process main() arguments. The envs vseg (DATA type) contains the process environment variables. It is located on top of the kernel zone, and starts at address defined by the CONFIG_VMM_ELF_BASE parameter.

5.1.3 The elf zone

It contains the text (CODE type) and data (DATA type) vsegs, defining the user process binary code and global data. The actual vsegs base addresses and sizes are defined in the .elf file and reported in the boot_info_t structure by the boot loader.

5.1.4 The heap zone

It contains all vsegs dynamically allocated / released by the mmap / munmap system calls (i.e. FILE / ANON / REMOTE types). It is located on top of the elf zone, and starts at the address defined by the CONFIG_VMM_HEAP_BASE parameter. The VMM defines a specific MMAP allocator for this zone, implementing the buddy algorithm. The mmap( FILE ) syscall maps directly a file in user space. The user level malloc library uses the mmap( ANON ) syscall to allocate virtual memory from the heap and map it in the same cluster as the calling thread. Besides the standard malloc() function, this library implements a non-standard remote_malloc() function, that uses the mmap( REMOTE ) syscall to dynamically allocate virtual memory from the heap, and map it to a remote physical cluster.

5.1.5 The stack zone

It is located on top of the mmap zone and starts at the address defined by the CONFIG_VMM_STACK_BASE parameter. It contains an array of fixed size slots, and each slot contains one stack vseg. The size of a slot is defined by the CONFIG_VMM_STACK_SIZE. In each slot, the first page is not mapped, in order to detect stack overflows. As threads are dynamically created and destroyed, the VMM implements a specific STACK allocator for this zone, using a bitmap vector. As the stack vsegs are private (the same virtual address can have different mappings, depending on the cluster) the number of slots in the stack zone actually defines the max number of threads for given process in a given cluster.