Version 41 (modified by 7 years ago) (diff) | ,
---|
Process and thread creation/destruction
The process is the internal représentation of an user application. A process can be running as a single thread (called main thread), or can be multi-threaded. ALMOS-MKH supports the POSIX thread API. For a multi-threaded application, the number of threads can be very large, and the threads of a given process can be distributed on all cores available in the shared memory architecture, for maximal parallelism. Therefore A single process can spread on all clusters. To avoid contention, the process descriptor of a P process, and the associated structures, such as the list of registered vsegs (VSL), the generic page table (GPT), or the file descriptors table (FDT) are (partially) replicated in all clusters containing at least one thread of P.
1) Process
The PID (Process Identifier) is coded on 32 bits. It is unique in the system, and has a fixed format: The 16 MSB (CXY) contain the owner cluster identifier. The 16 LSB bits (LPID) contain the local process index in owner cluster. The owner cluster is therefore defined by the 16 MSB bits of PID.
As it exists several copies of the process descriptors, ALMOS-MKH defines a reference process descriptor, located in the reference cluster. The other copies are used as local caches, and ALMOS-MKH must guaranty the coherence between the reference and the copies.
As ALMOS-MKH supports process migration, the reference cluster can be different from the owner cluster. The owner cluster cannot change (because the PID is fixed), but the reference cluster can change in case of process migration.
In each cluster K, the local cluster manager ( cluster_t type in ALMOS-MKH ) contains a process manager ( pmgr_t type in ALMOS-MKH ) that maintains three structures for all process owned by K :
- The PREF_TBL[lpid] is an array indexed by the local process index. Each entry contains an extended pointer on the reference process descriptor.
- The COPIES_ROOT[lpid] array is also indexed by the local process index. Each entry contains the root of the global list of copies for each process owned by cluster K.
- The LOCAL_ROOT is the local list of all process descriptors in cluster K. A process descriptor copy of P is present in K, as soon as P has a thread in cluster K.
There is a partial list of informations stored in a process descriptor ( process_t in ALMOS-MKH ):
- PID : proces identifier.
- PPID : parent process identifier,
- PREF : extended pointer on the reference process descriptor.
- VSL : root of the local list of virtual segments defining the memory image.
- GPT : generic page table defining the physical memory mapping.
- FDT : open file descriptors table.
- TH_TBL : local table of threads owned by this process in this cluster.
- LOCAL_LIST : member of local list of all process descriptors in same cluster.
- COPIES_LIST : member of global list of all descriptors of same process.
- CHILDREN_LIST : member of global list of all children of same parent process.
- CHILDREN_ROOT : root of global list of children process.
All elements of a local list are in the same cluster, and ALMOS-MKH uses local pointers. Elements of a global list can be distributed on all clusters, and ALMOS-MKH uses extended pointers.
2) Thread
ALMOS-MKH defines four types of threads :
- one USR thread is created by a pthread_create() system call.
- one DEV thread is created by the kernel to execute all I/O operations for a given channel device.
- one RPC thread is activated by the kernel to execute pending RPC requests in the local RPC fifo.
- the IDL thread is executed when there is no other thread to execute on a core.
From the point of view of scheduling, a thread can be in three states : RUNNING, RUNNABLE or BLOCKED.
This implementation of ALMOS-MK does not support thread migration: a thread created by a pthread_create() system call is pinned on a given core in a given cluster. The only exception is the main thread of a process, that is automatically created by the kernel when a new process is created, and follows its owner process in case of process migration.
In a given process, a thread is identified by a fixed format TRDID identifier, coded on 32 bits : The 16 MSB bits (CXY) define the cluster where the thread has been pinned. The 16 LSB bits (LTID) define the thread local index in the local TH_TBL[K,P] of a process descriptor P in a cluster K. This LTID index is allocated by the local process descriptor when the thread is created.
Therefore, the TH_TBL(K,P) thread table for a given process in a given clusters contains only the threads of P placed in cluster K. The set of all threads of a given process is defined by the union of all TH_TBL(K,P) for all active clusters K. To scan the set off all threads of a process P, ALMOS-MK traverse the COPIES_LIST of all process_descriptors associated to P process.
There is a partial list of informations stored in a thread descriptor (thread_t in ALMOS-MK):
- TRDID : thread identifier
- TYPE : KERNEL / USER / IDLE / RPC
- FLAGS : thread attributes
- STATE : CREATE / READY / USER / KERNEL / WAIT / ZOMBI / DEAD
- PROCESS : pointer on the local process descriptor
- LOCKS_COUNT : current number of locks taken by this thread
- PWS : zone de sauvegarde des registres du coeur.
- SCHED : pointer on the scheduler in charge of this thread.
- CORE : pointer on the owner processor core.
- IO : allocated devices (in case of privately allocated devices).
- SIGNALS : bit vector permettant d’enregistrer les signaux reçus par le thread.
- XLIST : member of the global list of threads waiting on the same resource.
- CHILDREN_ROOT : root of the global list of children threads.
- CHILDREN_LIST : member of the global list of all children of same parent.
- etc.
3) Process creation
The process creation in a remote cluster implement the POSIX fork() / exec() mechanism. When a parent process P executes the fork() system call, a new child process C is created. The new C process inherit from the parent process P the open files (FDT), and the memory image (VSL and GPT). These structures must be replicated in the new process descriptor. After a fork(), the C process can execute an exec() system call, that allocate a new memory image to the C process, but the new process can also continue to execute with the inherited memory image. For load balancing, ALMOS-MKH uses the DQDT to create the child process C on a different cluster from the parent cluster P, but the user application can also use the non-standard fork_place() system call to specify the target cluster.
3.1) fork()
NEW SPECIFICATION
A thread of parent process P, running in a cluster X, executes the fork() system call to create a child process C on a remote cluster Y, that will become both the owner and the reference cluster for the C process. A new process descriptor, and a new thread descriptor must be created and initialized in target cluster Y for the child process. The calling thread can run in any cluster. If the reference cluster Z for process P is different from the calling thread cluster X, the calling thread must use a RPC to ask the reference cluster Z to do the work, because only the reference cluster Z contains a complete description of the parent process VSL and GPT.
Regarding the process descriptor, a new PID must be allocated in cluster Y. The child process C inherit the vsegs registered in the parent process VSL, but the ALMOS-MKH replication policy depends on the vseg type:
- for the DATA, MMAP, REMOTE vsegs (containing shared, non replicated data), all vsegs registered in the parent reference VSL(Z,P) are registered in the child reference VSL(Y,C), and all valid GPT entries in the reference parent GPT(Z,P) are copied in the child reference GPT(Y,C). For all pages, the WRITABLE flag is reset and the COW flag is set, in both (parent and child) GPTs. This require to update all replicated parent GPT copies in all cluster.
- for the STACK vseg(that are private), only one vseg is registered in the child reference VSL(Y,C). This vseg contains the user stack of the user thread requesting the fork, running in cluster X. All valid GPT entries in the parent GPT(X,P) are copied in the child GPT(Y,C). For all pages, the WRITABLE flag is reset and the COW flag is set, in both (parent and child) GPTs. This require to update all replicated parent GPT copies in all cluster.
- for the CODE vsegs (that must be replicated in all clusters containing a thread), all vsegs registered in the reference parent VSL(Z,P) are registered in the child reference VSL(Y,C), but the reference child GPT(Y,C) is not updated by the fork: It will be dynamically updated on demand in case of page fault.
- for the FILE vsegs (containing shared memory mapped files), all vsegs registered in the reference parent VSL(Z,P) are registered in the child reference VSL(Y,C), and all valid entries registered in the reference parent GPT(Z,P) are copied in the reference child GPT(Y,C). The COW flag is not set for these shared data.
Regarding the thread descriptor, a new TRDID must be allocated in cluster Y, and the calling parent thread context (current values stored in the CPU and FPU registers) must be saved in the child thread CPU and FPU contexts, to be restored when the child thread will be selected for execution. Two slots corresponding to two CPU registers must be specifically initialized:
- the thread pointer register contains the current thread descriptor address. This thread pointer register cannot have the same value for parent and child.
- the stack pointer register define the current kernel stack. ALMOS-MLH uses a specific kernel stack when an user thread enters the kernel, and this kernel stack is implemented in the thread descriptor. As parent and child cannot use the same kernel stack, the parent kernel stack content must be copied to the child kernel stack, and the stack pointer register cannot have the same value for parent and child.
OLD SPECIFICATION (DEPRECATED)
A thread of parent process P, running in a cluster K, executes the fork() system call to create a new process C on a remote cluster Z, that will become the owner for the C process. ALMOS-MK creates the first C process descriptor in the same cluster as the parent cluster P, and postpone the costly remote copy of VSL and GPT from P to C, because this copy is useless in case of exec(). When the fork() system call returns, the C process owner cluster is Z, but the reference process descriptor is in cluster K. The child process and the associated main thread will be migrated to cluster Z later, when the child process makes an "exec" or any other system call.
- the cluster K allocates memory in K for the reference process descriptor of C, and get a pointer on this process descriptor.
- the cluster K ask to kernel Z to allocate a PID for the C process, and to register the process descriptor extended pointer in its PREF_TBL(Z). It uses the RPC_PROCESS_PID_ALLOC that takes the process descriptor pointer as argument and returns the PID.
- after RPC completion, the kernel K initializes the C process descriptor from informations found in the P parent process descriptor.
- the kernel K creates locally the main thread of process C, and register this thread in the TH_TBL(K,C),
- the kernel K register this new thread in the scheduler of the core executing the fork() system call, an return.
At the end of the fork(), the owner cluster for the new C process is cluster Z, and the reference cluster is cluster K. This C process contains one single thread running on K.
3.2) exec()
NEW SPECIFICATION
After a fork() system call, any thread of the the P process can execute an exec() system call. This system call forces the P process to execute a new application, while keeping the same PID. The P process keep all open files, and the environment variables. The P process reference descriptor must be re-initialised from values found in the .elf file defining the new application. All existing threads of process P must be killed (in all clusters), and a new main thread must be created in the reference cluster. The calling thread can run in any cluster. If the reference cluster Z for process P is different from the calling thread cluster X, the calling thread must use a RPC to ask the reference cluster Z to do the work.
OLD SPECIFICATION (DEPRECATED)
- The kernel K send an RPC_PROCESS_MIGRATE to cluster Z. The argument are the extended pointer on the C process descriptor in cluster K.
- To execute this RPC, the kernel Z allocates a new reference process descriptor in cluster Z, and initializes it from informations found in process descriptor in cluster K, using a remote_memcpy().
- The kernel Z allocates and initializes from disk the structures contained in the process VMM: GPT(Z,C), VSL(Z,C).
- The kernel Z creates the main thread associated to process C in cluster Z, initializes it, and register it in the TH_TBL(Z,C).
- The kernel Z registers this thread in the scheduler of the core selected by the Z kernel and acknowledges the RPC.
- When receiving the RPC acknowledge, the kernel K destroy the C process descriptor and the associated thread in cluster K, that is not anymore involved in process C execution.
At the end of the exec() system call, the cluster Z is both the owner and the reference cluster for process C, that contains one single thread in cluster Z.
4) Thread creation
Any thread T of any process P, running in any cluster K, can create a new thread NT in any cluster M. This creation is driven by the pthread_create() system call. The target M cluster is called the host cluster. If the M cluster does not contain a process descriptor copy for process P (because the NT thread is the first thread of process P in cluster M), a new process descriptor must be created in cluster M.
- The target cluster M can be specified by the user application, using the CXY field of the pthread_attr_t argument. If the CXY is not defined by the user, the target cluster M is selected by the kernel K, using the DQDT.
- The Target core in cluster M can be specified by the user application, using the CORE_LID field of the pthread_attr_t argument. If the CORE_LID is not defined by the userpmù$, the target core is selected by the target kernel M.
4.1) phase 1
The kernel K select a target cluster M, and send a RPC_THREAD_USER_CREATE request to cluster M. The argument is a complete structure pthread_attr_t (defined in the thread.h file in ALMOS-MK), containing the PID, the function to execute and its arguments, and optionally, the target cluster and target core. This RPC should return a the thread TRDID.
4.2) phase 2
To execute this RPC, the kernel M will make a local copy of the pthread_attr_t structure, and execute the following steps:
- The kernel M checks if it contains a copy of the P process descriptor.
- If not, the kernel M creates a process descriptor copy from the reference P process descriptor, using a remote_memcpy(), and using the cluster_get_reference_process_from_pid() to get the extended pointer on reference cluster. It allocates memory for the associated structures PG_TBL(M,P), VSEG_LIST(M,P), FD_TBL(M,P). It initializes (partially) these structures by using remote_memcpy() from the reference cluster. The PG_TBL structure will be filled by the page faults.
- The kernel M register this new process descriptor in the COPIES_LIST and LOCAL_LIST.
- When the local process descriptor is set, the kernel M select the core that will execute the thread, allocates a TRDID to this thread, and creates the thread descriptor for NT.
- The kernel M registers the thread descriptor in the local process descriptor TH_TBL(M,P), and in the selected core scheduler.
- The kernel M returns the TRDID to the client cluster K, and acknowledge the RPC.
5) Thread destruction
The destruction of a thread T running in cluster K can be caused by the thread itself, with the pthread_exit() system call. It can also be caused by a kill signal, sent by another thread, requesting the thread to stop execution. In both case, the host kernel K is in charge of the destruction. The scenario is more complex if the finishing thread T is running in ATTACH mode, because the parent thread TP must be informed of the completion of thread T, in case of pthread_join() executed by TP.
5.1) phase 1
It T is running in ATTACH mode, the host cluster K force the T state to ZOMBI, to prevent the thread to be scheduled. If the thread completion is caused by an exit, the thread T stops immediately execution. If it is caused by a kill signal, the signal is registered in the thread descriptor, and the rescheduling will only occur at the next scheduling point. If T is not running in attached mode, the scenario is similar, but the T is directly forced to the DEAD state.
5.2) phase 2
This second phase is only required, in the parent thread cluster M, if T is running in ATTACH mode. The parent thread descriptor maintains a global list of all children threads running in ATTACH mode. When the parent thread execute the pthread_join() system call for a child identified by its TRDID, the M kernel scan this list to localize the selected child thread, and directly poll the child thread status using a remote_read access. When the M kernel detects the completion of T (ZOMBI state), it send a RPC_THREAD_USER_JOIN to the K cluster containing the T thread to ask the K kernel to change the state of T from ZOMBI to DEAD.
5.3) phase 3
In each cluster, a dedicated kernel thread is in charge of housekeeping: This thread releases the memory allocated to all DEAD threads.
6) Process destruction
The process destruction can be caused by an exit() system call, or by asignal send by another process. In both case, the owner cluster is in charge of the destruction.
6.1) phase 1
If the exit() system call is executed by a thread running in a cluster K different from the owner cluster Z, the kernel K send a RPC_PROCESS_REQ_EXIT to cluster Z. The argument is the PID.
6.2) phase 2
To execute this RPC, the owner kernel Z send a multi-cast RPC_PROCESS_EXIT to all clusters X that contain a copy of the process descriptor, using its COPIES_LIST. The argument of this RPC is the PID.
6.3) phase 3
In each cluster X, the kernel receiving a RPC_PROCESS_EXIT register the kill signal in all threads descriptors associated to the PID process. and polls the local TH_TBL(X,P). When it detects that the TH_TBL(X,P) is empty, it releases the memory allocated to process descriptor, and acknowledges the RPC to cluster Z.
6.4) phase 4
When the kernel Z has received all expected responses to the multi-cast RPC, it releases all memory located to process PID in cluster Z, and this completes the process destruction.