= Process and thread creation/destruction = [[PageOutline]] The process is the internal représentation of an user application. A process can be running as a single thread (called main thread), or can be multi-threaded. ALMOS-MKH supports the POSIX thread API. For a multi-threaded application, the number of threads can be very large, and the threads of a given process can be distributed on all cores available in the shared memory architecture, for maximal parallelism. Therefore, a single process can spread on all clusters. To avoid contention, the process descriptor of a P process, and the associated structures, such as the list of registered vsegs ('''VSL'''), the generic page table ('''GPT'''), or the file descriptors table ('''FDT''') are (partially) replicated in all clusters containing at least one thread of P. == __1) Process__ == The PID (Process Identifier) is coded on 32 bits. It is unique in the system, and has a fixed format: The 16 MSB (CXY) contain the owner cluster identifier. The 16 LSB bits (LPID) contain the local process index in owner cluster. The '''owner cluster''' is therefore defined by the 16 MSB bits of PID. As it exists several copies of the process descriptors, ALMOS-MKH defines a reference process descriptor, located in the '''reference cluster'''. The other copies are used as local caches, and ALMOS-MKH must guaranty the coherence between the reference and the copies. As ALMOS-MKH supports process migration, the '''reference cluster''' can be different from the '''owner cluster'''. The '''owner cluster''' cannot change (because the PID is fixed), but the '''reference cluster''' can change in case of process migration. In each cluster K, the local cluster manager ( cluster_t type in ALMOS-MKH ) contains a process manager ( pmgr_t type in ALMOS-MKH ) that maintains three structures for all process owned by K : * The '''PREF_TBL[lpid]''' is an array indexed by the local process index. Each entry contains an extended pointer on the reference process descriptor. * The '''COPIES_ROOT[lpid]''' array is also indexed by the local process index. Each entry contains the root of the global list of copies for each process owned by cluster K. * The '''LOCAL_ROOT''' is the local list of all process descriptors in cluster K. A process descriptor copy of P is present in K, as soon as P has a thread in cluster K. A process can be in four states: * '''RUNNING''' : the process is normally executing. * '''STOPPED''' : the process received a SIGSTOP signal. It can return to RUNNING state by a SIGCONT signal. * '''EXITED''' : the process terminated by an exit() syscall. It will be destroyed by the parent process executing a wait() syscall. * '''KILLED''' : the process received a SIGKILL signal. It will be destroyed by the parent process executing a wait() sys call. There is a partial list of informations stored in a process descriptor ( process_t in ALMOS-MKH ): - '''PID''' : proces identifier. - '''PPID''' : parent process identifier, - '''PREF''' : extended pointer on the reference process descriptor. - '''STATE''': current process state. - '''VSL''' : root of the local list of virtual segments defining the memory image. - '''GPT''' : generic page table defining the physical memory mapping. - '''FDT''' : open file descriptors table. - '''TH_TBL''' : local table of threads owned by this process in this cluster. - '''LOCAL_LIST''' : member of local list of all process descriptors in same cluster. - '''COPIES_LIST''' : member of global list of all descriptors of same process. - '''CHILDREN_LIST''' : member of global list of all children of same parent process. - '''CHILDREN_ROOT''' : root of global list of children process. All elements of a ''local'' list are in the same cluster, and ALMOS-MKH uses local pointers. Elements of a ''global'' list can be distributed on all clusters, and ALMOS-MKH uses extended pointers. == __2) Thread__ == ALMOS-MKH defines four types of threads : * one '''USR''' thread is created by a pthread_create() system call. * one '''DEV''' thread is created by the kernel to execute all I/O operations for a given channel device. * one '''RPC''' thread is activated by the kernel to execute pending RPC requests in the local RPC fifo. * the '''IDL''' thread is executed when there is no other thread to execute on a core. From the point of view of scheduling, a thread can be in three states : RUNNING, RUNNABLE or BLOCKED. This implementation of ALMOS-MKH does not support thread migration: a thread created by a pthread_create() system call is pinned on a given core in a given cluster. The only exception is the main thread of a process, that is automatically created by the kernel when a new process is created, and follows its owner process in case of process migration. In a given process, a thread is identified by a fixed format TRDID identifier, coded on 32 bits : The 16 MSB bits (CXY) define the cluster where the thread has been pinned. The 16 LSB bits (LTID) define the thread local index in the local TH_TBL[K,P] of a process descriptor P in a cluster K. This LTID index is allocated by the local process descriptor when the thread is created. Therefore, the TH_TBL(K,P) thread table for a given process in a given clusters contains only the threads of P placed in cluster K. The set of all threads of a given process is defined by the union of all TH_TBL(K,P) for all active clusters K. To scan the set off all threads of a process P, ALMOS-MKH traverse the COPIES_LIST of all process_descriptors associated to P process. There is a partial list of informations stored in a thread descriptor (thread_t in ALMOS-MKH): * '''TRDID''' : thread identifier * '''TYPE''' : KERNEL / USER / IDLE / RPC * '''FLAGS''' : bit_vector of thread attributes. * '''BLOCKED''' : bit_vector of blocking causes. * '''SIGNALS''' : bit vector permettant d’enregistrer les signaux reçus par le thread. * '''PROCESS''' : pointer on the local process descriptor * '''SCHED''' : pointer on the scheduler in charge of this thread. * '''CORE''' : pointer on the owner processor core. * '''LOCKS_COUNT''' : current number of locks taken by this thread * '''CPU_CONTEXT''' : save the CPU registers when descheduled. * '''FPU_CONTEXT''' : save the FPU registers when descheduled. * '''XLIST''' : member of the global list of threads waiting on the same resource. * '''CHILDREN_ROOT''' : root of the global list of children threads. * '''CHILDREN_LIST''' : member of the global list of all children of same parent. - etc. == __3) Process creation__ == The process creation in a remote cluster implement the POSIX fork() / exec() mechanism. When a parent process P executes the fork() system call, a new child process C is created. The new C process inherit from the parent process P the open files (FDT), and the memory image (VSL and GPT). These structures must be replicated in the new process descriptor. After a fork(), the C process can execute an exec() system call, that allocate a new memory image to the C process, but the new process can also continue to execute with the inherited memory image. For load balancing, ALMOS-MKH uses the DQDT to create the child process C on a different cluster from the parent cluster P, but the user application can also use the non-standard fork_place() system call to specify the target cluster. === 3.1) fork() === The fork() system call is the only method to create a new process. A thread of parent process P, running in a cluster X, executes the fork() system call to create a child process C on a remote cluster Y, that will become both the owner and the reference cluster for the C process. A new process descriptor, and a new thread descriptor are created and initialized in target cluster Y for the child process. The calling thread can run in any cluster. If the target cluster Y is different from the calling thread cluster X, the calling thread uses a RPC to ask the target cluster Y to do the work, because only the target cluster Y can allocate memory for the new process and thread descriptor. Regarding the process descriptor, a new PID is allocated in cluster Y. The child process C inherit the vsegs registered in the parent process reference VSL, but the ALMOS-MKH replication policy depends on the vseg type: * for the '''DATA, MMAP, REMOTE''' vsegs (containing shared, non replicated data), all vsegs registered in the parent reference VSL(Z,P) are registered in the child reference VSL(Y,C), and all valid GPT entries in the reference parent GPT(Z,P) are copied in the child reference GPT(Y,C). For all pages, the WRITABLE flag is reset and the COW flag is set, in both (parent and child) GPTs. This require to update all corresponding entries in the parent GPT copies (in clusters other than the reference). * for the '''STACK''' vsegs (that are private), only one vseg is registered in the child reference VSL(Y,C). This vseg contains the user stack of the user thread requesting the fork, running in cluster X. All valid GPT entries in the parent GPT(X,P) are copied in the child GPT(Y,C). For all pages, the WRITABLE flag is reset and the COW flag is set, in both (client and child) GPTs. * for the '''CODE''' vsegs (that must be replicated in all clusters containing a thread), all vsegs registered in the reference parent VSL(Z,P) are registered in the child reference VSL(Y,C), but the reference child GPT(Y,C) is not updated by the fork: It will be dynamically updated on demand in case of page fault. * for the '''FILE''' vsegs (containing shared memory mapped files), all vsegs registered in the reference parent VSL(Z,P) are registered in the child reference VSL(Y,C), and all valid entries registered in the reference parent GPT(Z,P) are copied in the reference child GPT(Y,C). The COW flag is not set for these shared data. Regarding the thread descriptor, a new TRDID is allocated in cluster Y, and the calling parent thread context (current values stored in the CPU and FPU registers) is saved in the child thread CPU and FPU contexts, to be restored when the child thread will be selected for execution. Three CPU context slots are not simple copies of the parent value: * the '''thread pointer''' register contains the current thread descriptor address. This '''thread pointer''' register cannot have the same value for parent and child. * the '''stack pointer''' register contains the current pointer on the kernel stack. ALMOS-MKH uses a specific kernel stack when an user thread enters the kernel, and this kernel stack is implemented in the thread descriptor. As parent and child cannot use the same kernel stack, the parent kernel stack content is copied to the child kernel stack, and the '''stack pointer''' register cannot have the same value for parent and child. * the '''page table pointer''' register contains the physical base address of the current generic page table. As the child GPT is a copy of the parent GPT in the child cluster, this '''page table register''' cannot have the same value for parent and child. At the end of the fork(), cluster Y is both the owner cluster and the reference cluster for the new C process, that contains one single thread running in the Y cluster. All pages of DATA, REMOTE, and MMAP vsegs are marked ''Copy On Write'' in the child C process GPT (clusters Y), and in all copies of the parent P process GPT (all clusters containing a copy of P). === 3.2) exec() === After a fork() system call, any thread of a process P can execute an exec() system call. This system call forces the P process to execute a new application, while keeping the same PID, the same parent process, the same open file descriptors, and the same environment variables. The existing P process descriptors (both the reference and the copies) and all associated threads will be destroyed. A new process descriptor and a new main thread descriptor are created in the reference cluster, and initialized from values found in the existing process descriptor, and from values contained in the .elf file defining the new application. The calling thread can run in any cluster. If the reference cluster Z for process P is different from the calling thread cluster X, the calling thread must use a RPC to ask the reference cluster Z to do the work. At the end of the exec() system call, the cluster Z is both the owner and the reference cluster for process C, that contains one single thread in cluster Z. == __4) Thread creation__ == Any thread T of any process P, running in any cluster K, can create a new thread NT in any cluster M. This creation is initiated by the ''pthread_create'' system call. The target M cluster is called the host cluster. * The target cluster M can be specified by the user application, using the CXY field of the pthread_attr_t argument. If the CXY is not defined by the user, the target cluster M is selected by the kernel K, using the DQDT. * The target core in cluster M can be specified by the user application, using the CORE_LID field of the pthread_attr_t argument. If the CORE_LID is not defined by the user, the target core is selected by the target kernel M. If the target cluster M is different from the client cluster, the cluster K send a RPC_THREAD_USER_CREATE request to cluster M. The argument is a complete structure pthread_attr_t (defined in the ''thread.h'' file in ALMOS-MK), containing the PID, the function to execute and its arguments, and optionally, the target cluster and target core. This RPC should return the thread TRDID. * If the target cluster M does not contain a copy of the P process descriptor, the kernel M creates a process descriptor copy from the reference P process descriptor, using a remote_memcpy(), and using the cluster_get_reference_process_from_pid() to get the extended pointer on reference cluster. It allocates memory for the associated structures GPT(M,P), VSL(M,P), FDT(M,P). These structures being used as read-only caches will be dynamically filled by the page faults. This new process descriptor is registered in the COPIES_LIST and in the LOCAL_LIST. * When the local process descriptor is set, the kernel M select the core that will execute the new thread, allocates a TRDID to this thread, creates the thread descritor, and registers it in the local process descriptor, and in the selected core scheduler. == __5) Thread destruction__ == The destruction of a thread T running in cluster K can be caused by another thread K, executing the thread_kill() function requesting the target thread to stop execution. It can also be caused by the thread T itself, executing the thread_exit() function to suicide. === 5.1) thread kill === The '''thread_kill()''' kernel function is executed by a killer thread that must be different from the target thread, but must run be running in the same cluster as the target thread. This function can be called by the ''pthread_cancel'' system call, to destroy one single thread, or by the ''kill'' system call, (through the process_kill() function), to destroy all threads of a given process. The killer thread requires the target thread scheduler to do the work by writing in the target thread descriptor, and the target scheduler signals completion to the killer by writing in the killer thread descriptor. * To request the kill, the killer thread sets the BLOCKED_GLOBAL bit in the target thread "blocked" field, sets the SIG_KILL bit in the target thread "signals" field, and register in the target thread "kill_xp" field the extended pointer on the killer thread. * If the target thread is running on another core than the killer thread, the killer thread send an IPI to the core running the target thread, to ask the target scheduler to handle the request. If the target is running on the same thread as the killer, the killer thread calls directly the sched_handle_signal() function. * In both cases, the sched_handle_signals() function - detecting the SIG_KILL signal - detach the target thread from the scheduler, detach the target thread from the local process descriptor, detach the target thread from the parent thread if it is attached, release the memory allocated to the target thread descriptor, and atomically decrement the response counter in the killer thread to signal completion. === 5.2) thread exit when DETACHED === The '''sys_thread_exit()''' kernel function is called by an user thread T executing the ''pthread_exit'' system call to suicide. The scenario is rather simple when the thread T is not running in ATTACHED mode. * The sys_thread_exit() function sets the SIG_SUICIDE bit in the thread "signals" bit_vector, sets the BLOCKED_GLOBAL bit in the thread "blocked" bit_vector, and de-schedule. * The scheduler, detecting the SIG_SUICIDE bit, detach the thread from the scheduler, detach the thread from the local process descriptor, and releases the memory allocated to the thread descriptor. === 5.3) thread exit when ATTACHED === The '''sys_thread_exit()''' function is more complex if the finishing thread T is running in ATTACHED mode, because another - possibly remote - PT thread, executing the ''pthread_join'' system call, must be informed of the exit of thread T. As the '''sys_thread_exit()''' and the '''sys_thread_join''' function can be executed in any order, this requires a "rendez-vous": The first arrived thread block and deschedule, and must be reactivated by the other thread. This synchronisation uses three specific fields in the thread descriptor: the "join_lock" field is a remote_spin_lock; the "join_value" field contains the exit value returned by the finishing thread T; the "join_xp"field contains an extended pointer on the PT thread that wants to join. It uses one specific JOIN_DONE flag in the thread descriptor "flags" field. The scenario is not symmetrical, because the PT thread can access the T thread descriptor at any time, but the T thread cannot access the PT thread descriptor before the pthread_join execution: * Both the T thread (executing the sys_thread_exit() function), and the PT thread (executing the sys_thread_join() function) try to take the "join_lock" implemented in the T thread descriptor (the "join_lock" in the PT thread is not used). * The T thread registers its exit value in the T thread "join_value" field, and test the JOIN_DONE flag in the T thread "flags" field: * If the JOIN_DONE flag is set, the PT thread arrived first and is blocked: the T thread reset the BLOCKED_EXIT bit in the PT thread (using the extended pointer stored in the "join_xp" field), reset the JOIN_DONE flag, releases the "join_lock" in T thread, and exit as described in the DETACHED case. * If the JOIN_DONE flag is not set, the T thread T arrived first: the T thread set the BLOCKED_JOIN bit in the T thread "blocked" field, releases the "join"lock", and deschedules. * The PT thread test the BLOCKED_JOIN bit in T thread: * If the BLOCKED_JOIN bit is set, the T thread arrived first and is blocked: the PT thread reset the BLOCKED_JOIN bit in the T thread, get the exit value from the T thread i "join_value" field, releases the "join_lock" in T thread, and continue. * If the BLOCKED_JOIN bit is not set, the PT thread arrived first: the PT thread register its extended pointer in the T thread "join_xp" field, set the JOIN_DONE flag in the T thread, sets the BLOCKED_EXIT bit in the PT thread "blocked" field, releases the "join_lock" in the T thread, and deschedules. == __6) Process destruction__ == The destruction of process P can be caused by a sys_exit() system call executed by any thread of process P, or by another process executing the sys_kill() system call. It can also be caused by a CtrlC signal typed on the process terminal. In all cases, the work must be done by a thread running in the owner cluster, because all process copies must be involved, and the list of copies is rooted in the owner cluster. It must also be done by a RPC thread, because a thread cannot delete itself. === 6.1 parent / child synchronization The process descriptors copies (I.e. other than the reference process descriptor) are simply deleted by the scheduler when the last thread of a given process in a given cluster is deleted. It is removed from the list of copies in reference process cluster descriptor, and that's it. But the reference process destruction is more complex, because the child process destruction must be reported to the parent process when the parent process executes the blocking sys_wait() system call. Therefore, the child process destruction cannot be done before the parent calls the sys_wait() function. As the '''sys_wait()''' function, and the '''sys_kill() / sys_exit()''' function are executed by different threads running in different clusters, this requires a parent/child synchronization. After a sys-kill() or sys_exit(), all child process threads and all process copies are immediately destroyed, but the reference child process must be kept in ''zombi'' state if the sys_wait() syscall has not been executed. The synchronization uses the '''term_state''' field in the reference child process descriptor : * the PROCESS_FLAG_KILL flag indicates that a KILL request has been received by the child; * the PROCESS_FLAG_EXIT flag indicates that an EXIT request has been made by the child; * the PROCESS_FLAG_BLOCK flag indicates that a SIGSTOP signal has been received by the child; * the PROCESS_FLAG_WAIT flag indicates that a WAIT request from parent has been received by the chid and reported to parent. * moreover the sys_exit() argument is registered in the ''term_state'' field of the process descriptor. The actual deletion of the reference process is always caused by the sys_wait() function, using generally a RPC: * If the sys_wait() arrives first, the corresponding flag is atomically set in the child reference process, and the parent main thread executing the sys_wait() is blocked. This parent thread will be unblocked when a ''kill'' or ''exit'' is received by the child, and delete the child reference process. * If the sys_kill() or sys_exit() arrive first, the corresponding flag is atomically set in the child reference process. When these flags are detected by the parent main thread executing the sys_wait(), it deletes the reference process descriptor. === 6.2) detailed destruction scenario === 1. Both the sys_kill() or sys_exit() sys calls, use the rpc_process_make_kill_client(), to ask an RPC thread to execute the process_make_kill() function in the owner cluster (because a thread cannot kill itself). 1. The process_make_kill() function in owner cluster send a multicast and parallel RPC to all clusters containing a copy of the process. In each cluster, the process_block_threads() function set the BLOCKED_GLOBAL bit for all threads of the process. It returns only when all threads are blocked and descheduled. 1. When the process_make_kill() function in owner cluster has received all expected responses to the first multicast RPC, it send another multicast and parallel RPC to all clusters containing a copy of the process. In each cluster, the process_delete_threads() function releases the memory allocated to the local threads. The local process descriptor is also destroyed, if it is not the reference process descriptor. 1. When the process_make_kill() function in owner cluster has received all expected responses to the second multi-cast RPC, it updates the owner cluster manager to remove the process from the set of owned processes. 1. Finally, the process_make_kill() function synchronizes with the parent process, and the parent sys_wait() function delete the reference process descriptor.