= GIET_VM / User Applications = [[PageOutline]] The following applications use the GIET_VM [wiki:library_stdio system calls] and [wiki:user_libraries user libraries]. The multi-threaded applications use the POSIX threads API. == __shell__ == This single thread interactive application can be used to handle the FAT32 file system, or to dynamically activate or de-activate others applications. When this application is mapped on the target architecture, it is automatically launched at the end of the boot phase. The list of available commands can be obtained with the ''help'' command. It requires one private TTY terminal. The source code can be found [source:soft/giet_vm/applications/shell/shell.c here], and the mapping directives are defined [source:soft/giet_vm/applications/shell/shell.py here]. == __ display__ == This single thread application illustrates the use of various peripherals such as the IOC (external block device), or the CMA (chained Buffer DMA) peripheral to display a stream of images. The application read a stream of images from the ''/misc/images_128.ram'' file, stored on the FAT32 disk controller. It displays the stream of images on the FBF (graphical display) peripheral. The ''images_128.raw'' contains 20 images : 128 lines * 128 pixels / 1 byte per pixel. It requires one private TTY terminal. The source code can be found [source:soft/giet_vm/applications/display/display.c here], and the mapping directives are defined [source:soft/giet_vm/applications/display/display.py here]. == __coproc__ == This single thread application illustrates the use of hardware accelerators by an user application. The hardware coprocessor must be connected to the system by a ''vci_mwmr_dma'' component. In this application, the coprocessor makes the Greater Common Divider computation between two vectors of randomly generated 32 bits integers. The vector size is a parameter. It requires one private TTY terminal. The source code can be found [source:soft/giet_vm/applications/coproc/coproc.c here], and the mapping directives are defined [source:soft/giet_vm/applications/coproc/coproc.py here]. == __sort__ == This first multi-threaded application is a very simple parallel sort. The input is an array of randomly generated integers. The size of this array is a parameter, that must be a multiple of the number of threads. It can run on a multi-processors, multi-clusters architecture, with one thread per processor core. It requires one TTY terminal, shared by all threads. The source code can be found [source:soft/giet_vm/applications/sort/sort.c here], and the mapping directives are defined [source:soft/giet_vm/applications/sort/sort.py here]. == __transpose__ == This multi-threaded application is typical of parallelism that can be exploited in low-level image processing. It ask the user to enter the name of a file containing an image stored on the FAT32 disk, check that the selected image fit the frame buffer size, transpose the image (X <-> Y), display the result on the graphical display, and save the transposed image to the FAT32 disk. It can run on a multi-processors, multi-clusters architecture, with one thread per processor core. The total number of threads depends on the hardware architecture, and is computed as ( x_size * y_size * nprocs ) . The main() function is executed by the thread running on P[0,0,0]. It makes several initializations, launches all other threads (using the pthread_create() function), and calls the execute() function. When the main() function returns from the execute(), it uses the pthread_join() function to detect application completion. All others threads are executing the execute() function. Each execute() function is handling exactly (image_size / nthreads) lines. The input and output buffers containing the source and transposed images are allocated from the user heap distributed in all clusters. There is (image size / clusters) lines per cluster. Therefore, the data read are mostly local, but the data write are mostly remote. The number of clusters must be a power of 2 no larger than 256. The number of processors per cluster must be a power of 2 no larger than 4. It requires one TTY terminal, shared by all threads. The source code can be found [source:soft/giet_vm/applications/transpose/transpose.c here], and the mapping is defined [source:soft/giet_vm/applications/transpose/transpose.py here]. == __convol__ == This multi-threaded application is a medical image processing application. It implements a 2D convolution product, used to remove some noise artifacts. The image size is 1024 * 1024 pixels, with 2 bytes per pixel provided by the Philips company. It is stored on the FAT32 disk in ''/misc/philips_image_1024.raw''. The convolution kernel is [201]*[35] pixels, but it can be factored in two independant line and column convolution products, requiring two intermediate image transpositions. The five buffers containing the intermediate images are distributed in all clusters. It that can run on a multi-processors, multi-clusters architecture, with one thread per processor. The main() function can be executed on any processor P[x,y,p]. It makes the initialisations, launch the (N-1) other threads to run the execute() function on the (N-1) other processors, call himself the execute() function, and finally call the instrument() function to display instrumentation results when the parallel execution is completed. The number of clusters containing processors must be power of 2 no larger than 256. The number of processors per cluster must be power of 2 no larger than 8. It requires one TTY terminal, shared by all threads. The source code can be found [source:soft/giet_vm/applications/convol/convol.c here], and the mapping is defined [source:soft/giet_vm/applications/transpose/transpose.py here]. == __gameoflife__ == This multi-threaded application is an emulation of the Game of Life automaton. The world size is defined by the Frame Buffer width and height. It that can run on a multi-processors, multi-clusters architecture. * If the number of processors is larger than the number of lines, the number of threads is equal to the number of lines, and each thread process one single line. * if the number of processors is not larger than the number of lines, the number of threads is equal to the number of processors, and each thread process height/nthreads (or height/nthreads + 1) lines. The thread running on processor P(0,0,0) execute the main() function, that initialises the barrier, the TTY terminal, the CMA controler, and launch the other threads, before calling the execute() function. Other threads are just running the execute() function. The total number of clusters cannot be larger than 16 *16. The total number of processors per cluster cannot be larger than 4. It uses one TTY terminal shared by all threads. The source code can be found [source:soft/giet_vm/applications/gameoflife/gameoflife.c here], and the mapping is defined [source:soft/giet_vm/applications/gameoflife/gameoflife.py here]. == __classif__ == This multi-threaded application takes a stream of Gigabit Ethernet packets, and makes packet analysis and classification, based on the source MAC address. It uses the multi-channels NIC peripheral, and the chained buffers DMA controller, to receive and send packets on the Gigabit Ethernet port. It can run on architectures containing up to 256 clusters, and up to 8 processors per cluster: one task per processor. It requires one TTY terminal shared by all threads. This application is described as a TCG (Task and Communication Graph) containing (N+2) tasks per cluster: one '''load''' task, one '''store''' task, and N '''analyse''' tasks. Each container can contain from 2 to 60 packets and has a fixed size of 4 Kbytes. These containers are distributed in clusters: * one RX container per cluster (part of the kernel rx_chbuf), in the kernel heap. * one TX container per cluster (part of the kernel tx-chbuf), in the kernel heap. * N working containers per cluster (one per analysis task), in the user heap. In each cluster, the "load", analysis" and "store" tasks communicates through three local MWMR FIFOs: * fifo_l2a : tranfer a full container from "load" to "analyse" task. * fifo_a2s : transfer a full container from "analyse" to "store" task. * fifo_s2l : transfer an empty container from "store" to "load" task. For each fifo, one item is a 32 bits word defining the index of an available working container. The pointers on the working containers, and the pointers on the MWMR fifos are defined by global arrays stored in cluster[0][0]. The MWMR fifo descriptors array is defined as a global variable in cluster[0][0]. Initialisation is done in three steps by the "load" & "store" tasks: 1. Task "load" in cluster[0][0] initialises the heaps in all clusters. Other tasks are waiting on the global_sync synchronisation variable. 2. Task "load" in cluster[0][0] initialises the barrier between all "load" tasks, allocates NIC & CMA RX channel, and starts the NIC_CMA RX transfer. Other "load" tasks are waiting on the load_sync synchronisation variable. Task "store" in cluster[0][0] initialises the barrier between all "store" tasks, allocates NIC & CMA TX channels, and starts the NIC_CMA TX transfer. Other "store" tasks are waiting on the store_sync synchronisation variable. 3. When this global initialisation is completed, the "load" task in all clusters allocates the working containers and the MWMR fifos descriptors from the user local heap. In each cluster, the "analyse" and "store" tasks are waiting the local initialisation completion on the local_sync[x][y] variables. When initialisation is completed, all tasks loop on containers: 1. The "load" task get an empty working container from the fifo_s2l, transfer one container from the kernel rx_chbuf to this user container, and transfer ownership of this container to one "analysis" task by writing into the fifo_l2a. 2. The "analyse" task get one working container from the fifo_l2a, analyse each packet header, compute the packet type (depending on the SRC MAC address), increment the correspondint classification counter, and transpose the SRC and the DST MAC addresses fot TX tranmission. 3. The "store" task transfer get a full working container from the fifo_a2s, transfer this user container content to the the kernel tx_chbuf, and transfer ownership of this empty container to the "load" task by writing into the fifo_s2l. Instrumentation results display is done by the "store" task in cluster[0][0] when all "store" tasks completed the number of clusters specified by the CONTAINERS_MAX parameter. The source code can be found [source:soft/giet_vm/applications/classif/classif.c here], and the mapping is defined [source:soft/giet_vm/applications/classif/classif.py here]. == __mjpeg__ == This multi-threaded application makes the decompression of a MJPEG bit-stream contained in a file, an display the stream of images on the Frame Buffer. It illustrates the "multi pipe-line" parallelism: each image is decompressed by a five stages pipe-line implemented as five POSIX threads. Several images can be decomposed in parallel, as each cluster implement a complete pipe-line. It uses the message passing programming model, on top of the POSIX threads API, and the MWMR communication middleware. The application is described as a TCG (Task and Communication Graph), and all communications between threads uses MWMR channels. It uses the chained buffer DMA component to display the stream of decompressed images. It contains 6 types of threads (plus the "main" thread, that makes initialisation), and 7 types of MWMR communication channels: * the TG thread dispatch the bit-stream to the pipeline. It is only mapped in cluster[0,0]. * the 5 threads implementing the pipeline (DEMUX, VLD, IQZZ, IDCT, LIBU) are replicated in all clusters. * the 7 MWMR channels are replicated in all clusters. The image throughput is actually bounded by the TG thread that cannot be parallelized. As the MWMR communication channels support communication between software threads and hardware accelerators, the IDCT software thread can be optionally replaced by an hardware DCT coprocessor, if this component is available in the target architecture. The hardware constraints are the following * The number of clusters cannot be larger than 16*16. * The number of processors per cluster cannot be larger than 4. * The frame buffer size must fit the decompressed images size. * It uses one TTY terminal shared by all tasks. All parameters (number of images, depths of communication channels, debug variables) are defined in the [source:soft/giet_vm/applications/mjpeg/mjpeg.h mjpeg.h] file. The source code can be found [source:soft/giet_vm/applications/mjpeg/mjpeg.c here], and the mapping is defined [source:soft/giet_vm/applications/mjpeg/mjpeg.py here]. == __raycast__ == This multi-threaded application implement a video game requiring 3D image synthesis. The gamer can dynamically explore a maze and the gamer vision (3D images) depends interactively on the gamer moves. It that can run on a multi-processors, multi-clusters architecture, with one thread per processor, and can use any values for the frame buffer (width * height) associated to the graphical display, as it does not use any pre-existing images. It uses the ''chained buffer DMA'' peripheral to speed the display, but the heaviest part of the computation is the image synthesis. After each gamer move, a new image is displayed. For a given image, the columns of pixels can be build in parallel by several threads running the same render() function for a given column. The number of threads is independent from the number of columns (image width), because the load is dynamically balanced between threads by a job allocator, until all columns for a given image have been handled. It requires one TTY terminal, shared by all threads. The source code can be found [source:soft/giet_vm/applications/raycast/raycast.c here], and the mapping is defined [source:soft/giet_vm/applications/raycast/raycast.py here]. == __ router__ == This multi-threaded application emulates a network processing application such as an ethernet router. All communications between threads use the MWMR (multi-writer/multi-reader) middleware. The application is described as a TCG (Task and communication Graph) : * The number N of thread is (x_size*y_size*nprocs): nprocs threads per clusters. * There is one producer() thread, one consumer() thread, and N-2 compute() threads. * The number M of MWMR channels is (2 * x_size * y_size) : one input and one output channel per cluster. It that can run on a multi-processors, multi-clusters architecture, with one thread per processor. In this implementation, only integer token are transfered between threads, but each token can be interpreted as a job descriptor: * The main() thread, running on P[0,0,0], makes the initializations, launches the N other threads, and exit. * The producer() thread, running on P[0,0,0] try to write continuously tokens into the M distributed input channels using non-blocking write function. * The consumer() thread, running on P[0,0,1] try to read continuously tokens from the M distributed output channels, using a non-blocking read function. * The N-2 compute() threads running on all other processors are continuously reading token from the local input channel, and writing the same token to the local output channel, after a random delay emulating a variable processing time. They use blocking access functions. It requires one TTY terminal shared by all threads. The source code can be found [source:soft/giet_vm/applications/router/router.c here], and the mapping is defined [source:soft/giet_vm/applications/router/router.py here].