wiki:user_applications

Version 22 (modified by alain, 9 years ago) (diff)

--

GIET_VM / User Applications

The following applications use the GIET_VM system calls and user libraries. The multi-threaded applications use the POSIX threads API.

shell

This single thread interactive application can be used to handle the FAT32 file system, or to dynamically activate or de-activate others applications. When this application is mapped on the target architecture, it is automatically launched at the end of the boot phase. The list of available commands can be obtained with the help command.

It requires one private TTY terminal.

The source code can be found here, and the mapping directives are defined here.

display

This single thread application illustrates the use of various peripherals such as the IOC (external block device), or the CMA (chained Buffer DMA) peripheral to display a stream of images. The application read a stream of images from the /misc/images_128.ram file, stored on the FAT32 disk controller. It displays the stream of images on the FBF (graphical display) peripheral. The images_128.raw contains 20 images : 128 lines * 128 pixels / 1 byte per pixel.

It requires one private TTY terminal.

The source code can be found here, and the mapping directives are defined here.

coproc

This single thread application illustrates the use of hardware accelerators by an user application. The hardware coprocessor must be connected to the system by a vci_mwmr_dma component. In this application, the coprocessor makes the Greater Common Divider computation between two vectors of randomly generated 32 bits integers. The vector size is a parameter.

It requires one private TTY terminal.

The source code can be found here, and the mapping directives are defined here.

sort

This first multi-threaded application is a very simple parallel sort. The input is an array of randomly generated integers. The size of this array is a parameter, that must be a multiple of the number of threads.

It can run on a multi-processors, multi-clusters architecture, with one thread per processor core.

It requires one TTY terminal, shared by all threads.

The source code can be found here, and the mapping directives are defined here.

transpose

This multi-threaded application is typical of parallelism that can be exploited in low-level image processing.

It ask the user to enter the name of a file containing an image stored on the FAT32 disk, check that the selected image fit the frame buffer size, transpose the image (X <-> Y), display the result on the graphical display, and save the transposed image to the FAT32 disk. A compilation flag allows the user to skip this interactive section and use default input and output files.

It can run on a multi-processors, multi-clusters architecture, with one thread per processor core. The total number of threads depends on the hardware architecture, and is computed as ( x_size * y_size * nprocs ) . The main() function is executed by the thread running on P[0,0,0]. It makes several initializations, launches all other threads (using the pthread_create() function), and calls the execute() function. When the main() function returns from the execute(), it uses the pthread_join() function to detect application completion. All others threads are executing the execute() function. Each execute() function is handling exactly (image_size / nthreads) lines.

The buf_in[x,y] and buf_out[x,y] buffers containing the direct ans transposed images are distributed in clusters: In each cluster[x,y], the thread running on processor P[x,y,0] uses the giet_fat_mmap() function to map the buf_in[x,y] and buf_out[x,y] buffers containing a set of lines. Then, all threads in cluster[x,y] read pixels from the local buf_in[x,y] buffer, and write the pixels to the remote buf_out[x,y] buffers. Finally, each thread display a part of the transposed image to the frame buffer. There is (image size / clusters) lines per cluster. Therefore, the data read are local, but the data write are mostly remote.

  • The image size must fit the frame buffer width and height, that must be power of 2.
  • The number of clusters must be a power of 2 no larger than 256.
  • The number of processors per cluster must be a power of 2 no larger than 4.
  • The number of clusters cannot be larger than (image_size * image_size) / 4096, because the size of buf_in[x,y] and buf_out[x,y] must be multiple of 4096.

It requires one TTY terminal, shared by all threads.

The source code can be found here, and the mapping is defined here.

The transpose.c file contains a variant that use the giet_fat_read() and giet_fat_write() system calls, to access the files.

convol

This multi-threaded application is a medical image processing application.

It implements a 2D convolution product, used to remove some noise artifacts. The image size is 1024 * 1024 pixels, with 2 bytes per pixel provided by the Philips company. It is stored on the FAT32 disk in /misc/philips_image_1024.raw. The convolution kernel is [201]*[35] pixels, but it can be factored in two independant line and column convolution products, requiring two intermediate image transpositions. The five buffers containing the intermediate images are distributed in all clusters.

It that can run on a multi-processors, multi-clusters architecture, with one thread per processor. The main() function can be executed on any processor P[x,y,p]. It makes the initialisations, launch the (N-1) other threads to run the execute() function on the (N-1) other processors, call himself the execute() function, and finally call the instrument() function to display instrumentation results when the parallel execution is completed.

The number of clusters containing processors must be power of 2 no larger than 256. The number of processors per cluster must be power of 2 no larger than 8.

It requires one TTY terminal, shared by all threads.

The source code can be found here, and the mapping is defined here.

gameoflife

This multi-threaded application is an emulation of the Game of Life automaton. The world size is defined by the Frame Buffer width and height.

It that can run on a multi-processors, multi-clusters architecture.

  • If the number of processors is larger than the number of lines, the number of threads is equal to the number of lines, and each thread process one single line.
  • if the number of processors is not larger than the number of lines, the number of threads is equal to the number of processors, and each thread process height/nthreads (or height/nthreads + 1) lines.

The thread running on processor P(0,0,0) execute the main() function, that initialises the barrier, the TTY terminal, the CMA controler, and launch the other threads, before calling the execute() function. Other threads are just running the execute() function. The total number of clusters cannot be larger than 16 *16. The total number of processors per cluster cannot be larger than 4.

It uses one TTY terminal shared by all threads.

The source code can be found here, and the mapping is defined here.

classif

This multi-threaded application takes a stream of Gigabit Ethernet packets, and makes packet analysis and classification, based on the source MAC address. It uses the multi-channels NIC peripheral, and the chained buffers DMA controller, to receive and send packets on the Gigabit Ethernet port. It can run on architectures containing up to 256 clusters, and up to 8 processors per cluster: one task per processor.

It requires one TTY terminal shared by all threads.

This application is described as a TCG (Task and Communication Graph) containing (N+2) tasks per cluster: one load task, one store task, and N analyse tasks. Each container can contain from 2 to 60 packets and has a fixed size of 4 Kbytes. These containers are distributed in clusters:

  • one RX container per cluster (part of the kernel rx_chbuf), in the kernel heap.
  • one TX container per cluster (part of the kernel tx-chbuf), in the kernel heap.
  • N working containers per cluster (one per analysis task), in the user heap.

In each cluster, the "load", analysis" and "store" tasks communicates through three local MWMR FIFOs:

  • fifo_l2a : tranfer a full container from "load" to "analyse" task.
  • fifo_a2s : transfer a full container from "analyse" to "store" task.
  • fifo_s2l : transfer an empty container from "store" to "load" task.

For each fifo, one item is a 32 bits word defining the index of an available working container.

The pointers on the working containers, and the pointers on the MWMR fifos are defined by global arrays stored in cluster[0][0]. The MWMR fifo descriptors array is defined as a global variable in cluster[0][0].

Initialisation is done in three steps by the "load" & "store" tasks:

  1. Task "load" in cluster[0][0] initialises the heaps in all clusters. Other tasks are waiting on the global_sync synchronisation variable.
  2. Task "load" in cluster[0][0] initialises the barrier between all "load" tasks, allocates NIC & CMA RX channel, and starts the NIC_CMA RX transfer. Other "load" tasks are waiting on the load_sync synchronisation variable. Task "store" in cluster[0][0] initialises the barrier between all "store" tasks, allocates NIC & CMA TX channels, and starts the NIC_CMA TX transfer. Other "store" tasks are waiting on the store_sync synchronisation variable.
  3. When this global initialisation is completed, the "load" task in all clusters allocates the working containers and the MWMR fifos descriptors from the user local heap. In each cluster, the "analyse" and "store" tasks are waiting the local initialisation completion on the local_sync[x][y] variables.

When initialisation is completed, all tasks loop on containers:

  1. The "load" task get an empty working container from the fifo_s2l, transfer one container from the kernel rx_chbuf to this user container, and transfer ownership of this container to one "analysis" task by writing into the fifo_l2a.
  2. The "analyse" task get one working container from the fifo_l2a, analyse each packet header, compute the packet type (depending on the SRC MAC address), increment the correspondint classification counter, and transpose the SRC and the DST MAC addresses fot TX tranmission.
  3. The "store" task transfer get a full working container from the fifo_a2s, transfer this user container content to the the kernel tx_chbuf, and transfer ownership of this empty container to the "load" task by writing into the fifo_s2l.

Instrumentation results display is done by the "store" task in cluster[0][0] when all "store" tasks completed the number of clusters specified by the CONTAINERS_MAX parameter.

The source code can be found here, and the mapping is defined here.

mjpeg

This multi-threaded application makes the decompression of a MJPEG bit-stream contained in a file, an display the stream of images on the Frame Buffer. It illustrates the "multi-pipe-line" parallelism: each image is decompressed by a five stages pipe-line implemented as five POSIX threads. Several images can be decomposed in parallel, as each cluster implement a complete pipe-line. It uses the message passing programming model, on top of the POSIX threads API, and the MWMR user-level communication middleware. The application is described as a TCG (Task and Communication Graph), and all communications between threads uses MWMR channels. It uses the chained buffer DMA component to display the stream of decompressed images. It contains 5 types of threads (plus the MAIN thread), and 7 types of MWMR communication channels:

  • the MAIN thread makes the initialization, dispatch the bit-stream to the pipelines, and makes the instrumentation. It is mapped in cluster[0,0].
  • the 5 threads implementing the pipeline (DEMUX, VLD, IQZZ, IDCT, LIBU) are replicated in all clusters.
  • the 7 MWMR channels are replicated in all clusters.

The speedup is actually bounded by the dispatch work done by the MAIN thread, that cannot be parallelized. As the MWMR communication channels support communication between software threads and hardware accelerators, the IDCT software thread can be optionally replaced by an hardware DCT coprocessor, if this component is available in the target architecture.

The hardware constraints are the following

  • The number of clusters cannot be larger than 16*16.
  • The number of processors per cluster cannot be larger than 4.
  • The frame buffer size must fit the decompressed images size.
  • It uses one TTY terminal shared by all tasks.

All parameters (number of images, depths of communication channels, debug variables) are defined in the mjpeg.h file.

The source code can be found here, and the mapping is defined here.

raycast

This multi-threaded application implement a video game requiring 3D image synthesis. The gamer can dynamically explore a maze and the gamer vision (3D images) depends interactively on the gamer moves.

It that can run on a multi-processors, multi-clusters architecture, with one thread per processor, and can use any values for the frame buffer (width * height) associated to the graphical display, as it does not use any pre-existing images. It uses the chained buffer DMA peripheral to speed the display, but the heaviest part of the computation is the image synthesis.

After each gamer move, a new image is displayed. For a given image, the columns of pixels can be build in parallel by several threads running the same render() function for a given column. The number of threads is independent from the number of columns (image width), because the load is dynamically balanced between threads by a job allocator, until all columns for a given image have been handled.

It requires one TTY terminal, shared by all threads.

The source code can be found here, and the mapping is defined here.

router

This multi-threaded application emulates a network processing application such as an ethernet router. All communications between threads use the MWMR (multi-writer/multi-reader) middleware. The application is described as a TCG (Task and communication Graph) :

  • The number N of thread is (x_size*y_size*nprocs): nprocs threads per clusters.
  • There is one producer() thread, one consumer() thread, and N-2 compute() threads.
  • The number M of MWMR channels is (2 * x_size * y_size) : one input and one output channel per cluster.

It that can run on a multi-processors, multi-clusters architecture, with one thread per processor. In this implementation, only integer token are transfered between threads, but each token can be interpreted as a job descriptor:

  • The main() thread, running on P[0,0,0], makes the initializations, launches the N other threads, and exit.
  • The producer() thread, running on P[0,0,0] try to write continuously tokens into the M distributed input channels using non-blocking write function.
  • The consumer() thread, running on P[0,0,1] try to read continuously tokens from the M distributed output channels, using a non-blocking read function.
  • The N-2 compute() threads running on all other processors are continuously reading token from the local input channel, and writing the same token to the local output channel, after a random delay emulating a variable processing time. They use blocking access functions.

It requires one TTY terminal shared by all threads.

The source code can be found here, and the mapping is defined here.