wiki:user_applications

Version 9 (modified by alain, 9 years ago) (diff)

--

GIET_VM / User Applications

The following applications use the GIET_VM system calls and user libraries. The multi-threaded applications use the POSIX threads API, and have been specifically designed to analyse the TSAR manycore architecture scalability.

shell

This single thread interactive application can be used to access the FAT32 file system, or to dynamically activate or de-activate others applications. When is mapped on the target architecture, it is automatically launched at the end of the boot phase. The list of available commands can be obtained with the help command.

It requires one private TTY terminal.

The source code can be found here, and the mapping directives are defined here.

Display

This single thread application illustrates the use of the CMA (chained Buffer DMA) peripheral to display a stream of images. The application read a stream of images from the /misc/images_128.ram file, stored on a FAT32 disk controller. It displays the stream of images on the FBF (graphical display) peripheral. The images_128.raw contains 20 images : 128 lines * 128 pixels / 1 byte per pixel.

It requires one private TTY terminal.

The source code can be found here, and the mapping directives are defined here.

Coproc

This single thread application illustrates the use of multi-channels hardware accelerators by an user application. The hardware coprocessor must be connected to the system by a vci_mwmr_dma component. In this application, the coprocessor makes the Greater Common Divider computation between two vectors of randomly generated 32 bits integers. The vector size is a parameter.

It requires one private TTY terminal.

The source code can be found here, and the mapping directives are defined here.

Transpose

This multi-threaded application is typical of parallelism that can be exploited in low-level image processing.

It ask the user to enter the name of a file containing an image stored on the FAT32 disk, check that the selected image fit the frame buffer size, transpose the image (X <-> Y), display the result on the graphical display, and save the transposed image to the FAT32 disk.

It can run on a multi-processors, multi-clusters architecture, with one thread per processor core. The total number of threads depends on the hardware architecture, and is computed as ( x_size * y_size * nprocs ) . The main() function is executed by the thread running on P[0,0,0]. It makes several initializations, launches all other threads (using the pthread_create() function), and calls the execute() function. When the main() function returns from the execute(), it uses the pthread_join() function to detect application completion. All others threads are executing the execute() function. Each execute() function is handling exactly (image_size / nthreads) lines.

The input and output buffers containing the source and transposed images are allocated from the user heap distributed in all clusters. There is (image size / clusters) lines per cluster. Therefore, the data read are mostly local, but the data write are mostly remote.

The number of clusters must be a power of 2 no larger than 256. The number of processors per cluster must be a power of 2 no larger than 4.

It requires one TTY terminal, shared by all threads.

The source code can be found here, and the mapping is defined here.

Convol

This multi-threaded application is a medical image processing application.

It implements a 2D convolution product, used to remove some noise artifacts. The image size is 1024 * 1024 pixels, with 2 bytes per pixel provided by the Philips company. It is stored on the FAT32 disk in /misc/philips_image_1024.raw. The convolution kernel is [201]*[35] pixels, but it can be factored in two independant line and column convolution products, requiring two intermediate image transpositions. The five buffers containing the intermediate images are distributed in all clusters.

It that can run on a multi-processors, multi-clusters architecture, with one thread per processor. The main() function can be executed on any processor P[x,y,p]. It makes the initialisations, launch the (N-1) other threads to run the execute() function on the (N-1) other processors, call himself the execute() function, and finally call the instrument() function to display instrumentation results when the parallel execution is completed.

The number of clusters containing processors must be power of 2 no larger than 256. The number of processors per cluster must be power of 2 no larger than 8.

It requires one TTY terminal, shared by all threads.

The source code can be found here, and the mapping is defined here.

Classif

This multi-threaded application takes a stream of Gigabit Ethernet packets, and makes packet analysis and classification, based on the source MAC address. It uses the multi-channels NIC peripheral, and the chained buffers DMA controller, to receive and send packets on the Gigabit Ethernet port. It can run on architectures containing up to 256 clusters, and up to 8 processors per cluster: one task per processor. It requires as many private TTYs as the number of processors in cluster[0,0].

This application is described as a TCG (Task and Communication Graph) containing (N+2) tasks per cluster: one load task, one store task, and N analyse tasks. Each container can contain from 2 to 60 packets and has a fixed size of 4 Kbytes. These containers are distributed in clusters:

  • one RX container per cluster (part of the kernel rx_chbuf), in the kernel heap.
  • one TX container per cluster (part of the kernel tx-chbuf), in the kernel heap.
  • N working containers per cluster (one per analysis task), in the user heap.

In each cluster, the "load", analysis" and "store" tasks communicates through three local MWMR FIFOs:

  • fifo_l2a : tranfer a full container from "load" to "analyse" task.
  • fifo_a2s : transfer a full container from "analyse" to "store" task.
  • fifo_s2l : transfer an empty container from "store" to "load" task.

For each fifo, one item is a 32 bits word defining the index of an available working container.

The pointers on the working containers, and the pointers on the MWMR fifos are defined by global arrays stored in cluster[0][0]. The MWMR fifo descriptors array is defined as a global variable in cluster[0][0].

Initialisation is done in three steps by the "load" & "store" tasks:

  1. Task "load" in cluster[0][0] initialises the heaps in all clusters. Other tasks are waiting on the global_sync synchronisation variable.
  2. Task "load" in cluster[0][0] initialises the barrier between all "load" tasks, allocates NIC & CMA RX channel, and starts the NIC_CMA RX transfer. Other "load" tasks are waiting on the load_sync synchronisation variable. Task "store" in cluster[0][0] initialises the barrier between all "store" tasks, allocates NIC & CMA TX channels, and starts the NIC_CMA TX transfer. Other "store" tasks are waiting on the store_sync synchronisation variable.
  3. When this global initialisation is completed, the "load" task in all clusters allocates the working containers and the MWMR fifos descriptors from the user local heap. In each cluster, the "analyse" and "store" tasks are waiting the local initialisation completion on the local_sync[x][y] variables.

When initialisation is completed, all tasks loop on containers:

  1. The "load" task get an empty working container from the fifo_s2l, transfer one container from the kernel rx_chbuf to this user container, and transfer ownership of this container to one "analysis" task by writing into the fifo_l2a.
  2. The "analyse" task get one working container from the fifo_l2a, analyse each packet header, compute the packet type (depending on the SRC MAC address), increment the correspondint classification counter, and transpose the SRC and the DST MAC addresses fot TX tranmission.
  3. The "store" task transfer get a full working container from the fifo_a2s, transfer this user container content to the the kernel tx_chbuf, and transfer ownership of this empty container to the "load" task by writing into the fifo_s2l.

Instrumentation results display is done by the "store" task in cluster[0][0] when all "store" tasks completed the number of clusters specified by the CONTAINERS_MAX parameter.

The source code can be found here, and the mapping is defined here.

Raycast

This multi-threaded application implement a video game requiring 3D image synthesis. The gamer can dynamically explore a maze and the gamer vision (3D images) depends on the gamer moves.

It that can run on a multi-processors, multi-clusters architecture, with one thread per processor, and can use any values for the frame buffer (width * height) associated to the graphical display, as it does not use any pre-existing images. It uses the chained buffer DMA peripheral to speed the display, but the heaviest part of the computation is the image synthesis.

After each gamer move, a new image is displayed. For a given image, the columns of pixels can be build in parallel by several threads running the same render() function for a given column. The number of threads is independent from the number of columns (image width), because the load is dynamically balanced between threads by a job allocator, until all columns for a given image have been handled.

It requires one TTY terminal, shared by all threads.

The source code can be found here, and the mapping is defined here.

Router

The source code can be found here, and the mapping is defined here.