wiki:projectstructure

The project structure

Online monitoring, or instrumentation, consists in adding software and/or hardware probes/sensors to the running architecture in order to detect and collect events corresponding to one or several physical phenomena occurring in the MP2SoC (temperature, power consumption or processor workload reaches a threshold, the contents of a hardware register or of a variable is different from expected, etc). Monitoring aims at being a non-intrusive stage that basically reads analog, digital, and software sensors and stores the results in local memories.

Online diagnosis is the stage responsible for making thorough analysis, once events corresponding to alteration/malfunction have been detected in the previous stage. This stage interprets and formats the raw results, logs them into an efficient data structures like databases and manages their history. Diagnosis also performs intrusive tests, like functional or structural tests on IPs, computes an annotated representation of the running architecture, and finally builds a database of audited architecture views. These views, or maps, represent an instant picture of the architecture showing the exact physical locations of the analyzed phenomenon occurrences. Because it is intrusive, the diagnosis stage generally suspends or simply stops the running application. For instance, in the case of the structural test of a component, the running application must be stopped and totally replaced by the test application. In other words, an event map actually represents the audited architecture with respect to the monitored event

Online constrained application remapping exploits the database of event maps, and possibly its history, to determine how and under what conditions the application graph can be remapped to the architecture. The instant map is used to constrain the placement of the monitored application graph. Different placement strategies for the application graph are possible, from a centralized scheme which statically assigns threads to processors once for all to a distributed and dynamic placement algorithm that allows task migration/replication and local optimization.

The ADAM project addresses a major part of the issues related to MP2SoC self-adaptability and aims at determining the common hardware and software mechanisms needed for the three stages. For the sake of readability, to each of these steps has been assigned a work package in the proposal. CEA-LETI is responsible for the “online monitoring” work-package, LIP6 is responsible for the “online diagnosis” work-package and LIRMM for the “online constrained application remapping” work-package. As a proof of concept, and to validate the whole work, 3 distinct applications will be mapped onto the three hardware architectures maintained by each partner: a telecom application 3GPP-LTE, a H264 decoder, and a mp3 decoder.

ADAM Workpackages

The project is composed of 3 Work-Packages (WP). The first one, WP1 is dedicated to online non-intrusive monitoring, the second one, WP2, addresses the problem of online diagnosis and event database management and WP3 deals with constraint-driven application remapping.

WP1 contains 3 tasks: the first one 1a) is dedicated to performance measurement, the second one 1b) will give figures for power consumption, temperature and voltage and the third one 1c) addresses fault detection techniques. Information provided by these 3 tasks are then gathered in the Distributed Raw Event Tables (DRET) which are tables of “first-level” measurements distributed among the different processing tiles of the architecture. The objective of this WP is to obtain a whole set of information which allows to perform a correct diagnosis of the architecture.

WP2 takes as input the DRET from WP1, and its objective is to obtain a Consolidated Database of multi-parameters Architecture Instant Map (AIM), called AIM-DB. This database will have a formalism that will enable the different application remapping scenarios developed in WP3. To perform the WP2 objective, 4 tasks are defined: 2a) which allows access to the database and history management when needed, 2b) periodically analyses the DRET to gives inputs to the AIM, 2c) trigs alerts when important events from the DRET are detected and 2d) performs some intrusive diagnosis/test tasks after alerts have been triggered.

AIM-DB is the direct input of the WP3. The objective of this WP is to dynamically adapt the application with this knowledge of the architecture present state with respect to monitored events. The outputs are the different methods that will be developed to perform this self-adaptability, and the associated software codes and hardware developments. Task 3a) will study a centralized remapping scenario where the information found in AIM-DB is exploited globally to perform application remapping and binary relinking/reloading, while task 3b) will examine the distributed remapping scenario which takes advantage on AIM-DB to perform local or global remapping orders. Finally, task 3c) defines a common set of example applications that will be used for the validation of the different concepts.

WP1 : On-line Non-Intrusive Monitoring

wp1.png

In this first WP, the general objective is to provide monitoring capabilities to platform-based SoC. The meaning of monitoring, for this project, is the measurement of different characteristics of the cores, while the application is running. These measurements are done continuously in a non-intrusive way (no modification of the initial application). One of the first quality of the added monitoring elements is that they ensure a minimum perturbation compared to the initial structure. The studied platform is composed of tiles, interconnected with a Network on Chip (NoC). The tiles can embed simple processors (like RISC R3000 or LEON SPARCV8 processor), complex clusters (multiple cores with their associated memories), or even powerful hardware accelerators (probably reconfigurable cores), with a possible mix between these categories. In fact we just suppose that it can be composed of SW and HW elements sharing the same set of communications primitives. One other important point is that the platform is supposed to be composed of some hundreds of these different tiles.

Software and hardware monitoring for performance, power/voltage/temperature and fault detection work as first level instrumentation tasks, and are studied in WP tasks 1.a,1.b and 1.c. They deliver raw monitored digital information on a periodic basis or permanently (online behaviour). This information is then modelled in the form of classified events in the task 1.d. The event model specifies the event format, fixed for the architecture. Formatted events are stored on the local memories of the architecture tiles, in fixed-size cyclic buffers designed to be easily accessible. Altogether, the disseminated buffers represent the “Distributed Raw Event Tables” (DRET) available at all times within the architecture. The monitoring capabilities are summarized in figure 4.

Task 1.a : Performance measurement, Task manager : LETI Partners : LETI, LIRMM. In this task, the performance of the tile is monitored. The difficulty is to reach the minimum perturbation requirement. We propose to develop two mechanisms. The first one is SW oriented and consists on measuring periodically, or on-line the processors as well as their communication workloads. The Network Interface of the NoC will help to have a generic way to perform in/out throughput on-line monitoring. The second mechanism is HW oriented and consists in probing some chosen critical paths. The advantage of this kind of monitoring is the non-intrusive property, but the difficulty is to have access to the data paths or the control part. Both HW and SW solutions will be studied and compared in this task.
T0 → T0+18
Status : not achieved yet


Task 1.b : PVT management, Task manager : LETI Partners : LETI, LIP6. The objective of this task is the monitoring of physical information, such as temperature, voltage and power consumption. This can be obtained by the way of direct measurement, with on-site temperature sensors for example, or with non direct measurement, thanks to SW load evaluation and equivalent tables. Due to parameters dispersions throughout the chip in nanotechnologies, HW on-site sensors will be probably necessary. Nevertheless, non direct measurement will add another dimension and help the diagnosis phase. These two techniques will be studied and evaluated in this task. Some of the chosen techniques will also be implemented.
T0 → T0+24
Status : not achieved yet


Task 1.c : HW fault detection, Task manager : LETI Partners : All. Nanotechnologies are leading to more and more difficulties to ensure a correct behavior of the tiles and interconnects between tiles during the chip lifetime. The fault detection is then becoming a mandatory feature of future architectures. The objective of this WP is to evaluate some of the HW and SW possible techniques, like on-line or periodic testing, BIST, software CRC or software security survey tasks. As the field of research is very vast and it is not the aim of the project to have a full protection against faults, just a few techniques will be implemented. The objective is to add this dimension to the event table, because of its importance.
T0 → T0+24
Status : not achieved yet


Task 1.d : Event shaping and logging, Task manager : LETI Partners : All. The final objective of all the tasks in this WP is to obtain a raw event table for all the features of the architecture measured. The objective of this task is to determine the format of this table and how it can be accessed (how to write and how and when to read the table). Both of them should be kept as simple as possible to be exploited by different diagnosis systems, and not only those provided in this project. In that way, these first level (or preliminary) results can be independently used and disseminated.
T0 → T0+24
Status : not achieved yet

WP2 : On-line Diagnosis

wp2.png

This work-package is dedicated to online diagnosis. It aims at building and managing a database of relevant cartographies of the MP2SoC from the Distributed Raw Event Tables elaborated in WP1. In ADAM, a cartography for a given event type is called an Architecture Instant Map (AIM). An AIM is a collection of relevant events extracted from DRET or newly aggregate and computed events with their exact physical locations and characteristics at a given time.

In other terms, WP2 exploits the local information found in the Distributed Raw Event Tables (DRET) as an entry point to compute efficient data structures that can be used appropriately during the next phase, constrained remapping. Maintaining at all times a database of AIMs, called AIM Consolidated Database, and their corresponding histories can be of great interest in order to perform multi-criteria mapping and optimization.

This work-package is composed of four tasks, 2a to 2d. All these tasks collaborate to define the interoperable AIM database and its access mechanisms (task 2d). More precisely, Task 2a is responsible for performing statistical analysis on raw or historized events. Task 2b is responsible for triggering actions when a set of events matches a certain condition, and task 2c defines one of these actions, online functional/structural testing. When dealing with fault detection, this work-package performs intrusive monitoring in the form of functional/structural test to determine with optimal accuracy the deficient hardware resources.


Task 2.a : Statistical analysis, Task manager : LIRMM Partners : LIRMM. Among the remapping strategies that will later be presented in WP3; some, or a sequence of those applied over time may have hardly predictable effects on application performance. In order to keep track of mid- or long-term consequences of those remapping decisions, a statistical analysis of the DRET and AIM databases will be performed. These information may later help refining the decision-taking policy when, for instance, a previous task migration order led to a worst global solution.
T0+12 → T0+24
Status : not achieved yet


Task 2.b : Hard/Soft? alert triggering, Task manager : LIP6, Partners : ALL. This task is dedicated to the appropriate exploitation of the formatted events available in the tables of the DRET, filled by all the mechanisms studied in WP1. Exploitation covers Boolean computing on events, ordering, classifying, filtering, accumulating, mean-value calculation, and finally triggering of an appropriate action. This task intends to identify how to write these second level instrumentation threads (with respect to the first level instrumentation threads studied in WP1)that are designed to be the reactive part of the online monitoring, for they decide to trigger the next step or not. Next step can either be further analysis through test (2c) or direct remapping (WP3).
T0+6 → T0+24
Status : not achieved yet


Task 2.c : Functional/structural test, Task manager : LIP6, Partners : LIP6. The main objective of this task is to identify the failing component(s) of the MP2SoC architecture, in reaction to the alert triggering. Faulty components can be processors, RAMs within a Processing Element (PE), or routers interconnecting the main PEs. Thus, functional/structural software/hardware architectures and methods must be defined to allow a quick targeting of the faulty part. Different strategies depending on the level of accuracy are to be experimented: stop at first fail, identify all failing components within the PE … It must be noticed in this task that the structural test is not intended to localize a faulty gate or a faulty transistor, the objective is rather to find which component should be deactivated and not considered during the remapping step. This obviously implies the use of on-chip functional resources for test purpose.
T0 → T0+24
Status : not achieved yet


Task 2.d : Event database access/history primitives, Task manager : LIP6, Partners : ALL. This task aims at defining the Application Program Interface (API) of the AIM database, in interaction with the API defined in task 1d. An AIM is a collection of logged events corresponding to a single monitored phenomenon (exact locations and characteristics). In particular, this task is to define all the means needed to manipulate/enhance a database of AIMs, classified by observed phenomenon (temperature audit of the SoC, accurate identification of the faulty parts, communication contention points, etc), as well as their history for backtrack purposes.
T0+6 → T0+24
Status : not achieved yet

WP3 : On-line application remapping

wp3.png

Based on the online monitoring information that have been gathered by the appropriate monitoring resources (WP1), diagnosed and classified in the Architecture Instant Map Database AIM-DB (WP2) a collection of strategies for improving the application mapping are considered. This self-adaptability scheme may, depending on the system policy, aim at:

  • Decreasing the system power consumption
  • Ensuring real-time performance or more generally improving application performance
  • Guaranteeing functionality in the presence of faulty hardware resources

These strategies operate on a task basis as per defined by the application task graph. Only the following on-line operations are considered:

  • Migration. Tasks are moved from HW resource to HW resource for :
    • Lowering communication cost / power consumption in order to improve performance if an alternative processing resource has a more appropriate support for a particular task. Additionally, if some of the processing resources have time-sliced execution capabilities (a CPU running a multitasking operating system); migrating tasks results in a higher performance since time is shared among fewer tasks.
    • Avoiding mapping to a faulty hardware resource, when online diagnostic support identifies an imminent problem occurrence (increased current leakage, temperature rising, etc.)
  • Replication. Tasks that are identified as critical (forming a bottleneck) can get replicated in order to improve performance, provided that the application description enables it. Replication only occurs when a task becomes critical momentarily meaning that later on, when the performance demand drops below a given threshold replicated tasks are killed freeing the corresponding processing resources.
  • Router reconfiguration. For either communication performance (avoiding contentions) or for dealing with a hardware defect on some units/physical links, the routing tables may be changed at run-time.



The two first classes of operations may take place in different ways:

  • a fully centralized scenario where once the decision to remap the application has been taken, a global remapping is issued.
  • a fully decentralized scenario where processing resources are all equally endowed with decision capabilities.

Depending on the chosen approach (i.e. centralized or distributed) there may exist tight coupling between the application remapping (current WP) and online diagnosis (WP2). Although a centralized mapping strategy can operate using the system-level information issued by the online diagnosis and drive the corresponding remapping operations, a fully distributed strategy runs differently. The fundamental difference between the 2 principles relies in the following:

  • In a centralized system a single unit holds the global system-level monitoring information and may therefore take decisions and issue remapping orders
  • In a fully distributed system each unit takes decisions independently (according to potentially local-only information) from the others leading to potentially conflicting solution.

The underlying motivation behind the evaluation of both strategies relies in the problem of scalability. A centralized scenario may intrinsically take better decisions since it operates on a global system view. Nevertheless, the necessary support it requires for periodically retrieving the AIM information implies an overhead which grows larger with the number of processors the system features. The intended explorations will help better defining the best trade-off according to, among other criteria, the number of processors and the desired adaptability.


Task 3.a : Multi-constrained centralized remapping scenario, Task manager : LIP6, Partners : LIP6, LIRMM, LETI. In this approach, the information contained in the AIM-DB of WP2 is considered globally. Once the diagnosis has been performed, and the actions to be undertaken (skipping faulty parts in the remapping process, migrating threads to lower temperature on a region of the chip, etc…) clearly identified, a dedicated master processor on the MP2SoC executes the remapping algorithm, taking into account one AIM (single constrained, faulty parts) or several (multi-constrained) AIMs (faulty parts AND temperature AND processor workload, for instance). The remapping is actually composed of two steps: the remapping step that takes as inputs the application graph and the considered AIMs, and assigns hardware resources to threads, and the application relinking/reloading step that generates a new binary executable, appropriate global/local routing tables, and reloading orders from the collection of object files (thread object files and operating system object files) representing the running application. This way, the MP2SoC is able to self-adapt to a new application firmware.
T0+12 → T0+36
Status : not achieved yet


Task 3.b : Multi-constrained distributed remapping scenario, Task manager : LIRMM, Partners : LIRMM, LETI. This scenario is based on a distributed remapping strategy where each processing resource in the MP2SoC performs all three phases of the self-adaptation scheme in a continuous manner. This scenario therefore triggers remapping orders asynchronously (when a local Boolean condition becomes true for instance) with respect to the other units or clusters which also undergo the same process. Each of these operates according to one of the following principles: 1) local-only AIM database drives remapping decisions. Since remapping orders are issued locally, a lot of effort has to be put into researching adequate remapping policies; specifically because we here consider both migration and replication operations for this phase. 2) global, periodically broadcasted instant map system view guide the remapping process. In this specific case we assume that this global information is potentially outdated. Here again we do consider both migration and replication. 3) An intermediate solution between centralized and fully distributed will also be explored; based on a globally issued placement some regions will undergo locally decided remapping in order to perform local placement refinements For this specific solution, only task migrations are considered.
T0+12 → T0+36
Status : not achieved yet


Task 3.c : Mutualized System-level Case studies, Task manager : LIRMM, Partners : ALL. In order to fairly benchmark the proposed remapping solutions, both centralized and distributed remapping scenarios will be validated on four already existing multi-threaded applications. Applications range from a simple MJPEG application to a complete fourth generation mobile telecommunication system, 3GPP-LTE (Long Term Evolution). H264 and MP3 decoding applications will also be studied. This will help in clearly quantifying the benefits of the proposed approaches as this application is representative of the real applications that MP2SoC will soon have to run. More precisely, the partners will setup a result table with 2 lines (1/centralized, 2/distributed) and four columns (MJPEG, MP3, H264, 3GPP-LTE) and identify scientifically for each table cell the pros and cons of the remapping scenario.
T0+18 → T0+36
Status : not achieved yet
Last modified 17 years ago Last modified on Jun 16, 2008, 6:46:08 PM