wiki:projectstructure

Version 14 (modified by fpecheux, 17 years ago) (diff)

--

The project structure

Online monitoring, or instrumentation, consists in adding software and/or hardware probes/sensors to the running architecture in order to detect and collect events corresponding to one or several physical phenomena occurring in the MP2SoC (temperature, power consumption or processor workload reaches a threshold, the contents of a hardware register or of a variable is different from expected, etc). Monitoring aims at being a non-intrusive stage that basically reads analog, digital, and software sensors and stores the results in local memories.

Online diagnosis is the stage responsible for making thorough analysis, once events corresponding to alteration/malfunction have been detected in the previous stage. This stage interprets and formats the raw results, logs them into an efficient data structures like databases and manages their history. Diagnosis also performs intrusive tests, like functional or structural tests on IPs, computes an annotated representation of the running architecture, and finally builds a database of audited architecture views. These views, or maps, represent an instant picture of the architecture showing the exact physical locations of the analyzed phenomenon occurrences. Because it is intrusive, the diagnosis stage generally suspends or simply stops the running application. For instance, in the case of the structural test of a component, the running application must be stopped and totally replaced by the test application. In other words, an event map actually represents the audited architecture with respect to the monitored event

Online constrained application remapping exploits the database of event maps, and possibly its history, to determine how and under what conditions the application graph can be remapped to the architecture. The instant map is used to constrain the placement of the monitored application graph. Different placement strategies for the application graph are possible, from a centralized scheme which statically assigns threads to processors once for all to a distributed and dynamic placement algorithm that allows task migration/replication and local optimization.

The ADAM project addresses a major part of the issues related to MP2SoC self-adaptability and aims at determining the common hardware and software mechanisms needed for the three stages. For the sake of readability, to each of these steps has been assigned a work package in the proposal. CEA-LETI is responsible for the “online monitoring” work-package, LIP6 is responsible for the “online diagnosis” work-package and LIRMM for the “online constrained application remapping” work-package. As a proof of concept, and to validate the whole work, 3 distinct applications will be mapped onto the three hardware architectures maintained by each partner: a telecom application 3GPP-LTE, a H264 decoder, and a mp3 decoder.

ADAM Workpackages

The following figure shows a synthetic view of the ADAM project. The project is composed of 3 Work-Packages (WP). The first one, WP1 is dedicated to online non-intrusive monitoring, the second one, WP2, addresses the problem of online diagnosis and event database management and WP3 deals with constraint-driven application remapping.

WP1 contains 3 tasks: the first one 1a) is dedicated to performance measurement, the second one 1b) will give figures for power consumption, temperature and voltage and the third one 1c) addresses fault detection techniques. Information provided by these 3 tasks are then gathered in the Distributed Raw Event Tables (DRET) which are tables of “first-level” measurements distributed among the different processing tiles of the architecture. The objective of this WP is to obtain a whole set of information which allows to perform a correct diagnosis of the architecture.

WP2 takes as input the DRET from WP1, and its objective is to obtain a Consolidated Database of multi-parameters Architecture Instant Map (AIM), called AIM-DB. This database will have a formalism that will enable the different application remapping scenarios developed in WP3. To perform the WP2 objective, 4 tasks are defined: 2a) which allows access to the database and history management when needed, 2b) periodically analyses the DRET to gives inputs to the AIM, 2c) trigs alerts when important events from the DRET are detected and 2d) performs some intrusive diagnosis/test tasks after alerts have been triggered.

AIM-DB is the direct input of the WP3. The objective of this WP is to dynamically adapt the application with this knowledge of the architecture present state with respect to monitored events. The outputs are the different methods that will be developed to perform this self-adaptability, and the associated software codes and hardware developments. Task 3a) will study a centralized remapping scenario where the information found in AIM-DB is exploited globally to perform application remapping and binary relinking/reloading, while task 3b) will examine the distributed remapping scenario which takes advantage on AIM-DB to perform local or global remapping orders. Finally, task 3c) defines a common set of example applications that will be used for the validation of the different concepts.

WP1 : On-line Non-Intrusive Monitoring

wp1.png

In this first WP, the general objective is to provide monitoring capabilities to platform-based SoC. The meaning of monitoring, for this project, is the measurement of different characteristics of the cores, while the application is running. These measurements are done continuously in a non-intrusive way (no modification of the initial application). One of the first quality of the added monitoring elements is that they ensure a minimum perturbation compared to the initial structure. The studied platform is composed of tiles, interconnected with a Network on Chip (NoC). The tiles can embed simple processors (like RISC R3000 or LEON SPARCV8 processor), complex clusters (multiple cores with their associated memories), or even powerful hardware accelerators (probably reconfigurable cores), with a possible mix between these categories. In fact we just suppose that it can be composed of SW and HW elements sharing the same set of communications primitives. One other important point is that the platform is supposed to be composed of some hundreds of these different tiles.

Software and hardware monitoring for performance, power/voltage/temperature and fault detection work as first level instrumentation tasks, and are studied in WP tasks 1.a,1.b and 1.c. They deliver raw monitored digital information on a periodic basis or permanently (online behaviour). This information is then modelled in the form of classified events in the task 1.d. The event model specifies the event format, fixed for the architecture. Formatted events are stored on the local memories of the architecture tiles, in fixed-size cyclic buffers designed to be easily accessible. Altogether, the disseminated buffers represent the “Distributed Raw Event Tables” (DRET) available at all times within the architecture. The monitoring capabilities are summarized in figure 4.

Task 1.a : Performance measurement, Task manager : LETI Partners : LETI, LIRMM. In this task, the performance of the tile is monitored. The difficulty is to reach the minimum perturbation requirement. We propose to develop two mechanisms. The first one is SW oriented and consists on measuring periodically, or on-line the processors as well as their communication workloads. The Network Interface of the NoC will help to have a generic way to perform in/out throughput on-line monitoring. The second mechanism is HW oriented and consists in probing some chosen critical paths. The advantage of this kind of monitoring is the non-intrusive property, but the difficulty is to have access to the data paths or the control part. Both HW and SW solutions will be studied and compared in this task.
T0 → T0+18
Status : not achieved yet


Task 1.b : PVT management, Task manager : LETI Partners : LETI, LIP6. The objective of this task is the monitoring of physical information, such as temperature, voltage and power consumption. This can be obtained by the way of direct measurement, with on-site temperature sensors for example, or with non direct measurement, thanks to SW load evaluation and equivalent tables. Due to parameters dispersions throughout the chip in nanotechnologies, HW on-site sensors will be probably necessary. Nevertheless, non direct measurement will add another dimension and help the diagnosis phase. These two techniques will be studied and evaluated in this task. Some of the chosen techniques will also be implemented.
T0 → T0+24
Status : not achieved yet


Task 1.c : HW fault detection, Task manager : LETI Partners : All. Nanotechnologies are leading to more and more difficulties to ensure a correct behavior of the tiles and interconnects between tiles during the chip lifetime. The fault detection is then becoming a mandatory feature of future architectures. The objective of this WP is to evaluate some of the HW and SW possible techniques, like on-line or periodic testing, BIST, software CRC or software security survey tasks. As the field of research is very vast and it is not the aim of the project to have a full protection against faults, just a few techniques will be implemented. The objective is to add this dimension to the event table, because of its importance.
T0 → T0+24
Status : not achieved yet


Task 1.d : Event shaping and logging, Task manager : LETI Partners : All. The final objective of all the tasks in this WP is to obtain a raw event table for all the features of the architecture measured. The objective of this task is to determine the format of this table and how it can be accessed (how to write and how and when to read the table). Both of them should be kept as simple as possible to be exploited by different diagnosis systems, and not only those provided in this project. In that way, these first level (or preliminary) results can be independently used and disseminated.
T0 → T0+24
Status : not achieved yet

WP2 : On-line Diagnosis

wp2.png

This work-package is dedicated to online diagnosis. It aims at building and managing a database of relevant cartographies of the MP2SoC from the Distributed Raw Event Tables elaborated in WP1. In ADAM, a cartography for a given event type is called an Architecture Instant Map (AIM). An AIM is a collection of relevant events extracted from DRET or newly aggregate and computed events with their exact physical locations and characteristics at a given time.

In other terms, WP2 exploits the local information found in the Distributed Raw Event Tables (DRET) as an entry point to compute efficient data structures that can be used appropriately during the next phase, constrained remapping. Maintaining at all times a database of AIMs, called AIM Consolidated Database, and their corresponding histories can be of great interest in order to perform multi-criteria mapping and optimization.

This work-package is composed of four tasks, 2a to 2d. All these tasks collaborate to define the interoperable AIM database and its access mechanisms (task 2d). More precisely, Task 2a is responsible for performing statistical analysis on raw or historized events. Task 2b is responsible for triggering actions when a set of events matches a certain condition, and task 2c defines one of these actions, online functional/structural testing. When dealing with fault detection, this work-package performs intrusive monitoring in the form of functional/structural test to determine with optimal accuracy the deficient hardware resources.


Task 2.a : Statistical analysis, Task manager : LIRMM Partners : LIRMM. Among the remapping strategies that will later be presented in WP3; some, or a sequence of those applied over time may have hardly predictable effects on application performance. In order to keep track of mid- or long-term consequences of those remapping decisions, a statistical analysis of the DRET and AIM databases will be performed. These information may later help refining the decision-taking policy when, for instance, a previous task migration order led to a worst global solution.
T0+12 → T0+24
Status : not achieved yet