Version 9 (modified by 17 years ago) (diff) | ,
---|
The project structure
Online monitoring, or instrumentation, consists in adding software and/or hardware probes/sensors to the running architecture in order to detect and collect events corresponding to one or several physical phenomena occurring in the MP2SoC (temperature, power consumption or processor workload reaches a threshold, the contents of a hardware register or of a variable is different from expected, etc). Monitoring aims at being a non-intrusive stage that basically reads analog, digital, and software sensors and stores the results in local memories.
Online diagnosis is the stage responsible for making thorough analysis, once events corresponding to alteration/malfunction have been detected in the previous stage. This stage interprets and formats the raw results, logs them into an efficient data structures like databases and manages their history. Diagnosis also performs intrusive tests, like functional or structural tests on IPs, computes an annotated representation of the running architecture, and finally builds a database of audited architecture views. These views, or maps, represent an instant picture of the architecture showing the exact physical locations of the analyzed phenomenon occurrences. Because it is intrusive, the diagnosis stage generally suspends or simply stops the running application. For instance, in the case of the structural test of a component, the running application must be stopped and totally replaced by the test application. In other words, an event map actually represents the audited architecture with respect to the monitored event
Online constrained application remapping exploits the database of event maps, and possibly its history, to determine how and under what conditions the application graph can be remapped to the architecture. The instant map is used to constrain the placement of the monitored application graph. Different placement strategies for the application graph are possible, from a centralized scheme which statically assigns threads to processors once for all to a distributed and dynamic placement algorithm that allows task migration/replication and local optimization.
The ADAM project addresses a major part of the issues related to MP2SoC self-adaptability and aims at determining the common hardware and software mechanisms needed for the three stages. For the sake of readability, to each of these steps has been assigned a work package in the proposal. CEA-LETI is responsible for the “online monitoring” work-package, LIP6 is responsible for the “online diagnosis” work-package and LIRMM for the “online constrained application remapping” work-package. As a proof of concept, and to validate the whole work, 3 distinct applications will be mapped onto the three hardware architectures maintained by each partner: a telecom application 3GPP-LTE, a H264 decoder, and a mp3 decoder.
The following figure shows a synthetic view of the ADAM project. The project is composed of 3 Work-Packages (WP). The first one, WP1 is dedicated to online non-intrusive monitoring, the second one, WP2, addresses the problem of online diagnosis and event database management and WP3 deals with constraint-driven application remapping.
WP1 contains 3 tasks: the first one 1a) is dedicated to performance measurement, the second one 1b) will give figures for power consumption, temperature and voltage and the third one 1c) addresses fault detection techniques. Information provided by these 3 tasks are then gathered in the Distributed Raw Event Tables (DRET) which are tables of “first-level” measurements distributed among the different processing tiles of the architecture. The objective of this WP is to obtain a whole set of information which allows to perform a correct diagnosis of the architecture.
WP2 takes as input the DRET from WP1, and its objective is to obtain a Consolidated Database of multi-parameters Architecture Instant Map (AIM), called AIM-DB. This database will have a formalism that will enable the different application remapping scenarios developed in WP3. To perform the WP2 objective, 4 tasks are defined: 2a) which allows access to the database and history management when needed, 2b) periodically analyses the DRET to gives inputs to the AIM, 2c) trigs alerts when important events from the DRET are detected and 2d) performs some intrusive diagnosis/test tasks after alerts have been triggered.
AIM-DB is the direct input of the WP3. The objective of this WP is to dynamically adapt the application with this knowledge of the architecture present state with respect to monitored events. The outputs are the different methods that will be developed to perform this self-adaptability, and the associated software codes and hardware developments. Task 3a) will study a centralized remapping scenario where the information found in AIM-DB is exploited globally to perform application remapping and binary relinking/reloading, while task 3b) will examine the distributed remapping scenario which takes advantage on AIM-DB to perform local or global remapping orders. Finally, task 3c) defines a common set of example applications that will be used for the validation of the different concepts.
WP1 : On-line Non-Intrusive Monitoring
In this first WP, the general objective is to provide monitoring capabilities to platform-based SoC. The meaning of monitoring, for this project, is the measurement of different characteristics of the cores, while the application is running. These measurements are done continuously in a non-intrusive way (no modification of the initial application). One of the first quality of the added monitoring elements is that they ensure a minimum perturbation compared to the initial structure. The studied platform is composed of tiles, interconnected with a Network on Chip (NoC). The tiles can embed simple processors (like RISC R3000 or LEON SPARCV8 processor), complex clusters (multiple cores with their associated memories), or even powerful hardware accelerators (probably reconfigurable cores), with a possible mix between these categories. In fact we just suppose that it can be composed of SW and HW elements sharing the same set of communications primitives. One other important point is that the platform is supposed to be composed of some hundreds of these different tiles.
Software and hardware monitoring for performance, power/voltage/temperature and fault detection work as first level instrumentation tasks, and are studied in WP tasks 1.a,1.b and 1.c. They deliver raw monitored digital information on a periodic basis or permanently (online behaviour). This information is then modelled in the form of classified events in the task 1.d. The event model specifies the event format, fixed for the architecture. Formatted events are stored on the local memories of the architecture tiles, in fixed-size cyclic buffers designed to be easily accessible. Altogether, the disseminated buffers represent the “Distributed Raw Event Tables” (DRET) available at all times within the architecture. The monitoring capabilities are summarized in figure 4.
Task 1.a : Performance measurement Task manager : LETI Partners : LETI, LIRMM In this task, the performance of the tile is monitored. The difficulty is to reach the minimum perturbation requirement. We propose to develop two mechanisms. The first one is SW oriented and consists on measuring periodically, or on-line the processors as well as their communication workloads. The Network Interface of the NoC will help to have a generic way to perform in/out throughput on-line monitoring. The second mechanism is HW oriented and consists in probing some chosen critical paths. The advantage of this kind of monitoring is the non-intrusive property, but the difficulty is to have access to the data paths or the control part. Both HW and SW solutions will be studied and compared in this task. |
T0 → T0+18 |
Status : not achieved yet |
Task 1.b : PVT management Task manager : LETI Partners : LETI, LIP6 The objective of this task is the monitoring of physical information, such as temperature, voltage and power consumption. This can be obtained by the way of direct measurement, with on-site temperature sensors for example, or with non direct measurement, thanks to SW load evaluation and equivalent tables. Due to parameters dispersions throughout the chip in nanotechnologies, HW on-site sensors will be probably necessary. Nevertheless, non direct measurement will add another dimension and help the diagnosis phase. These two techniques will be studied and evaluated in this task. Some of the chosen techniques will also be implemented. |
T0 → T0+24 |
Status : not achieved yet |
Task 1.c : HW fault detection Task manager : LETI Partners : All Nanotechnologies are leading to more and more difficulties to ensure a correct behavior of the tiles and interconnects between tiles during the chip lifetime. The fault detection is then becoming a mandatory feature of future architectures. The objective of this WP is to evaluate some of the HW and SW possible techniques, like on-line or periodic testing, BIST, software CRC or software security survey tasks. As the field of research is very vast and it is not the aim of the project to have a full protection against faults, just a few techniques will be implemented. The objective is to add this dimension to the event table, because of its importance. |
T0 → T0+24 |
Status : not achieved yet |