1 | \section{Project context} |
---|
2 | \hspace{2cm}\begin{scriptsize}\begin{verbatim} |
---|
3 | % 1. CONTEXTE ET POSITIONNEMENT DU PROJET |
---|
4 | % (1 page maximum) Présentation générale du problème qu'il est proposé de traiter |
---|
5 | % dans le projet et du cadre de travail (recherche fondamentale, industrielle ou |
---|
6 | % développement expérimental). |
---|
7 | \end{verbatim} |
---|
8 | \end{scriptsize} |
---|
9 | An embedded system is an application integrated into one or several chips |
---|
10 | in order to accelerate it or to embedd it into a small device such as a personal |
---|
11 | digital assistant (PDA). |
---|
12 | This topic is investigated since 80s using Applications Specific Integrated Circuits (ASIC), |
---|
13 | Digital Signal Processing (DSP) and parallel computing on multiprocessor machines or networks. |
---|
14 | More recently, since end of 90s, other technologies appeared like Very Large Instruction Word (VLIW), |
---|
15 | Application Specific Instruction Processors (ASIP), System on Chip (SoC), |
---|
16 | Multi-Processors SoC (MPSoC). |
---|
17 | \\ |
---|
18 | During these last decades embedded system was reserved to major industrial companies targeting high volume market |
---|
19 | due to the design and fabrication costs. |
---|
20 | Nowadays Field Programmable Gate Arrays (FPGA), like Virtex5 from Xilinx and Stratix4 from Altera, |
---|
21 | can implement a SoC with multiple processors and several coprocessors for less than 10K euros the piece. |
---|
22 | In addition, High Level Synthesis (HLS) becomes more mature and allows to automize design |
---|
23 | and to decrease drastically its cost in terms of man power. Thus, both FPGA and HLS tends to spread over |
---|
24 | HPC for small companies targeting low volume markets. |
---|
25 | \par |
---|
26 | To get an efficient embedded system, designer has to take into account application characteristics when it |
---|
27 | chooses one of the former technologies. |
---|
28 | This choice is not easy and in most cases designer has to try different technologies to retain the |
---|
29 | most adapted one. |
---|
30 | \\ |
---|
31 | The first objective of COACH is to provide an open-source framework to design embedded system |
---|
32 | on FPGA device. |
---|
33 | COACH framework allows designer to explore various software/hardware partitions of the |
---|
34 | target application, to run timing and functional simulations and to generate automatically both |
---|
35 | the software and the synthesizable description of the hardware. |
---|
36 | The main topics of the project are: |
---|
37 | \begin{itemize} |
---|
38 | \item |
---|
39 | Design space exploration: It consists in analysing the application runnig on FPGA, defining the target |
---|
40 | technology (SoC, MPSoC, ASIP, ...) and hardware/software partitioning of tasks depending on |
---|
41 | technology choice. This exploration is driven basically by throughput, latency and power consumption |
---|
42 | criteria. |
---|
43 | \item |
---|
44 | Micro-architectural exploration: When hardware components are required, the HLS tools of the framework |
---|
45 | generate them automatically. At this stage the framework provides various HLS tools allowing the |
---|
46 | micro-architectural space design exploration. The exploration criteria are also throughput, latency |
---|
47 | and power consumption. |
---|
48 | % FIXME |
---|
49 | %CA At this stage, preliminary source-level transformations will be |
---|
50 | %CA required to improve the efficiency of the target component. |
---|
51 | %CA COACH will also provide such facilities, such as automatic parallelization |
---|
52 | %CA and memory optimisation. |
---|
53 | \item |
---|
54 | Performance measurement: For each point of design space exploration, metrics of criteria are available |
---|
55 | such as throughput, latency, power consumption, area, memory allocation and data locality. |
---|
56 | They are evaluated using virtual prototyping, estimation or analysing methodologies. |
---|
57 | \item |
---|
58 | Targeted hardware technology: The COACH description of system is independent of the FPGA family. |
---|
59 | Every point of the design exploration space can be implemented on any FPGA having the required resources. |
---|
60 | Basically, COACH handles both Altera and Xilinx FPGA families. |
---|
61 | \end{itemize} |
---|
62 | As an extension of embedded system design, COACH deals also with High Performance Computing (HPC). |
---|
63 | In HPC, the kind of targeted application is an existing one running on PC. COACH helps designer |
---|
64 | to accelerate it by migrating critical parts into a SoC implemented on a FPGA plugged to the PC bus. |
---|
65 | \par |
---|
66 | COACH is the result of the will of several laboratory to unify their know how and skills in the |
---|
67 | following domains: Operating system and hardware communication (TIMA, SITI), SoC and MPSoC (LIP6 and TIMA), |
---|
68 | ASIP (IRISA) and HLS (LIP6, Lab-STIC and LIP). The project objective is to integrate these various |
---|
69 | domains into a unique free framework (licence ...) masking as much as possible these domains and its |
---|
70 | different tools to the user. |
---|
71 | |
---|
72 | |
---|
73 | \subsection{Economical context and interest} |
---|
74 | \hspace{2cm}\begin{scriptsize}\begin{verbatim} |
---|
75 | % 1.1. CONTEXTE ET ENJEUX ECONOMIQUES ET SOCIETAUX |
---|
76 | % (2 pages maximum) |
---|
77 | % Décrire le contexte économique, social, réglementaire. dans lequel se situe |
---|
78 | % le projet en présentant une analyse des enjeux sociaux, économiques, environnementaux, |
---|
79 | % industriels. Donner si possible des arguments chiffrés, par exemple, pertinence et |
---|
80 | % portée du projet par rapport à la demande économique (analyse du marché, analyse des |
---|
81 | % tendances), analyse de la concurrence, indicateurs de réduction de coûts, perspectives |
---|
82 | % de marchés (champs d'application, .). Indicateurs des gains environnementaux, cycle |
---|
83 | % de vie. |
---|
84 | \end{verbatim} |
---|
85 | \end{scriptsize} |
---|
86 | Microelectronic allows to integrate complicated functions into products, to increase their |
---|
87 | commercial attractivity and to improve their competitivity. Multimedia and communication |
---|
88 | sectors have taken advantage from microelectronics facilities thanks to developpment of |
---|
89 | design methodologies and tools for real time embedded systems. Many other sectors could |
---|
90 | benefit from microelectronics if these methologies and tools are adapted to their features. |
---|
91 | The Non Recurring Engineering (NRE) costs involded in designing and manufacturing an ASIC is |
---|
92 | very high. It costs several milliars of euros for IC factory and several millions to fabricate |
---|
93 | a specific circuit. Consequently, it is generally unfeasible to design and fabricate ASICs in |
---|
94 | low volumes and ICs are designed to cover a broad applications spectrum at the cost of |
---|
95 | performance degradation. |
---|
96 | \\ |
---|
97 | Today, FPGAs become important actors in the computational domain that was originally dominated |
---|
98 | by microprocessors and ASICs. Just like microprocessors FPGA based systems can be reprogrammed |
---|
99 | on a per-application basis. At the same time, FPGAs offer significant performance benefits over |
---|
100 | microprocessors implementation for a number of applications. Although these benefits are still |
---|
101 | generally an order of magnitude less than equivalent ASIC implementations, low costs |
---|
102 | (500 euros to 10K euros), fast time to market and flexibility of FPGAs make them an attractive |
---|
103 | choice for low-to-medium volume applications. |
---|
104 | Since their introduction in the mid eighties, FPGAs evolved from a simple, |
---|
105 | low-capacity gate array technology to devices (Altera STRATIX III, Xilinx Virtex V) that |
---|
106 | provide a mix of coarse-grained data path units, memory blocks, microprocessor cores, |
---|
107 | on chip A/D conversion, and gate counts by millions. This high logic capacity allows to implement |
---|
108 | complex systems like multi-processors platform with application dedicated coprocessors. |
---|
109 | Using FPGA limits the NRE costs to design cost. This boosts the developpment of methodologies |
---|
110 | and tools to automize design and reduce its cost. |
---|
111 | \par |
---|
112 | Nowadays, there are neither commercial nor free tools covering the whole design process. |
---|
113 | For instance, with SOPC Builder from Altera, users can select and parameterize IP components |
---|
114 | from an extensive drop-down list of communication, digital signal processor (DSP), microprocessor |
---|
115 | and bus interface cores, as well as incorporate their own IP. Designers can then generate |
---|
116 | a synthesized netlist, simulation test bench and custom software library that reflect the hardware |
---|
117 | configuration. |
---|
118 | Nevertheless, SOPC Builder does not provide any facilities to synthesize coprocessors and to |
---|
119 | simulate the platform at a high design level (system C). |
---|
120 | In addition, SOPC Builder is proprietary and only works together with Altera's Quartus compilation |
---|
121 | tool to implement designs on Altera devices (Stratix, Arria, Cyclone). |
---|
122 | PICO [CITATION] and CATAPULT [CITATION] allow to synthesize coprocessors from a C++ description. |
---|
123 | Nevertheless, they can only deal with data dominated applications and they do not handle the |
---|
124 | platform level. |
---|
125 | The Xilinx System Generator for DSP [http://www.xilinx.com/tools/sysgen.htm] is a plug-in to |
---|
126 | Simulink that enables designers to develop high-performance DSP systems for Xilinx FPGAs. |
---|
127 | Designers can design and simulate a system using MATLAB and Simulink. The tool will then |
---|
128 | automatically generate synthesizable Hardware Description Language (HDL) code mapped to Xilinx |
---|
129 | pre-optimized algorithms. |
---|
130 | However, this tool targets only DSP based algorithms. |
---|
131 | \\ |
---|
132 | Consequently, designer developping a embedded system needs to master for example |
---|
133 | SoCLib for design exploration, |
---|
134 | SOPC Builde at the platform level, |
---|
135 | PICO for synthesizing the data dominated coprocessors |
---|
136 | and Quartus for design implementation. |
---|
137 | This requires an important tools interfacing effort and makes the design process very complex |
---|
138 | and achievable only by designers skilled in various domains. |
---|
139 | COACH project integrates all these tools in the same framework masking them to the user. |
---|
140 | The objective is to allow \textbf{pure software} developpers to realize embedded systems. |
---|
141 | \par |
---|
142 | % ZIED: CHIFFRES MARCHE, ASIC, EMBEDED system, HPC. Nombre de socites et taille faisant du ES et du HPC |
---|
143 | The combination of the framework dedicated to software developpers and FPGA target, allows small |
---|
144 | and even very small companies to propose embedded system and accelerating solutions for standard |
---|
145 | software applications with acceptable prices. |
---|
146 | This allows to avoid huge hardware investment in opposite to ASIC based solution. |
---|
147 | \\ |
---|
148 | The combination of the framework dedicated to software developpers and FPGA target can open new markets |
---|
149 | to small and even very small companies. |
---|
150 | Such markets we can state HPC (High Performance Computing) and embedded applications. |
---|
151 | HPC consists in proposing accelerating solutions for standard software applications with acceptable |
---|
152 | prices, for example, DNA sequencing recognization or DBMS acceleration. |
---|
153 | Embedded application consists in implementing an application on a low power standalone device, |
---|
154 | for example distributed intelligent sensors. |
---|
155 | \\ |
---|
156 | This new market may explose like it was done by micro-computing in eighties. This success were due |
---|
157 | to the low cost of first micro-computers (compared to main frame) and the advent of high level |
---|
158 | programming languages that allow a high number of programmers to launch start-ups in software |
---|
159 | engineering. |
---|
160 | |
---|
161 | \subsection{Project position} |
---|
162 | \hspace{2cm}\begin{scriptsize}\begin{verbatim} |
---|
163 | % 1.2. POSITIONNEMENT DU PROJET |
---|
164 | % (2 pages maximum) |
---|
165 | % Préciser : |
---|
166 | % - positionnement du projet par rapport au contexte développé précédemment : |
---|
167 | % vis- à-vis des projets et recherches concurrents, complémentaires ou antérieurs, |
---|
168 | % des brevets et standards. |
---|
169 | % - positionnement du projet par rapport aux axes thématiques de l'appel à projets. |
---|
170 | % - positionnement du projet aux niveaux européen et international. |
---|
171 | \end{verbatim} |
---|
172 | \end{scriptsize} |
---|
173 | The aim of this project is to propose an open-source framework for architecture synthesis |
---|
174 | targeting mainly field programmable gate array circuits (FPGA). |
---|
175 | \\% LIP6/TIMA |
---|
176 | To evaluate the different architectures, the project uses the prototyping platform |
---|
177 | of the SoCLIB ANR project (2006-2009). |
---|
178 | \\% IRISA |
---|
179 | The project will also borrow from the ROMA ANR project (2007-2009) and the ongoing |
---|
180 | joint INRIA-STMicro Nano2012 project. In particular we will adapt |
---|
181 | existing pattern extraction algorithms and datapath merging techniques to the synthesis of customized |
---|
182 | ASIP processors. |
---|
183 | \par |
---|
184 | %%% 1 -- POUVEZ VOUS CHACUN AJOUTER SVP (SI POSSIBLE) UNE LIGNE |
---|
185 | %%% 1 -- REFERANT UN PROJET ANR OU EUROPEEN |
---|
186 | %%% 1 -- Projets européens ou ANR réutilisés ou continués |
---|
187 | %%% 1 LIP6/TIMA/LAB-STIC OK |
---|
188 | Regarding the expertise in High Level Synthesis (HLS), the project leverages on know-how acquired over 15 years |
---|
189 | with GAUT project developped in Lab-STIC laboratory and UGH project developped in LIP6 |
---|
190 | and TIMA laboratories. \\ |
---|
191 | Regarding architecture synthesis skills, the project is based on a know-how acquired over 10 years |
---|
192 | with the COSY European project (1998-2000) and the DISYDENT project developped in LIP6. \\ |
---|
193 | %%% 1 IRISA OK |
---|
194 | Regarding Application Specific Instruction Processor (ASIP) design, the CAIRN group at INRIA Bretagne |
---|
195 | Atlantique benefits from several years of expertise in the domain of retargetable compiler (Armor/Calife |
---|
196 | since 1996, and the Gecos compilers since 2002). |
---|
197 | % LIP FIXME:UN:PEU:LONG ET HORS:SUJET |
---|
198 | %CA% The source-level transformations required by the HLS tools will be |
---|
199 | %CA% designed in the {\em polyhedral model}, a general framework |
---|
200 | %CA% initiated by Paul Feautrier 20 years ago. The programs handled in |
---|
201 | %CA% the polyhedral model are such that loop iterators describe a |
---|
202 | %CA% polyhedron (hence the name). This includes most of the kernels used |
---|
203 | %CA% in embedded applications. This property allows to design precise |
---|
204 | %CA% analysis by means of integer programming techniques. |
---|
205 | %CA% %communaute active & internationale |
---|
206 | %CA% %transfert techno (Reservoir) |
---|
207 | %CA% The polyhedral community is very active, and the technological |
---|
208 | %CA% transfer has now started. Reservoir Labs inc., a company based in |
---|
209 | %CA% New-York, is currently integrating the last polyhedral developments |
---|
210 | %CA% in its commercial compiler. |
---|
211 | %CA% %transfert techno (gcc) |
---|
212 | %CA% Also, polyhedra are progressively migrating into the {\sc GNU Gcc} |
---|
213 | %CA% compiler, via {\sc Graphite}, a module initially developed by |
---|
214 | %CA% Sebastian Pop. |
---|
215 | %CA% %outils existants |
---|
216 | %CA% Several tools have been developed in the polyhedral community, |
---|
217 | %CA% such as {\sc Piplib} (parameter integer programming library), and |
---|
218 | %CA% {\sc Polylib}, a library providing set operations on polyhedra. Both |
---|
219 | %CA% tools are almost mandatory in polyhedral tools, and have reached |
---|
220 | %CA% a sufficient level of maturity to be considered as standard. |
---|
221 | %syntol & bee ??? |
---|
222 | % FIN |
---|
223 | % and on more than 15 years of experience on parallel hardware generation |
---|
224 | % in the polyedral model in the CAIRN group (MMAlpha software |
---|
225 | % developped in the group since 1996). |
---|
226 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
227 | %%% 2 -- A COMPLETER (COURT) |
---|
228 | %%% 2 -- For polyedric transformation and memory optimization ... LIP |
---|
229 | %%% 2 -- For ASIP IRISA |
---|
230 | %%% 2 -- For ... CITI |
---|
231 | %%% 2 -- For ... TIMA |
---|
232 | \par |
---|
233 | The SoCLIB ANR platform were developped by 11 laboratories and 6 companies. It allows to |
---|
234 | describe hardware architectures with shared memory space and to deploy software |
---|
235 | applications on them to evaluate their performance. |
---|
236 | The heart of this platform is a library containing simulation models (in SystemC) |
---|
237 | of hardware IP cores such as processors, buses, networks, memories, IO controller. |
---|
238 | The platform provides also embedded operating systems and software/hardware |
---|
239 | communication components useful to implement applications quickly. |
---|
240 | However, the synthesisable description of IPs have to be provided by users. \\ |
---|
241 | This project enhances SoCLib by providing synthesisable VHDL of standard IPs. |
---|
242 | In addition, HLS tools such as UGH and GAUT allow to get automatically a synthesisable |
---|
243 | description of an IP (coprocessor) from a sequential algorithm. |
---|
244 | %\par |
---|
245 | %%% 2 IRISA ? |
---|
246 | %%% 2 ASIP tool such as ... IRISA |
---|
247 | %%% 2 ... |
---|
248 | %%% 2 Coach uses pattern extractions from ROMA |
---|
249 | %\par |
---|
250 | %%% 2 LIP ? |
---|
251 | \par |
---|
252 | The different points proposed in this project cover priorities defined by the commission |
---|
253 | experts in the field of Information Technolgies Society (IST) for Embedded |
---|
254 | systems: <<Concepts, methods and tools for designing systems dealing with systems complexity |
---|
255 | and allowing to apply efficiently applications and various products on embedded platforms, |
---|
256 | considering resources constraints (delais, power, memory, etc.), security and quality |
---|
257 | services>>. |
---|
258 | \\ |
---|
259 | Our team aims at covering all the steps of the design flow of architecture synthesis. |
---|
260 | Our project overcomes the complexity of using various synthesis tools and description |
---|
261 | languages required today to design architectures. |
---|
262 | |
---|
263 | \section{Scientific and Technical Description} |
---|
264 | \subsection{State of the art} |
---|
265 | \hspace{2cm}\begin{scriptsize}\begin{verbatim} |
---|
266 | % 2. DESCRIPTION SCIENTIFIQUE ET TECHNIQUE |
---|
267 | % 2.1. ÉTAT DE L'ART |
---|
268 | % (3 pages maximum) |
---|
269 | % Décrire le contexte et les enjeux scientifiques dans lequel se situe le projet |
---|
270 | % en présentant un état de l'art national et international dressant l'état des |
---|
271 | % connaissances sur le sujet. Faire apparaître d'éventuels résultats préliminaires. |
---|
272 | % Inclure les références bibliographiques nécessaires en annexe 7.1. |
---|
273 | \end{verbatim} |
---|
274 | \end{scriptsize} |
---|
275 | Our project covers several critical domains in system design in order |
---|
276 | to achieve high performance computing. Starting from a high level description we aim |
---|
277 | at generating automatically both hardware and software components of the system. |
---|
278 | |
---|
279 | \subsubsection{High Performance Computing} |
---|
280 | Accelerating high-performance computing (HPC) applications with field-programmable |
---|
281 | gate arrays (FPGAs) can potentially improve performance. |
---|
282 | However, using FPGAs presents significant challenges [1]. |
---|
283 | First, the operating frequency of an FPGA is low compared to a high-end microprocessor. |
---|
284 | Second, based on Amdahl law, HPC/FPGA application performance is unusually sensitive |
---|
285 | to the implementation quality [2]. |
---|
286 | Finally, High-performance computing programmers are a highly sophisticated but scarce |
---|
287 | resource. Such programmers are expected to readily use new technology but lack the time |
---|
288 | to learn a completely new skill such as logic design [3]. |
---|
289 | \\ |
---|
290 | HPC/FPGA hardware is only now emerging and in early commercial stages, |
---|
291 | but these techniques have not yet caught up. |
---|
292 | Thus, much effort is required to develop design tools that translate high level |
---|
293 | language programs to FPGA configurations. |
---|
294 | |
---|
295 | \hspace{2cm}\begin{scriptsize}\begin{verbatim} |
---|
296 | [1] M.B. Gokhale et al., Promises and Pitfalls of Reconfigurable |
---|
297 | Supercomputing, Proc. 2006 Conf. Eng. of Reconfigurable |
---|
298 | Systems and Algorithms, CSREA Press, 2006, pp. 11-20; |
---|
299 | http://nis-www.lanl.gov/~maya/papers/ersa06_gokhale_paper. |
---|
300 | pdf. |
---|
301 | [2] D. Buell, Programming Reconfigurable Computers: Language |
---|
302 | Lessons Learned, keynote address, Reconfigurable Systems |
---|
303 | Summer Institute 2006, 12 July 2006; http://gladiator. |
---|
304 | ncsa.uiuc.edu/PDFs/rssi06/presentations/00_Duncan_Buell.pdf |
---|
305 | [3] T. Van Court et al., Achieving High Performance |
---|
306 | with FPGA-Based Computing, Computer, vol. 40, no. 3, |
---|
307 | pp. 50-57, Mar. 2007, doi:10.1109/MC.2007.79 |
---|
308 | \end{verbatim} |
---|
309 | \end{scriptsize} |
---|
310 | |
---|
311 | \subsubsection{System Synthesis} |
---|
312 | Today, several solutions for system design are proposed and commercialized. The most common are |
---|
313 | those provided by Altera and Xilinx to promote their FPGA devices. |
---|
314 | \\ |
---|
315 | The Xilinx System Generator for DSP [http://www.xilinx.com/tools/sysgen.htm] is a plug-in to |
---|
316 | Simulink that enables designers to develop high-performance DSP systems for Xilinx FPGAs. |
---|
317 | Designers can design and simulate a system using MATLAB and Simulink. The tool will then |
---|
318 | automatically generate synthesizable Hardware Description Language (HDL) code mapped to Xilinx |
---|
319 | pre-optimized algorithms. |
---|
320 | However, this tool targets only DSP based algorithms, Xilinx FPGAs and cannot handle complete |
---|
321 | SoC. Thus, it is not really a system synthesis tool. |
---|
322 | \\ |
---|
323 | In the opposite, SOPC Builder [CITATION] allows to describe a system, to synthesis it, |
---|
324 | to programm it into a target FPGA and to upload a software application. |
---|
325 | % FIXME(C2H from Altera, marche vite mais ressource monstrueuse) |
---|
326 | Nevertheless, SOPC Builder does not provide any facilities to synthesize coprocessors. |
---|
327 | Users have to provide the synthesizable description with the feasible bus interface. |
---|
328 | \\ |
---|
329 | In addition, Xilinx System Generator and SOPC are closed world since each one imposes |
---|
330 | their own IPs which are not interchangeable. |
---|
331 | We can conclude that the existing commercial or free tools does not coverthe whole system |
---|
332 | synthesis process in a full automatic way. Moreover, they are bound to a particular device family |
---|
333 | and to IPs library. |
---|
334 | |
---|
335 | \subsubsection{High Level Synthesis} |
---|
336 | High Level Synthesis translates a sequential algorithmic description and a constraints set |
---|
337 | (area, power, frequency, ...) to a micro-architecture at Register Transfer Level (RTL). |
---|
338 | Several academic and commercial tools are today available. |
---|
339 | Most common tools are SPARK [HLS1], GAUT [HLS2], UGH [HLS3] in the academic world |
---|
340 | and catapultC [HLS4], PICO [HLS5] and Cynthesizer [HLS6] in commercial world. |
---|
341 | Despite their maturity, their usage is restrained by: |
---|
342 | \begin{itemize} |
---|
343 | \item They do not respect accurately the frequency constraint when they target an FPGA device. |
---|
344 | Their error is about 10 percent. This is annoying when the generated component is integrated |
---|
345 | in a SoC since it will slow down the hole system. |
---|
346 | \item These tools take into account only one or few constraints simultaneously while realistic |
---|
347 | designs are multi-constrained. |
---|
348 | Moreover, low power consumption constraint is mandatory for embedded systems. |
---|
349 | However, it is not yet well handled by common synthesis tools. |
---|
350 | \item The parallelism is extracted from initial algorithm. To get more parallelism or to reduce |
---|
351 | the amout of required memory, the user must re-write it while there is techniques as polyedric |
---|
352 | transformations to increase the intrinsec parallelism. |
---|
353 | \item Despite they have the same input language (C/C++), they are sensitive to the style in |
---|
354 | which the algorithm is written. Consequently, engineering work is required to swap from |
---|
355 | a tool to another. |
---|
356 | \item The HLS tools are not integrated into an architecture and system exploration tool. |
---|
357 | Thus, a designer who needs to accelerate a software part of the system, must adapt it manually |
---|
358 | to the HLS input dialect and performs engineering work to exploit the synthesis result |
---|
359 | at the system level. |
---|
360 | \end{itemize} |
---|
361 | Regarding these limitations, it is necessary to create a new tool generation reducing the gap |
---|
362 | between the specification of an heterogenous system and its hardware implementation. |
---|
363 | |
---|
364 | \hspace{2cm}\begin{scriptsize}\begin{verbatim} |
---|
365 | [HLS1] SPARK universite de californie San Diego |
---|
366 | [HLS2] GAUT UBS/Lab-STIC |
---|
367 | [HLS3] UGH |
---|
368 | [HLS4] catapultC Mentor |
---|
369 | [HLS5] PICO synfora |
---|
370 | [HLS6] Cynthesizer Forte design system |
---|
371 | \end{verbatim} |
---|
372 | \end{scriptsize} |
---|
373 | |
---|
374 | \subsubsection{Application Specific Instruction Processors} |
---|
375 | ASIP (Application-Specific Instruction-Set Processor) are programmable |
---|
376 | processors in which both the instruction and the micro architecture have |
---|
377 | been tailored to a given application domain (eg. video processing), or |
---|
378 | in some extreme cases to a specific application (eg H264 specific ASIP). |
---|
379 | This processor specialization usually offers a good compromise between |
---|
380 | performance (compared to a pure software implementation on a COTS |
---|
381 | embeded processor) and flexibility (compared to an application specific |
---|
382 | hardware co-processor). |
---|
383 | \\ |
---|
384 | As a consequence, this type of architecture is a very attractive choice |
---|
385 | as a System on chip building block. In spite of their obvious |
---|
386 | advantages, using/designing ASIPs remains a difficult task, since it |
---|
387 | involves designing both an efficient micro-architecture and implementing |
---|
388 | an efficient compiler for this |
---|
389 | specific micro-architecture. |
---|
390 | \\ |
---|
391 | Recently, the use of instruction set extensions has received a lot of |
---|
392 | interest from the embedded systems design community [NIOS2,FSL,ST70], |
---|
393 | since it allows to rely on a template micro-architecture in which only a |
---|
394 | small fraction of the architecture has to be specialized. Even if such |
---|
395 | an approach offers less flexiblity and forbids very tight coupling |
---|
396 | between the extensions and the template micro-architecture, it makes the |
---|
397 | design of the micro-architecture more tractable and amenable to a fully |
---|
398 | automated flow. |
---|
399 | \\ |
---|
400 | However, to our knowledge, there is still no available open-source |
---|
401 | design flow addressing those two design challenges together, either |
---|
402 | because the target architecture is proprietary, or because the compiler |
---|
403 | technology is closed/commercial. |
---|
404 | \\ |
---|
405 | In the context of the COACH project, we propose to add to the |
---|
406 | infra-structure a design flow targeted to automatic instruction set |
---|
407 | extension for the MIPS-based CPU, which will come as a complement or an |
---|
408 | alternative to the other proposed approaches (hardware accelerator, |
---|
409 | multi processors). |
---|
410 | |
---|
411 | \subsubsection{Automatic Parallelization} |
---|
412 | \begin{Large}\begin{verbatim} |
---|
413 | -- A COMPLETER LIP |
---|
414 | \end{verbatim} |
---|
415 | \end{Large} |
---|
416 | %CA% Parallel machines are often difficult and painful to program |
---|
417 | %CA% directly, and one would like the compiler to %do the job, that is to |
---|
418 | %CA% turn automatically a sequential program into a parallel form. This |
---|
419 | %CA% transformation is referred as {\em automatic parallelization}, and has |
---|
420 | %CA% been widely addressed since the 70s. Automatic parallelization |
---|
421 | %CA% relies on data dependences, which cannot be computed in general.%, as |
---|
422 | %CA% %one cannot predict at compile time the variable values on a given |
---|
423 | %CA% %execution point. |
---|
424 | %CA% This negative result led researchers to (i) find a |
---|
425 | %CA% program model in which no approximation is needed (ie polyhedral |
---|
426 | %CA% model), (ii) make conservative approximations (iii) remark that |
---|
427 | %CA% variable values are known at runtime, and make the decisions during |
---|
428 | %CA% program execution. The latter approach is obviously not suitable |
---|
429 | %CA% there, as we target hardware generation. We will give there a short |
---|
430 | %CA% history of the approaches that fall in the first category. |
---|
431 | %CA% |
---|
432 | %CA%% In the real world, we deal with a limited amount of processors, |
---|
433 | %CA%% and the communication between processors takes time, and is |
---|
434 | %CA%% critical for performance. %Whenever we have synchronisation-free |
---|
435 | %CA%% parallelism, like for embarrassingly parallel kernels, this is not an |
---|
436 | %CA%% issue. But in case of pipelined parallelism, we need to reduce |
---|
437 | %CA%% communications as much as possible. |
---|
438 | %CA%% So we also need to find parallelism toghether with a proper mapping |
---|
439 | %CA%% of operations and data on physical processors. |
---|
440 | %CA% |
---|
441 | %CA% As programs spend most of there time in loops, the community has |
---|
442 | %CA% focused on loop transformations that reveal parallelism. |
---|
443 | %CA%%unimodulaire |
---|
444 | %CA% The first approaches worked on perfect loop nests, where the tree |
---|
445 | %CA% formed by the nested loops is linear. In this program model, the |
---|
446 | %CA% loops can be seen as a basis that drive the way the iteration |
---|
447 | %CA% domain will be described. Hence, a first idea was to change this |
---|
448 | %CA% basis such that one vector (one loop) at least is parallel. To ease |
---|
449 | %CA% the code generation, the area of defined by the news vectors must |
---|
450 | %CA% be a unit volume. %Otherwise, one would produce an homothetic |
---|
451 | %CA%% expansion of the iteration domain, which will force to put modulos |
---|
452 | %CA%% in the target code. |
---|
453 | %CA% For this reason, these transformations are called {\em unimodular |
---|
454 | %CA% transformations}. |
---|
455 | %CA%%tiling |
---|
456 | %CA% |
---|
457 | %CA% The next approaches include {\em loop tiling}, a simple |
---|
458 | %CA% partitioning of the iteration domain, whose initial purpose is to |
---|
459 | %CA% execute every partition on a different processor. %In the same way, |
---|
460 | %CA% The execution order is modified with a proper unimodular |
---|
461 | %CA% transformation, then the tiles are obtained by cutting the |
---|
462 | %CA% iteration domain with the hyperplanes directed by every vector of |
---|
463 | %CA% the new (unimodular) basis, at regular intervals. When the tiling |
---|
464 | %CA% hyperplanes are properly chosen, we can both improve data-locality |
---|
465 | %CA% on every processor, and reduce the communication between two |
---|
466 | %CA% different tiles (which will be mapped on processors). This last |
---|
467 | %CA% property implying that one tend to find a degree of parallelism as |
---|
468 | %CA% great as possible. |
---|
469 | %CA% |
---|
470 | %CA%%affine scheduling |
---|
471 | %CA% The previous approaches were restricted to kernels with perfect |
---|
472 | %CA% loop nests (linear loop tree), and unimodular transformations. The |
---|
473 | %CA% last generation of approaches broke with these limitations. We now |
---|
474 | %CA% choose a different basis for every assignment, without the |
---|
475 | %CA% unimodularity restriction. A dual way to present the things is the |
---|
476 | %CA% notion of {\em affine schedule}, introduced by Feautrier [part1], |
---|
477 | %CA% that simply assigns an abstract execution date to every assignment |
---|
478 | %CA% execution. As an assignment execution is exactly characterised by |
---|
479 | %CA% the current value of the loops counters (iteration vector), the |
---|
480 | %CA% affine schedule will be defined as an affine form of the iteration |
---|
481 | %CA% vector (hence the 'affine'). The affine property allows to use |
---|
482 | %CA% integer programming techniques to compute the schedule. With this |
---|
483 | %CA% approach, additional techniques are required to allocate the |
---|
484 | %CA% parallel operations and the data to processor in an efficient way |
---|
485 | %CA% [griebl, feautrier]. |
---|
486 | %CA% |
---|
487 | %CA%%modularity?? |
---|
488 | %CA%%% As loop nests are no longer perfect, we deal with (transformed) |
---|
489 | %CA%%% iteration domains of different dimensions, which can possibly (and |
---|
490 | %CA%%% certainly) overlap. At this point, a new code generation technique |
---|
491 | %CA%%% was needed. The first attempt is due to Chamsky et al. [??], and |
---|
492 | %CA%%% was improved by Quillere et al. [QRW]. The code is now implemented |
---|
493 | %CA%%% in an efficient tool [cloog], that gave a new life to polyhedral |
---|
494 | %CA%%% techniques. |
---|
495 | %CA% |
---|
496 | %CA%%pluto's tiling |
---|
497 | %CA% The tiling techniques were extended to non-perfect loop nest with |
---|
498 | %CA% {\em affine partitioning}. Affine partitioning is to affine |
---|
499 | %CA% scheduling what (original) tiling was to unimodular |
---|
500 | %CA% transformations. An affine partitioning assigns to every assignment |
---|
501 | %CA% its coordinates in the basis defined by the normals to the tiling |
---|
502 | %CA% hyperplanes. Recently, a way to compute efficient hyperplanes were |
---|
503 | %CA% found [uday], with a good data locality, and communications |
---|
504 | %CA% confined in a small neighborhood around every processor. |
---|
505 | %CA% |
---|
506 | %CA%\subsubsection{Source-level Memory Optimisation} |
---|
507 | %CA% The HLS process allows to customise memory, which impacts on final |
---|
508 | %CA% circuit size and power consumption. Though most HLS tools already |
---|
509 | %CA% try to optimise memory usage, it is better to provide an independent |
---|
510 | %CA% source-level pass, that could be reused for different tools and in |
---|
511 | %CA% other contexts. |
---|
512 | %CA% |
---|
513 | %CA% There exists many approaches to evaluate and reduce the memory |
---|
514 | %CA% requirement of a program. The first approaches are concerned with |
---|
515 | %CA% {\em memory size estimation}, which can be defined as the maximum |
---|
516 | %CA% number of memory cells used at the same time [clauss,zhao]. These |
---|
517 | %CA% approaches provide an estimation as a symbolic expression of program |
---|
518 | %CA% parameters, which can be used further to guide loop optimisations. |
---|
519 | %CA% However, no explicit way to reduce the memory size is given. {\em |
---|
520 | %CA% Intra-array reuse} approaches brake with this limitation, and |
---|
521 | %CA% collapse the array cells which are not alive at the same time. The |
---|
522 | %CA% collapse is done by means of a data layout transformation, specified |
---|
523 | %CA% with a linear (modular) mapping. The first approaches were |
---|
524 | %CA% developed at IMEC [balasa,catthoor], and basically try to linearize |
---|
525 | %CA% the arrays and fold them using a modulo operator. Then, Lefebvre et |
---|
526 | %CA% al. propose a solution to fold independently the array dimensions |
---|
527 | %CA% [lefebvre]. Finally, Darte et al. provide a general formalisation of |
---|
528 | %CA% the problem, together with a solution that subsumes the previous |
---|
529 | %CA% approaches [darte]. A first implementation was made with the tool |
---|
530 | %CA% {\sc Bee}, but there are still many limitations. |
---|
531 | %CA% |
---|
532 | %CA% \begin{itemize} |
---|
533 | %CA% \item The tool is restricted to regular programs, whereas more |
---|
534 | %CA% general programs could be handled with a conservative array liveness |
---|
535 | %CA% analysis. |
---|
536 | %CA% |
---|
537 | %CA% \item Programs depending on parameters (inputs) are not handled, |
---|
538 | %CA% which forbids to handle, for example, the body of tiled loops. |
---|
539 | %CA% |
---|
540 | %CA% \item The new array layout can brake spatial locality, and then impact |
---|
541 | %CA% performance and power consumption. One would like to get a mapping |
---|
542 | %CA% that improve or, at least, preserve the spatial locality of the |
---|
543 | %CA% program. |
---|
544 | %CA% |
---|
545 | %CA% \item Finally, the final memory compaction strongly depends on the |
---|
546 | %CA% program schedule, and is naturally hindered by the |
---|
547 | %CA% parallelism. Consequently, there is a trade-off to find with |
---|
548 | %CA% automatic parallelization. An ideal solution would be to reduce |
---|
549 | %CA% memory usage, while preserving parallelism. |
---|
550 | %CA% \end{itemize} |
---|
551 | |
---|
552 | \subsubsection{Interfaces} |
---|
553 | \begin{Large}\begin{verbatim} |
---|
554 | -- A COMPLETER INSA Etat de l'art |
---|
555 | \end{verbatim} |
---|
556 | \end{Large} |
---|
557 | % |
---|
558 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
559 | \subsection{Objectives and innovation aspects} |
---|
560 | \hspace{2cm}\begin{scriptsize}\begin{verbatim} |
---|
561 | % 2.2. OBJECTIFS ET CARACTERE AMBITIEUX/NOVATEUR DU PROJET |
---|
562 | % (2 pages maximum) |
---|
563 | % Décrire les objectifs scientifiques/techniques du projet. |
---|
564 | % Présenter l'avancée scientifique attendue. Préciser l'originalité et le caractère |
---|
565 | % ambitieux du projet. |
---|
566 | % Détailler les verrous scientifiques et techniques à lever par la réalisation du projet. |
---|
567 | % Décrire éventuellement le ou les produits finaux développés à l'issue du projet |
---|
568 | % montrant le caractère innovant du projet. |
---|
569 | % Présenter les résultats escomptés en proposant si possible des critères de réussite |
---|
570 | % et d'évaluation adaptés au type de projet, permettant d'évaluer les résultats en |
---|
571 | % fin de projet. |
---|
572 | % Le cas échéant (programmes exigeant la pluridisciplinarité), démontrer l'articulation |
---|
573 | % entre les disciplines scientifiques. |
---|
574 | \end{verbatim} |
---|
575 | \end{scriptsize} |
---|
576 | |
---|
577 | % les objectifs scientifiques/techniques du projet. |
---|
578 | The objectives of COACH project are to develop a complete framework to |
---|
579 | HPC (accelerating solutions for existing software applications) |
---|
580 | and embedded applications (implementing an application on a low power standalone device). |
---|
581 | The design steps are presented figure 1. |
---|
582 | \begin{figure}[hbtp]\leavevmode\center |
---|
583 | \includegraphics[width=.8\linewidth]{anr-2010} |
---|
584 | \caption{\label{coach-flow} COACH flow.} |
---|
585 | \end{figure} |
---|
586 | \begin{description} |
---|
587 | \item[HPC setup] Here the user splits the application into 2 parts: the host application |
---|
588 | which remains on PC and the SoC application which migrates on SoC. |
---|
589 | The framework provides a simulation model allowing to evaluate the partitioning. |
---|
590 | \item[SoC design] In this phase, |
---|
591 | The user can obtain simulators at different abstraction levels of the SoC by giving to COACH framework |
---|
592 | a SoC description. |
---|
593 | This description consists of a process network corresponding to the SoC application, |
---|
594 | an OS, an instance of a generic hardware platform |
---|
595 | and a mapping of processes on the platform components. The supported mapping are |
---|
596 | software (the process runs on a SoC processor), |
---|
597 | XXXpeci (the process runs on a SoC processor enhanced with dedicated instructions), |
---|
598 | and hardware (the process runs into a coprocessor generated by HLS and plugged on the SoC bus). |
---|
599 | \item[Application compilation] Once SoC description is validated, COACH generates automatically |
---|
600 | an FPGA bitstream containing the hardware platform with SoC application software and |
---|
601 | an executable containing the host application. The user can launch the application by |
---|
602 | loading the bitstream on FPGA and running the executable on PC. |
---|
603 | \end{description} |
---|
604 | |
---|
605 | % l'avancee scientifique attendue. Preciser l'originalite et le caractere |
---|
606 | % ambitieux du projet. |
---|
607 | The main scientific contribution of the project is to unify various synthesis techniques |
---|
608 | (same input and output formats) allowing the user to swap without engineering effort |
---|
609 | from one to an other and even to chain them, for example, to run polyedric transformation |
---|
610 | before synthesis. |
---|
611 | Another advantage of this framework is to provide different abstraction levels from |
---|
612 | a single description. |
---|
613 | Finally, this description is device family independent and its hardware implementation |
---|
614 | is automatically generated. |
---|
615 | |
---|
616 | % Detailler les verrous scientifiques et techniques a lever par la realisation du projet. |
---|
617 | System design is a very complicated task and in this project we try to simplify it |
---|
618 | as much as possible. For this purpose we have to deal with the following scientific |
---|
619 | and technological barriers. |
---|
620 | \begin{itemize} |
---|
621 | \item The main problem in HPC is the communication between the PC and the SoC. |
---|
622 | This problem has 2 aspects. The first one is the efficiency. The second is to |
---|
623 | eliminate enginnering effort to implement it at different abstract levels. |
---|
624 | \item COACH design flow has a top-down approach. In the such case, |
---|
625 | the required performance of a coprocessor (run frequency, maximum cycles for |
---|
626 | a given computation, power consumption, etc) are imposed by the other system |
---|
627 | components. The challenge is to allow user to control accurately the synthesis |
---|
628 | process. For instance, the run frequency must not be a result of the RTL synthesis |
---|
629 | but a strict synthesis constraint. |
---|
630 | \item HLS tools are sensitive to the style in which the algorithm is written. |
---|
631 | In addition, they are are not integrated into an architecture and system |
---|
632 | exploration tool. |
---|
633 | Consequently, engineering work is required to swap from a tool to another, |
---|
634 | to integrate the resulting simulation model to an architectural exploration tool |
---|
635 | and to synthesize the generated RTL description. |
---|
636 | %CA Additionnal preprocessing, source-level transformations, are thus |
---|
637 | %CA required to improve the process. |
---|
638 | %CA Particularly, this includes parallelism exposure and efficient memory mapping. |
---|
639 | \item Most HLS tools translate a sequential algorithm into a coprocessor |
---|
640 | containing a single data-path and finite state machine (FSM). In this way, |
---|
641 | only the fine grained parallelism is exploited (ILP parallelism). |
---|
642 | The challenge is to identify the coarse grained parallelism and to generate, |
---|
643 | from a sequential algorithm, coprocessor containing multiple communicating |
---|
644 | tasks (data-paths and FSMs). |
---|
645 | \end{itemize} |
---|
646 | |
---|
647 | %Presenter les resultats escomptes en proposant si possible des criteres de reussite |
---|
648 | %et d'evaluation adaptes au type de projet, permettant d'evaluer les resultats en |
---|
649 | %fin de projet. |
---|
650 | The main result is the framework. It is composed concretely of: |
---|
651 | 2 HPC communication shemes with their implementation, |
---|
652 | 5 HLS tools (control dominated HLS, data dominated HLS, Coarse grained HLS, |
---|
653 | Memory optimisation HLS and ASIP), |
---|
654 | 3 systemC based virtual prototyping environment extended with synthesizable |
---|
655 | RTL IP cores (generic, ALTERA/NIOS/AVALON, XILINX/MICROBLAZE/OPB), |
---|
656 | one design space exploration tool, |
---|
657 | one operating system (OS). |
---|
658 | \\ |
---|
659 | The framework fonctionality will be demonstrated with XXX-EXAMPLE1, XXX-EXAMPLE2 |
---|
660 | and XXX-EXAMPLE3 on 4 archictures (generic/XILINX, generic/ALTERA, |
---|
661 | proprietary/XILINX, proprietary/ALTERA). |
---|
662 | |
---|
663 | %% \section{} |
---|
664 | %% %3. PROGRAMME SCIENTIFIQUE ET TECHNIQUE, ORGANISATION DU PROJET |
---|
665 | %% \subsection{} |
---|
666 | %% %3.1. PROGRAMME SCIENTIFIQUE ET STRUCTURATION DU PROJET |
---|
667 | %% %(2 pages maximum) |
---|
668 | %% %Présentez le programme scientifique et justifiez la décomposition en tâches du |
---|
669 | %% %programme de travail en cohérence avec les objectifs poursuivis. |
---|
670 | %% %Utilisez un diagramme pour présenter les liens entre les différentes tâches |
---|
671 | %% %(organigramme technique) |
---|
672 | %% %Les tâches représentent les grandes phases du projet. Elles sont en nombre limité. |
---|
673 | %% %N'oubliez pas les activités et actions correspondant à la dissémination et à la |
---|
674 | %% %valorisation. |
---|
675 | %% |
---|
676 | %% %METTRE UNE FIGURE ICI DECRIVANT LES TACHES ET LEURS INTERACTION (AVEC LE FLOT |
---|
677 | %% %EN FILIGRANE ? ) |
---|
678 | %% \subsection{} |
---|
679 | %% %3.2. MANAGEMENT DU PROJET |
---|
680 | %% %(2 pages maximum) |
---|
681 | %% %Préciser les aspects organisationnels du projet et les modalités de coordination |
---|
682 | %% %(si possible individualisation d'une tâche coordination : cf. tâche 0 du document |
---|
683 | %% %de soumission A). |
---|
684 | %% \subsection{} |
---|
685 | %% %3.3. DESCRIPTION DES TRAVAUX PAR TACHE |
---|
686 | %% %(idéalement 1 ou 2 pages par tâche) |
---|
687 | %% %Pour chaque tâche, décrire : |
---|
688 | %% %- les objectifs de la tâche et éventuels indicateurs de succès, |
---|
689 | %% %- le responsable de la tâche et les partenaires impliqués (possibilité de |
---|
690 | %% %l'indiquer sous forme graphique), |
---|
691 | %% %- le programme détaillé des travaux par tâche, |
---|
692 | %% %- les livrables de la tâche, |
---|
693 | %% %- les contributions des partenaires (le " qui fait quoi "), |
---|
694 | %% %- la description des méthodes et des choix techniques et de la manière dont |
---|
695 | %% %les solutions seront apportées, |
---|
696 | %% %- les risques de la tâche et les solutions de repli envisagées. |
---|
697 | |
---|
698 | |
---|
699 | |
---|
700 | |
---|
701 | |
---|
702 | |
---|