source: anr/body.tex @ 2

Last change on this file since 2 was 2, checked in by coach, 15 years ago

Modified state of the art section for ASIP synthesis

File size: 43.1 KB
Line 
1\section{Project context}
2\hspace{2cm}\begin{scriptsize}\begin{verbatim}
3% 1.    CONTEXTE ET POSITIONNEMENT DU PROJET
4% (1 page maximum) Prᅵsentation gᅵnᅵrale du problᅵme qu'il est proposᅵ de traiter
5% dans le projet et du cadre de travail (recherche fondamentale, industrielle ou
6% dï¿œveloppement expï¿œrimental).
7\end{verbatim}
8\end{scriptsize}
9An embedded system is an application integrated into one or several chips
10in order to accelerate it or to embedd it into a small device such as a personal
11digital assistant (PDA).
12This topic is investigated since 80s using Applications Specific Integrated Circuits (ASIC),
13Digital Signal Processing (DSP) and parallel computing on multiprocessor machines or networks.
14More recently, since end of 90s, other technologies appeared like Very Large Instruction Word (VLIW),
15Application Specific Instruction Processors (ASIP), System on Chip (SoC),
16Multi-Processors SoC (MPSoC).
17\\
18During these last decades embedded system was reserved to major industrial companies targeting high volume market
19due to the design and fabrication costs.
20Nowadays Field Programmable Gate Arrays (FPGA), like Virtex5 from Xilinx and Stratix4 from Altera,
21can implement a SoC with multiple processors and several coprocessors for less than 10K euros the piece.
22In addition, High Level Synthesis (HLS) becomes more mature and allows to automize design
23and to decrease drastically its cost in terms of man power. Thus, both FPGA and HLS tends to spread over
24HPC for small companies targeting low volume markets.
25\par
26To get an efficient embedded system, designer has to take into account application characteristics when it
27chooses one of the former technologies.
28This choice is not easy and in most cases designer has to try different technologies to retain the
29most adapted one.
30\\
31The first objective of COACH is to provide an open-source framework to design embedded system
32on FPGA device.
33COACH framework allows designer to explore various software/hardware partitions of the
34target application, to run timing and functional simulations and to generate automatically both
35the software and the synthesizable description of the hardware.
36The main topics of the project are:
37\begin{itemize} 
38\item
39Design space exploration: It consists in analysing the application runnig on FPGA, defining the target
40technology (SoC, MPSoC, ASIP, ...) and hardware/software partitioning of tasks depending on
41technology choice. This exploration is driven basically by throughput, latency and power consumption
42criteria.
43\item
44Micro-architectural exploration: When hardware components are required, the HLS tools of the framework
45generate them automatically. At this stage the framework provides various HLS tools allowing the
46micro-architectural space design exploration. The exploration criteria are also throughput, latency
47and power consumption.
48% FIXME
49%CA At this stage, preliminary source-level transformations will be
50%CA required to improve the efficiency of the target component.
51%CA COACH will also provide such facilities, such as automatic parallelization
52%CA and memory optimisation.
53\item
54Performance measurement: For each point of design space exploration, metrics of criteria are available
55such as throughput, latency, power consumption, area, memory allocation and data locality.
56They are evaluated using virtual prototyping, estimation or analysing methodologies.
57\item
58Targeted hardware technology: The COACH description of system is independent of the FPGA family.
59Every point of the design exploration space can be implemented on any FPGA having the required resources.
60Basically, COACH handles both Altera and Xilinx FPGA families.
61\end{itemize}
62As an extension of embedded system design, COACH deals also with High Performance Computing (HPC).
63In HPC, the kind of targeted application is an existing one running on PC. COACH helps designer
64to accelerate it by migrating critical parts into a SoC implemented on a FPGA plugged to the PC bus.
65\par
66COACH is the result of the will of several laboratory to unify their know how and skills in the
67following domains: Operating system and hardware communication (TIMA, SITI), SoC and MPSoC (LIP6 and TIMA),
68ASIP (IRISA) and HLS (LIP6, Lab-STIC and LIP). The project objective is to integrate these various
69domains into a unique free framework (licence ...) masking as much as possible these domains and its
70different tools to the user.
71
72
73\subsection{Economical context and interest}
74\hspace{2cm}\begin{scriptsize}\begin{verbatim}
75% 1.1.  CONTEXTE ET ENJEUX ECONOMIQUES ET SOCIETAUX
76% (2 pages maximum)
77% Dï¿œcrire le contexte ï¿œconomique, social, rï¿œglementaire. dans lequel se situe
78% le projet en prï¿œsentant une analyse des enjeux sociaux, ï¿œconomiques, environnementaux,
79% industriels. Donner si possible des arguments chiffrï¿œs, par exemple, pertinence et
80% portᅵe du projet par rapport ᅵ la demande ᅵconomique (analyse du marchᅵ, analyse des
81% tendances), analyse de la concurrence, indicateurs de rï¿œduction de coï¿œts, perspectives
82% de marchï¿œs (champs d'application, .). Indicateurs des gains environnementaux, cycle
83% de vie.
84\end{verbatim}
85\end{scriptsize}
86Microelectronic allows to integrate complicated functions into products, to increase their
87commercial attractivity and to improve their competitivity. Multimedia and communication
88sectors have taken advantage from microelectronics facilities thanks to developpment of
89design methodologies and tools for real time embedded systems. Many other sectors could
90benefit from microelectronics if these methologies and tools are adapted to their features.
91The Non Recurring Engineering (NRE) costs involded in designing and manufacturing an ASIC is
92very high. It costs several milliars of euros for IC factory and several millions to fabricate
93a specific circuit for example a conservative estimate for a 65nm ASIC project is 10 million USD.
94Consequently, it is generally unfeasible to design and fabricate ASICs in
95low volumes and ICs are designed to cover a broad applications spectrum at the cost of
96performance degradation.
97\\
98Today, FPGAs become important actors in the computational domain that was originally dominated
99by microprocessors and ASICs. Just like microprocessors FPGA based systems can be reprogrammed
100on a per-application basis. At the same time, FPGAs offer significant performance benefits over
101microprocessors implementation for a number of applications. Although these benefits are still
102generally an order of magnitude less than equivalent ASIC implementations, low costs
103(500 euros to 10K euros), fast time to market and flexibility of FPGAs make them an attractive
104choice for low-to-medium volume applications.
105Since their introduction in the mid eighties, FPGAs evolved from a simple,
106low-capacity gate array technology to devices (Altera STRATIX III, Xilinx Virtex V) that
107provide a mix of coarse-grained data path units, memory blocks, microprocessor cores,
108on chip A/D conversion, and gate counts by millions. This high logic capacity allows to implement
109complex systems like multi-processors platform with application dedicated coprocessors.
110Table~\ref{fpga_market} shows the estimation of FPGA worldwide market in the next years covering
111various application domains. The ``high end'' lines concern only FPGA with high logic capacity able
112to implement complex systems.
113This market is in significant expansion and is estimated to 914 M\$ in 2012.
114Using FPGA limits the NRE costs to design cost. This boosts the developpment of methodologies
115and tools to automize design and reduce its cost.
116\begin{table}\leavevmode\center
117\begin{tabular}{|l|l|l|l|}\hline
118Segment         & 2010  & 2011  & 2012 \\\hline\hline
119Communications  & 1,867 & 1,946 & 2,096 \\
120High end        & 467   & 511   & 550 \\\hline
121Consumer        & 550   & 592   & 672 \\
122High end        & 53    & 62    & 75 \\\hline
123Automotive      & 243   & 286   & 358 \\
124High end        & -     & -     & - \\\hline
125Industrial      & 1,102 & 1,228 & 1,406 \\
126High end        & 177   & 188   & 207 \\\hline
127Military/Aereo  & 566   & 636   & 717 \\
128High end        & 56    & 65    & 82 \\\hline\hline
129Total FPGA/PLD  & 4,659 & 5,015 & 5,583 \\
130Total High-End  FPGA    & 753   & 826   & 914 \\\hline
131\end{tabular}
132\caption{\label{fga_market} Gartner estimation of worldwide FPGA/PLD consumption (Millions \$)}
133\end{table}
134\par
135Today, several companies (atipa, blue-arc, Bull, Chelsio, Convey, CRAY, DataDirect, DELL, hp,
136Wild Systems, IBM, Intel, Microsoft, Myricom, NEC, nvidia etc) are making systems where demand
137for very high performance (HPC) primes over other requirements. They tend to use the highest
138performing devices like Multi-core CPUs, GPUs, large FPGAs, custom ICs and the most innovative
139architectures and algorithms. Companies show up in different "traditional" applications and market
140segments like computing clusters (ad-hoc), servers and storage, networking and Telecom, ASIC
141emulation and prototyping, Mil/aero etc. HPC market size is estimated today by FPGA providers
142to 214 M\$.
143This market is dominated by Multi-core CPUs and GPUs based solutions and the expansion
144of FPGA-based solutions is limited by the flow automation. Nowadays, there are neither commercial
145nor free tools covering the whole design process.
146For instance, with SOPC Builder from Altera, users can select and parameterize IP components
147from an extensive drop-down list of communication, digital signal processor (DSP), microprocessor
148and bus interface cores, as well as incorporate their own IP. Designers can then generate
149a synthesized netlist, simulation test bench and custom software library that reflect the hardware
150configuration.
151Nevertheless, SOPC Builder does not provide any facilities to synthesize coprocessors and to
152simulate the platform at a high design level (system C).
153In addition, SOPC Builder is proprietary and only works together with Altera's Quartus compilation
154tool to implement designs on Altera devices (Stratix, Arria, Cyclone).
155PICO [CITATION] and CATAPULT [CITATION] allow to synthesize coprocessors from a C++ description.
156Nevertheless, they can only deal with data dominated applications and they do not handle the
157platform level.
158The Xilinx System Generator for DSP [http://www.xilinx.com/tools/sysgen.htm] is a plug-in to
159Simulink that enables designers to develop high-performance DSP systems for Xilinx FPGAs.
160Designers can design and simulate a system using MATLAB and Simulink. The tool will then
161automatically generate synthesizable Hardware Description Language (HDL) code mapped to Xilinx
162pre-optimized algorithms.
163However, this tool targets only DSP based algorithms.
164\\
165Consequently, designer developping an embedded system needs to master for example
166SoCLib for design exploration,
167SOPC Builde at the platform level,
168PICO for synthesizing the data dominated coprocessors
169and Quartus for design implementation.
170This requires an important tools interfacing effort and makes the design process very complex
171and achievable only by designers skilled in various domains.
172COACH project integrates all these tools in the same framework masking them to the user.
173The objective is to allow \textbf{pure software} developpers to realize embedded systems.
174\par
175The combination of the framework dedicated to software developpers and FPGA target, allows to gain
176market share over Multi-core CPUs and GPUs HPC based solutions.
177Moreover, one can expect that small and even very small companies will be able to propose embedded
178system and accelerating solutions for standard software applications with acceptable prices, thanks
179 to the elimination of huge hardware investment in opposite to ASIC based solution.
180\\
181This new market may explose like it was done by micro-computing in eighties. This success were due
182to the low cost of first micro-computers (compared to main frame) and the advent of high level
183programming languages that allow a high number of programmers to launch start-ups in software
184engineering.
185
186\subsection{Project position}
187\hspace{2cm}\begin{scriptsize}\begin{verbatim}
188% 1.2.  POSITIONNEMENT DU PROJET
189% (2 pages maximum)
190% Prï¿œciser :
191% -     positionnement du projet par rapport au contexte dï¿œveloppï¿œ prï¿œcï¿œdemment :
192%   vis- ï¿œ-vis des projets et recherches concurrents, complï¿œmentaires ou antï¿œrieurs,
193%   des brevets et standards.
194% - positionnement du projet par rapport aux axes thᅵmatiques de l'appel ᅵ projets.
195% - positionnement du projet aux niveaux europï¿œen et international.
196\end{verbatim}
197\end{scriptsize}
198The aim of this project is to propose an open-source framework for architecture synthesis
199targeting mainly field programmable gate array circuits (FPGA).
200\\% LIP6/TIMA
201To evaluate the different architectures, the project uses the prototyping platform
202of the SoCLIB ANR project (2006-2009).
203\\% IRISA
204The project will also borrow from the ROMA ANR project (2007-2009) and the ongoing
205joint INRIA-STMicro Nano2012 project. In particular we will adapt existing pattern
206extraction algorithms and datapath merging techniques to the synthesis of customized
207ASIP processors.
208\par
209%%% 1 -- POUVEZ VOUS CHACUN AJOUTER SVP (SI POSSIBLE) UNE LIGNE
210%%% 1 -- REFERANT UN PROJET ANR OU EUROPEEN
211%%% 1 -- Projets europï¿œens ou ANR rï¿œutilisï¿œs ou continuï¿œs
212%%% 1 LIP6/TIMA/LAB-STIC OK
213Regarding the expertise in  High Level Synthesis (HLS), the project leverages on know-how acquired over 15 years
214with GAUT project developped in Lab-STIC laboratory and UGH project developped in LIP6
215and TIMA laboratories. \\
216Regarding architecture synthesis skills, the project is based on a know-how acquired over 10 years
217with the COSY European project (1998-2000) and the DISYDENT project developped in LIP6.  \\
218%%% 1 IRISA OK
219Regarding Application Specific Instruction Processor (ASIP) design, the CAIRN group at INRIA Bretagne
220Atlantique benefits from several years of expertise in the domain of retargetable compiler (Armor/Calife
221since 1996, and the Gecos compilers since 2002).
222
223
224% LIP FIXME:UN:PEU:LONG ET HORS:SUJET
225%CA% The source-level transformations required by the HLS tools will be
226%CA% designed in the {\em polyhedral model}, a general framework
227%CA% initiated by Paul Feautrier 20 years ago.  The programs handled in
228%CA% the polyhedral model are such that loop iterators describe a
229%CA% polyhedron (hence the name). This includes most of the kernels used
230%CA% in embedded applications. This property allows to design precise
231%CA% analysis by means of integer programming techniques.
232%CA% %communaute active & internationale
233%CA% %transfert techno (Reservoir)
234%CA% The polyhedral community is very active, and the technological
235%CA% transfer has now started. Reservoir Labs inc., a company based in
236%CA% New-York, is currently integrating the last polyhedral developments
237%CA% in its commercial compiler.
238%CA% %transfert techno (gcc)
239%CA% Also, polyhedra are progressively migrating into the {\sc GNU Gcc}
240%CA% compiler, via {\sc Graphite}, a module initially developed by
241%CA% Sebastian Pop.
242%CA% %outils existants
243%CA% Several tools have been developed in the polyhedral community,
244%CA% such as {\sc Piplib} (parameter integer programming library), and
245%CA% {\sc Polylib}, a library providing set operations on polyhedra. Both
246%CA% tools are almost mandatory in polyhedral tools, and have reached
247%CA% a sufficient level of maturity to be considered as standard.
248%syntol & bee ???
249% FIN
250% and on more than 15 years of experience on parallel hardware generation
251% in the polyedral model in the CAIRN group (MMAlpha software
252% developped in the group since 1996).
253%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
254%%% 2 -- A COMPLETER (COURT)
255%%% 2 -- For polyedric transformation and memory optimization ... LIP
256%%% 2 -- For ASIP IRISA
257%%% 2 -- For ... CITI
258%%% 2 -- For ... TIMA
259\par
260The SoCLIB ANR platform were developped by 11 laboratories and 6 companies. It allows to
261describe hardware architectures with shared memory space and to deploy software
262applications on them to evaluate their performance.
263The heart of this platform is a library containing simulation models (in SystemC)
264of hardware IP cores such as processors, buses, networks, memories, IO controller.
265The platform provides also embedded operating systems and software/hardware
266communication components useful to implement applications quickly.
267However, the synthesisable description of IPs have to be provided by users. \\
268This project enhances SoCLib by providing synthesisable VHDL of standard IPs.
269In addition, HLS tools such as UGH and GAUT allow to get automatically a synthesisable
270description of an IP (coprocessor) from a sequential algorithm.
271%\par
272%%% 2 IRISA ?
273%%% 2 ASIP tool such as ... IRISA
274%%% 2 ...
275%%% 2 Coach uses pattern extractions from ROMA
276%\par
277%%% 2 LIP ?
278\par
279The different points proposed in this project cover priorities defined by the commission
280experts in the field of Information Technolgies Society (IST) for Embedded
281systems: <<Concepts, methods and tools for designing systems dealing with systems complexity
282and allowing to apply efficiently applications and various products on embedded platforms,
283considering resources constraints (delais, power, memory, etc.), security and quality
284services>>.
285\\
286Our team aims at covering all the steps of the design flow of architecture synthesis.
287Our project overcomes the complexity of using various synthesis tools and description
288languages required today to design architectures.
289
290\section{Scientific and Technical Description}
291\subsection{State of the art}
292\hspace{2cm}\begin{scriptsize}\begin{verbatim}
293% 2.    DESCRIPTION SCIENTIFIQUE ET TECHNIQUE
294% 2.1.  ï¿œTAT DE L'ART
295% (3 pages maximum)
296% Dï¿œcrire le contexte et les enjeux scientifiques dans lequel se situe le projet
297% en prï¿œsentant un ï¿œtat de l'art national et international dressant l'ï¿œtat des
298% connaissances sur le sujet. Faire apparaï¿œtre d'ï¿œventuels rï¿œsultats prï¿œliminaires.
299% Inclure les rï¿œfï¿œrences bibliographiques nï¿œcessaires en annexe 7.1.
300\end{verbatim}
301\end{scriptsize}
302Our project covers several critical domains in system design in order
303to achieve high performance computing. Starting from a high level description we aim
304at generating automatically both hardware and software components of the system.
305
306\subsubsection{High Performance Computing}
307Accelerating high-performance computing (HPC) applications with field-programmable
308gate arrays (FPGAs) can potentially improve performance.
309However, using FPGAs presents significant challenges [1].
310First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
311Second, based on Amdahl law,  HPC/FPGA application performance is unusually sensitive
312to the implementation quality [2].
313Finally, High-performance computing programmers are a highly sophisticated but scarce
314resource. Such programmers are expected to readily use new technology but lack the time
315to learn a completely new skill such as logic design [3].
316\\
317HPC/FPGA hardware is only now emerging and in early commercial stages,
318but these techniques have not yet caught up.
319Thus, much effort is required to develop design tools that translate high level
320language programs to FPGA configurations.
321
322\hspace{2cm}\begin{scriptsize}\begin{verbatim}
323[1] M.B. Gokhale et al., Promises and Pitfalls of Reconfigurable
324Supercomputing, Proc. 2006 Conf. Eng. of Reconfigurable
325Systems and Algorithms, CSREA Press, 2006, pp. 11-20;
326http://nis-www.lanl.gov/~maya/papers/ersa06_gokhale_paper.
327pdf.
328[2] D. Buell, Programming Reconfigurable Computers: Language
329Lessons Learned, keynote address, Reconfigurable Systems
330Summer Institute 2006, 12 July 2006; http://gladiator.
331ncsa.uiuc.edu/PDFs/rssi06/presentations/00_Duncan_Buell.pdf
332[3] T. Van Court et al., Achieving High Performance
333with FPGA-Based Computing, Computer, vol. 40, no. 3,
334pp. 50-57, Mar. 2007, doi:10.1109/MC.2007.79
335\end{verbatim}
336\end{scriptsize}
337
338\subsubsection{System Synthesis}
339Today, several solutions for system design are proposed and commercialized. The most common are
340those provided by Altera and Xilinx to promote their FPGA devices.
341\\
342The Xilinx System Generator for DSP [http://www.xilinx.com/tools/sysgen.htm] is a plug-in to
343Simulink that enables designers to develop high-performance DSP systems for Xilinx FPGAs.
344Designers can design and simulate a system using MATLAB and Simulink. The tool will then
345automatically generate synthesizable Hardware Description Language (HDL) code mapped to Xilinx
346pre-optimized algorithms.
347However, this tool targets only DSP based algorithms, Xilinx FPGAs and cannot handle complete
348SoC. Thus, it is not really a system synthesis tool.
349\\
350In the opposite, SOPC Builder [CITATION] allows to describe a system, to synthesis it,
351to programm it into a target FPGA and to upload a software application.
352% FIXME(C2H from Altera, marche vite mais ressource monstrueuse)
353Nevertheless, SOPC Builder does not provide any facilities to synthesize coprocessors.
354Users have to provide the synthesizable description with the feasible bus interface.
355\\
356In addition, Xilinx System Generator and SOPC are closed world since each one imposes
357their own IPs which are not interchangeable.
358We can conclude that the existing commercial or free tools does not coverthe whole system
359synthesis process in a full automatic way. Moreover, they are bound to a particular device family
360and to IPs library.
361
362\subsubsection{High Level Synthesis}
363High Level Synthesis translates a sequential algorithmic description and a constraints set
364(area, power, frequency, ...) to a micro-architecture at Register Transfer Level (RTL).
365Several academic and commercial tools are today available.
366Most common tools are SPARK [HLS1], GAUT [HLS2], UGH [HLS3] in the academic world
367and catapultC [HLS4], PICO [HLS5] and Cynthesizer [HLS6] in commercial world.
368Despite their maturity, their usage is restrained by:
369\begin{itemize}
370\item They do not respect accurately the frequency constraint when they target an FPGA device.
371Their error is about 10 percent. This is annoying when the generated component is integrated
372in a SoC since it will slow down the hole system.
373\item These tools take into account only one or few constraints simultaneously while realistic
374designs are multi-constrained.
375Moreover, low power consumption constraint is mandatory for embedded systems.
376However, it is not yet well handled by common synthesis tools.
377\item The parallelism is extracted from initial algorithm. To get more parallelism or to reduce
378the amout of required memory, the user must re-write it while there is techniques as polyedric
379transformations to increase the intrinsec parallelism.
380\item Despite they have the same input language (C/C++), they are sensitive to the style in
381which the algorithm is written. Consequently, engineering work is required to swap from
382a tool to another.
383\item The HLS tools are not integrated into an architecture and system exploration tool.
384Thus, a designer who needs to accelerate a software part of the system, must adapt it manually
385to the HLS input dialect and performs engineering work to exploit the synthesis result
386at the system level.
387\end{itemize}
388Regarding these limitations, it is necessary to create a new tool generation reducing the gap
389between the specification of an heterogenous system and its hardware implementation.
390
391\hspace{2cm}\begin{scriptsize}\begin{verbatim}
392[HLS1] SPARK universite de californie San Diego
393[HLS2] GAUT UBS/Lab-STIC
394[HLS3] UGH
395[HLS4] catapultC Mentor
396[HLS5] PICO synfora
397[HLS6] Cynthesizer Forte design system
398\end{verbatim}
399\end{scriptsize}
400
401\subsubsection{Application Specific Instruction Processors}
402
403ASIP (Application-Specific Instruction-Set Processor) are programmable processors in
404which both the instruction and the micro architecture have been tailored to a given
405 application domain (eg. video processing), or to a specific application.
406This specialization usually offers a good compromise between performance (w.r.t a pure software
407implementation on an embeded CPU) and flexibility (w.r.t an application specific
408hardware co-processor).
409In spite of their obvious advantages, using/designing ASIPs remains a difficult
410task, since it involves designing both a micro-architecture and a compiler for this
411architecture. Besides, to our knowledge, there is still no available open-source
412design flow\footnote{There are commercial tools such a } for ASIP design even if such a tool would
413be valuable in the context of a System Level design exploration tool.   
414
415In this context, ASIP design based on Instruction Set Extensions (ISEs) has
416received a lot of interest [NIOSII,TENSILICA]%~\cite{NIOS2,ST70},
417as it makes micro architecture synthesis
418more tractable \footnote{ISEs rely on a template micro-architecture in which
419only a small fraction of the architecture has to be specialized}, and help ASIP
420designers to focus on compilers, for which there are still many open problems
421[CODES04,FPGA08].
422This approach however has a strong weakness, since it also significantly reduces
423opportunities for achieving good seedups (most speedup remain between 1.5x and
4242.5x), since ISEs performance is generally tied down by I/O constraints as
425they generally rely on the main CPU register file to access data.
426
427% (
428%automaticcaly extraction ISE candidates for application code \cite{CODES04},
429%performing efficient instruction selection and/or storage resource (register)
430%allocation \cite{FPGA08}). 
431 
432
433To cope with this issue, recent approaches~[DAC09,DAC08]%\cite{DAC09,DAC08}
434advocate the use of
435micro-architectural ISE models in which the coupling between the processor micro-architecture
436and the ISE component is thightened up so as to allow the ISE to overcome the register
437I/O limitations, however these approaches tackle the problem for a compiler/simulation
438point of view and not address the problem of generating synthesizable representations for
439these models.
440
441We therefore strongly believe that there is a need for an open-framework which
442would allow researchers and system designers to :
443\begin{itemize}
444\item Explore the various level of interactions between the original CPU micro-architecure
445and its extension (for example throught a Domain Specific Language targeted at micro-architecture
446specification and synthesis).
447\item Retarget the compiler instruction-selection (or prototype nex passes) passes so as
448to be able to take advantage of this ISEs.
449\item Provide  a complete System-level Integration for using ASIP as SoC building blocks
450(integration with application specific blocks, MPSoc, etc.)
451\end{itemize}
452
453\hspace{2cm}
454\begin{scriptsize}\begin{verbatim} 
455
456[CODES08] Theo Kluter, Philip Brisk, Paolo Ienne, and Edoardo Charbon, Speculative DMA for
457Architecturally Visible Storage in Instruction Set Extensions
458
459[DAC09] Theo Kluter, Philip Brisk, Paolo Ienne, Edoardo Charbon, Way Stealing: Cache-assisted
460Automatic Instruction Set Extensions.
461
462[CODES04] Pan Yu, Tulika Mitra, Scalable Custom Instructions Identification for
463Instruction Set Extensible Processors.
464
465[FPGA08] Quang Dinh, Deming Chen, Martin D. F. Wong, Efficient ASIP Design for Configurable
466Processors with Fine-Grained Resource Sharing.
467
468[NIOSII] Nios II Custom Instruction User Guide
469
470\end{verbatim}
471
472\end{scriptsize}
473%, either
474%because the target architecture is proprietary, or because the compiler
475%technology is closed/commercial.
476
477
478
479
480% We propose to explore how to tighten the coupling of the extensions and
481% the underlyoing template micro-architecture.
482% *  Thightne Even if such
483% an approach offers less flexiblity and forbids very tight coupling
484% between the extensions and the template micro-architecture, it makes the
485% design of the micro-architecture more tractable and amenable to a fully
486% automated flow.
487% \\
488% \\
489% In the context of the COACH project, we propose to add to the
490% infra-structure a design flow targeted to automatic instruction set
491% extension for the MIPS-based CPU, which will come as a complement or an
492% alternative to the other proposed approaches (hardware accelerator,
493% multi processors).
494%
495
496\subsubsection{Automatic Parallelization}
497\begin{Large}\begin{verbatim}
498-- A COMPLETER LIP
499\end{verbatim}
500\end{Large}
501%CA%   Parallel machines are often difficult and painful to program
502%CA%   directly, and one would like the compiler to %do the job, that is to
503%CA%   turn automatically a sequential program into a parallel form. This
504%CA%   transformation is referred as {\em automatic parallelization}, and has
505%CA%   been widely addressed since the 70s. Automatic parallelization
506%CA%   relies on data dependences, which cannot be computed in general.%, as
507%CA%   %one cannot predict at compile time the variable values on a given
508%CA%   %execution point.
509%CA%   This negative result led researchers to (i) find a
510%CA%   program model in which no approximation is needed (ie polyhedral
511%CA%   model), (ii) make conservative approximations (iii) remark that
512%CA%   variable values are known at runtime, and make the decisions during
513%CA%   program execution. The latter approach is obviously not suitable
514%CA%   there, as we target hardware generation. We will give there a short
515%CA%   history of the approaches that fall in the first category.
516%CA%
517%CA%%   In the real world, we deal with a limited amount of processors,
518%CA%%   and the communication between processors takes time, and is
519%CA%%   critical for performance. %Whenever we have synchronisation-free
520%CA%%   parallelism, like for embarrassingly parallel kernels, this is not an
521%CA%%   issue. But in case of pipelined parallelism, we need to reduce
522%CA%%   communications as much as possible.
523%CA%%   So we also need to find parallelism toghether with a proper mapping
524%CA%%   of operations and data on physical processors.
525%CA%
526%CA%   As programs spend most of there time in loops, the community has
527%CA%   focused on loop transformations that reveal parallelism.
528%CA%%unimodulaire
529%CA%   The first approaches worked on perfect loop nests, where the tree
530%CA%   formed by the nested loops is linear. In this program model, the
531%CA%   loops can be seen as a basis that drive the way the iteration
532%CA%   domain will be described. Hence, a first idea was to change this
533%CA%   basis such that one vector (one loop) at least is parallel. To ease
534%CA%   the code generation, the area of defined by the news vectors must
535%CA%   be a unit volume. %Otherwise, one would produce an homothetic
536%CA%%   expansion of the iteration domain, which will force to put modulos
537%CA%%   in the target code.
538%CA%   For this reason, these transformations are called {\em unimodular
539%CA%   transformations}.
540%CA%%tiling
541%CA%   
542%CA%   The next approaches include {\em loop tiling}, a simple
543%CA%   partitioning of the iteration domain, whose initial purpose is to
544%CA%   execute every partition on a different processor. %In the same way,
545%CA%   The execution order is modified with a proper unimodular
546%CA%   transformation, then the tiles are obtained by cutting the
547%CA%   iteration domain with the hyperplanes directed by every vector of
548%CA%   the new (unimodular) basis, at regular intervals. When the tiling
549%CA%   hyperplanes are properly chosen, we can both improve data-locality
550%CA%   on every processor, and reduce the communication between two
551%CA%   different tiles (which will be mapped on processors). This last
552%CA%   property implying that one tend to find a degree of parallelism as
553%CA%   great as possible.
554%CA%
555%CA%%affine scheduling
556%CA%   The previous approaches were restricted to kernels with perfect
557%CA%   loop nests (linear loop tree), and unimodular transformations. The
558%CA%   last generation of approaches broke with these limitations. We now
559%CA%   choose a different basis for every assignment, without the
560%CA%   unimodularity restriction. A dual way to present the things is the
561%CA%   notion of {\em affine schedule}, introduced by Feautrier [part1],
562%CA%   that simply assigns an abstract execution date to every assignment
563%CA%   execution. As an assignment execution is exactly characterised by
564%CA%   the current value of the loops counters (iteration vector), the
565%CA%   affine schedule will be defined as an affine form of the iteration
566%CA%   vector (hence the 'affine'). The affine property allows to use
567%CA%   integer programming techniques to compute the schedule. With this
568%CA%   approach, additional techniques are required to allocate the
569%CA%   parallel operations and the data to processor in an efficient way
570%CA%   [griebl, feautrier].
571%CA%
572%CA%%modularity??
573%CA%%%    As loop nests are no longer perfect, we deal with (transformed)
574%CA%%%    iteration domains of different dimensions, which can possibly (and
575%CA%%%    certainly) overlap. At this point, a new code generation technique
576%CA%%%    was needed. The first attempt is due to Chamsky et al. [??], and
577%CA%%%    was improved by Quillere et al. [QRW]. The code is now implemented
578%CA%%%    in an efficient tool [cloog], that gave a new life to polyhedral
579%CA%%%    techniques.
580%CA%
581%CA%%pluto's tiling
582%CA%   The tiling techniques were extended to non-perfect loop nest with
583%CA%   {\em affine partitioning}. Affine partitioning is to affine
584%CA%   scheduling what (original) tiling was to unimodular
585%CA%   transformations. An affine partitioning assigns to every assignment
586%CA%   its coordinates in the basis defined by the normals to the tiling
587%CA%   hyperplanes. Recently, a way to compute efficient hyperplanes were
588%CA%   found [uday], with a good data locality, and communications
589%CA%   confined in a small neighborhood around every processor.
590%CA%
591%CA%\subsubsection{Source-level Memory Optimisation}
592%CA%  The HLS process allows to customise memory, which impacts on final
593%CA%  circuit size and power consumption. Though most HLS tools already
594%CA%  try to optimise memory usage, it is better to provide an independent
595%CA%  source-level pass, that could be reused for different tools and in
596%CA%  other contexts.
597%CA%
598%CA%  There exists many approaches to evaluate and reduce the memory
599%CA%  requirement of a program. The first approaches are concerned with
600%CA%  {\em memory size estimation}, which can be defined as the maximum
601%CA%  number of memory cells used at the same time [clauss,zhao]. These
602%CA%  approaches provide an estimation as a symbolic expression of program
603%CA%  parameters, which can be used further to guide loop optimisations.
604%CA%  However, no explicit way to reduce the memory size is given.  {\em
605%CA%  Intra-array reuse} approaches brake with this limitation, and
606%CA%  collapse the array cells which are not alive at the same time. The
607%CA%  collapse is done by means of a data layout transformation, specified
608%CA%  with a linear (modular) mapping.  The first approaches were
609%CA%  developed at IMEC [balasa,catthoor], and basically try to linearize
610%CA%  the arrays and fold them using a modulo operator. Then, Lefebvre et
611%CA%  al. propose a solution to fold independently the array dimensions
612%CA%  [lefebvre]. Finally, Darte et al. provide a general formalisation of
613%CA%  the problem, together with a solution that subsumes the previous
614%CA%  approaches [darte]. A first implementation was made with the tool
615%CA%  {\sc Bee}, but there are still many limitations.
616%CA%
617%CA%  \begin{itemize}
618%CA%  \item The tool is restricted to regular programs, whereas more
619%CA%  general programs could be handled with a conservative array liveness
620%CA%  analysis.
621%CA%
622%CA%  \item Programs depending on parameters (inputs) are not handled,
623%CA%  which forbids to handle, for example, the body of tiled loops.
624%CA%
625%CA%  \item The new array layout can brake spatial locality, and then impact
626%CA%  performance and power consumption. One would like to get a mapping
627%CA%  that improve or, at least, preserve the spatial locality of the
628%CA%  program.
629%CA%
630%CA%  \item Finally, the final memory compaction strongly depends on the
631%CA%  program schedule, and is naturally hindered by the
632%CA%  parallelism. Consequently, there is a trade-off to find with
633%CA%  automatic parallelization. An ideal solution would be to reduce
634%CA%  memory usage, while preserving parallelism. 
635%CA%  \end{itemize}
636
637\subsubsection{Interfaces}
638\begin{Large}\begin{verbatim}
639-- A COMPLETER INSA Etat de l'art
640\end{verbatim}
641\end{Large}
642%
643%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
644\subsection{Objectives and innovation aspects}
645\hspace{2cm}\begin{scriptsize}\begin{verbatim}
646% 2.2.  OBJECTIFS ET CARACTERE AMBITIEUX/NOVATEUR DU PROJET
647% (2 pages maximum)
648% Dï¿œcrire les objectifs scientifiques/techniques du projet.
649% Prᅵsenter l'avancᅵe scientifique attendue. Prᅵciser l'originalitᅵ et le caractᅵre
650% ambitieux du projet.
651% Dᅵtailler les verrous scientifiques et techniques ᅵ lever par la rᅵalisation du projet.
652% Dï¿œcrire ï¿œventuellement le ou les produits finaux dï¿œveloppï¿œs ï¿œ l'issue du projet 
653% montrant le caractï¿œre innovant du projet.
654% Prï¿œsenter les rï¿œsultats escomptï¿œs en proposant si possible des critï¿œres de rï¿œussite
655% et d'ï¿œvaluation adaptï¿œs au type de projet, permettant d'ï¿œvaluer les rï¿œsultats en
656% fin de projet.
657% Le cas ᅵchᅵant (programmes exigeant la pluridisciplinaritᅵ), dᅵmontrer l'articulation
658% entre les disciplines scientifiques.
659\end{verbatim}
660\end{scriptsize}
661
662% les objectifs scientifiques/techniques du projet.
663The objectives of COACH project are to develop a complete framework to
664HPC (accelerating solutions for existing software applications)
665and embedded applications (implementing an application on a low power standalone device).
666The design steps are presented figure 1.
667\begin{figure}[hbtp]\leavevmode\center
668  \includegraphics[width=.8\linewidth]{flow}
669  \caption{\label{coach-flow} COACH flow.}
670\end{figure}
671\begin{description}
672\item[HPC setup] Here the user splits the application into 2 parts: the host application
673which remains on PC and the SoC application which migrates on SoC.
674The framework provides a simulation model allowing to evaluate the partitioning.
675\item[SoC design] In this phase,
676The user can obtain simulators at different abstraction levels of the SoC by giving to COACH framework
677a SoC description. 
678This description consists of a process network corresponding to the SoC application,
679an OS, an instance of a generic hardware platform
680and a mapping of processes on the platform components. The supported mapping are
681software (the process runs on a SoC processor),
682XXXpeci (the process runs on a SoC processor enhanced with dedicated instructions),
683and hardware (the process runs into a coprocessor generated by HLS and plugged on the SoC bus).
684\item[Application compilation] Once SoC description is validated, COACH generates automatically
685an FPGA bitstream containing the hardware platform with SoC application software and
686an executable containing the host application. The user can launch the application by
687loading the bitstream on FPGA and running the executable on PC.
688\end{description}
689 
690% l'avancee scientifique attendue. Preciser l'originalite et le caractere
691% ambitieux du projet.
692The main scientific contribution of the project is to unify various synthesis techniques
693(same input and output formats) allowing the user to swap without engineering effort
694from one to an other and even to chain them, for example, to run polyedric transformation
695before synthesis.
696Another advantage of this framework is to provide different abstraction levels from
697a single description.
698Finally, this description is device family independent and its hardware implementation
699is automatically generated.
700
701% Detailler les verrous scientifiques et techniques a lever par la realisation du projet.
702System design is a very complicated task and in this project we try to simplify it
703as much as possible. For this purpose we have to deal with the following scientific
704and technological barriers.
705\begin{itemize}
706\item The main problem in HPC is the communication between the PC and the SoC.
707This problem has 2 aspects. The first one is the efficiency. The second is to
708eliminate enginnering effort to implement it at different abstract levels.
709\item COACH design flow has a top-down approach. In the such case,
710the required performance of a coprocessor (run frequency, maximum cycles for
711a given computation, power consumption, etc) are imposed by the other system
712components. The challenge is to allow user to control accurately the synthesis
713process. For instance, the run frequency must not be a result of the RTL synthesis
714but a strict synthesis constraint.
715\item HLS tools are sensitive to the style in which the algorithm is written.
716In addition, they are are not integrated into an architecture and system
717exploration tool.
718Consequently, engineering work is required to swap from a tool to another,
719to integrate the resulting simulation model to an architectural exploration tool
720and to synthesize the generated RTL description.
721%CA Additionnal preprocessing, source-level transformations, are thus
722%CA required to improve the process.
723%CA Particularly, this includes parallelism exposure and efficient memory mapping.
724\item Most HLS tools translate a sequential algorithm into a coprocessor
725containing a single data-path and finite state machine (FSM). In this way,
726only the fine grained parallelism is exploited (ILP parallelism).
727The challenge is to identify the coarse grained parallelism and to generate,
728from a sequential algorithm, coprocessor containing multiple communicating
729tasks (data-paths and FSMs).
730\end{itemize}
731
732%Presenter les resultats escomptes en proposant si possible des criteres de reussite
733%et d'evaluation adaptes au type de projet, permettant d'evaluer les resultats en
734%fin de projet.
735The main result is the framework. It is composed concretely of:
7362 HPC communication shemes with their implementation,
7375 HLS tools (control dominated HLS, data dominated HLS, Coarse grained HLS,
738Memory optimisation HLS and ASIP),
7393 systemC based virtual prototyping environment extended with synthesizable
740RTL IP cores (generic, ALTERA/NIOS/AVALON, XILINX/MICROBLAZE/OPB),
741one design space exploration tool,
742one operating system (OS).
743\\
744The framework fonctionality will be demonstrated with XXX-EXAMPLE1, XXX-EXAMPLE2
745and XXX-EXAMPLE3 on 4 archictures (generic/XILINX, generic/ALTERA,
746proprietary/XILINX, proprietary/ALTERA).
747
748%% \section{}
749%% %3.  PROGRAMME SCIENTIFIQUE ET TECHNIQUE, ORGANISATION DU PROJET
750%% \subsection{}
751%% %3.1.        PROGRAMME SCIENTIFIQUE ET STRUCTURATION DU PROJET
752%% %(2 pages maximum)
753%% %Prï¿œsentez le programme scientifique et justifiez la dï¿œcomposition en tï¿œches du
754%% %programme de travail en cohï¿œrence avec les objectifs poursuivis.
755%% %Utilisez un diagramme pour prï¿œsenter les liens entre les diffï¿œrentes tï¿œches
756%% %(organigramme technique)
757%% %Les tᅵches reprᅵsentent les grandes phases du projet. Elles sont en nombre limitᅵ.
758%% %N'oubliez pas les activitᅵs et actions correspondant ᅵ la dissᅵmination et ᅵ la
759%% %valorisation.
760%%
761%% %METTRE UNE FIGURE ICI DECRIVANT LES TACHES ET LEURS INTERACTION (AVEC LE FLOT 
762%% %EN FILIGRANE ? )
763%% \subsection{}
764%% %3.2.        MANAGEMENT DU PROJET
765%% %(2 pages maximum)
766%% %Prï¿œciser les aspects organisationnels du projet et les modalitï¿œs de coordination
767%% %(si possible individualisation d'une tï¿œche coordination : cf. tï¿œche 0 du document
768%% %de soumission A).
769%% \subsection{}
770%% %3.3.        DESCRIPTION DES TRAVAUX PAR TACHE
771%% %(idï¿œalement 1 ou 2 pages par tï¿œche)
772%% %Pour chaque tï¿œche, dï¿œcrire :
773%% %-   les objectifs  de la tï¿œche et ï¿œventuels indicateurs de succï¿œs,
774%% %-   le responsable de la tï¿œche et les partenaires impliquï¿œs (possibilitï¿œ de
775%% %l'indiquer sous forme graphique),
776%% %-   le programme dï¿œtaillï¿œ des travaux par tï¿œche,
777%% %-   les livrables de la tï¿œche,
778%% %-   les contributions des partenaires (le " qui fait quoi "),
779%% %-   la description des mï¿œthodes et des choix techniques et de la maniï¿œre dont
780%% %les solutions seront apportï¿œes,
781%% %-   les risques de la tï¿œche et les solutions de repli envisagï¿œes.
782
783
784
785
786
787
Note: See TracBrowser for help on using the repository browser.