source: anr/section-etat-de-art.tex @ 293

Last change on this file since 293 was 289, checked in by coach, 14 years ago

Changed to adapt the document to the ANR 2011 call.

  • Property svn:eol-style set to native
  • Property svn:keywords set to Revision HeadURL Id Date
File size: 14.2 KB
Line 
1% vim:set spell:
2% vim:spell spelllang=en:
3\anrdoc{\begin{itemize}
4\item Presenter un etat de l’art national et international, en dressant l’etat des
5      connaissances sur le sujet.
6\item Faire apparaître d’eventuelles contributions des partenaires de la proposition
7      de projet a cet etat de l’art.
8\item Faire apparaître d’eventuels resultats preliminaires.
9\item Inclure les references bibliographiques necessaires en annexe 7.1.
10\end{itemize}}
11
12Our project covers several critical domains in system design in order
13to achieve high performance computing. Starting from a high level description we aim
14at generating automatically both hardware and software components of the system.
15
16\subsubsection{High Performance Computing}
17% Un marché bouffé par les archi GPGPU tel que le FERMI de NvidiaCUDA programming language
18The High-Performance Computing (HPC) world is composed of three main families of architectures:
19many-core, GPGPU (General Purpose computation on Graphics Unit Processing) and FPGA.
20The first  two families are dominating the market by taking benefit
21of the strength and influence of mass-market leaders (Intel, Nvidia).
22%such as Intel for many-core CPU and Nvidia for GPGPU.
23In this market, FPGA architectures are emerging and very promising.
24By adapting architecture to the software, % (the opposite is done in the others families)
25FPGAs architectures enable better performance
26(typically between x10 and x100 accelerations)
27while using smaller size and less energy (and heat).
28However, using FPGAs presents significant challenges~\cite{hpc06a}.
29First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
30Second, based on Amdahl law,  HPC/FPGA application performance is unusually sensitive
31to the implementation quality~\cite{hpc06b}.
32% Thus, the performance strongly relies on the detected parallelism.
33% (pour résumer les 2 derniers points)
34Finally, efficient design methodology are required in order to
35hide FPGA complexity and the underlying implantation subtleties to HPC users,
36so that they do not have to change their habits and can have equivalent design productivity
37than in others families~\cite{hpc07a}.
38
39%état de l'art FPGA
40HPC/FPGA hardware is only now emerging and in early commercial stages,
41but these techniques have not yet caught up.
42Industrial (Mitrionics~\cite{hpc08}, Gidel~\cite{hpc09}, Convey Computer~\cite{hpc10}) and academic (CHREC)
43researches on HPC-FPGA are mainly conducted in the USA.
44None of the approaches developed in these researches are fulfilling entirely the
45challenges described above. For example, Convey Computer proposes application-specific instruction set extension of x86 cores in FPGA accelerator,
46but extension generation is not automated and requires hardware design skills.
47Mitrionics has an elegant solution based on a compute engine specifically
48developed for high-performance execution in FPGAs. Unfortunately, the design flow
49is based on a new programming language (mitrionC) implying important designer efforts and poor portability.
50% tool relying on operator libraries (XtremeData), 
51% Parle t-on de l'OPenFPGA consortium, dont le but est : "to accelerate the incorporation of reconfigurable computing technology in high-performance and enterprise applications" ?
52
53Thus, much effort is required to develop design tools that translate high level
54language programs to FPGA configurations.
55Moreover, as already remarked in~\cite{hpc11}, Dynamic Partial Reconfiguration~\cite{hpc12}
56(DPR, which enables changing a part of the FPGA, while the rest is still working)
57appears very interesting for improving HPC performance as well as reducing required area.
58
59\subsubsection{System Synthesis}
60Today, several solutions for system design are proposed and commercialized.
61The existing commercial or free tools do not
62cover the whole system synthesis process in a full automatic way. Moreover,
63they are bound to a particular device family and to IPs library.
64The most commonly used are provided by \altera and \xilinx to promote their
65FPGA devices. These representative tools used to synthesize SoC on FPGA
66are introduced below.
67\\
68The \xilinx System Generator for DSP~\cite{system-generateur-for-dsp} is a
69plug-in to Simulink that enables designers to develop high-performance DSP
70systems for \xilinx FPGAs.
71Designers can design and simulate a system using MATLAB and Simulink. The
72tool will then automatically generate synthesizable Hardware Description
73Language (HDL) code mapped to \xilinx pre-optimized algorithms.
74However, this tool targets only DSP based algorithms, \xilinx FPGAs and
75cannot handle a complete SoC. Thus, it is not really a system synthesis tool.
76\\
77In the opposite, SOPC Builder~\cite{spoc-builder} from \altera and \xilinx 
78Platform Studio XPS from \xilinx allows to describe a system, to synthesis it,
79to program it into a target FPGA and to upload a software application.
80Both SOPC Builder and XPS, allow designers to select and parameterize components from
81an extensive drop-down list of IP cores (I/O core, DSP, processor,  bus core, ...)
82as well as incorporate their own IP. Nevertheless, all the previously introduced tools
83do not provide any facilities to synthesize coprocessors and to simulate the platform
84at a high level (SystemC).
85System designer must provide the synthesizable description of its own IP-cores with
86the feasible bus interface. Design Space Exploration is thus limited
87and SystemC simulation is not possible neither at transactional nor at cycle
88accurate level.
89\\
90In addition, \xilinx System Generator, XPS and SOPC Builder are closed world
91since each one imposes their own IPs which are not interchangeable.
92Designers can then only generate a synthesized netlist, VHDL/Verilog simulation test
93bench and custom software library that reflect the hardware configuration.
94
95Consequently, a designer developing an embedded system needs to master four different
96design environments:
97\begin{enumerate}
98  \item a virtual prototyping environment (in SystemC) for system level exploration,
99  \item an architecture compiler to define the hardware architecture (Verilog/VHDL),
100  \item one or several third-party HLS tools for coprocessor synthesis (C to RTL),
101  \item and finally back-end synthesis tools for the bit-stream generation (RTL to bitstream).
102\end{enumerate}
103Furthermore, mixing these tools requires an important interfacing effort and this makes
104the design process very complex and achievable only by designers skilled in many domains.
105
106\subsubsection{High Level Synthesis}
107High Level Synthesis translates a sequential algorithmic description and a
108set of constraints (area, power, frequency, ...) to a micro-architecture at
109Register Transfer Level (RTL).
110Several academic and commercial tools are today available. The most common
111tools are SPARK~\cite{spark04}, GAUT~\cite{gaut08}, UGH~\cite{ugh08} in the
112academic world and CATAPULTC~\cite{catapult-c}, PICO~\cite{pico} and
113CYNTHETIZER~\cite{cynthetizer} in the commercial world.  Despite their
114maturity, their usage is restrained by \cite{IEEEDT} \cite{CATRENE} \cite{HLSBOOK}:
115\begin{itemize}
116\item HLS tools are not integrated into an architecture and system exploration tool.
117Thus, a designer who needs to accelerate a software part of the system, must adapt it manually
118to the HLS input dialect and perform engineering work to exploit the synthesis result
119at the system level,
120\item Current HLS tools can not target control AND data oriented applications,
121\item HLS tools take into account mainly a unique constraint while realistic design
122is multi-constrained.
123Low power consumption constraint which is mandatory for embedded systems is not yet
124well handled or not handled at all by the HLS tools already available,
125\item The parallelism is extracted from initial specification.
126To get more parallelism or to reduce the amount of required memory in the SoC, the user
127must re-write the algorithmic specification while there is techniques such as polyedric
128transformations to increase the intrinsic parallelism,
129\item While they support limited loop transformations like loop unrolling and loop
130pipelining, current HLS tools do not provide support for design space exploration neither
131through automatic loop transformations nor through memory mapping,
132\item Despite having the same input language (C/C++), they are sensitive to the style in
133which the algorithm dis written. Consequently, engineering work is required to swap from
134a tool to another,
135\item They do not respect accurately the frequency constraint when they target an FPGA device.
136Their error is about 10 percent. This is annoying when the generated component is integrated
137in a SoC since it will slow down the whole system.
138\end{itemize}
139Regarding these limitations, it is necessary to create a new tool generation reducing the gap
140between the specification of an heterogeneous system and its hardware implementation \cite{HLSBOOK} \cite{IEEEDT}.
141
142\subsubsection{Application Specific Instruction Processors}
143
144ASIP (Application-Specific Instruction-Set Processor) are programmable
145processors in which both the instruction and the micro architecture have
146been tailored to a given application domain or to a
147specific application.  This specialization usually offers a good compromise
148between performance (w.r.t a pure software implementation on an embedded
149CPU) and flexibility (w.r.t an application specific hardware co-processor).
150In spite of their obvious advantages, using/designing ASIPs remains a
151difficult task, since it involves designing both a micro-architecture and a
152compiler for this architecture. Besides, to our knowledge, there is still
153no available open-source design flow for ASIP design even if such a tool
154 would be valuable in the
155context of a System Level design exploration tool.
156\par
157In this context, ASIP design based on Instruction Set Extensions (ISEs) has
158received a lot of interest~\cite{NIOS2}, as it makes micro architecture synthesis
159more tractable \footnote{ISEs rely on a template micro-architecture in which
160only a small fraction of the architecture has to be specialized}, and help ASIP
161designers to focus on compilers, for which there are still many open
162problems\cite{ARC08}.
163This approach however has a severe weakness, since it also significantly reduces
164opportunities for achieving good speedups (most speedups remain between 1.5x and
1652.5x), since ISEs performance is generally tied down by I/O constraints as
166they generally rely on the main CPU register file to access data.
167
168% (
169%automaticcaly extraction ISE candidates for application code \cite{CODES04},
170%performing efficient instruction selection and/or storage resource (register)
171%allocation \cite{FPGA08}). 
172To cope with this issue, recent approaches~\cite{DAC09,CODES08,TVLSI06} advocate the use of
173micro-architectural ISE models in which the coupling between the processor micro-architecture
174and the ISE component is tightened up so as to allow the ISE to overcome the register
175I/O limitations. However these approaches generally tackle the problem from a compiler/simulation
176point of view and do not address the problem of generating synthesizable representations for
177these models.
178
179We therefore strongly believe that there is a need for an open-framework which
180would allow researchers and system designers to :
181\begin{itemize}
182\item Explore the various level of interactions between the original CPU micro-architecture
183and its extension (for example through a Domain Specific Language targeted at micro-architecture
184specification and synthesis).
185\item Retarget the compiler instruction-selection pass
186(or prototype new passes) so as to be able to take advantage of this ISEs.
187\item Provide  a complete System-level Integration for using ASIP as SoC building blocks
188(integration with application specific blocks, MPSoc, etc.)
189\end{itemize}
190
191\subsubsection{Automatic Parallelization}
192
193The problem of compiling sequential programs for parallel computers
194has been studied since the advent of the first parallel architectures
195in the 1970s. The basic approach consists in applying program transformations
196which exhibit or increase the potential parallelism, while guaranteeing
197the preservation of the program semantics. Most of these transformations
198just reorder the operations of the program; some of them modify its
199data structures. Dependences (exact or conservative) are checked to guarantee
200the legality of the transformation.
201
202This has lead to the invention of many loop transformations (loop fusion,
203loop splitting, loop skewing, loop interchange, loop unrolling, ...)
204which interact in a complicated way. More recently, it has been noticed
205that all of these are just changes of basis in the iteration domain of
206the program. This has lead to the introduction of the polyhedral model
207\cite{FP:96,DRV:2000}, in which the combination of two transformations is
208simply a matrix product.
209
210Since hardware is inherently parallel, finding parallelism in sequential
211programs in an important prerequisite for HLS. The large FPGA chips of
212today can accomodate much more parallelism than is available in basic blocks.
213The polyhedral model is the ideal tool for finding more parallelism in
214loops.
215
216As a side effect, it has been observed that the polyhedral model is a useful
217tool for many other optimization, like memory reduction and locality
218improvement. Another point is
219that the polyhedral domain \emph{stricto sensu} applies only to
220very regular programs. Its extension to more general programs is
221an active research subject.
222
223%\subsubsection{High Performance Computing}
224%Accelerating high-performance computing (HPC) applications with field-programmable
225%gate arrays (FPGAs) can potentially improve performance.
226%However, using FPGAs presents significant challenges~\cite{hpc06a}.
227%First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
228%Second, based on Amdahl law,  HPC/FPGA application performance is unusually sensitive
229%to the implementation quality~\cite{hpc06b}.
230%Finally, High-performance computing programmers are a highly sophisticated but scarce
231%resource. Such programmers are expected to readily use new technology but lack the time
232%to learn a completely new skill such as logic design~\cite{hpc07a} .
233%\\
234%HPC/FPGA hardware is only now emerging and in early commercial stages,
235%but these techniques have not yet caught up.
236%Thus, much effort is required to develop design tools that translate high level
237%language programs to FPGA configurations.
238
Note: See TracBrowser for help on using the repository browser.