source: anr/section-3.1.tex @ 207

Last change on this file since 207 was 198, checked in by coach, 15 years ago

minor modifs in 3.1.4

File size: 12.7 KB
RevLine 
[56]1% vim:set spell:
2% vim:spell spelllang=en:
3
[12]4Our project covers several critical domains in system design in order
5to achieve high performance computing. Starting from a high level description we aim
6at generating automatically both hardware and software components of the system.
7
8\subsubsection{High Performance Computing}
[56]9% Un marché bouffé par les archi GPGPU tel que le FERMI de NvidiaCUDA programming language
10High-Performance Computing (HPC) world is composed of three main families of architectures:
11many-core, GPGPU (General Purpose computation on Graphics Unit Processing) and FPGA.
12The two first families are dominating the market by taking benefit
[66]13of the strength and influence of mass-market leaders (Intel, Nvidia).
[56]14%such as Intel for many-core CPU and Nvidia for GPGPU.
15In this market, FPGA architectures are emerging and very promising.
16By adapting architecture to the software, % (the opposite is done in the others families)
17FPGAs architectures enable better performance
18(typically between x10 and x100 accelerations)
19while using smaller size and less energy (and heat).
[12]20However, using FPGAs presents significant challenges~\cite{hpc06a}.
21First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
22Second, based on Amdahl law,  HPC/FPGA application performance is unusually sensitive
23to the implementation quality~\cite{hpc06b}.
[56]24% Thus, the performance strongly relies on the detected parallelism.
25% (pour résumer les 2 derniers points)
26Finally, efficient design methodology are required in order to
27hide FPGA complexity and the underlying implantation subtleties to HPC users,
[180]28so that they do not have to change their habits and can have equivalent design productivity
[56]29than in others families~\cite{hpc07a}.
30
31%état de l'art FPGA
[12]32HPC/FPGA hardware is only now emerging and in early commercial stages,
33but these techniques have not yet caught up.
[56]34Industrial (Mitrionics~\cite{hpc08}, Gidel~\cite{hpc09}, Convey Computer~\cite{hpc10}) and academic (CHREC)
35researches on HPC-FPGA are mainly conducted in the USA.
36None of the approaches developed in these researches are fulfilling entirely the
37challenges described above. For example, Convey Computer proposes application-specific instruction set extension of x86 cores in FPGA accelerator,
38but extension generation is not automated and requires hardware design skills.
39Mitrionics has an elegant solution based on a compute engine specifically
40developed for high-performance execution in FPGAs. Unfortunately, the design flow
[180]41is based on a new programming language (mitrionC) implying important designer efforts and poor portability.
[56]42% tool relying on operator libraries (XtremeData), 
43% Parle t-on de l'OPenFPGA consortium, dont le but est : "to accelerate the incorporation of reconfigurable computing technology in high-performance and enterprise applications" ?
44
[12]45Thus, much effort is required to develop design tools that translate high level
46language programs to FPGA configurations.
[56]47Moreover, as already remarked in~\cite{hpc11}, Dynamic Partial Reconfiguration~\cite{hpc12}
48(DPR, which enables changing a part of the FPGA, while the rest is still working)
49appears very interesting for improving HPC performance as well as reducing required area.
[12]50
51\subsubsection{System Synthesis}
52Today, several solutions for system design are proposed and commercialized.
[103]53The existing commercial or free tools does not
54cover the whole system synthesis process in a full automatic way. Moreover,
55they are bound to a particular device family and to IPs library.
[134]56The most commonly used are provided by \altera and \xilinx to promote their
[103]57FPGA devices. These two representative tools used to synthesize SoC on FPGA
58are introduced below.
[12]59\\
[134]60The \xilinx System Generator for DSP~\cite{system-generateur-for-dsp} is a
[12]61plug-in to Simulink that enables designers to develop high-performance DSP
[134]62systems for \xilinx FPGAs.
[12]63Designers can design and simulate a system using MATLAB and Simulink. The
64tool will then automatically generate synthesizable Hardware Description
[134]65Language (HDL) code mapped to \xilinx pre-optimized algorithms.
66However, this tool targets only DSP based algorithms, \xilinx FPGAs and
[103]67cannot handle a complete SoC. Thus, it is not really a system synthesis tool.
[12]68\\
69In the opposite, SOPC Builder~\cite{spoc-builder} allows to describe a
70system, to synthesis it, to programm it into a target FPGA and to upload a
71software application.
[134]72% FIXME(C2H from \altera, marche vite mais ressource monstrueuse)
[12]73Nevertheless, SOPC Builder does not provide any facilities to synthesize
74coprocessors. System Designer must provide the synthesizable description
[103]75with the feasible bus interface. Design Space Exploration is thus limited
[180]76and SystemC simulation is not possible neither at transactional nor at cycle
[103]77accurate level.
[12]78\\
[134]79In addition, \xilinx System Generator and SOPC Builder are closed world
[12]80since each one imposes their own IPs which are not interchangeable.
81
82\subsubsection{High Level Synthesis}
83High Level Synthesis translates a sequential algorithmic description and a
[66]84set of constraints (area, power, frequency, ...) to a micro-architecture at
[12]85Register Transfer Level (RTL).
86Several academic and commercial tools are today available. Most common
87tools are SPARK~\cite{spark04}, GAUT~\cite{gaut08}, UGH~\cite{ugh08} in the
88academic world and CATAPULTC~\cite{catapult-c}, PICO~\cite{pico} and
89CYNTHETIZER~\cite{cynthetizer} in commercial world.  Despite their
[180]90maturity, their usage is restrained by \cite{IEEEDT} \cite{CATRENE} \cite{HLSBOOK}:
[12]91\begin{itemize}
[103]92\item The HLS tools are not integrated into an architecture and system exploration tool.
93Thus, a designer who needs to accelerate a software part of the system, must adapt it manually
94to the HLS input dialect and performs engineering work to exploit the synthesis result
[180]95at the system level,
[181]96\item Current HLS tools can not target control AND data oriented applications,
[103]97\item HLS tools take into account only one or few constraints simultaneously while realistic
[180]98designs are multi-constrained,
[12]99Moreover, low power consumption constraint is mandatory for embedded systems.
[180]100However, it is not yet well handled or not handle at all by the synthesis tools already available,
101\item The parallelism is extracted from initial algorithmic specification,
[134]102To get more parallelism or to reduce the amount of required memory in the SoC, the user
103must re-write the algorithmic specification while there is techniques as polyedric
[180]104transformations to increase the intrinsic parallelism,
[134]105\item While they support limited loop transformations like loop unrolling and loop
106pipelining, current HLS tools do not provide support for design space exploration neither
[180]107through automatic loop transformations nor through memory mapping,
[12]108\item Despite they have the same input language (C/C++), they are sensitive to the style in
109which the algorithm is written. Consequently, engineering work is required to swap from
[180]110a tool to another,
[103]111\item They do not respect accurately the frequency constraint when they target an FPGA device.
112Their error is about 10 percent. This is annoying when the generated component is integrated
113in a SoC since it will slow down the hole system.
[12]114\end{itemize}
115Regarding these limitations, it is necessary to create a new tool generation reducing the gap
[180]116between the specification of an heterogeneous system and its hardware implementation \cite{HLSBOOK} \cite{IEEEDT}.
[12]117
118\subsubsection{Application Specific Instruction Processors}
119
120ASIP (Application-Specific Instruction-Set Processor) are programmable
121processors in which both the instruction and the micro architecture have
[103]122been tailored to a given application domain (e.g. video processing), or to a
[12]123specific application.  This specialization usually offers a good compromise
[103]124between performance (w.r.t a pure software implementation on an embedded
[12]125CPU) and flexibility (w.r.t an application specific hardware co-processor).
126In spite of their obvious advantages, using/designing ASIPs remains a
127difficult task, since it involves designing both a micro-architecture and a
128compiler for this architecture. Besides, to our knowledge, there is still
[93]129no available open-source design flow for ASIP design even if such a tool
130 would be valuable in the
[12]131context of a System Level design exploration tool.
132\par
133In this context, ASIP design based on Instruction Set Extensions (ISEs) has
[120]134received a lot of interest~\cite{NIOS2}, as it makes micro architecture synthesis
[12]135more tractable \footnote{ISEs rely on a template micro-architecture in which
136only a small fraction of the architecture has to be specialized}, and help ASIP
137designers to focus on compilers, for which there are still many open
[93]138problems\cite{ARC08}.
[198]139This approach however has a severe weakness, since it also significantly reduces
140opportunities for achieving good speedups (most speedups remain between 1.5x and
[12]1412.5x), since ISEs performance is generally tied down by I/O constraints as
142they generally rely on the main CPU register file to access data.
143
144% (
145%automaticcaly extraction ISE candidates for application code \cite{CODES04},
146%performing efficient instruction selection and/or storage resource (register)
147%allocation \cite{FPGA08}). 
[93]148To cope with this issue, recent approaches~\cite{DAC09,CODES08,TVLSI06} advocate the use of
[12]149micro-architectural ISE models in which the coupling between the processor micro-architecture
[198]150and the ISE component is tightened up so as to allow the ISE to overcome the register
151I/O limitations. However these approaches generally tackle the problem from a compiler/simulation
152point of view and do not address the problem of generating synthesizable representations for
[12]153these models.
154
155We therefore strongly believe that there is a need for an open-framework which
156would allow researchers and system designers to :
157\begin{itemize}
[180]158\item Explore the various level of interactions between the original CPU micro-architecture
159and its extension (for example through a Domain Specific Language targeted at micro-architecture
[12]160specification and synthesis).
[198]161\item Retarget the compiler instruction-selection pass
162(or prototype new passes) so as to be able to take advantage of this ISEs.
[12]163\item Provide  a complete System-level Integration for using ASIP as SoC building blocks
164(integration with application specific blocks, MPSoc, etc.)
165\end{itemize}
166
167\subsubsection{Automatic Parallelization}
[31]168
[30]169The problem of compiling sequential programs for parallel computers
170has been studied since the advent of the first parallel architectures
171in the 1970s. The basic approach consists in applying program transformations
172which exhibit or increase the potential parallelism, while guaranteeing
173the preservation of the program semantics. Most of these transformations
174just reorder the operations of the program; some of them modify its
[174]175data structures. Dependences (exact or conservative) are checked to guarantee
[30]176the legality of the transformation.
177
178This has lead to the invention of many loop transformations (loop fusion,
179loop splitting, loop skewing, loop interchange, loop unrolling, ...)
180which interact in a complicated way. More recently, it has been noticed
181that all of these are just changes of basis in the iteration domain of
[174]182the program. This has lead to the introduction of the polyhedral model,
183\cite{FP:96,DRV:2000} in which the combination of two transformation is
184simply a matrix product.
[30]185
[174]186Since hardware is inherently parallel, finding parallelism in sequential
187programs in an important prerequisite for HLS. The large FPGA chips of
188today can accomodate much more parallelism than is available in basic blocks.
189The polyhedral model is the ideal tool for finding more parallelism in
190loops.
191
192As a side effect, it has been observed that the polyhedral model is a useful
[30]193tool for many other optimization, like memory reduction and locality
194improvement. Another point is
195that the polyhedral domain \emph{stricto sensu} applies only to
196very regular programs. Its extension to more general programs is
197an active research subject.
198
[66]199%\subsubsection{High Performance Computing}
200%Accelerating high-performance computing (HPC) applications with field-programmable
201%gate arrays (FPGAs) can potentially improve performance.
202%However, using FPGAs presents significant challenges~\cite{hpc06a}.
203%First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
204%Second, based on Amdahl law,  HPC/FPGA application performance is unusually sensitive
205%to the implementation quality~\cite{hpc06b}.
206%Finally, High-performance computing programmers are a highly sophisticated but scarce
207%resource. Such programmers are expected to readily use new technology but lack the time
208%to learn a completely new skill such as logic design~\cite{hpc07a} .
209%\\
210%HPC/FPGA hardware is only now emerging and in early commercial stages,
211%but these techniques have not yet caught up.
212%Thus, much effort is required to develop design tools that translate high level
213%language programs to FPGA configurations.
[12]214
Note: See TracBrowser for help on using the repository browser.