source: anr/section-3.1.tex @ 87

Last change on this file since 87 was 66, checked in by coach, 15 years ago

IA: modif UBS

File size: 14.2 KB
RevLine 
[56]1% vim:set spell:
2% vim:spell spelllang=en:
3
[12]4Our project covers several critical domains in system design in order
5to achieve high performance computing. Starting from a high level description we aim
6at generating automatically both hardware and software components of the system.
7
8\subsubsection{High Performance Computing}
[56]9% Un marché bouffé par les archi GPGPU tel que le FERMI de NvidiaCUDA programming language
10High-Performance Computing (HPC) world is composed of three main families of architectures:
11many-core, GPGPU (General Purpose computation on Graphics Unit Processing) and FPGA.
12The two first families are dominating the market by taking benefit
[66]13of the strength and influence of mass-market leaders (Intel, Nvidia).
[56]14%such as Intel for many-core CPU and Nvidia for GPGPU.
15In this market, FPGA architectures are emerging and very promising.
16By adapting architecture to the software, % (the opposite is done in the others families)
17FPGAs architectures enable better performance
18(typically between x10 and x100 accelerations)
19while using smaller size and less energy (and heat).
[12]20However, using FPGAs presents significant challenges~\cite{hpc06a}.
21First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
22Second, based on Amdahl law,  HPC/FPGA application performance is unusually sensitive
23to the implementation quality~\cite{hpc06b}.
[56]24% Thus, the performance strongly relies on the detected parallelism.
25% (pour résumer les 2 derniers points)
26Finally, efficient design methodology are required in order to
27hide FPGA complexity and the underlying implantation subtleties to HPC users,
28so that they don't have to change their habits and can have equivalent design productivity
29than in others families~\cite{hpc07a}.
30
31%état de l'art FPGA
[12]32HPC/FPGA hardware is only now emerging and in early commercial stages,
33but these techniques have not yet caught up.
[56]34Industrial (Mitrionics~\cite{hpc08}, Gidel~\cite{hpc09}, Convey Computer~\cite{hpc10}) and academic (CHREC)
35researches on HPC-FPGA are mainly conducted in the USA.
36None of the approaches developed in these researches are fulfilling entirely the
37challenges described above. For example, Convey Computer proposes application-specific instruction set extension of x86 cores in FPGA accelerator,
38but extension generation is not automated and requires hardware design skills.
39Mitrionics has an elegant solution based on a compute engine specifically
40developed for high-performance execution in FPGAs. Unfortunately, the design flow
41is based on a new programming language (mitrionC) implying designer efforts and poor portability.
42% tool relying on operator libraries (XtremeData), 
43% Parle t-on de l'OPenFPGA consortium, dont le but est : "to accelerate the incorporation of reconfigurable computing technology in high-performance and enterprise applications" ?
44
[12]45Thus, much effort is required to develop design tools that translate high level
46language programs to FPGA configurations.
[56]47Moreover, as already remarked in~\cite{hpc11}, Dynamic Partial Reconfiguration~\cite{hpc12}
48(DPR, which enables changing a part of the FPGA, while the rest is still working)
49appears very interesting for improving HPC performance as well as reducing required area.
[12]50
51\subsubsection{System Synthesis}
52Today, several solutions for system design are proposed and commercialized.
53The most common are those provided by Altera and Xilinx to promote their
54FPGA devices.
55\\
56The Xilinx System Generator for DSP~\cite{system-generateur-for-dsp} is a
57plug-in to Simulink that enables designers to develop high-performance DSP
58systems for Xilinx FPGAs.
59Designers can design and simulate a system using MATLAB and Simulink. The
60tool will then automatically generate synthesizable Hardware Description
61Language (HDL) code mapped to Xilinx pre-optimized algorithms.
62However, this tool targets only DSP based algorithms, Xilinx FPGAs and
63cannot handle complete SoC. Thus, it is not really a system synthesis tool.
64\\
65In the opposite, SOPC Builder~\cite{spoc-builder} allows to describe a
66system, to synthesis it, to programm it into a target FPGA and to upload a
67software application.
68% FIXME(C2H from Altera, marche vite mais ressource monstrueuse)
69Nevertheless, SOPC Builder does not provide any facilities to synthesize
70coprocessors. System Designer must provide the synthesizable description
71with the feasible bus interface.
72\\
73In addition, Xilinx System Generator and SOPC Builder are closed world
74since each one imposes their own IPs which are not interchangeable.
[66]75The existing commercial or free tools does not
76cover the whole system synthesis process in a full automatic way. Moreover,
[12]77they are bound to a particular device family and to IPs library.
78
79\subsubsection{High Level Synthesis}
80High Level Synthesis translates a sequential algorithmic description and a
[66]81set of constraints (area, power, frequency, ...) to a micro-architecture at
[12]82Register Transfer Level (RTL).
83Several academic and commercial tools are today available. Most common
84tools are SPARK~\cite{spark04}, GAUT~\cite{gaut08}, UGH~\cite{ugh08} in the
85academic world and CATAPULTC~\cite{catapult-c}, PICO~\cite{pico} and
86CYNTHETIZER~\cite{cynthetizer} in commercial world.  Despite their
87maturity, their usage is restrained by:
88\begin{itemize}
89\item They do not respect accurately the frequency constraint when they target an FPGA device.
90Their error is about 10 percent. This is annoying when the generated component is integrated
91in a SoC since it will slow down the hole system.
92\item These tools take into account only one or few constraints simultaneously while realistic
93designs are multi-constrained.
94Moreover, low power consumption constraint is mandatory for embedded systems.
95However, it is not yet well handled by common synthesis tools.
96\item The parallelism is extracted from initial algorithm. To get more parallelism or to reduce
97the amout of required memory, the user must re-write it while there is techniques as polyedric
98transformations to increase the intrinsec parallelism.
99\item Despite they have the same input language (C/C++), they are sensitive to the style in
100which the algorithm is written. Consequently, engineering work is required to swap from
101a tool to another.
102\item The HLS tools are not integrated into an architecture and system exploration tool.
103Thus, a designer who needs to accelerate a software part of the system, must adapt it manually
104to the HLS input dialect and performs engineering work to exploit the synthesis result
105at the system level.
106\end{itemize}
107Regarding these limitations, it is necessary to create a new tool generation reducing the gap
108between the specification of an heterogenous system and its hardware implementation.
109
110\subsubsection{Application Specific Instruction Processors}
111
112ASIP (Application-Specific Instruction-Set Processor) are programmable
113processors in which both the instruction and the micro architecture have
114been tailored to a given application domain (eg. video processing), or to a
115specific application.  This specialization usually offers a good compromise
116between performance (w.r.t a pure software implementation on an embeded
117CPU) and flexibility (w.r.t an application specific hardware co-processor).
118In spite of their obvious advantages, using/designing ASIPs remains a
119difficult task, since it involves designing both a micro-architecture and a
120compiler for this architecture. Besides, to our knowledge, there is still
121no available open-source design flow\footnote{There are commercial tools
122such a } for ASIP design even if such a tool would be valuable in the
123context of a System Level design exploration tool.
124\par
125In this context, ASIP design based on Instruction Set Extensions (ISEs) has
126received a lot of interest~\cite{NIOS2,ST70}, as it makes micro architecture synthesis
127more tractable \footnote{ISEs rely on a template micro-architecture in which
128only a small fraction of the architecture has to be specialized}, and help ASIP
129designers to focus on compilers, for which there are still many open
130problems\cite{CODES04,FPGA08}.
131This approach however has a strong weakness, since it also significantly reduces
132opportunities for achieving good seedups (most speedup remain between 1.5x and
1332.5x), since ISEs performance is generally tied down by I/O constraints as
134they generally rely on the main CPU register file to access data.
135
136% (
137%automaticcaly extraction ISE candidates for application code \cite{CODES04},
138%performing efficient instruction selection and/or storage resource (register)
139%allocation \cite{FPGA08}). 
140To cope with this issue, recent approaches~\cite{DAC09,DAC08} advocate the use of
141micro-architectural ISE models in which the coupling between the processor micro-architecture
142and the ISE component is thightened up so as to allow the ISE to overcome the register
143I/O limitations, however these approaches tackle the problem for a compiler/simulation
144point of view and not address the problem of generating synthesizable representations for
145these models.
146
147We therefore strongly believe that there is a need for an open-framework which
148would allow researchers and system designers to :
149\begin{itemize}
150\item Explore the various level of interactions between the original CPU micro-architecure
151and its extension (for example throught a Domain Specific Language targeted at micro-architecture
152specification and synthesis).
153\item Retarget the compiler instruction-selection (or prototype nex passes) passes so as
154to be able to take advantage of this ISEs.
155\item Provide  a complete System-level Integration for using ASIP as SoC building blocks
156(integration with application specific blocks, MPSoc, etc.)
157\end{itemize}
158
159\subsubsection{Automatic Parallelization}
160% FIXME:LIP FIXME:PF FIXME:CA
161% Paul je ne suis pas sur que ce soit vraiment un etat de l'art
162% Christophe, ce que tu m'avais envoye se trouve dans obsolete/body.tex
[30]163%\mustbecompleted{
164%Hardware is inherently parallel. On the other hand, high level languages,
165%like C or Fortran, are abstractions of the processors of the 1970s, and
166%hence are sequential. One of the aims of an HLS tool is therefore to
167%extract hidden parallelism from the source program, and to infer enough
168%hardware operators for its efficient exploitation.
169%\\
170%Present day HLS tools search for parallelism in linear pieces of code
171%acting only on scalars -- the so-called basic blocs. On the other hand,
172%it is well known that most programs, especially in the fields of signal
173%processing and image processing, spend most of their time executing loops
174%acting on arrays. Efficient use of the large amount of hardware available
175%in the next generation of FPGA chips necessitates parallelism far beyond
176%what can be extracted from basic blocs only.
[31]177
[30]178%The Compsys team of LIP has built an automatic parallelizer, Syntol, which
179%handle restricted C programs -- the well known polyhedral model --,
180%computes dependences and build a symbolic schedule. The schedule is
181%a specification for a parallel program. The parallelism itself can be
182%expressed in several ways: as a system of threads, or as data-parallel
183%operations, or as a pipeline. In the context of the COACH project, one
184%of the task will be to decide which form of parallelism is best suited
185%to hardware, and how to convey the results of Syntol to the actual
186%synthesis tools. One of the advantages of this approach is that the
187%resulting degree of parallelism can be easilly controlled, e.g. by
188%adjusting the number of threads, as a mean of exploring the
189%area / performance tradeoff of the resulting design.
[31]190
[30]191%Another point is that potentially parallel programs necessarily involve
192%arrays: two operations which write to the same location must be executed
193%in sequence. In synthesis, arrays translate to memory. However, in FPGAs,
194%the amount of on-chip memory is limited, and access to an external memory
195%has a high time penalty. Hence the importance of reducing the size of
196%temporary arrays to the minimum necessary to support the requested degree
197%of parallelism. Compsys has developped a stand-alone tool, Bee, based
198%on research by A. Darte, F. Baray and C. Alias, which can be extended
199%into a memory optimizer for COACH.
200%}
[12]201
[30]202The problem of compiling sequential programs for parallel computers
203has been studied since the advent of the first parallel architectures
204in the 1970s. The basic approach consists in applying program transformations
205which exhibit or increase the potential parallelism, while guaranteeing
206the preservation of the program semantics. Most of these transformations
207just reorder the operations of the program; some of them modify its
208data structures. Dpendences (exact or conservative) are checked to guarantee
209the legality of the transformation.
210
211This has lead to the invention of many loop transformations (loop fusion,
212loop splitting, loop skewing, loop interchange, loop unrolling, ...)
213which interact in a complicated way. More recently, it has been noticed
214that all of these are just changes of basis in the iteration domain of
215the program. This has lead to the invention of the polyhedral model, in
216which the combination of two transformation is simply a matrix product.
217
218As a side effect, it has been observed that the polytope model is a useful
219tool for many other optimization, like memory reduction and locality
220improvement. Another point is
221that the polyhedral domain \emph{stricto sensu} applies only to
222very regular programs. Its extension to more general programs is
223an active research subject.
224
[66]225%\subsubsection{High Performance Computing}
226%Accelerating high-performance computing (HPC) applications with field-programmable
227%gate arrays (FPGAs) can potentially improve performance.
228%However, using FPGAs presents significant challenges~\cite{hpc06a}.
229%First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
230%Second, based on Amdahl law,  HPC/FPGA application performance is unusually sensitive
231%to the implementation quality~\cite{hpc06b}.
232%Finally, High-performance computing programmers are a highly sophisticated but scarce
233%resource. Such programmers are expected to readily use new technology but lack the time
234%to learn a completely new skill such as logic design~\cite{hpc07a} .
235%\\
236%HPC/FPGA hardware is only now emerging and in early commercial stages,
237%but these techniques have not yet caught up.
238%Thus, much effort is required to develop design tools that translate high level
239%language programs to FPGA configurations.
[12]240
Note: See TracBrowser for help on using the repository browser.