source: anr/section-3.1.tex @ 30

Last change on this file since 30 was 30, checked in by coach, 14 years ago

M anr/section-2.tex
M anr/section-2.2.tex
M anr/section-3.1.tex

File size: 13.9 KB
Line 
1Our project covers several critical domains in system design in order
2to achieve high performance computing. Starting from a high level description we aim
3at generating automatically both hardware and software components of the system.
4
5\subsubsection{High Performance Computing}
6Accelerating high-performance computing (HPC) applications with field-programmable
7gate arrays (FPGAs) can potentially improve performance.
8However, using FPGAs presents significant challenges~\cite{hpc06a}.
9First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
10Second, based on Amdahl law,  HPC/FPGA application performance is unusually sensitive
11to the implementation quality~\cite{hpc06b}.
12Finally, High-performance computing programmers are a highly sophisticated but scarce
13resource. Such programmers are expected to readily use new technology but lack the time
14to learn a completely new skill such as logic design~\cite{hpc07a} .
15\\
16HPC/FPGA hardware is only now emerging and in early commercial stages,
17but these techniques have not yet caught up.
18Thus, much effort is required to develop design tools that translate high level
19language programs to FPGA configurations.
20
21\subsubsection{System Synthesis}
22Today, several solutions for system design are proposed and commercialized.
23The most common are those provided by Altera and Xilinx to promote their
24FPGA devices.
25\\
26The Xilinx System Generator for DSP~\cite{system-generateur-for-dsp} is a
27plug-in to Simulink that enables designers to develop high-performance DSP
28systems for Xilinx FPGAs.
29Designers can design and simulate a system using MATLAB and Simulink. The
30tool will then automatically generate synthesizable Hardware Description
31Language (HDL) code mapped to Xilinx pre-optimized algorithms.
32However, this tool targets only DSP based algorithms, Xilinx FPGAs and
33cannot handle complete SoC. Thus, it is not really a system synthesis tool.
34\\
35In the opposite, SOPC Builder~\cite{spoc-builder} allows to describe a
36system, to synthesis it, to programm it into a target FPGA and to upload a
37software application.
38% FIXME(C2H from Altera, marche vite mais ressource monstrueuse)
39Nevertheless, SOPC Builder does not provide any facilities to synthesize
40coprocessors. System Designer must provide the synthesizable description
41with the feasible bus interface.
42\\
43In addition, Xilinx System Generator and SOPC Builder are closed world
44since each one imposes their own IPs which are not interchangeable.
45We can conclude that the existing commercial or free tools does not
46coverthe whole system synthesis process in a full automatic way. Moreover,
47they are bound to a particular device family and to IPs library.
48
49\subsubsection{High Level Synthesis}
50High Level Synthesis translates a sequential algorithmic description and a
51constraints set (area, power, frequency, ...) to a micro-architecture at
52Register Transfer Level (RTL).
53Several academic and commercial tools are today available. Most common
54tools are SPARK~\cite{spark04}, GAUT~\cite{gaut08}, UGH~\cite{ugh08} in the
55academic world and CATAPULTC~\cite{catapult-c}, PICO~\cite{pico} and
56CYNTHETIZER~\cite{cynthetizer} in commercial world.  Despite their
57maturity, their usage is restrained by:
58\begin{itemize}
59\item They do not respect accurately the frequency constraint when they target an FPGA device.
60Their error is about 10 percent. This is annoying when the generated component is integrated
61in a SoC since it will slow down the hole system.
62\item These tools take into account only one or few constraints simultaneously while realistic
63designs are multi-constrained.
64Moreover, low power consumption constraint is mandatory for embedded systems.
65However, it is not yet well handled by common synthesis tools.
66\item The parallelism is extracted from initial algorithm. To get more parallelism or to reduce
67the amout of required memory, the user must re-write it while there is techniques as polyedric
68transformations to increase the intrinsec parallelism.
69\item Despite they have the same input language (C/C++), they are sensitive to the style in
70which the algorithm is written. Consequently, engineering work is required to swap from
71a tool to another.
72\item The HLS tools are not integrated into an architecture and system exploration tool.
73Thus, a designer who needs to accelerate a software part of the system, must adapt it manually
74to the HLS input dialect and performs engineering work to exploit the synthesis result
75at the system level.
76\end{itemize}
77Regarding these limitations, it is necessary to create a new tool generation reducing the gap
78between the specification of an heterogenous system and its hardware implementation.
79
80\subsubsection{Application Specific Instruction Processors}
81
82ASIP (Application-Specific Instruction-Set Processor) are programmable
83processors in which both the instruction and the micro architecture have
84been tailored to a given application domain (eg. video processing), or to a
85specific application.  This specialization usually offers a good compromise
86between performance (w.r.t a pure software implementation on an embeded
87CPU) and flexibility (w.r.t an application specific hardware co-processor).
88In spite of their obvious advantages, using/designing ASIPs remains a
89difficult task, since it involves designing both a micro-architecture and a
90compiler for this architecture. Besides, to our knowledge, there is still
91no available open-source design flow\footnote{There are commercial tools
92such a } for ASIP design even if such a tool would be valuable in the
93context of a System Level design exploration tool.
94\par
95In this context, ASIP design based on Instruction Set Extensions (ISEs) has
96received a lot of interest~\cite{NIOS2,ST70}, as it makes micro architecture synthesis
97more tractable \footnote{ISEs rely on a template micro-architecture in which
98only a small fraction of the architecture has to be specialized}, and help ASIP
99designers to focus on compilers, for which there are still many open
100problems\cite{CODES04,FPGA08}.
101This approach however has a strong weakness, since it also significantly reduces
102opportunities for achieving good seedups (most speedup remain between 1.5x and
1032.5x), since ISEs performance is generally tied down by I/O constraints as
104they generally rely on the main CPU register file to access data.
105
106% (
107%automaticcaly extraction ISE candidates for application code \cite{CODES04},
108%performing efficient instruction selection and/or storage resource (register)
109%allocation \cite{FPGA08}). 
110To cope with this issue, recent approaches~\cite{DAC09,DAC08} advocate the use of
111micro-architectural ISE models in which the coupling between the processor micro-architecture
112and the ISE component is thightened up so as to allow the ISE to overcome the register
113I/O limitations, however these approaches tackle the problem for a compiler/simulation
114point of view and not address the problem of generating synthesizable representations for
115these models.
116
117We therefore strongly believe that there is a need for an open-framework which
118would allow researchers and system designers to :
119\begin{itemize}
120\item Explore the various level of interactions between the original CPU micro-architecure
121and its extension (for example throught a Domain Specific Language targeted at micro-architecture
122specification and synthesis).
123\item Retarget the compiler instruction-selection (or prototype nex passes) passes so as
124to be able to take advantage of this ISEs.
125\item Provide  a complete System-level Integration for using ASIP as SoC building blocks
126(integration with application specific blocks, MPSoc, etc.)
127\end{itemize}
128
129\subsubsection{Automatic Parallelization}
130% FIXME:LIP FIXME:PF FIXME:CA
131% Paul je ne suis pas sur que ce soit vraiment un etat de l'art
132% Christophe, ce que tu m'avais envoye se trouve dans obsolete/body.tex
133%\mustbecompleted{
134%Hardware is inherently parallel. On the other hand, high level languages,
135%like C or Fortran, are abstractions of the processors of the 1970s, and
136%hence are sequential. One of the aims of an HLS tool is therefore to
137%extract hidden parallelism from the source program, and to infer enough
138%hardware operators for its efficient exploitation.
139%\\
140%Present day HLS tools search for parallelism in linear pieces of code
141%acting only on scalars -- the so-called basic blocs. On the other hand,
142%it is well known that most programs, especially in the fields of signal
143%processing and image processing, spend most of their time executing loops
144%acting on arrays. Efficient use of the large amount of hardware available
145%in the next generation of FPGA chips necessitates parallelism far beyond
146%what can be extracted from basic blocs only.
147\\
148%The Compsys team of LIP has built an automatic parallelizer, Syntol, which
149%handle restricted C programs -- the well known polyhedral model --,
150%computes dependences and build a symbolic schedule. The schedule is
151%a specification for a parallel program. The parallelism itself can be
152%expressed in several ways: as a system of threads, or as data-parallel
153%operations, or as a pipeline. In the context of the COACH project, one
154%of the task will be to decide which form of parallelism is best suited
155%to hardware, and how to convey the results of Syntol to the actual
156%synthesis tools. One of the advantages of this approach is that the
157%resulting degree of parallelism can be easilly controlled, e.g. by
158%adjusting the number of threads, as a mean of exploring the
159%area / performance tradeoff of the resulting design.
160\\
161%Another point is that potentially parallel programs necessarily involve
162%arrays: two operations which write to the same location must be executed
163%in sequence. In synthesis, arrays translate to memory. However, in FPGAs,
164%the amount of on-chip memory is limited, and access to an external memory
165%has a high time penalty. Hence the importance of reducing the size of
166%temporary arrays to the minimum necessary to support the requested degree
167%of parallelism. Compsys has developped a stand-alone tool, Bee, based
168%on research by A. Darte, F. Baray and C. Alias, which can be extended
169%into a memory optimizer for COACH.
170%}
171
172The problem of compiling sequential programs for parallel computers
173has been studied since the advent of the first parallel architectures
174in the 1970s. The basic approach consists in applying program transformations
175which exhibit or increase the potential parallelism, while guaranteeing
176the preservation of the program semantics. Most of these transformations
177just reorder the operations of the program; some of them modify its
178data structures. Dpendences (exact or conservative) are checked to guarantee
179the legality of the transformation.
180
181This has lead to the invention of many loop transformations (loop fusion,
182loop splitting, loop skewing, loop interchange, loop unrolling, ...)
183which interact in a complicated way. More recently, it has been noticed
184that all of these are just changes of basis in the iteration domain of
185the program. This has lead to the invention of the polyhedral model, in
186which the combination of two transformation is simply a matrix product.
187
188As a side effect, it has been observed that the polytope model is a useful
189tool for many other optimization, like memory reduction and locality
190improvement. Another point is
191that the polyhedral domain \emph{stricto sensu} applies only to
192very regular programs. Its extension to more general programs is
193an active research subject.
194
195\subsubsection{Interfaces}
196\newcommand{\ip}{\sc ip}
197\newcommand{\dma}{\sc dma}
198\newcommand{\soc}{\sc SoC}
199\newcommand{\mwmr}{\sc mwmr}
200The hardware/software interface has been a difficult task since the advent
201of complex systems on chip. After the first Co-design
202environments~\cite{Coware,Polis,Ptolemy}, the Hardware Abstraction Layer
203has been defined so that software applications can be developed without low
204level hardware implementation details.  In~\cite{jerraya}, Yoo and Jerraya
205propose an {\sc api} with extension ability instead of a unique hardware
206abstraction layer.  System level communication frameworks have been
207introduced~\cite{JerrayaPetrot,mwmr}.
208\par
209A good abstraction of a hardware/software interface has been proposed
210in~\cite{Jantsch}: it is composed of a software driver, a {\dma} and and a
211bus interface circuit. Automatic wrapping between bus protocols has
212generated a lot of papers~\cite{Avnit,smith,Narayan, Alberto}. These works
213do not use a {\dma}. In COACH, the hardware/software interface is done at a
214higher level and uses burst communication in the bus interface circuit to
215improve the communication performances.
216\par
217There are two important projects related to efficient interface of
218data-flow {\ip}s : the work of Park and Diniz~\cite{ Park01} and the the
219Lip6 work on {\mwmr}~\cite{mwmr}.  Park and Diniz~\cite{ Park01} proposed
220of a generic interface that can be parameterized to connect different
221data-flow {\ip}s. This approach does not request the communications to be
222statically known and proposes a runtime resolution to solve conflicting
223access to the bus. To our knowledge this approach has not been implemented
224further since 2003.
225\par
226{\mwmr}~\cite{mwmr} stands for both a computation model (multi-write,
227multi-read {\sc fifo}) inherited from the Khan Process Networks and a bus
228interface circuit protocol.  As for the work of Park and Diniz, {\mwmr}
229does not make the assumption of a static communication flow.  This implies
230simple software driver to write, but introduces additional complexity due
231to the mutual exclusion locks necessary to protect the shared memory.
232\par
233we propose, in COACH, to use recent work on hardware/software
234interface~\cite{FR-vlsi}  that  uses a {\em clever} {\dma} responsible for
235managing data streams. A assumption is that the behavior of the {\ip}s can
236be statically described. A similar choice has been made in the Faust
237{\soc}~\cite{FAUST} which includes the {\em smart memory engine} component.
238Jantsch and O'Nils already noticed in ~\cite{Jantsch} the huge complexity
239of writing this hardware/software interface, in COACH,  automatic
240generation of the interface will be achieved, this is one goal of the CITI
241contribution to COACH.
242
Note: See TracBrowser for help on using the repository browser.