Context Navigation

source: anr/section-3.1.tex @ 251

Last change on this file since 251 was 247, checked in by coach, 14 years ago
UBS
File size: 14.9 KB

Line
1	% vim:set spell:
2	% vim:spell spelllang=en:
3
4	Our project covers several critical domains in system design in order
5	to achieve high performance computing. Starting from a high level description we aim
6	at generating automatically both hardware and software components of the system.
7
8	\subsubsection{High Performance Computing}
9	% Un marchÃ© bouffÃ© par les archi GPGPU tel que le FERMI de NvidiaCUDA programming language
10	The High-Performance Computing (HPC) world is composed of three main families of architectures:
11	many-core, GPGPU (General Purpose computation on Graphics Unit Processing) and FPGA.
12	The first two families are dominating the market by taking benefit
13	of the strength and influence of mass-market leaders (Intel, Nvidia).
14	%such as Intel for many-core CPU and Nvidia for GPGPU.
15	In this market, FPGA architectures are emerging and very promising.
16	By adapting architecture to the software, % (the opposite is done in the others families)
17	FPGAs architectures enable better performance
18	(typically between x10 and x100 accelerations)
19	while using smaller size and less energy (and heat).
20	However, using FPGAs presents significant challenges~\cite{hpc06a}.
21	First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
22	Second, based on Amdahl law, HPC/FPGA application performance is unusually sensitive
23	to the implementation quality~\cite{hpc06b}.
24	% Thus, the performance strongly relies on the detected parallelism.
25	% (pour rÃ©sumer les 2 derniers points)
26	Finally, efficient design methodology are required in order to
27	hide FPGA complexity and the underlying implantation subtleties to HPC users,
28	so that they do not have to change their habits and can have equivalent design productivity
29	than in others families~\cite{hpc07a}.
30
31	%Ã©tat de l'art FPGA
32	HPC/FPGA hardware is only now emerging and in early commercial stages,
33	but these techniques have not yet caught up.
34	Industrial (Mitrionics~\cite{hpc08}, Gidel~\cite{hpc09}, Convey Computer~\cite{hpc10}) and academic (CHREC)
35	researches on HPC-FPGA are mainly conducted in the USA.
36	None of the approaches developed in these researches are fulfilling entirely the
37	challenges described above. For example, Convey Computer proposes application-specific instruction set extension of x86 cores in FPGA accelerator,
38	but extension generation is not automated and requires hardware design skills.
39	Mitrionics has an elegant solution based on a compute engine specifically
40	developed for high-performance execution in FPGAs. Unfortunately, the design flow
41	is based on a new programming language (mitrionC) implying important designer efforts and poor portability.
42	% tool relying on operator libraries (XtremeData),
43	% Parle t-on de l'OPenFPGA consortium, dont le but est : "to accelerate the incorporation of reconfigurable computing technology in high-performance and enterprise applications" ?
44
45	Thus, much effort is required to develop design tools that translate high level
46	language programs to FPGA configurations.
47	Moreover, as already remarked in~\cite{hpc11}, Dynamic Partial Reconfiguration~\cite{hpc12}
48	(DPR, which enables changing a part of the FPGA, while the rest is still working)
49	appears very interesting for improving HPC performance as well as reducing required area.
50
51	\subsubsection{System Synthesis}
52	Today, several solutions for system design are proposed and commercialized.
53	The existing commercial or free tools do not
54	cover the whole system synthesis process in a full automatic way. Moreover,
55	they are bound to a particular device family and to IPs library.
56	The most commonly used are provided by \altera and \xilinx to promote their
57	FPGA devices. These two representative tools used to synthesize SoC on FPGA
58	are introduced below.
59	\\
60	The \xilinx System Generator for DSP~\cite{system-generateur-for-dsp} is a
61	plug-in to Simulink that enables designers to develop high-performance DSP
62	systems for \xilinx FPGAs.
63	Designers can design and simulate a system using MATLAB and Simulink. The
64	tool will then automatically generate synthesizable Hardware Description
65	Language (HDL) code mapped to \xilinx pre-optimized algorithms.
66	However, this tool targets only DSP based algorithms, \xilinx FPGAs and
67	cannot handle a complete SoC. Thus, it is not really a system synthesis tool.
68	\\
69	In the opposite, SOPC Builder~\cite{spoc-builder} allows to describe a
70	system, to synthesis it, to programm it into a target FPGA and to upload a
71	software application.
72	% FIXME(C2H from \altera, marche vite mais ressource monstrueuse)
73	Nevertheless, SOPC Builder does not provide any facilities to synthesize
74	coprocessors. System Designer must provide the synthesizable description
75	with the feasible bus interface. Design Space Exploration is thus limited
76	and SystemC simulation is not possible neither at transactional nor at cycle
77	accurate level.
78	\\
79	In addition, \xilinx System Generator and SOPC Builder are closed world
80	since each one imposes their own IPs which are not interchangeable.
81	%By using SOPC Builder~\cite{spoc-builder} from \altera, designers can select and
82	%parameterize components from an extensive drop-down list of IP cores (I/O core, DSP,
83	%processor, bus core, ...) as well as incorporate their own IP.
84	%Designers can then generate a synthesized netlist, simulation test bench and custom
85	%software library that reflect the hardware configuration.
86	%Nevertheless, SOPC Builder does not provide any facilities to synthesize coprocessors and to
87	%simulate the platform at a high design level (systemC).
88	%In addition, SOPC Builder is proprietary and only works together with \altera's Quartus compilation
89	%tool to implement designs on \altera devices (Stratix, Arria, Cyclone).
90	%PICO~\cite{pico} and CATAPULT-C~\cite{catapult-c} allow to synthesize
91	%coprocessors from a C++ description.
92	%Nevertheless, they can only deal with data dominated applications and they do not handle
93	%the platform level.
94	%Similarly, the System Generator for DSP~\cite{system-generateur-for-dsp} is a plug-in to
95	%Simulink that enables designers to develop high-performance DSP systems for \xilinx FPGAs.
96	%Designers can design and simulate a system using MATLAB and Simulink. The tool will then
97	%automatically generate synthesizable Hardware Description Language (HDL) code mapped to
98	%\xilinx pre-optimized macro-cells.
99	%However, this tool targets only DSP based algorithms.
100	%\\
101	%Consequently, a designer developping an embedded system needs to master four different
102	%design environments:
103	%\begin{enumerate}
104	% \item a virtual prototyping environment such as SoCLib for system level exploration,
105	% \item an architecture compiler (such as SOPC Builder from \altera, or System generator
106	% from \xilinx) to define the hardware architecture,
107	% \item one or several HLS tools (such as PICO~\cite{pico} or CATAPULT-C~\cite{catapult-c}) for
108	% coprocessor synthesis,
109	% \item and finally backend synthesis tools (such as Quartus or Synopsys) for the bit-stream generation.
110	%\end{enumerate}
111	%Furthermore, mixing these tools requires an important interfacing effort and this makes
112	%the design process very complex and achievable only by designers skilled in many domains.
113
114	\subsubsection{High Level Synthesis}
115	High Level Synthesis translates a sequential algorithmic description and a
116	set of constraints (area, power, frequency, ...) to a micro-architecture at
117	Register Transfer Level (RTL).
118	Several academic and commercial tools are today available. The most common
119	tools are SPARK~\cite{spark04}, GAUT~\cite{gaut08}, UGH~\cite{ugh08} in the
120	academic world and CATAPULTC~\cite{catapult-c}, PICO~\cite{pico} and
121	CYNTHETIZER~\cite{cynthetizer} in the commercial world. Despite their
122	maturity, their usage is restrained by \cite{IEEEDT} \cite{CATRENE} \cite{HLSBOOK}:
123	\begin{itemize}
124	\item The HLS tools are not integrated into an architecture and system exploration tool.
125	Thus, a designer who needs to accelerate a software part of the system, must adapt it manually
126	to the HLS input dialect and performs engineering work to exploit the synthesis result
127	at the system level,
128	\item Current HLS tools can not target control AND data oriented applications,
129	\item HLS tools take into account only one or few constraints simultaneously while realistic
130	designs are multi-constrained,
131	Moreover, low power consumption constraint is mandatory for embedded systems.
132	However, it is not yet well handled or not handled at all by the synthesis tools already available,
133	\item The parallelism is extracted from initial algorithmic specification.
134	To get more parallelism or to reduce the amount of required memory in the SoC, the user
135	must re-write the algorithmic specification while there is techniques such as polyedric
136	transformations to increase the intrinsic parallelism,
137	\item While they support limited loop transformations like loop unrolling and loop
138	pipelining, current HLS tools do not provide support for design space exploration neither
139	through automatic loop transformations nor through memory mapping,
140	\item Despite having the same input language (C/C++), they are sensitive to the style in
141	which the algorithm dis written. Consequently, engineering work is required to swap from
142	a tool to another,
143	\item They do not respect accurately the frequency constraint when they target an FPGA device.
144	Their error is about 10 percent. This is annoying when the generated component is integrated
145	in a SoC since it will slow down the whole system.
146	\end{itemize}
147	Regarding these limitations, it is necessary to create a new tool generation reducing the gap
148	between the specification of an heterogeneous system and its hardware implementation \cite{HLSBOOK} \cite{IEEEDT}.
149
150	\subsubsection{Application Specific Instruction Processors}
151
152	ASIP (Application-Specific Instruction-Set Processor) are programmable
153	processors in which both the instruction and the micro architecture have
154	been tailored to a given application domain (e.g. video processing), or to a
155	specific application. This specialization usually offers a good compromise
156	between performance (w.r.t a pure software implementation on an embedded
157	CPU) and flexibility (w.r.t an application specific hardware co-processor).
158	In spite of their obvious advantages, using/designing ASIPs remains a
159	difficult task, since it involves designing both a micro-architecture and a
160	compiler for this architecture. Besides, to our knowledge, there is still
161	no available open-source design flow for ASIP design even if such a tool
162	would be valuable in the
163	context of a System Level design exploration tool.
164	\par
165	In this context, ASIP design based on Instruction Set Extensions (ISEs) has
166	received a lot of interest~\cite{NIOS2}, as it makes micro architecture synthesis
167	more tractable \footnote{ISEs rely on a template micro-architecture in which
168	only a small fraction of the architecture has to be specialized}, and help ASIP
169	designers to focus on compilers, for which there are still many open
170	problems\cite{ARC08}.
171	This approach however has a severe weakness, since it also significantly reduces
172	opportunities for achieving good speedups (most speedups remain between 1.5x and
173	2.5x), since ISEs performance is generally tied down by I/O constraints as
174	they generally rely on the main CPU register file to access data.
175
176	% (
177	%automaticcaly extraction ISE candidates for application code \cite{CODES04},
178	%performing efficient instruction selection and/or storage resource (register)
179	%allocation \cite{FPGA08}).
180	To cope with this issue, recent approaches~\cite{DAC09,CODES08,TVLSI06} advocate the use of
181	micro-architectural ISE models in which the coupling between the processor micro-architecture
182	and the ISE component is tightened up so as to allow the ISE to overcome the register
183	I/O limitations. However these approaches generally tackle the problem from a compiler/simulation
184	point of view and do not address the problem of generating synthesizable representations for
185	these models.
186
187	We therefore strongly believe that there is a need for an open-framework which
188	would allow researchers and system designers to :
189	\begin{itemize}
190	\item Explore the various level of interactions between the original CPU micro-architecture
191	and its extension (for example through a Domain Specific Language targeted at micro-architecture
192	specification and synthesis).
193	\item Retarget the compiler instruction-selection pass
194	(or prototype new passes) so as to be able to take advantage of this ISEs.
195	\item Provide a complete System-level Integration for using ASIP as SoC building blocks
196	(integration with application specific blocks, MPSoc, etc.)
197	\end{itemize}
198
199	\subsubsection{Automatic Parallelization}
200
201	The problem of compiling sequential programs for parallel computers
202	has been studied since the advent of the first parallel architectures
203	in the 1970s. The basic approach consists in applying program transformations
204	which exhibit or increase the potential parallelism, while guaranteeing
205	the preservation of the program semantics. Most of these transformations
206	just reorder the operations of the program; some of them modify its
207	data structures. Dependences (exact or conservative) are checked to guarantee
208	the legality of the transformation.
209
210	This has lead to the invention of many loop transformations (loop fusion,
211	loop splitting, loop skewing, loop interchange, loop unrolling, ...)
212	which interact in a complicated way. More recently, it has been noticed
213	that all of these are just changes of basis in the iteration domain of
214	the program. This has lead to the introduction of the polyhedral model
215	\cite{FP:96,DRV:2000}, in which the combination of two transformations is
216	simply a matrix product.
217
218	Since hardware is inherently parallel, finding parallelism in sequential
219	programs in an important prerequisite for HLS. The large FPGA chips of
220	today can accomodate much more parallelism than is available in basic blocks.
221	The polyhedral model is the ideal tool for finding more parallelism in
222	loops.
223
224	As a side effect, it has been observed that the polyhedral model is a useful
225	tool for many other optimization, like memory reduction and locality
226	improvement. Another point is
227	that the polyhedral domain \emph{stricto sensu} applies only to
228	very regular programs. Its extension to more general programs is
229	an active research subject.
230
231	%\subsubsection{High Performance Computing}
232	%Accelerating high-performance computing (HPC) applications with field-programmable
233	%gate arrays (FPGAs) can potentially improve performance.
234	%However, using FPGAs presents significant challenges~\cite{hpc06a}.
235	%First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
236	%Second, based on Amdahl law, HPC/FPGA application performance is unusually sensitive
237	%to the implementation quality~\cite{hpc06b}.
238	%Finally, High-performance computing programmers are a highly sophisticated but scarce
239	%resource. Such programmers are expected to readily use new technology but lack the time
240	%to learn a completely new skill such as logic design~\cite{hpc07a} .
241	%\\
242	%HPC/FPGA hardware is only now emerging and in early commercial stages,
243	%but these techniques have not yet caught up.
244	%Thus, much effort is required to develop design tools that translate high level
245	%language programs to FPGA configurations.
246

Note: See TracBrowser for help on using the repository browser.

Download in other formats: