Context Navigation

source: anr/section-3.1.tex @ 117

Last change on this file since 117 was 114, checked in by coach, 16 years ago
IA: updated data from bull and navtel
File size: 14.7 KB

Rev	Line
[56]	1	% vim:set spell:
	2	% vim:spell spelllang=en:
	3
[12]	4	Our project covers several critical domains in system design in order
	5	to achieve high performance computing. Starting from a high level description we aim
	6	at generating automatically both hardware and software components of the system.
	7
	8	\subsubsection{High Performance Computing}
[56]	9	% Un marchÃ© bouffÃ© par les archi GPGPU tel que le FERMI de NvidiaCUDA programming language
	10	High-Performance Computing (HPC) world is composed of three main families of architectures:
	11	many-core, GPGPU (General Purpose computation on Graphics Unit Processing) and FPGA.
	12	The two first families are dominating the market by taking benefit
[66]	13	of the strength and influence of mass-market leaders (Intel, Nvidia).
[56]	14	%such as Intel for many-core CPU and Nvidia for GPGPU.
	15	In this market, FPGA architectures are emerging and very promising.
	16	By adapting architecture to the software, % (the opposite is done in the others families)
	17	FPGAs architectures enable better performance
	18	(typically between x10 and x100 accelerations)
	19	while using smaller size and less energy (and heat).
[12]	20	However, using FPGAs presents significant challenges~\cite{hpc06a}.
	21	First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
	22	Second, based on Amdahl law, HPC/FPGA application performance is unusually sensitive
	23	to the implementation quality~\cite{hpc06b}.
[56]	24	% Thus, the performance strongly relies on the detected parallelism.
	25	% (pour rÃ©sumer les 2 derniers points)
	26	Finally, efficient design methodology are required in order to
	27	hide FPGA complexity and the underlying implantation subtleties to HPC users,
	28	so that they don't have to change their habits and can have equivalent design productivity
	29	than in others families~\cite{hpc07a}.
	30
	31	%Ã©tat de l'art FPGA
[12]	32	HPC/FPGA hardware is only now emerging and in early commercial stages,
	33	but these techniques have not yet caught up.
[56]	34	Industrial (Mitrionics~\cite{hpc08}, Gidel~\cite{hpc09}, Convey Computer~\cite{hpc10}) and academic (CHREC)
	35	researches on HPC-FPGA are mainly conducted in the USA.
	36	None of the approaches developed in these researches are fulfilling entirely the
	37	challenges described above. For example, Convey Computer proposes application-specific instruction set extension of x86 cores in FPGA accelerator,
	38	but extension generation is not automated and requires hardware design skills.
	39	Mitrionics has an elegant solution based on a compute engine specifically
	40	developed for high-performance execution in FPGAs. Unfortunately, the design flow
	41	is based on a new programming language (mitrionC) implying designer efforts and poor portability.
	42	% tool relying on operator libraries (XtremeData),
	43	% Parle t-on de l'OPenFPGA consortium, dont le but est : "to accelerate the incorporation of reconfigurable computing technology in high-performance and enterprise applications" ?
	44
[12]	45	Thus, much effort is required to develop design tools that translate high level
	46	language programs to FPGA configurations.
[56]	47	Moreover, as already remarked in~\cite{hpc11}, Dynamic Partial Reconfiguration~\cite{hpc12}
	48	(DPR, which enables changing a part of the FPGA, while the rest is still working)
	49	appears very interesting for improving HPC performance as well as reducing required area.
[12]	50
	51	\subsubsection{System Synthesis}
	52	Today, several solutions for system design are proposed and commercialized.
[103]	53	The existing commercial or free tools does not
	54	cover the whole system synthesis process in a full automatic way. Moreover,
	55	they are bound to a particular device family and to IPs library.
	56	The most commonly used are provided by Altera and Xilinx to promote their
	57	FPGA devices. These two representative tools used to synthesize SoC on FPGA
	58	are introduced below.
[12]	59	\\
	60	The Xilinx System Generator for DSP~\cite{system-generateur-for-dsp} is a
	61	plug-in to Simulink that enables designers to develop high-performance DSP
	62	systems for Xilinx FPGAs.
	63	Designers can design and simulate a system using MATLAB and Simulink. The
	64	tool will then automatically generate synthesizable Hardware Description
	65	Language (HDL) code mapped to Xilinx pre-optimized algorithms.
	66	However, this tool targets only DSP based algorithms, Xilinx FPGAs and
[103]	67	cannot handle a complete SoC. Thus, it is not really a system synthesis tool.
[12]	68	\\
	69	In the opposite, SOPC Builder~\cite{spoc-builder} allows to describe a
	70	system, to synthesis it, to programm it into a target FPGA and to upload a
	71	software application.
	72	% FIXME(C2H from Altera, marche vite mais ressource monstrueuse)
	73	Nevertheless, SOPC Builder does not provide any facilities to synthesize
	74	coprocessors. System Designer must provide the synthesizable description
[103]	75	with the feasible bus interface. Design Space Exploration is thus limited
	76	and SystemC simulation is not possible neither at transactional nor at Cycle
	77	accurate level.
[12]	78	\\
	79	In addition, Xilinx System Generator and SOPC Builder are closed world
	80	since each one imposes their own IPs which are not interchangeable.
	81
	82	\subsubsection{High Level Synthesis}
	83	High Level Synthesis translates a sequential algorithmic description and a
[66]	84	set of constraints (area, power, frequency, ...) to a micro-architecture at
[12]	85	Register Transfer Level (RTL).
	86	Several academic and commercial tools are today available. Most common
	87	tools are SPARK~\cite{spark04}, GAUT~\cite{gaut08}, UGH~\cite{ugh08} in the
	88	academic world and CATAPULTC~\cite{catapult-c}, PICO~\cite{pico} and
	89	CYNTHETIZER~\cite{cynthetizer} in commercial world. Despite their
	90	maturity, their usage is restrained by:
	91	\begin{itemize}
[103]	92	\item The HLS tools are not integrated into an architecture and system exploration tool.
	93	Thus, a designer who needs to accelerate a software part of the system, must adapt it manually
	94	to the HLS input dialect and performs engineering work to exploit the synthesis result
	95	at the system level.
	96	\item HLS tools take into account only one or few constraints simultaneously while realistic
[12]	97	designs are multi-constrained.
	98	Moreover, low power consumption constraint is mandatory for embedded systems.
[103]	99	However, it is not yet well handled or not handle at all by the synthesis tools already available.
	100	\item The parallelism is extracted from initial algorithmic specification. To get more parallelism or to reduce
	101	the amount of required memory in the SoC, the user must re-write the algorithmic specification while there is
	102	techniques as polyedric transformations to increase the intrinsic parallelism.
	103	\item While they support limited loop transformations like loop unrolling and loop pipelining, current HLS tools
	104	do not provide support for design space exploration neither through automatic loop transformations nor through
	105	memory mapping.
[12]	106	\item Despite they have the same input language (C/C++), they are sensitive to the style in
	107	which the algorithm is written. Consequently, engineering work is required to swap from
	108	a tool to another.
[103]	109	\item They do not respect accurately the frequency constraint when they target an FPGA device.
	110	Their error is about 10 percent. This is annoying when the generated component is integrated
	111	in a SoC since it will slow down the hole system.
[12]	112	\end{itemize}
	113	Regarding these limitations, it is necessary to create a new tool generation reducing the gap
[103]	114	between the specification of an heterogeneous system and its hardware implementation.
[114]	115	\mustbecompleted {FIXME :: Ajouter ref livre + D\&T}
[12]	116
	117	\subsubsection{Application Specific Instruction Processors}
	118
	119	ASIP (Application-Specific Instruction-Set Processor) are programmable
	120	processors in which both the instruction and the micro architecture have
[103]	121	been tailored to a given application domain (e.g. video processing), or to a
[12]	122	specific application. This specialization usually offers a good compromise
[103]	123	between performance (w.r.t a pure software implementation on an embedded
[12]	124	CPU) and flexibility (w.r.t an application specific hardware co-processor).
	125	In spite of their obvious advantages, using/designing ASIPs remains a
	126	difficult task, since it involves designing both a micro-architecture and a
	127	compiler for this architecture. Besides, to our knowledge, there is still
[93]	128	no available open-source design flow for ASIP design even if such a tool
	129	would be valuable in the
[12]	130	context of a System Level design exploration tool.
	131	\par
	132	In this context, ASIP design based on Instruction Set Extensions (ISEs) has
	133	received a lot of interest~\cite{NIOS2,ST70}, as it makes micro architecture synthesis
	134	more tractable \footnote{ISEs rely on a template micro-architecture in which
	135	only a small fraction of the architecture has to be specialized}, and help ASIP
	136	designers to focus on compilers, for which there are still many open
[93]	137	problems\cite{ARC08}.
[12]	138	This approach however has a strong weakness, since it also significantly reduces
	139	opportunities for achieving good seedups (most speedup remain between 1.5x and
	140	2.5x), since ISEs performance is generally tied down by I/O constraints as
	141	they generally rely on the main CPU register file to access data.
	142
	143	% (
	144	%automaticcaly extraction ISE candidates for application code \cite{CODES04},
	145	%performing efficient instruction selection and/or storage resource (register)
	146	%allocation \cite{FPGA08}).
[93]	147	To cope with this issue, recent approaches~\cite{DAC09,CODES08,TVLSI06} advocate the use of
[12]	148	micro-architectural ISE models in which the coupling between the processor micro-architecture
	149	and the ISE component is thightened up so as to allow the ISE to overcome the register
[93]	150	I/O limitations, however these approaches generally tackle the problem for a compiler/simulation
[12]	151	point of view and not address the problem of generating synthesizable representations for
	152	these models.
	153
	154	We therefore strongly believe that there is a need for an open-framework which
	155	would allow researchers and system designers to :
	156	\begin{itemize}
	157	\item Explore the various level of interactions between the original CPU micro-architecure
	158	and its extension (for example throught a Domain Specific Language targeted at micro-architecture
	159	specification and synthesis).
	160	\item Retarget the compiler instruction-selection (or prototype nex passes) passes so as
	161	to be able to take advantage of this ISEs.
	162	\item Provide a complete System-level Integration for using ASIP as SoC building blocks
	163	(integration with application specific blocks, MPSoc, etc.)
	164	\end{itemize}
	165
	166	\subsubsection{Automatic Parallelization}
	167	% FIXME:LIP FIXME:PF FIXME:CA
	168	% Paul je ne suis pas sur que ce soit vraiment un etat de l'art
	169	% Christophe, ce que tu m'avais envoye se trouve dans obsolete/body.tex
[30]	170	%\mustbecompleted{
	171	%Hardware is inherently parallel. On the other hand, high level languages,
	172	%like C or Fortran, are abstractions of the processors of the 1970s, and
	173	%hence are sequential. One of the aims of an HLS tool is therefore to
	174	%extract hidden parallelism from the source program, and to infer enough
	175	%hardware operators for its efficient exploitation.
	176	%\\
	177	%Present day HLS tools search for parallelism in linear pieces of code
	178	%acting only on scalars -- the so-called basic blocs. On the other hand,
	179	%it is well known that most programs, especially in the fields of signal
	180	%processing and image processing, spend most of their time executing loops
	181	%acting on arrays. Efficient use of the large amount of hardware available
	182	%in the next generation of FPGA chips necessitates parallelism far beyond
	183	%what can be extracted from basic blocs only.
[31]	184
[30]	185	%The Compsys team of LIP has built an automatic parallelizer, Syntol, which
	186	%handle restricted C programs -- the well known polyhedral model --,
	187	%computes dependences and build a symbolic schedule. The schedule is
	188	%a specification for a parallel program. The parallelism itself can be
	189	%expressed in several ways: as a system of threads, or as data-parallel
	190	%operations, or as a pipeline. In the context of the COACH project, one
	191	%of the task will be to decide which form of parallelism is best suited
	192	%to hardware, and how to convey the results of Syntol to the actual
	193	%synthesis tools. One of the advantages of this approach is that the
	194	%resulting degree of parallelism can be easilly controlled, e.g. by
	195	%adjusting the number of threads, as a mean of exploring the
	196	%area / performance tradeoff of the resulting design.
[31]	197
[30]	198	%Another point is that potentially parallel programs necessarily involve
	199	%arrays: two operations which write to the same location must be executed
	200	%in sequence. In synthesis, arrays translate to memory. However, in FPGAs,
	201	%the amount of on-chip memory is limited, and access to an external memory
	202	%has a high time penalty. Hence the importance of reducing the size of
	203	%temporary arrays to the minimum necessary to support the requested degree
	204	%of parallelism. Compsys has developped a stand-alone tool, Bee, based
	205	%on research by A. Darte, F. Baray and C. Alias, which can be extended
	206	%into a memory optimizer for COACH.
	207	%}
[12]	208
[30]	209	The problem of compiling sequential programs for parallel computers
	210	has been studied since the advent of the first parallel architectures
	211	in the 1970s. The basic approach consists in applying program transformations
	212	which exhibit or increase the potential parallelism, while guaranteeing
	213	the preservation of the program semantics. Most of these transformations
	214	just reorder the operations of the program; some of them modify its
	215	data structures. Dpendences (exact or conservative) are checked to guarantee
	216	the legality of the transformation.
	217
	218	This has lead to the invention of many loop transformations (loop fusion,
	219	loop splitting, loop skewing, loop interchange, loop unrolling, ...)
	220	which interact in a complicated way. More recently, it has been noticed
	221	that all of these are just changes of basis in the iteration domain of
	222	the program. This has lead to the invention of the polyhedral model, in
	223	which the combination of two transformation is simply a matrix product.
	224
	225	As a side effect, it has been observed that the polytope model is a useful
	226	tool for many other optimization, like memory reduction and locality
	227	improvement. Another point is
	228	that the polyhedral domain \emph{stricto sensu} applies only to
	229	very regular programs. Its extension to more general programs is
	230	an active research subject.
	231
[66]	232	%\subsubsection{High Performance Computing}
	233	%Accelerating high-performance computing (HPC) applications with field-programmable
	234	%gate arrays (FPGAs) can potentially improve performance.
	235	%However, using FPGAs presents significant challenges~\cite{hpc06a}.
	236	%First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
	237	%Second, based on Amdahl law, HPC/FPGA application performance is unusually sensitive
	238	%to the implementation quality~\cite{hpc06b}.
	239	%Finally, High-performance computing programmers are a highly sophisticated but scarce
	240	%resource. Such programmers are expected to readily use new technology but lack the time
	241	%to learn a completely new skill such as logic design~\cite{hpc07a} .
	242	%\\
	243	%HPC/FPGA hardware is only now emerging and in early commercial stages,
	244	%but these techniques have not yet caught up.
	245	%Thus, much effort is required to develop design tools that translate high level
	246	%language programs to FPGA configurations.
[12]	247

Note: See TracBrowser for help on using the repository browser.

Download in other formats: