Context Navigation

source: anr/section-3.1.tex @ 20

Last change on this file since 20 was 12, checked in by coach, 15 years ago

File size: 12.6 KB

Rev	Line
[12]	1	Our project covers several critical domains in system design in order
	2	to achieve high performance computing. Starting from a high level description we aim
	3	at generating automatically both hardware and software components of the system.
	4
	5	\subsubsection{High Performance Computing}
	6	Accelerating high-performance computing (HPC) applications with field-programmable
	7	gate arrays (FPGAs) can potentially improve performance.
	8	However, using FPGAs presents significant challenges~\cite{hpc06a}.
	9	First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
	10	Second, based on Amdahl law, HPC/FPGA application performance is unusually sensitive
	11	to the implementation quality~\cite{hpc06b}.
	12	Finally, High-performance computing programmers are a highly sophisticated but scarce
	13	resource. Such programmers are expected to readily use new technology but lack the time
	14	to learn a completely new skill such as logic design~\cite{hpc07a} .
	15	\\
	16	HPC/FPGA hardware is only now emerging and in early commercial stages,
	17	but these techniques have not yet caught up.
	18	Thus, much effort is required to develop design tools that translate high level
	19	language programs to FPGA configurations.
	20
	21	\subsubsection{System Synthesis}
	22	Today, several solutions for system design are proposed and commercialized.
	23	The most common are those provided by Altera and Xilinx to promote their
	24	FPGA devices.
	25	\\
	26	The Xilinx System Generator for DSP~\cite{system-generateur-for-dsp} is a
	27	plug-in to Simulink that enables designers to develop high-performance DSP
	28	systems for Xilinx FPGAs.
	29	Designers can design and simulate a system using MATLAB and Simulink. The
	30	tool will then automatically generate synthesizable Hardware Description
	31	Language (HDL) code mapped to Xilinx pre-optimized algorithms.
	32	However, this tool targets only DSP based algorithms, Xilinx FPGAs and
	33	cannot handle complete SoC. Thus, it is not really a system synthesis tool.
	34	\\
	35	In the opposite, SOPC Builder~\cite{spoc-builder} allows to describe a
	36	system, to synthesis it, to programm it into a target FPGA and to upload a
	37	software application.
	38	% FIXME(C2H from Altera, marche vite mais ressource monstrueuse)
	39	Nevertheless, SOPC Builder does not provide any facilities to synthesize
	40	coprocessors. System Designer must provide the synthesizable description
	41	with the feasible bus interface.
	42	\\
	43	In addition, Xilinx System Generator and SOPC Builder are closed world
	44	since each one imposes their own IPs which are not interchangeable.
	45	We can conclude that the existing commercial or free tools does not
	46	coverthe whole system synthesis process in a full automatic way. Moreover,
	47	they are bound to a particular device family and to IPs library.
	48
	49	\subsubsection{High Level Synthesis}
	50	High Level Synthesis translates a sequential algorithmic description and a
	51	constraints set (area, power, frequency, ...) to a micro-architecture at
	52	Register Transfer Level (RTL).
	53	Several academic and commercial tools are today available. Most common
	54	tools are SPARK~\cite{spark04}, GAUT~\cite{gaut08}, UGH~\cite{ugh08} in the
	55	academic world and CATAPULTC~\cite{catapult-c}, PICO~\cite{pico} and
	56	CYNTHETIZER~\cite{cynthetizer} in commercial world. Despite their
	57	maturity, their usage is restrained by:
	58	\begin{itemize}
	59	\item They do not respect accurately the frequency constraint when they target an FPGA device.
	60	Their error is about 10 percent. This is annoying when the generated component is integrated
	61	in a SoC since it will slow down the hole system.
	62	\item These tools take into account only one or few constraints simultaneously while realistic
	63	designs are multi-constrained.
	64	Moreover, low power consumption constraint is mandatory for embedded systems.
	65	However, it is not yet well handled by common synthesis tools.
	66	\item The parallelism is extracted from initial algorithm. To get more parallelism or to reduce
	67	the amout of required memory, the user must re-write it while there is techniques as polyedric
	68	transformations to increase the intrinsec parallelism.
	69	\item Despite they have the same input language (C/C++), they are sensitive to the style in
	70	which the algorithm is written. Consequently, engineering work is required to swap from
	71	a tool to another.
	72	\item The HLS tools are not integrated into an architecture and system exploration tool.
	73	Thus, a designer who needs to accelerate a software part of the system, must adapt it manually
	74	to the HLS input dialect and performs engineering work to exploit the synthesis result
	75	at the system level.
	76	\end{itemize}
	77	Regarding these limitations, it is necessary to create a new tool generation reducing the gap
	78	between the specification of an heterogenous system and its hardware implementation.
	79
	80	\subsubsection{Application Specific Instruction Processors}
	81
	82	ASIP (Application-Specific Instruction-Set Processor) are programmable
	83	processors in which both the instruction and the micro architecture have
	84	been tailored to a given application domain (eg. video processing), or to a
	85	specific application. This specialization usually offers a good compromise
	86	between performance (w.r.t a pure software implementation on an embeded
	87	CPU) and flexibility (w.r.t an application specific hardware co-processor).
	88	In spite of their obvious advantages, using/designing ASIPs remains a
	89	difficult task, since it involves designing both a micro-architecture and a
	90	compiler for this architecture. Besides, to our knowledge, there is still
	91	no available open-source design flow\footnote{There are commercial tools
	92	such a } for ASIP design even if such a tool would be valuable in the
	93	context of a System Level design exploration tool.
	94	\par
	95	In this context, ASIP design based on Instruction Set Extensions (ISEs) has
	96	received a lot of interest~\cite{NIOS2,ST70}, as it makes micro architecture synthesis
	97	more tractable \footnote{ISEs rely on a template micro-architecture in which
	98	only a small fraction of the architecture has to be specialized}, and help ASIP
	99	designers to focus on compilers, for which there are still many open
	100	problems\cite{CODES04,FPGA08}.
	101	This approach however has a strong weakness, since it also significantly reduces
	102	opportunities for achieving good seedups (most speedup remain between 1.5x and
	103	2.5x), since ISEs performance is generally tied down by I/O constraints as
	104	they generally rely on the main CPU register file to access data.
	105
	106	% (
	107	%automaticcaly extraction ISE candidates for application code \cite{CODES04},
	108	%performing efficient instruction selection and/or storage resource (register)
	109	%allocation \cite{FPGA08}).
	110	To cope with this issue, recent approaches~\cite{DAC09,DAC08} advocate the use of
	111	micro-architectural ISE models in which the coupling between the processor micro-architecture
	112	and the ISE component is thightened up so as to allow the ISE to overcome the register
	113	I/O limitations, however these approaches tackle the problem for a compiler/simulation
	114	point of view and not address the problem of generating synthesizable representations for
	115	these models.
	116
	117	We therefore strongly believe that there is a need for an open-framework which
	118	would allow researchers and system designers to :
	119	\begin{itemize}
	120	\item Explore the various level of interactions between the original CPU micro-architecure
	121	and its extension (for example throught a Domain Specific Language targeted at micro-architecture
	122	specification and synthesis).
	123	\item Retarget the compiler instruction-selection (or prototype nex passes) passes so as
	124	to be able to take advantage of this ISEs.
	125	\item Provide a complete System-level Integration for using ASIP as SoC building blocks
	126	(integration with application specific blocks, MPSoc, etc.)
	127	\end{itemize}
	128
	129	\subsubsection{Automatic Parallelization}
	130	% FIXME:LIP FIXME:PF FIXME:CA
	131	% Paul je ne suis pas sur que ce soit vraiment un etat de l'art
	132	% Christophe, ce que tu m'avais envoye se trouve dans obsolete/body.tex
	133	\mustbecompleted{
	134	Hardware is inherently parallel. On the other hand, high level languages,
	135	like C or Fortran, are abstractions of the processors of the 1970s, and
	136	hence are sequential. One of the aims of an HLS tool is therefore to
	137	extract hidden parallelism from the source program, and to infer enough
	138	hardaware operators for its efficient exploitation.
	139	\\
	140	Present day HLS tools search for parallelism in linear pieces of code
	141	acting only on scalars -- the so-called basic blocs. On the other hand,
	142	it is well known that most programs, especially in the fields of signal
	143	processing and image processing, spend most of their time executing loops
	144	acting on arrays. Efficient use of the large amount of hardware available
	145	in the next generation of FPGA chips necessitates parallelism far beyond
	146	what can be extracted from basic blocs only.
	147	\\
	148	The Compsys team of LIP has built an automatic parallelizer, Syntol, which
	149	handle restricted C programs -- the well known polyhedral model --,
	150	computes dependences and build a symbolic schedule. The schedule is
	151	a specification for a parallel program. The parallelism itself can be
	152	expressed in several ways: as a system of threads, or as data-parallel
	153	operations, or as a pipeline. In the context of the COACH project, one
	154	of the task will be to decide which form of parallelism is best suited
	155	to hardware, and how to convey the results of Syntol to the actual
	156	synthesis tools. One of the advantages of this approach is that the
	157	resulting degree of parallelism can be easilly controlled, e.g. by
	158	adjusting the number of threads, as a mean of exploring the
	159	area / performance tradeoff of the resulting design.
	160	\\
	161	Another point is that potentially parallel programs necessarily involve
	162	arrays: two operations which write to the same location must be executed
	163	in sequence. In synthesis, arrays translate to memory. However, in FPGAs,
	164	the amount of on-chip memory is limited, and access to an external memory
	165	has a high time penalty. Hence the importance of reducing the size of
	166	temporary arrays to the minimum necessary to support the requested degree
	167	of parallelism. Compsys has developped a stand-alone tool, Bee, based
	168	on research by A. Darte, F. Baray and C. Alias, which can be extended
	169	into a memory optimizer for COACH.
	170	}
	171
	172	\subsubsection{Interfaces}
	173	\newcommand{\ip}{\sc ip}
	174	\newcommand{\dma}{\sc dma}
	175	\newcommand{\soc}{\sc SoC}
	176	\newcommand{\mwmr}{\sc mwmr}
	177	The hardware/software interface has been a difficult task since the advent
	178	of complex systems on chip. After the first Co-design
	179	environments~\cite{Coware,Polis,Ptolemy}, the Hardware Abstraction Layer
	180	has been defined so that software applications can be developed without low
	181	level hardware implementation details. In~\cite{jerraya}, Yoo and Jerraya
	182	propose an {\sc api} with extension ability instead of a unique hardware
	183	abstraction layer. System level communication frameworks have been
	184	introduced~\cite{JerrayaPetrot,mwmr}.
	185	\par
	186	A good abstraction of a hardware/software interface has been proposed
	187	in~\cite{Jantsch}: it is composed of a software driver, a {\dma} and and a
	188	bus interface circuit. Automatic wrapping between bus protocols has
	189	generated a lot of papers~\cite{Avnit,smith,Narayan, Alberto}. These works
	190	do not use a {\dma}. In COACH, the hardware/software interface is done at a
	191	higher level and uses burst communication in the bus interface circuit to
	192	improve the communication performances.
	193	\par
	194	There are two important projects related to efficient interface of
	195	data-flow {\ip}s : the work of Park and Diniz~\cite{ Park01} and the the
	196	Lip6 work on {\mwmr}~\cite{mwmr}. Park and Diniz~\cite{ Park01} proposed
	197	of a generic interface that can be parameterized to connect different
	198	data-flow {\ip}s. This approach does not request the communications to be
	199	statically known and proposes a runtime resolution to solve conflicting
	200	access to the bus. To our knowledge this approach has not been implemented
	201	further since 2003.
	202	\par
	203	{\mwmr}~\cite{mwmr} stands for both a computation model (multi-write,
	204	multi-read {\sc fifo}) inherited from the Khan Process Networks and a bus
	205	interface circuit protocol. As for the work of Park and Diniz, {\mwmr}
	206	does not make the assumption of a static communication flow. This implies
	207	simple software driver to write, but introduces additional complexity due
	208	to the mutual exclusion locks necessary to protect the shared memory.
	209	\par
	210	we propose, in COACH, to use recent work on hardware/software
	211	interface~\cite{FR-vlsi} that uses a {\em clever} {\dma} responsible for
	212	managing data streams. A assumption is that the behavior of the {\ip}s can
	213	be statically described. A similar choice has been made in the Faust
	214	{\soc}~\cite{FAUST} which includes the {\em smart memory engine} component.
	215	Jantsch and O'Nils already noticed in ~\cite{Jantsch} the huge complexity
	216	of writing this hardware/software interface, in COACH, automatic
	217	generation of the interface will be achieved, this is one goal of the CITI
	218	contribution to COACH.
	219

Note: See TracBrowser for help on using the repository browser.

Download in other formats: