Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

body.tex @ 375

Last change on this file since 375 was 12, checked in by coach, 15 years ago

File size: 43.5 KB

Rev	Line
[12]	1	\section{Project context}
	2	\hspace{2cm}\begin{scriptsize}\begin{verbatim}
	3	% 1. CONTEXTE ET POSITIONNEMENT DU PROJET
	4	% (1 page maximum) Prï¿œsentation gï¿œnï¿œrale du problï¿œme qu'il est proposï¿œ de traiter
	5	% dans le projet et du cadre de travail (recherche fondamentale, industrielle ou
	6	% dï¿œveloppement expï¿œrimental).
	7	\end{verbatim}
	8	\end{scriptsize}
	9	An embedded system is an application integrated into one or several chips
	10	in order to accelerate it or to embedd it into a small device such as a personal
	11	digital assistant (PDA).
	12	This topic is investigated since 80s using Applications Specific Integrated Circuits (ASIC),
	13	Digital Signal Processing (DSP) and parallel computing on multiprocessor machines or networks.
	14	More recently, since end of 90s, other technologies appeared like Very Large Instruction Word (VLIW),
	15	Application Specific Instruction Processors (ASIP), System on Chip (SoC),
	16	Multi-Processors SoC (MPSoC).
	17	\\
	18	During these last decades embedded system was reserved to major industrial companies targeting high volume market
	19	due to the design and fabrication costs.
	20	Nowadays Field Programmable Gate Arrays (FPGA), like Virtex5 from Xilinx and Stratix4 from Altera,
	21	can implement a SoC with multiple processors and several coprocessors for less than 10K euros
	22	per item. In addition, High Level Synthesis (HLS) becomes more mature and allows to automate
	23	design and to drastically decrease its cost in terms of man power. Thus, both FPGA and HLS
	24	tend to spread over HPC for small companies targeting low volume markets.
	25	\par
	26	To get an efficient embedded system, designer has to take into account application characteristics when it
	27	chooses one of the former technologies.
	28	This choice is not easy and in most cases designer has to try different technologies to retain the
	29	most adapted one.
	30	\\
	31	The first objective of COACH is to provide an open-source framework to design embedded system
	32	on FPGA device.
	33	COACH framework allows designer to explore various software/hardware partitions of the
	34	target application, to run timing and functional simulations and to generate automatically both
	35	the software and the synthesizable description of the hardware.
	36	The main topics of the project are:
	37	\begin{itemize}
	38	\item
	39	Design space exploration: It consists in analysing the application runnig on FPGA, defining the target
	40	technology (SoC, MPSoC, ASIP, ...) and hardware/software partitioning of tasks depending on
	41	technology choice. This exploration is driven basically by throughput, latency and power consumption
	42	criteria.
	43	\item
	44	Micro-architectural exploration: When hardware components are required, the HLS tools of the framework
	45	generate them automatically. At this stage the framework provides various HLS tools allowing the
	46	micro-architectural space design exploration. The exploration criteria are also throughput, latency
	47	and power consumption.
	48	% FIXME
	49	%CA At this stage, preliminary source-level transformations will be
	50	%CA required to improve the efficiency of the target component.
	51	%CA COACH will also provide such facilities, such as automatic parallelization
	52	%CA and memory optimisation.
	53	\item
	54	Performance measurement: For each point of design space exploration, metrics of criteria are available
	55	such as throughput, latency, power consumption, area, memory allocation and data locality.
	56	They are evaluated using virtual prototyping, estimation or analysing methodologies.
	57	\item
	58	Targeted hardware technology: The COACH description of system is independent of the FPGA family.
	59	Every point of the design exploration space can be implemented on any FPGA having the required resources.
	60	Basically, COACH handles both Altera and Xilinx FPGA families.
	61	\end{itemize}
	62	As an extension of embedded system design, COACH deals also with High Performance Computing (HPC).
	63	In HPC, the kind of targeted application is an existing one running on PC. COACH helps designer
	64	to accelerate it by migrating critical parts into a SoC implemented on a FPGA plugged to the PC bus.
	65	\par
	66	COACH is the result of the will of several laboratory to unify their know how and skills in the
	67	following domains: Operating system and hardware communication (TIMA, SITI), SoC and MPSoC (LIP6 and TIMA),
	68	ASIP (IRISA) and HLS (LIP6, Lab-STIC and LIP). The project objective is to integrate these various
	69	domains into a unique free framework (licence ...) masking as much as possible these domains and its
	70	different tools to the user.
	71
	72
	73	\subsection{Economical context and interest}
	74	\hspace{2cm}\begin{scriptsize}\begin{verbatim}
	75	% 1.1. CONTEXTE ET ENJEUX ECONOMIQUES ET SOCIETAUX
	76	% (2 pages maximum)
	77	% Dï¿œcrire le contexte ï¿œconomique, social, rï¿œglementaire. dans lequel se situe
	78	% le projet en prï¿œsentant une analyse des enjeux sociaux, ï¿œconomiques, environnementaux,
	79	% industriels. Donner si possible des arguments chiffrï¿œs, par exemple, pertinence et
	80	% portï¿œe du projet par rapport ï¿œ la demande ï¿œconomique (analyse du marchï¿œ, analyse des
	81	% tendances), analyse de la concurrence, indicateurs de rï¿œduction de coï¿œts, perspectives
	82	% de marchï¿œs (champs d'application, .). Indicateurs des gains environnementaux, cycle
	83	% de vie.
	84	\end{verbatim}
	85	\end{scriptsize}
	86	Microelectronic allows to integrate complicated functions into products, to increase their
	87	commercial attractivity and to improve their competitivity. Multimedia and communication
	88	sectors have taken advantage from microelectronics facilities thanks to developpment of
	89	design methodologies and tools for real time embedded systems. Many other sectors could
	90	benefit from microelectronics if these methologies and tools are adapted to their features.
	91	The Non Recurring Engineering (NRE) costs involded in designing and manufacturing an ASIC is
	92	very high. It costs several milliars of euros for IC factory and several millions to fabricate
	93	a specific circuit for example a conservative estimate for a 65nm ASIC project is 10 million USD.
	94	Consequently, it is generally unfeasible to design and fabricate ASICs in
	95	low volumes and ICs are designed to cover a broad applications spectrum at the cost of
	96	performance degradation.
	97	\\
	98	Today, FPGAs become important actors in the computational domain that was originally dominated
	99	by microprocessors and ASICs. Just like microprocessors FPGA based systems can be reprogrammed
	100	on a per-application basis. At the same time, FPGAs offer significant performance benefits over
	101	microprocessors implementation for a number of applications. Although these benefits are still
	102	generally an order of magnitude less than equivalent ASIC implementations, low costs
	103	(500 euros to 10K euros), fast time to market and flexibility of FPGAs make them an attractive
	104	choice for low-to-medium volume applications.
	105	Since their introduction in the mid eighties, FPGAs evolved from a simple,
	106	low-capacity gate array technology to devices (Altera STRATIX III, Xilinx Virtex V) that
	107	provide a mix of coarse-grained data path units, memory blocks, microprocessor cores,
	108	on chip A/D conversion, and gate counts by millions. This high logic capacity allows to implement
	109	complex systems like multi-processors platform with application dedicated coprocessors.
	110	Table~\ref{fpga_market} shows the estimation of FPGA worldwide market in the next years covering
	111	various application domains. The ``high end'' lines concern only FPGA with high logic capacity able
	112	to implement complex systems.
	113	This market is in significant expansion and is estimated to 914\,M\$ in 2012.
	114	Using FPGA limits the NRE costs to design cost. This boosts the developpment of methodologies
	115	and tools to automize design and reduce its cost.
	116	\begin{table}\leavevmode\center
	117	\begin{tabular}{\|l\|l\|l\|l\|}\hline
	118	Segment & 2010 & 2011 & 2012 \\\hline\hline
	119	Communications & 1,867 & 1,946 & 2,096 \\
	120	High end & 467 & 511 & 550 \\\hline
	121	Consumer & 550 & 592 & 672 \\
	122	High end & 53 & 62 & 75 \\\hline
	123	Automotive & 243 & 286 & 358 \\
	124	High end & - & - & - \\\hline
	125	Industrial & 1,102 & 1,228 & 1,406 \\
	126	High end & 177 & 188 & 207 \\\hline
	127	Military/Aereo & 566 & 636 & 717 \\
	128	High end & 56 & 65 & 82 \\\hline\hline
	129	Total FPGA/PLD & 4,659 & 5,015 & 5,583 \\
	130	Total High-End FPGA & 753 & 826 & 914 \\\hline
	131	\end{tabular}
	132	\caption{\label{fga_market} Gartner estimation of worldwide FPGA/PLD consumption (Millions \$)}
	133	\end{table}
	134	\par
	135	Today, several companies (atipa, blue-arc, Bull, Chelsio, Convey, CRAY, DataDirect, DELL, hp,
	136	Wild Systems, IBM, Intel, Microsoft, Myricom, NEC, nvidia etc) are making systems where demand
	137	for very high performance (HPC) primes over other requirements. They tend to use the highest
	138	performing devices like Multi-core CPUs, GPUs, large FPGAs, custom ICs and the most innovative
	139	architectures and algorithms. Companies show up in different "traditional" applications and market
	140	segments like computing clusters (ad-hoc), servers and storage, networking and Telecom, ASIC
	141	emulation and prototyping, Mil/aero etc. HPC market size is estimated today by FPGA providers
	142	to 214\,M\$.
	143	This market is dominated by Multi-core CPUs and GPUs based solutions and the expansion
	144	of FPGA-based solutions is limited by the flow automation. Nowadays, there are neither commercial
	145	nor free tools covering the whole design process.
	146	For instance, with SOPC Builder from Altera, users can select and parameterize IP components
	147	from an extensive drop-down list of communication, digital signal processor (DSP), microprocessor
	148	and bus interface cores, as well as incorporate their own IP. Designers can then generate
	149	a synthesized netlist, simulation test bench and custom software library that reflect the hardware
	150	configuration.
	151	Nevertheless, SOPC Builder does not provide any facilities to synthesize coprocessors\emph{I
	152	(Steven) disagree : the C2H compiler bundled with SOPCBuilder does a pretty good job at this} and to
	153	simulate the platform at a high design level (system C).
	154	In addition, SOPC Builder is proprietary and only works together with Altera's Quartus compilation
	155	tool to implement designs on Altera devices (Stratix, Arria, Cyclone).
	156	PICO [CITATION] and CATAPULT [CITATION] allow to synthesize coprocessors from a C++ description.
	157	Nevertheless, they can only deal with data dominated applications and they do not handle the
	158	platform level.
	159	The Xilinx System Generator for DSP [http://www.xilinx.com/tools/sysgen.htm] is a plug-in to
	160	Simulink that enables designers to develop high-performance DSP systems for Xilinx FPGAs.
	161	Designers can design and simulate a system using MATLAB and Simulink. The tool will then
	162	automatically generate synthesizable Hardware Description Language (HDL) code mapped to Xilinx
	163	pre-optimized algorithms.
	164	However, this tool targets only DSP based algorithms.
	165	\\
	166	Consequently, designers developping an embedded system needs to master for example
	167	SoCLib for design exploration,
	168	SOPC Builde at the platform level,
	169	PICO for synthesizing the data dominated coprocessors
	170	and Quartus for design implementation.
	171	This requires an important tools interfacing effort and makes the design process very complex
	172	and achievable only by designers skilled in many domains.
	173	COACH project integrates all these tools in the same framework masking them to the user.
	174	The objective is to allow \textbf{pure software} developpers to realize embedded systems.
	175	\par
	176	The combination of the framework dedicated to software developpers and FPGA target, allows to gain
	177	market share over Multi-core CPUs and GPUs HPC based solutions.
	178	Moreover, one can expect that small and even very small companies will be able to propose embedded
	179	system and accelerating solutions for standard software applications with acceptable prices, thanks
	180	to the elimination of huge hardware investment in opposite to ASIC based solution.
	181	\\
	182	This new market may explose like it was done by micro-computing in eighties. This success were due
	183	to the low cost of first micro-computers (compared to main frame) and the advent of high level
	184	programming languages that allow a high number of programmers to launch start-ups in software
	185	engineering.
	186
	187	\subsection{Project position}
	188	\hspace{2cm}\begin{scriptsize}\begin{verbatim}
	189	% 1.2. POSITIONNEMENT DU PROJET
	190	% (2 pages maximum)
	191	% Prï¿œciser :
	192	% - positionnement du projet par rapport au contexte dï¿œveloppï¿œ prï¿œcï¿œdemment :
	193	% vis- ï¿œ-vis des projets et recherches concurrents, complï¿œmentaires ou antï¿œrieurs,
	194	% des brevets et standards.
	195	% - positionnement du projet par rapport aux axes thï¿œmatiques de l'appel ï¿œ projets.
	196	% - positionnement du projet aux niveaux europï¿œen et international.
	197	\end{verbatim}
	198	\end{scriptsize}
	199	The aim of this project is to propose an open-source framework for architecture synthesis
	200	targeting mainly field programmable gate array circuits (FPGA).
	201	\\% LIP6/TIMA
	202	To evaluate the different architectures, the project uses the prototyping platform
	203	of the SoCLIB ANR project (2006-2009).
	204	\\% IRISA
	205	The project will also borrow from the ROMA ANR project (2007-2009) and the ongoing
	206	joint INRIA-STMicro Nano2012 project. In particular we will adapt existing pattern
	207	extraction algorithms and datapath merging techniques to the synthesis of customized
	208	ASIP processors.
	209	\\
	210	\textcolor{gris75}{Steven : Je propose de rajouter un lien avec le projet BioWic~:~on the HPC
	211	application side, we also hope to benefit from the experience in hardware acceleration of
	212	bioinformatic algorithms/workfows gathered by the CAIRN group in the context of the ANR
	213	BioWic project (2009-2011), so as to be able to validate the framework on
	214	real-life HPC applications.}
	215
	216	\par
	217	%%% 1 -- POUVEZ VOUS CHACUN AJOUTER SVP (SI POSSIBLE) UNE LIGNE
	218	%%% 1 -- REFERANT UN PROJET ANR OU EUROPEEN
	219	%%% 1 -- Projets europï¿œens ou ANR rï¿œutilisï¿œs ou continuï¿œs
	220	%%% 1 LIP6/TIMA/LAB-STIC OK
	221	Regarding the expertise in High Level Synthesis (HLS), the project leverages on know-how acquired over 15 years
	222	with GAUT project developped in Lab-STIC laboratory and UGH project developped in LIP6
	223	and TIMA laboratories. \\
	224	Regarding architecture synthesis skills, the project is based on a know-how acquired over 10 years
	225	with the COSY European project (1998-2000) and the DISYDENT project developped in LIP6. \\
	226	%%% 1 IRISA OK
	227	Regarding Application Specific Instruction Processor (ASIP) design, the CAIRN group at INRIA Bretagne
	228	Atlantique benefits from several years of expertise in the domain of retargetable compiler (Armor/Calife
	229	since 1996, and the Gecos compilers since 2002).
	230
	231
	232	% LIP FIXME:UN:PEU:LONG ET HORS:SUJET
	233	%CA% The source-level transformations required by the HLS tools will be
	234	%CA% designed in the {\em polyhedral model}, a general framework
	235	%CA% initiated by Paul Feautrier 20 years ago. The programs handled in
	236	%CA% the polyhedral model are such that loop iterators describe a
	237	%CA% polyhedron (hence the name). This includes most of the kernels used
	238	%CA% in embedded applications. This property allows to design precise
	239	%CA% analysis by means of integer programming techniques.
	240	%CA% %communaute active & internationale
	241	%CA% %transfert techno (Reservoir)
	242	%CA% The polyhedral community is very active, and the technological
	243	%CA% transfer has now started. Reservoir Labs inc., a company based in
	244	%CA% New-York, is currently integrating the last polyhedral developments
	245	%CA% in its commercial compiler.
	246	%CA% %transfert techno (gcc)
	247	%CA% Also, polyhedra are progressively migrating into the {\sc GNU Gcc}
	248	%CA% compiler, via {\sc Graphite}, a module initially developed by
	249	%CA% Sebastian Pop.
	250	%CA% %outils existants
	251	%CA% Several tools have been developed in the polyhedral community,
	252	%CA% such as {\sc Piplib} (parameter integer programming library), and
	253	%CA% {\sc Polylib}, a library providing set operations on polyhedra. Both
	254	%CA% tools are almost mandatory in polyhedral tools, and have reached
	255	%CA% a sufficient level of maturity to be considered as standard.
	256	%syntol & bee ???
	257	% FIN
	258	% and on more than 15 years of experience on parallel hardware generation
	259	% in the polyedral model in the CAIRN group (MMAlpha software
	260	% developped in the group since 1996).
	261	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	262	%%% 2 -- A COMPLETER (COURT)
	263	%%% 2 -- For polyedric transformation and memory optimization ... LIP
	264	%%% 2 -- For ASIP IRISA
	265	%%% 2 -- For ... CITI
	266	%%% 2 -- For ... TIMA
	267	\par
	268	The SoCLIB ANR platform were developped by 11 laboratories and 6 companies. It allows to
	269	describe hardware architectures with shared memory space and to deploy software
	270	applications on them to evaluate their performance.
	271	The heart of this platform is a library containing simulation models (in SystemC)
	272	of hardware IP cores such as processors, buses, networks, memories, IO controller.
	273	The platform provides also embedded operating systems and software/hardware
	274	communication components useful to implement applications quickly.
	275	However, the synthesisable description of IPs have to be provided by users. \\
	276	This project enhances SoCLib by providing synthesisable VHDL of standard IPs.
	277	In addition, HLS tools such as UGH and GAUT allow to get automatically a synthesisable
	278	description of an IP (coprocessor) from a sequential algorithm.
	279	%\par
	280	%%% 2 IRISA ?
	281	%%% 2 ASIP tool such as ... IRISA
	282	%%% 2 ...
	283	%%% 2 Coach uses pattern extractions from ROMA
	284	%\par
	285	%%% 2 LIP ?
	286	\par
	287	The different points proposed in this project cover priorities defined by the commission
	288	experts in the field of Information Technolgies Society (IST) for Embedded
	289	systems: <<Concepts, methods and tools for designing systems dealing with systems complexity
	290	and allowing to apply efficiently applications and various products on embedded platforms,
	291	considering resources constraints (delais, power, memory, etc.), security and quality
	292	services>>.
	293	\\
	294	Our team aims at covering all the steps of the design flow of architecture synthesis.
	295	Our project overcomes the complexity of using various synthesis tools and description
	296	languages required today to design architectures.
	297
	298	\section{Scientific and Technical Description}
	299	\subsection{State of the art}
	300	\hspace{2cm}\begin{scriptsize}\begin{verbatim}
	301	% 2. DESCRIPTION SCIENTIFIQUE ET TECHNIQUE
	302	% 2.1. ï¿œTAT DE L'ART
	303	% (3 pages maximum)
	304	% Dï¿œcrire le contexte et les enjeux scientifiques dans lequel se situe le projet
	305	% en prï¿œsentant un ï¿œtat de l'art national et international dressant l'ï¿œtat des
	306	% connaissances sur le sujet. Faire apparaï¿œtre d'ï¿œventuels rï¿œsultats prï¿œliminaires.
	307	% Inclure les rï¿œfï¿œrences bibliographiques nï¿œcessaires en annexe 7.1.
	308	\end{verbatim}
	309	\end{scriptsize}
	310	Our project covers several critical domains in system design in order
	311	to achieve high performance computing. Starting from a high level description we aim
	312	at generating automatically both hardware and software components of the system.
	313
	314	\subsubsection{High Performance Computing}
	315	Accelerating high-performance computing (HPC) applications with field-programmable
	316	gate arrays (FPGAs) can potentially improve performance.
	317	However, using FPGAs presents significant challenges [1].
	318	First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
	319	Second, based on Amdahl law, HPC/FPGA application performance is unusually sensitive
	320	to the implementation quality [2].
	321	Finally, High-performance computing programmers are a highly sophisticated but scarce
	322	resource. Such programmers are expected to readily use new technology but lack the time
	323	to learn a completely new skill such as logic design [3].
	324	\\
	325	HPC/FPGA hardware is only now emerging and in early commercial stages,
	326	but these techniques have not yet caught up.
	327	Thus, much effort is required to develop design tools that translate high level
	328	language programs to FPGA configurations.
	329
	330	\hspace{2cm}\begin{scriptsize}\begin{verbatim}
	331	[1] M.B. Gokhale et al., Promises and Pitfalls of Reconfigurable
	332	Supercomputing, Proc. 2006 Conf. Eng. of Reconfigurable
	333	Systems and Algorithms, CSREA Press, 2006, pp. 11-20;
	334	http://nis-www.lanl.gov/~maya/papers/ersa06_gokhale_paper.
	335	pdf.
	336	[2] D. Buell, Programming Reconfigurable Computers: Language
	337	Lessons Learned, keynote address, Reconfigurable Systems
	338	Summer Institute 2006, 12 July 2006; http://gladiator.
	339	ncsa.uiuc.edu/PDFs/rssi06/presentations/00_Duncan_Buell.pdf
	340	[3] T. Van Court et al., Achieving High Performance
	341	with FPGA-Based Computing, Computer, vol. 40, no. 3,
	342	pp. 50-57, Mar. 2007, doi:10.1109/MC.2007.79
	343	\end{verbatim}
	344	\end{scriptsize}
	345
	346	\subsubsection{System Synthesis}
	347	Today, several solutions for system design are proposed and commercialized. The most common are
	348	those provided by Altera and Xilinx to promote their FPGA devices.
	349	\\
	350	The Xilinx System Generator for DSP [http://www.xilinx.com/tools/sysgen.htm] is a plug-in to
	351	Simulink that enables designers to develop high-performance DSP systems for Xilinx FPGAs.
	352	Designers can design and simulate a system using MATLAB and Simulink. The tool will then
	353	automatically generate synthesizable Hardware Description Language (HDL) code mapped to Xilinx
	354	pre-optimized algorithms.
	355	However, this tool targets only DSP based algorithms, Xilinx FPGAs and cannot handle complete
	356	SoC. Thus, it is not really a system synthesis tool.
	357	\\
	358	In the opposite, SOPC Builder [CITATION] allows to describe a system, to synthesis it,
	359	to programm it into a target FPGA and to upload a software application.
	360	% FIXME(C2H from Altera, marche vite mais ressource monstrueuse)
	361	Nevertheless, SOPC Builder does not provide any facilities to synthesize coprocessors.
	362	Users have to provide the synthesizable description with the feasible bus interface.
	363	\\
	364	In addition, Xilinx System Generator and SOPC are closed world since each one imposes
	365	their own IPs which are not interchangeable.
	366	We can conclude that the existing commercial or free tools does not coverthe whole system
	367	synthesis process in a full automatic way. Moreover, they are bound to a particular device family
	368	and to IPs library.
	369
	370	\subsubsection{High Level Synthesis}
	371	High Level Synthesis translates a sequential algorithmic description and a constraints set
	372	(area, power, frequency, ...) to a micro-architecture at Register Transfer Level (RTL).
	373	Several academic and commercial tools are today available.
	374	Most common tools are SPARK [HLS1], GAUT [HLS2], UGH [HLS3] in the academic world
	375	and catapultC [HLS4], PICO [HLS5] and Cynthesizer [HLS6] in commercial world.
	376	Despite their maturity, their usage is restrained by:
	377	\begin{itemize}
	378	\item They do not respect accurately the frequency constraint when they target an FPGA device.
	379	Their error is about 10 percent. This is annoying when the generated component is integrated
	380	in a SoC since it will slow down the hole system.
	381	\item These tools take into account only one or few constraints simultaneously while realistic
	382	designs are multi-constrained.
	383	Moreover, low power consumption constraint is mandatory for embedded systems.
	384	However, it is not yet well handled by common synthesis tools.
	385	\item The parallelism is extracted from initial algorithm. To get more parallelism or to reduce
	386	the amout of required memory, the user must re-write it while there is techniques as polyedric
	387	transformations to increase the intrinsec parallelism.
	388	\item Despite they have the same input language (C/C++), they are sensitive to the style in
	389	which the algorithm is written. Consequently, engineering work is required to swap from
	390	a tool to another.
	391	\item The HLS tools are not integrated into an architecture and system exploration tool.
	392	Thus, a designer who needs to accelerate a software part of the system, must adapt it manually
	393	to the HLS input dialect and performs engineering work to exploit the synthesis result
	394	at the system level.
	395	\end{itemize}
	396	Regarding these limitations, it is necessary to create a new tool generation reducing the gap
	397	between the specification of an heterogenous system and its hardware implementation.
	398
	399	\hspace{2cm}\begin{scriptsize}\begin{verbatim}
	400	[HLS1] SPARK universite de californie San Diego
	401	[HLS2] GAUT UBS/Lab-STIC
	402	[HLS3] UGH
	403	[HLS4] catapultC Mentor
	404	[HLS5] PICO synfora
	405	[HLS6] Cynthesizer Forte design system
	406	\end{verbatim}
	407	\end{scriptsize}
	408
	409	\subsubsection{Application Specific Instruction Processors}
	410
	411	ASIP (Application-Specific Instruction-Set Processor) are programmable processors in
	412	which both the instruction and the micro architecture have been tailored to a given
	413	application domain (eg. video processing), or to a specific application.
	414	This specialization usually offers a good compromise between performance (w.r.t a pure software
	415	implementation on an embeded CPU) and flexibility (w.r.t an application specific
	416	hardware co-processor).
	417	In spite of their obvious advantages, using/designing ASIPs remains a difficult
	418	task, since it involves designing both a micro-architecture and a compiler for this
	419	architecture. Besides, to our knowledge, there is still no available open-source
	420	design flow\footnote{There are commercial tools such a } for ASIP design even if such a tool would
	421	be valuable in the context of a System Level design exploration tool.
	422
	423	In this context, ASIP design based on Instruction Set Extensions (ISEs) has
	424	received a lot of interest [NIOSII,TENSILICA]%~\cite{NIOS2,ST70},
	425	as it makes micro architecture synthesis
	426	more tractable \footnote{ISEs rely on a template micro-architecture in which
	427	only a small fraction of the architecture has to be specialized}, and help ASIP
	428	designers to focus on compilers, for which there are still many open problems
	429	[CODES04,FPGA08].
	430	This approach however has a strong weakness, since it also significantly reduces
	431	opportunities for achieving good seedups (most speedup remain between 1.5x and
	432	2.5x), since ISEs performance is generally tied down by I/O constraints as
	433	they generally rely on the main CPU register file to access data.
	434
	435	% (
	436	%automaticcaly extraction ISE candidates for application code \cite{CODES04},
	437	%performing efficient instruction selection and/or storage resource (register)
	438	%allocation \cite{FPGA08}).
	439
	440
	441	To cope with this issue, recent approaches~[DAC09,DAC08]%\cite{DAC09,DAC08}
	442	advocate the use of
	443	micro-architectural ISE models in which the coupling between the processor micro-architecture
	444	and the ISE component is thightened up so as to allow the ISE to overcome the register
	445	I/O limitations, however these approaches tackle the problem for a compiler/simulation
	446	point of view and not address the problem of generating synthesizable representations for
	447	these models.
	448
	449	We therefore strongly believe that there is a need for an open-framework which
	450	would allow researchers and system designers to :
	451	\begin{itemize}
	452	\item Explore the various level of interactions between the original CPU micro-architecure
	453	and its extension (for example throught a Domain Specific Language targeted at micro-architecture
	454	specification and synthesis).
	455	\item Retarget the compiler instruction-selection (or prototype nex passes) passes so as
	456	to be able to take advantage of this ISEs.
	457	\item Provide a complete System-level Integration for using ASIP as SoC building blocks
	458	(integration with application specific blocks, MPSoc, etc.)
	459	\end{itemize}
	460
	461	\hspace{2cm}
	462	\begin{scriptsize}\begin{verbatim}
	463
	464	[CODES08] Theo Kluter, Philip Brisk, Paolo Ienne, and Edoardo Charbon, Speculative DMA for
	465	Architecturally Visible Storage in Instruction Set Extensions
	466
	467	[DAC09] Theo Kluter, Philip Brisk, Paolo Ienne, Edoardo Charbon, Way Stealing: Cache-assisted
	468	Automatic Instruction Set Extensions.
	469
	470	[CODES04] Pan Yu, Tulika Mitra, Scalable Custom Instructions Identification for
	471	Instruction Set Extensible Processors.
	472
	473	[FPGA08] Quang Dinh, Deming Chen, Martin D. F. Wong, Efficient ASIP Design for Configurable
	474	Processors with Fine-Grained Resource Sharing.
	475
	476	[NIOSII] Nios II Custom Instruction User Guide
	477
	478	\end{verbatim}
	479
	480	\end{scriptsize}
	481	%, either
	482	%because the target architecture is proprietary, or because the compiler
	483	%technology is closed/commercial.
	484
	485
	486
	487
	488	% We propose to explore how to tighten the coupling of the extensions and
	489	% the underlyoing template micro-architecture.
	490	% * Thightne Even if such
	491	% an approach offers less flexiblity and forbids very tight coupling
	492	% between the extensions and the template micro-architecture, it makes the
	493	% design of the micro-architecture more tractable and amenable to a fully
	494	% automated flow.
	495	% \\
	496	% \\
	497	% In the context of the COACH project, we propose to add to the
	498	% infra-structure a design flow targeted to automatic instruction set
	499	% extension for the MIPS-based CPU, which will come as a complement or an
	500	% alternative to the other proposed approaches (hardware accelerator,
	501	% multi processors).
	502	%
	503
	504	\subsubsection{Automatic Parallelization}
	505	\begin{Large}\begin{verbatim}
	506	-- A COMPLETER LIP
	507	\end{verbatim}
	508	\end{Large}
	509	%CA% Parallel machines are often difficult and painful to program
	510	%CA% directly, and one would like the compiler to %do the job, that is to
	511	%CA% turn automatically a sequential program into a parallel form. This
	512	%CA% transformation is referred as {\em automatic parallelization}, and has
	513	%CA% been widely addressed since the 70s. Automatic parallelization
	514	%CA% relies on data dependences, which cannot be computed in general.%, as
	515	%CA% %one cannot predict at compile time the variable values on a given
	516	%CA% %execution point.
	517	%CA% This negative result led researchers to (i) find a
	518	%CA% program model in which no approximation is needed (ie polyhedral
	519	%CA% model), (ii) make conservative approximations (iii) remark that
	520	%CA% variable values are known at runtime, and make the decisions during
	521	%CA% program execution. The latter approach is obviously not suitable
	522	%CA% there, as we target hardware generation. We will give there a short
	523	%CA% history of the approaches that fall in the first category.
	524	%CA%
	525	%CA%% In the real world, we deal with a limited amount of processors,
	526	%CA%% and the communication between processors takes time, and is
	527	%CA%% critical for performance. %Whenever we have synchronisation-free
	528	%CA%% parallelism, like for embarrassingly parallel kernels, this is not an
	529	%CA%% issue. But in case of pipelined parallelism, we need to reduce
	530	%CA%% communications as much as possible.
	531	%CA%% So we also need to find parallelism toghether with a proper mapping
	532	%CA%% of operations and data on physical processors.
	533	%CA%
	534	%CA% As programs spend most of there time in loops, the community has
	535	%CA% focused on loop transformations that reveal parallelism.
	536	%CA%%unimodulaire
	537	%CA% The first approaches worked on perfect loop nests, where the tree
	538	%CA% formed by the nested loops is linear. In this program model, the
	539	%CA% loops can be seen as a basis that drive the way the iteration
	540	%CA% domain will be described. Hence, a first idea was to change this
	541	%CA% basis such that one vector (one loop) at least is parallel. To ease
	542	%CA% the code generation, the area of defined by the news vectors must
	543	%CA% be a unit volume. %Otherwise, one would produce an homothetic
	544	%CA%% expansion of the iteration domain, which will force to put modulos
	545	%CA%% in the target code.
	546	%CA% For this reason, these transformations are called {\em unimodular
	547	%CA% transformations}.
	548	%CA%%tiling
	549	%CA%
	550	%CA% The next approaches include {\em loop tiling}, a simple
	551	%CA% partitioning of the iteration domain, whose initial purpose is to
	552	%CA% execute every partition on a different processor. %In the same way,
	553	%CA% The execution order is modified with a proper unimodular
	554	%CA% transformation, then the tiles are obtained by cutting the
	555	%CA% iteration domain with the hyperplanes directed by every vector of
	556	%CA% the new (unimodular) basis, at regular intervals. When the tiling
	557	%CA% hyperplanes are properly chosen, we can both improve data-locality
	558	%CA% on every processor, and reduce the communication between two
	559	%CA% different tiles (which will be mapped on processors). This last
	560	%CA% property implying that one tend to find a degree of parallelism as
	561	%CA% great as possible.
	562	%CA%
	563	%CA%%affine scheduling
	564	%CA% The previous approaches were restricted to kernels with perfect
	565	%CA% loop nests (linear loop tree), and unimodular transformations. The
	566	%CA% last generation of approaches broke with these limitations. We now
	567	%CA% choose a different basis for every assignment, without the
	568	%CA% unimodularity restriction. A dual way to present the things is the
	569	%CA% notion of {\em affine schedule}, introduced by Feautrier [part1],
	570	%CA% that simply assigns an abstract execution date to every assignment
	571	%CA% execution. As an assignment execution is exactly characterised by
	572	%CA% the current value of the loops counters (iteration vector), the
	573	%CA% affine schedule will be defined as an affine form of the iteration
	574	%CA% vector (hence the 'affine'). The affine property allows to use
	575	%CA% integer programming techniques to compute the schedule. With this
	576	%CA% approach, additional techniques are required to allocate the
	577	%CA% parallel operations and the data to processor in an efficient way
	578	%CA% [griebl, feautrier].
	579	%CA%
	580	%CA%%modularity??
	581	%CA%%% As loop nests are no longer perfect, we deal with (transformed)
	582	%CA%%% iteration domains of different dimensions, which can possibly (and
	583	%CA%%% certainly) overlap. At this point, a new code generation technique
	584	%CA%%% was needed. The first attempt is due to Chamsky et al. [??], and
	585	%CA%%% was improved by Quillere et al. [QRW]. The code is now implemented
	586	%CA%%% in an efficient tool [cloog], that gave a new life to polyhedral
	587	%CA%%% techniques.
	588	%CA%
	589	%CA%%pluto's tiling
	590	%CA% The tiling techniques were extended to non-perfect loop nest with
	591	%CA% {\em affine partitioning}. Affine partitioning is to affine
	592	%CA% scheduling what (original) tiling was to unimodular
	593	%CA% transformations. An affine partitioning assigns to every assignment
	594	%CA% its coordinates in the basis defined by the normals to the tiling
	595	%CA% hyperplanes. Recently, a way to compute efficient hyperplanes were
	596	%CA% found [uday], with a good data locality, and communications
	597	%CA% confined in a small neighborhood around every processor.
	598	%CA%
	599	%CA%\subsubsection{Source-level Memory Optimisation}
	600	%CA% The HLS process allows to customise memory, which impacts on final
	601	%CA% circuit size and power consumption. Though most HLS tools already
	602	%CA% try to optimise memory usage, it is better to provide an independent
	603	%CA% source-level pass, that could be reused for different tools and in
	604	%CA% other contexts.
	605	%CA%
	606	%CA% There exists many approaches to evaluate and reduce the memory
	607	%CA% requirement of a program. The first approaches are concerned with
	608	%CA% {\em memory size estimation}, which can be defined as the maximum
	609	%CA% number of memory cells used at the same time [clauss,zhao]. These
	610	%CA% approaches provide an estimation as a symbolic expression of program
	611	%CA% parameters, which can be used further to guide loop optimisations.
	612	%CA% However, no explicit way to reduce the memory size is given. {\em
	613	%CA% Intra-array reuse} approaches brake with this limitation, and
	614	%CA% collapse the array cells which are not alive at the same time. The
	615	%CA% collapse is done by means of a data layout transformation, specified
	616	%CA% with a linear (modular) mapping. The first approaches were
	617	%CA% developed at IMEC [balasa,catthoor], and basically try to linearize
	618	%CA% the arrays and fold them using a modulo operator. Then, Lefebvre et
	619	%CA% al. propose a solution to fold independently the array dimensions
	620	%CA% [lefebvre]. Finally, Darte et al. provide a general formalisation of
	621	%CA% the problem, together with a solution that subsumes the previous
	622	%CA% approaches [darte]. A first implementation was made with the tool
	623	%CA% {\sc Bee}, but there are still many limitations.
	624	%CA%
	625	%CA% \begin{itemize}
	626	%CA% \item The tool is restricted to regular programs, whereas more
	627	%CA% general programs could be handled with a conservative array liveness
	628	%CA% analysis.
	629	%CA%
	630	%CA% \item Programs depending on parameters (inputs) are not handled,
	631	%CA% which forbids to handle, for example, the body of tiled loops.
	632	%CA%
	633	%CA% \item The new array layout can brake spatial locality, and then impact
	634	%CA% performance and power consumption. One would like to get a mapping
	635	%CA% that improve or, at least, preserve the spatial locality of the
	636	%CA% program.
	637	%CA%
	638	%CA% \item Finally, the final memory compaction strongly depends on the
	639	%CA% program schedule, and is naturally hindered by the
	640	%CA% parallelism. Consequently, there is a trade-off to find with
	641	%CA% automatic parallelization. An ideal solution would be to reduce
	642	%CA% memory usage, while preserving parallelism.
	643	%CA% \end{itemize}
	644
	645	\subsubsection{Interfaces}
	646	\begin{Large}\begin{verbatim}
	647	-- A COMPLETER INSA Etat de l'art
	648	\end{verbatim}
	649	\end{Large}
	650	%
	651	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	652	\subsection{Objectives and innovation aspects}
	653	\hspace{2cm}\begin{scriptsize}\begin{verbatim}
	654	% 2.2. OBJECTIFS ET CARACTERE AMBITIEUX/NOVATEUR DU PROJET
	655	% (2 pages maximum)
	656	% Dï¿œcrire les objectifs scientifiques/techniques du projet.
	657	% Prï¿œsenter l'avancï¿œe scientifique attendue. Prï¿œciser l'originalitï¿œ et le caractï¿œre
	658	% ambitieux du projet.
	659	% Dï¿œtailler les verrous scientifiques et techniques ï¿œ lever par la rï¿œalisation du projet.
	660	% Dï¿œcrire ï¿œventuellement le ou les produits finaux dï¿œveloppï¿œs ï¿œ l'issue du projet
	661	% montrant le caractï¿œre innovant du projet.
	662	% Prï¿œsenter les rï¿œsultats escomptï¿œs en proposant si possible des critï¿œres de rï¿œussite
	663	% et d'ï¿œvaluation adaptï¿œs au type de projet, permettant d'ï¿œvaluer les rï¿œsultats en
	664	% fin de projet.
	665	% Le cas ï¿œchï¿œant (programmes exigeant la pluridisciplinaritï¿œ), dï¿œmontrer l'articulation
	666	% entre les disciplines scientifiques.
	667	\end{verbatim}
	668	\end{scriptsize}
	669
	670	% les objectifs scientifiques/techniques du projet.
	671	The objectives of COACH project are to develop a complete framework to
	672	HPC (accelerating solutions for existing software applications)
	673	and embedded applications (implementing an application on a low power standalone device).
	674	The design steps are presented figure 1.
	675	\begin{figure}[hbtp]\leavevmode\center
	676	\includegraphics[width=.8\linewidth]{flow}
	677	\caption{\label{coach-flow} COACH flow.}
	678	\end{figure}
	679	\begin{description}
	680	\item[HPC setup] Here the user splits the application into 2 parts: the host application
	681	which remains on PC and the SoC application which migrates on SoC.
	682	The framework provides a simulation model allowing to evaluate the partitioning.
	683	\item[SoC design] In this phase,
	684	The user can obtain simulators at different abstraction levels of the SoC by giving to COACH framework
	685	a SoC description.
	686	This description consists of a process network corresponding to the SoC application,
	687	an OS, an instance of a generic hardware platform
	688	and a mapping of processes on the platform components. The supported mapping are
	689	software (the process runs on a SoC processor),
	690	XXXpeci (the process runs on a SoC processor enhanced with dedicated instructions),
	691	and hardware (the process runs into a coprocessor generated by HLS and plugged on the SoC bus).
	692	\item[Application compilation] Once SoC description is validated, COACH generates automatically
	693	an FPGA bitstream containing the hardware platform with SoC application software and
	694	an executable containing the host application. The user can launch the application by
	695	loading the bitstream on FPGA and running the executable on PC.
	696	\end{description}
	697
	698	% l'avancee scientifique attendue. Preciser l'originalite et le caractere
	699	% ambitieux du projet.
	700	The main scientific contribution of the project is to unify various synthesis techniques
	701	(same input and output formats) allowing the user to swap without engineering effort
	702	from one to an other and even to chain them, for example, to run polyedric transformation
	703	before synthesis.
	704	Another advantage of this framework is to provide different abstraction levels from
	705	a single description.
	706	Finally, this description is device family independent and its hardware implementation
	707	is automatically generated.
	708
	709	% Detailler les verrous scientifiques et techniques a lever par la realisation du projet.
	710	System design is a very complicated task and in this project we try to simplify it
	711	as much as possible. For this purpose we have to deal with the following scientific
	712	and technological barriers.
	713	\begin{itemize}
	714	\item The main problem in HPC is the communication between the PC and the SoC.
	715	This problem has 2 aspects. The first one is the efficiency. The second is to
	716	eliminate enginnering effort to implement it at different abstract levels.
	717	\item COACH design flow has a top-down approach. In the such case,
	718	the required performance of a coprocessor (run frequency, maximum cycles for
	719	a given computation, power consumption, etc) are imposed by the other system
	720	components. The challenge is to allow user to control accurately the synthesis
	721	process. For instance, the run frequency must not be a result of the RTL synthesis
	722	but a strict synthesis constraint.
	723	\item HLS tools are sensitive to the style in which the algorithm is written.
	724	In addition, they are are not integrated into an architecture and system
	725	exploration tool.
	726	Consequently, engineering work is required to swap from a tool to another,
	727	to integrate the resulting simulation model to an architectural exploration tool
	728	and to synthesize the generated RTL description.
	729	%CA Additionnal preprocessing, source-level transformations, are thus
	730	%CA required to improve the process.
	731	%CA Particularly, this includes parallelism exposure and efficient memory mapping.
	732	\item Most HLS tools translate a sequential algorithm into a coprocessor
	733	containing a single data-path and finite state machine (FSM). In this way,
	734	only the fine grained parallelism is exploited (ILP parallelism).
	735	The challenge is to identify the coarse grained parallelism and to generate,
	736	from a sequential algorithm, coprocessor containing multiple communicating
	737	tasks (data-paths and FSMs).
	738	\end{itemize}
	739
	740	%Presenter les resultats escomptes en proposant si possible des criteres de reussite
	741	%et d'evaluation adaptes au type de projet, permettant d'evaluer les resultats en
	742	%fin de projet.
	743	The main result is the framework. It is composed concretely of:
	744	2 HPC communication shemes with their implementation,
	745	5 HLS tools (control dominated HLS, data dominated HLS, Coarse grained HLS,
	746	Memory optimisation HLS and ASIP),
	747	3 systemC based virtual prototyping environment extended with synthesizable
	748	RTL IP cores (generic, ALTERA/NIOS/AVALON, XILINX/MICROBLAZE/OPB),
	749	one design space exploration tool,
	750	one operating system (OS).
	751	\\
	752	The framework fonctionality will be demonstrated with XXX-EXAMPLE1, XXX-EXAMPLE2
	753	and XXX-EXAMPLE3 on 4 archictures (generic/XILINX, generic/ALTERA,
	754	proprietary/XILINX, proprietary/ALTERA).
	755
	756	%% \section{}
	757	%% %3. PROGRAMME SCIENTIFIQUE ET TECHNIQUE, ORGANISATION DU PROJET
	758	%% \subsection{}
	759	%% %3.1. PROGRAMME SCIENTIFIQUE ET STRUCTURATION DU PROJET
	760	%% %(2 pages maximum)
	761	%% %Prï¿œsentez le programme scientifique et justifiez la dï¿œcomposition en tï¿œches du
	762	%% %programme de travail en cohï¿œrence avec les objectifs poursuivis.
	763	%% %Utilisez un diagramme pour prï¿œsenter les liens entre les diffï¿œrentes tï¿œches
	764	%% %(organigramme technique)
	765	%% %Les tï¿œches reprï¿œsentent les grandes phases du projet. Elles sont en nombre limitï¿œ.
	766	%% %N'oubliez pas les activitï¿œs et actions correspondant ï¿œ la dissï¿œmination et ï¿œ la
	767	%% %valorisation.
	768	%%
	769	%% %METTRE UNE FIGURE ICI DECRIVANT LES TACHES ET LEURS INTERACTION (AVEC LE FLOT
	770	%% %EN FILIGRANE ? )
	771	%% \subsection{}
	772	%% %3.2. MANAGEMENT DU PROJET
	773	%% %(2 pages maximum)
	774	%% %Prï¿œciser les aspects organisationnels du projet et les modalitï¿œs de coordination
	775	%% %(si possible individualisation d'une tï¿œche coordination : cf. tï¿œche 0 du document
	776	%% %de soumission A).
	777	%% \subsection{}
	778	%% %3.3. DESCRIPTION DES TRAVAUX PAR TACHE
	779	%% %(idï¿œalement 1 ou 2 pages par tï¿œche)
	780	%% %Pour chaque tï¿œche, dï¿œcrire :
	781	%% %- les objectifs de la tï¿œche et ï¿œventuels indicateurs de succï¿œs,
	782	%% %- le responsable de la tï¿œche et les partenaires impliquï¿œs (possibilitï¿œ de
	783	%% %l'indiquer sous forme graphique),
	784	%% %- le programme dï¿œtaillï¿œ des travaux par tï¿œche,
	785	%% %- les livrables de la tï¿œche,
	786	%% %- les contributions des partenaires (le " qui fait quoi "),
	787	%% %- la description des mï¿œthodes et des choix techniques et de la maniï¿œre dont
	788	%% %les solutions seront apportï¿œes,
	789	%% %- les risques de la tï¿œche et les solutions de repli envisagï¿œes.
	790
	791
	792
	793
	794
	795

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: anr/obsolete/body.tex @ 375

Download in other formats: