\section{Project context}
\hspace{2cm}\begin{scriptsize}\begin{verbatim}
% 1.	CONTEXTE ET POSITIONNEMENT DU PROJET
% (1 page maximum) Pr�sentation g�n�rale du probl�me qu'il est propos� de traiter 
% dans le projet et du cadre de travail (recherche fondamentale, industrielle ou 
% d�veloppement exp�rimental).
\end{verbatim}
\end{scriptsize}
An embedded system is an application integrated into one or several chips
in order to accelerate it or to embedd it into a small device such as a personal 
digital assistant (PDA).
This topic is investigated since 80s using Applications Specific Integrated Circuits (ASIC),
Digital Signal Processing (DSP) and parallel computing on multiprocessor machines or networks.
More recently, since end of 90s, other technologies appeared like Very Large Instruction Word (VLIW),
Application Specific Instruction Processors (ASIP), System on Chip (SoC), 
Multi-Processors SoC (MPSoC).
\\
During these last decades embedded system was reserved to major industrial companies targeting high volume market
due to the design and fabrication costs.
Nowadays Field Programmable Gate Arrays (FPGA), like Virtex5 from Xilinx and Stratix4 from Altera, 
can implement a SoC with multiple processors and several coprocessors for less than 10K euros
per item. In addition, High Level Synthesis (HLS) becomes more mature and allows to automate 
design and to drastically decrease its cost in terms of man power. Thus, both FPGA and HLS 
tend to spread over HPC for small companies targeting low volume markets.
\par
To get an efficient embedded system, designer has to take into account application characteristics when it 
chooses one of the former technologies.
This choice is not easy and in most cases designer has to try different technologies to retain the
most adapted one.
\\
The first objective of COACH is to provide an open-source framework to design embedded system
on FPGA device.
COACH framework allows designer to explore various software/hardware partitions of the
target application, to run timing and functional simulations and to generate automatically both
the software and the synthesizable description of the hardware.
The main topics of the project are:
\begin{itemize} 
\item
Design space exploration: It consists in analysing the application runnig on FPGA, defining the target
technology (SoC, MPSoC, ASIP, ...) and hardware/software partitioning of tasks depending on
technology choice. This exploration is driven basically by throughput, latency and power consumption 
criteria. 
\item
Micro-architectural exploration: When hardware components are required, the HLS tools of the framework
generate them automatically. At this stage the framework provides various HLS tools allowing the
micro-architectural space design exploration. The exploration criteria are also throughput, latency
and power consumption.
% FIXME
%CA At this stage, preliminary source-level transformations will be
%CA required to improve the efficiency of the target component.
%CA COACH will also provide such facilities, such as automatic parallelization
%CA and memory optimisation.
\item
Performance measurement: For each point of design space exploration, metrics of criteria are available
such as throughput, latency, power consumption, area, memory allocation and data locality. 
They are evaluated using virtual prototyping, estimation or analysing methodologies.
\item
Targeted hardware technology: The COACH description of system is independent of the FPGA family.
Every point of the design exploration space can be implemented on any FPGA having the required resources.
Basically, COACH handles both Altera and Xilinx FPGA families.
\end{itemize}
As an extension of embedded system design, COACH deals also with High Performance Computing (HPC).
In HPC, the kind of targeted application is an existing one running on PC. COACH helps designer
to accelerate it by migrating critical parts into a SoC implemented on a FPGA plugged to the PC bus.
\par
COACH is the result of the will of several laboratory to unify their know how and skills in the
following domains: Operating system and hardware communication (TIMA, SITI), SoC and MPSoC (LIP6 and TIMA),
ASIP (IRISA) and HLS (LIP6, Lab-STIC and LIP). The project objective is to integrate these various 
domains into a unique free framework (licence ...) masking as much as possible these domains and its 
different tools to the user.


\subsection{Economical context and interest}
\hspace{2cm}\begin{scriptsize}\begin{verbatim}
% 1.1.	CONTEXTE ET ENJEUX ECONOMIQUES ET SOCIETAUX 
% (2 pages maximum)
% D�crire le contexte �conomique, social, r�glementaire. dans lequel se situe 
% le projet en pr�sentant une analyse des enjeux sociaux, �conomiques, environnementaux, 
% industriels. Donner si possible des arguments chiffr�s, par exemple, pertinence et 
% port�e du projet par rapport � la demande �conomique (analyse du march�, analyse des 
% tendances), analyse de la concurrence, indicateurs de r�duction de co�ts, perspectives 
% de march�s (champs d'application, .). Indicateurs des gains environnementaux, cycle 
% de vie.
\end{verbatim}
\end{scriptsize}
Microelectronic allows to integrate complicated functions into products, to increase their
commercial attractivity and to improve their competitivity. Multimedia and communication
sectors have taken advantage from microelectronics facilities thanks to developpment of
design methodologies and tools for real time embedded systems. Many other sectors could
benefit from microelectronics if these methologies and tools are adapted to their features.
The Non Recurring Engineering (NRE) costs involded in designing and manufacturing an ASIC is 
very high. It costs several milliars of euros for IC factory and several millions to fabricate
a specific circuit for example a conservative estimate for a 65nm ASIC project is 10 million USD. 
Consequently, it is generally unfeasible to design and fabricate ASICs in
low volumes and ICs are designed to cover a broad applications spectrum at the cost of
performance degradation.
\\
Today, FPGAs become important actors in the computational domain that was originally dominated
by microprocessors and ASICs. Just like microprocessors FPGA based systems can be reprogrammed
on a per-application basis. At the same time, FPGAs offer significant performance benefits over
microprocessors implementation for a number of applications. Although these benefits are still
generally an order of magnitude less than equivalent ASIC implementations, low costs 
(500 euros to 10K euros), fast time to market and flexibility of FPGAs make them an attractive 
choice for low-to-medium volume applications. 
Since their introduction in the mid eighties, FPGAs evolved from a simple, 
low-capacity gate array technology to devices (Altera STRATIX III, Xilinx Virtex V) that
provide a mix of coarse-grained data path units, memory blocks, microprocessor cores, 
on chip A/D conversion, and gate counts by millions. This high logic capacity allows to implement
complex systems like multi-processors platform with application dedicated coprocessors. 
Table~\ref{fpga_market} shows the estimation of FPGA worldwide market in the next years covering 
various application domains. The ``high end'' lines concern only FPGA with high logic capacity able 
to implement complex systems. 
This market is in significant expansion and is estimated to 914\,M\$ in 2012.
Using FPGA limits the NRE costs to design cost. This boosts the developpment of methodologies
and tools to automize design and reduce its cost.
\begin{table}\leavevmode\center
\begin{tabular}{|l|l|l|l|}\hline
Segment	        & 2010	& 2011	& 2012 \\\hline\hline
Communications	& 1,867	& 1,946	& 2,096 \\
High end	& 467	& 511	& 550 \\\hline
Consumer	& 550	& 592	& 672 \\
High end	& 53	& 62	& 75 \\\hline
Automotive	& 243	& 286	& 358 \\
High end	& -	& -	& - \\\hline
Industrial	& 1,102	& 1,228	& 1,406 \\
High end	& 177	& 188	& 207 \\\hline
Military/Aereo	& 566	& 636	& 717 \\
High end	& 56	& 65	& 82 \\\hline\hline
Total FPGA/PLD	& 4,659	& 5,015	& 5,583 \\
Total High-End  FPGA	& 753	& 826	& 914 \\\hline
\end{tabular}
\caption{\label{fga_market} Gartner estimation of worldwide FPGA/PLD consumption (Millions \$)}
\end{table}
\par
Today, several companies (atipa, blue-arc, Bull, Chelsio, Convey, CRAY, DataDirect, DELL, hp, 
Wild Systems, IBM, Intel, Microsoft, Myricom, NEC, nvidia etc) are making systems where demand 
for very high performance (HPC) primes over other requirements. They tend to use the highest 
performing devices like Multi-core CPUs, GPUs, large FPGAs, custom ICs and the most innovative 
architectures and algorithms. Companies show up in different "traditional" applications and market 
segments like computing clusters (ad-hoc), servers and storage, networking and Telecom, ASIC 
emulation and prototyping, Mil/aero etc. HPC market size is estimated today by FPGA providers 
to 214\,M\$. 
This market is dominated by Multi-core CPUs and GPUs based solutions and the expansion 
of FPGA-based solutions is limited by the flow automation. Nowadays, there are neither commercial 
nor free tools covering the whole design process.
For instance, with SOPC Builder from Altera, users can select and parameterize IP components 
from an extensive drop-down list of communication, digital signal processor (DSP), microprocessor 
and bus interface cores, as well as incorporate their own IP. Designers can then generate 
a synthesized netlist, simulation test bench and custom software library that reflect the hardware 
configuration.
Nevertheless, SOPC Builder does not provide any facilities to synthesize coprocessors\emph{I
(Steven) disagree : the C2H compiler bundled with SOPCBuilder does a pretty good job at this} and to
simulate the platform at a high design level (system C). 
In addition, SOPC Builder is proprietary and only works together with Altera's Quartus compilation
tool to implement designs on Altera devices (Stratix, Arria, Cyclone).
PICO [CITATION] and CATAPULT [CITATION] allow to synthesize coprocessors from a C++ description.
Nevertheless, they can only deal with data dominated applications and they do not handle the
platform level.
The Xilinx System Generator for DSP [http://www.xilinx.com/tools/sysgen.htm] is a plug-in to 
Simulink that enables designers to develop high-performance DSP systems for Xilinx FPGAs. 
Designers can design and simulate a system using MATLAB and Simulink. The tool will then 
automatically generate synthesizable Hardware Description Language (HDL) code mapped to Xilinx 
pre-optimized algorithms. 
However, this tool targets only DSP based algorithms.
\\
Consequently, designers developping an embedded system needs to master for example
SoCLib for design exploration,
SOPC Builde at the platform level, 
PICO for synthesizing the data dominated coprocessors
and Quartus for design implementation.
This requires an important tools interfacing effort and makes the design process very complex 
and achievable only by designers skilled in many domains.
COACH project integrates all these tools in the same framework masking them to the user. 
The objective is to allow \textbf{pure software} developpers to realize embedded systems.
\par
The combination of the framework dedicated to software developpers and FPGA target, allows to gain 
market share over Multi-core CPUs and GPUs HPC based solutions. 
Moreover, one can expect that small and even very small companies will be able to propose embedded 
system and accelerating solutions for standard software applications with acceptable prices, thanks 
 to the elimination of huge hardware investment in opposite to ASIC based solution.
\\
This new market may explose like it was done by micro-computing in eighties. This success were due 
to the low cost of first micro-computers (compared to main frame) and the advent of high level 
programming languages that allow a high number of programmers to launch start-ups in software
engineering.

\subsection{Project position}
\hspace{2cm}\begin{scriptsize}\begin{verbatim}
% 1.2.	POSITIONNEMENT DU PROJET
% (2 pages maximum)
% Pr�ciser :
% -	positionnement du projet par rapport au contexte d�velopp� pr�c�demment : 
%   vis- �-vis des projets et recherches concurrents, compl�mentaires ou ant�rieurs, 
%   des brevets et standards.
% - positionnement du projet par rapport aux axes th�matiques de l'appel � projets.
% - positionnement du projet aux niveaux europ�en et international.
\end{verbatim}
\end{scriptsize}
The aim of this project is to propose an open-source framework for architecture synthesis
targeting mainly field programmable gate array circuits (FPGA).
\\% LIP6/TIMA
To evaluate the different architectures, the project uses the prototyping platform
of the SoCLIB ANR project (2006-2009).
\\% IRISA
The project will also borrow from the ROMA ANR project (2007-2009) and the ongoing 
joint INRIA-STMicro Nano2012 project. In particular we will adapt existing pattern 
extraction algorithms and datapath merging techniques to the synthesis of customized 
ASIP processors.
\\
\textcolor{gris75}{Steven : Je propose de rajouter un lien avec le projet BioWic~:~on the HPC
application side, we also hope to benefit from the experience in hardware acceleration of
bioinformatic algorithms/workfows gathered by the CAIRN group in the context of the ANR
BioWic project (2009-2011), so as to be able to validate the framework on 
real-life HPC applications.}

\par
%%% 1 -- POUVEZ VOUS CHACUN AJOUTER SVP (SI POSSIBLE) UNE LIGNE
%%% 1 -- REFERANT UN PROJET ANR OU EUROPEEN
%%% 1 -- Projets europ�ens ou ANR r�utilis�s ou continu�s
%%% 1 LIP6/TIMA/LAB-STIC OK
Regarding the expertise in  High Level Synthesis (HLS), the project leverages on know-how acquired over 15 years
with GAUT project developped in Lab-STIC laboratory and UGH project developped in LIP6 
and TIMA laboratories. \\
Regarding architecture synthesis skills, the project is based on a know-how acquired over 10 years
with the COSY European project (1998-2000) and the DISYDENT project developped in LIP6.  \\
%%% 1 IRISA OK
Regarding Application Specific Instruction Processor (ASIP) design, the CAIRN group at INRIA Bretagne
Atlantique benefits from several years of expertise in the domain of retargetable compiler (Armor/Calife
since 1996, and the Gecos compilers since 2002).


% LIP FIXME:UN:PEU:LONG ET HORS:SUJET
%CA% The source-level transformations required by the HLS tools will be
%CA% designed in the {\em polyhedral model}, a general framework
%CA% initiated by Paul Feautrier 20 years ago.  The programs handled in
%CA% the polyhedral model are such that loop iterators describe a
%CA% polyhedron (hence the name). This includes most of the kernels used
%CA% in embedded applications. This property allows to design precise
%CA% analysis by means of integer programming techniques.
%CA% %communaute active & internationale
%CA% %transfert techno (Reservoir)
%CA% The polyhedral community is very active, and the technological
%CA% transfer has now started. Reservoir Labs inc., a company based in
%CA% New-York, is currently integrating the last polyhedral developments
%CA% in its commercial compiler.
%CA% %transfert techno (gcc)
%CA% Also, polyhedra are progressively migrating into the {\sc GNU Gcc}
%CA% compiler, via {\sc Graphite}, a module initially developed by
%CA% Sebastian Pop.
%CA% %outils existants
%CA% Several tools have been developed in the polyhedral community,
%CA% such as {\sc Piplib} (parameter integer programming library), and
%CA% {\sc Polylib}, a library providing set operations on polyhedra. Both
%CA% tools are almost mandatory in polyhedral tools, and have reached
%CA% a sufficient level of maturity to be considered as standard.
%syntol & bee ???
% FIN
% and on more than 15 years of experience on parallel hardware generation
% in the polyedral model in the CAIRN group (MMAlpha software
% developped in the group since 1996).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% 2 -- A COMPLETER (COURT)
%%% 2 -- For polyedric transformation and memory optimization ... LIP 
%%% 2 -- For ASIP IRISA
%%% 2 -- For ... CITI
%%% 2 -- For ... TIMA
\par
The SoCLIB ANR platform were developped by 11 laboratories and 6 companies. It allows to
describe hardware architectures with shared memory space and to deploy software
applications on them to evaluate their performance. 
The heart of this platform is a library containing simulation models (in SystemC)
of hardware IP cores such as processors, buses, networks, memories, IO controller.
The platform provides also embedded operating systems and software/hardware
communication components useful to implement applications quickly.
However, the synthesisable description of IPs have to be provided by users. \\
This project enhances SoCLib by providing synthesisable VHDL of standard IPs.
In addition, HLS tools such as UGH and GAUT allow to get automatically a synthesisable 
description of an IP (coprocessor) from a sequential algorithm.
%\par
%%% 2 IRISA ?
%%% 2 ASIP tool such as ... IRISA
%%% 2 ...
%%% 2 Coach uses pattern extractions from ROMA
%\par
%%% 2 LIP ?
\par
The different points proposed in this project cover priorities defined by the commission 
experts in the field of Information Technolgies Society (IST) for Embedded
systems: <<Concepts, methods and tools for designing systems dealing with systems complexity
and allowing to apply efficiently applications and various products on embedded platforms,
considering resources constraints (delais, power, memory, etc.), security and quality
services>>.
\\
Our team aims at covering all the steps of the design flow of architecture synthesis.
Our project overcomes the complexity of using various synthesis tools and description 
languages required today to design architectures.

\section{Scientific and Technical Description}
\subsection{State of the art}
\hspace{2cm}\begin{scriptsize}\begin{verbatim}
% 2.	DESCRIPTION SCIENTIFIQUE ET TECHNIQUE
% 2.1.	�TAT DE L'ART
% (3 pages maximum)
% D�crire le contexte et les enjeux scientifiques dans lequel se situe le projet 
% en pr�sentant un �tat de l'art national et international dressant l'�tat des 
% connaissances sur le sujet. Faire appara�tre d'�ventuels r�sultats pr�liminaires. 
% Inclure les r�f�rences bibliographiques n�cessaires en annexe 7.1.
\end{verbatim}
\end{scriptsize}
Our project covers several critical domains in system design in order
to achieve high performance computing. Starting from a high level description we aim 
at generating automatically both hardware and software components of the system.

\subsubsection{High Performance Computing}
Accelerating high-performance computing (HPC) applications with field-programmable
gate arrays (FPGAs) can potentially improve performance. 
However, using FPGAs presents significant challenges [1].
First, the operating frequency of an FPGA is low compared to a high-end microprocessor.
Second, based on Amdahl law,  HPC/FPGA application performance is unusually sensitive 
to the implementation quality [2].
Finally, High-performance computing programmers are a highly sophisticated but scarce 
resource. Such programmers are expected to readily use new technology but lack the time 
to learn a completely new skill such as logic design [3]. 
\\
HPC/FPGA hardware is only now emerging and in early commercial stages, 
but these techniques have not yet caught up. 
Thus, much effort is required to develop design tools that translate high level
language programs to FPGA configurations.

\hspace{2cm}\begin{scriptsize}\begin{verbatim}
[1] M.B. Gokhale et al., Promises and Pitfalls of Reconfigurable
Supercomputing, Proc. 2006 Conf. Eng. of Reconfigurable
Systems and Algorithms, CSREA Press, 2006, pp. 11-20;
http://nis-www.lanl.gov/~maya/papers/ersa06_gokhale_paper.
pdf.
[2] D. Buell, Programming Reconfigurable Computers: Language
Lessons Learned, keynote address, Reconfigurable Systems
Summer Institute 2006, 12 July 2006; http://gladiator.
ncsa.uiuc.edu/PDFs/rssi06/presentations/00_Duncan_Buell.pdf
[3] T. Van Court et al., Achieving High Performance
with FPGA-Based Computing, Computer, vol. 40, no. 3, 
pp. 50-57, Mar. 2007, doi:10.1109/MC.2007.79
\end{verbatim}
\end{scriptsize}

\subsubsection{System Synthesis}
Today, several solutions for system design are proposed and commercialized. The most common are
those provided by Altera and Xilinx to promote their FPGA devices.
\\
The Xilinx System Generator for DSP [http://www.xilinx.com/tools/sysgen.htm] is a plug-in to 
Simulink that enables designers to develop high-performance DSP systems for Xilinx FPGAs. 
Designers can design and simulate a system using MATLAB and Simulink. The tool will then 
automatically generate synthesizable Hardware Description Language (HDL) code mapped to Xilinx 
pre-optimized algorithms. 
However, this tool targets only DSP based algorithms, Xilinx FPGAs and cannot handle complete
SoC. Thus, it is not really a system synthesis tool.
\\
In the opposite, SOPC Builder [CITATION] allows to describe a system, to synthesis it, 
to programm it into a target FPGA and to upload a software application. 
% FIXME(C2H from Altera, marche vite mais ressource monstrueuse)
Nevertheless, SOPC Builder does not provide any facilities to synthesize coprocessors.
Users have to provide the synthesizable description with the feasible bus interface.
\\
In addition, Xilinx System Generator and SOPC are closed world since each one imposes
their own IPs which are not interchangeable.
We can conclude that the existing commercial or free tools does not coverthe whole system 
synthesis process in a full automatic way. Moreover, they are bound to a particular device family
and to IPs library.

\subsubsection{High Level Synthesis}
High Level Synthesis translates a sequential algorithmic description and a constraints set 
(area, power, frequency, ...) to a micro-architecture at Register Transfer Level (RTL).
Several academic and commercial tools are today available. 
Most common tools are SPARK [HLS1], GAUT [HLS2], UGH [HLS3] in the academic world 
and catapultC [HLS4], PICO [HLS5] and Cynthesizer [HLS6] in commercial world.
Despite their maturity, their usage is restrained by:
\begin{itemize}
\item They do not respect accurately the frequency constraint when they target an FPGA device.
Their error is about 10 percent. This is annoying when the generated component is integrated
in a SoC since it will slow down the hole system.
\item These tools take into account only one or few constraints simultaneously while realistic
designs are multi-constrained. 
Moreover, low power consumption constraint is mandatory for embedded systems. 
However, it is not yet well handled by common synthesis tools.
\item The parallelism is extracted from initial algorithm. To get more parallelism or to reduce
the amout of required memory, the user must re-write it while there is techniques as polyedric 
transformations to increase the intrinsec parallelism.
\item Despite they have the same input language (C/C++), they are sensitive to the style in
which the algorithm is written. Consequently, engineering work is required to swap from 
a tool to another.
\item The HLS tools are not integrated into an architecture and system exploration tool.
Thus, a designer who needs to accelerate a software part of the system, must adapt it manually 
to the HLS input dialect and performs engineering work to exploit the synthesis result 
at the system level.
\end{itemize}
Regarding these limitations, it is necessary to create a new tool generation reducing the gap 
between the specification of an heterogenous system and its hardware implementation.

\hspace{2cm}\begin{scriptsize}\begin{verbatim}
[HLS1] SPARK universite de californie San Diego
[HLS2] GAUT UBS/Lab-STIC
[HLS3] UGH
[HLS4] catapultC Mentor
[HLS5] PICO synfora
[HLS6] Cynthesizer Forte design system 
\end{verbatim}
\end{scriptsize}

\subsubsection{Application Specific Instruction Processors}

ASIP (Application-Specific Instruction-Set Processor) are programmable processors in 
which both the instruction and the micro architecture have been tailored to a given
 application domain (eg. video processing), or to a specific application. 
This specialization usually offers a good compromise between performance (w.r.t a pure software
implementation on an embeded CPU) and flexibility (w.r.t an application specific 
hardware co-processor).
In spite of their obvious advantages, using/designing ASIPs remains a difficult
task, since it involves designing both a micro-architecture and a compiler for this
architecture. Besides, to our knowledge, there is still no available open-source
design flow\footnote{There are commercial tools such a } for ASIP design even if such a tool would
be valuable in the context of a System Level design exploration tool.    

In this context, ASIP design based on Instruction Set Extensions (ISEs) has 
received a lot of interest [NIOSII,TENSILICA]%~\cite{NIOS2,ST70}, 
as it makes micro architecture synthesis 
more tractable \footnote{ISEs rely on a template micro-architecture in which 
only a small fraction of the architecture has to be specialized}, and help ASIP
designers to focus on compilers, for which there are still many open problems 
[CODES04,FPGA08].
This approach however has a strong weakness, since it also significantly reduces 
opportunities for achieving good seedups (most speedup remain between 1.5x and 
2.5x), since ISEs performance is generally tied down by I/O constraints as 
they generally rely on the main CPU register file to access data.

% (
%automaticcaly extraction ISE candidates for application code \cite{CODES04}, 
%performing efficient instruction selection and/or storage resource (register) 
%allocation \cite{FPGA08}).  
 

To cope with this issue, recent approaches~[DAC09,DAC08]%\cite{DAC09,DAC08} 
advocate the use of 
micro-architectural ISE models in which the coupling between the processor micro-architecture
and the ISE component is thightened up so as to allow the ISE to overcome the register 
I/O limitations, however these approaches tackle the problem for a compiler/simulation 
point of view and not address the problem of generating synthesizable representations for 
these models. 

We therefore strongly believe that there is a need for an open-framework which
would allow researchers and system designers to :
\begin{itemize}
\item Explore the various level of interactions between the original CPU micro-architecure
and its extension (for example throught a Domain Specific Language targeted at micro-architecture
specification and synthesis).
\item Retarget the compiler instruction-selection (or prototype nex passes) passes so as
to be able to take advantage of this ISEs.
\item Provide  a complete System-level Integration for using ASIP as SoC building blocks 
(integration with application specific blocks, MPSoc, etc.)
\end{itemize}

\hspace{2cm}
\begin{scriptsize}\begin{verbatim} 

[CODES08] Theo Kluter, Philip Brisk, Paolo Ienne, and Edoardo Charbon, Speculative DMA for
Architecturally Visible Storage in Instruction Set Extensions

[DAC09] Theo Kluter, Philip Brisk, Paolo Ienne, Edoardo Charbon, Way Stealing: Cache-assisted
Automatic Instruction Set Extensions.

[CODES04] Pan Yu, Tulika Mitra, Scalable Custom Instructions Identification for
Instruction Set Extensible Processors.

[FPGA08] Quang Dinh, Deming Chen, Martin D. F. Wong, Efficient ASIP Design for Configurable
Processors with Fine-Grained Resource Sharing.

[NIOSII] Nios II Custom Instruction User Guide

\end{verbatim}

\end{scriptsize}
%, either 
%because the target architecture is proprietary, or because the compiler 
%technology is closed/commercial.


% We propose to explore how to tighten the coupling of the extensions and 
% the underlyoing template micro-architecture.
% *  Thightne Even if such 
% an approach offers less flexiblity and forbids very tight coupling 
% between the extensions and the template micro-architecture, it makes the 
% design of the micro-architecture more tractable and amenable to a fully 
% automated flow.
% \\
% \\
% In the context of the COACH project, we propose to add to the 
% infra-structure a design flow targeted to automatic instruction set 
% extension for the MIPS-based CPU, which will come as a complement or an 
% alternative to the other proposed approaches (hardware accelerator, 
% multi processors).
% 

\subsubsection{Automatic Parallelization}
\begin{Large}\begin{verbatim}
-- A COMPLETER LIP
\end{verbatim}
\end{Large}
%CA%   Parallel machines are often difficult and painful to program
%CA%   directly, and one would like the compiler to %do the job, that is to
%CA%   turn automatically a sequential program into a parallel form. This
%CA%   transformation is referred as {\em automatic parallelization}, and has
%CA%   been widely addressed since the 70s. Automatic parallelization
%CA%   relies on data dependences, which cannot be computed in general.%, as
%CA%   %one cannot predict at compile time the variable values on a given
%CA%   %execution point. 
%CA%   This negative result led researchers to (i) find a
%CA%   program model in which no approximation is needed (ie polyhedral
%CA%   model), (ii) make conservative approximations (iii) remark that
%CA%   variable values are known at runtime, and make the decisions during
%CA%   program execution. The latter approach is obviously not suitable
%CA%   there, as we target hardware generation. We will give there a short
%CA%   history of the approaches that fall in the first category.
%CA%
%CA%%   In the real world, we deal with a limited amount of processors,
%CA%%   and the communication between processors takes time, and is
%CA%%   critical for performance. %Whenever we have synchronisation-free
%CA%%   parallelism, like for embarrassingly parallel kernels, this is not an
%CA%%   issue. But in case of pipelined parallelism, we need to reduce
%CA%%   communications as much as possible. 
%CA%%   So we also need to find parallelism toghether with a proper mapping
%CA%%   of operations and data on physical processors.
%CA%
%CA%   As programs spend most of there time in loops, the community has
%CA%   focused on loop transformations that reveal parallelism. 
%CA%%unimodulaire
%CA%   The first approaches worked on perfect loop nests, where the tree
%CA%   formed by the nested loops is linear. In this program model, the
%CA%   loops can be seen as a basis that drive the way the iteration
%CA%   domain will be described. Hence, a first idea was to change this
%CA%   basis such that one vector (one loop) at least is parallel. To ease
%CA%   the code generation, the area of defined by the news vectors must
%CA%   be a unit volume. %Otherwise, one would produce an homothetic
%CA%%   expansion of the iteration domain, which will force to put modulos
%CA%%   in the target code. 
%CA%   For this reason, these transformations are called {\em unimodular
%CA%   transformations}.
%CA%%tiling
%CA%   
%CA%   The next approaches include {\em loop tiling}, a simple
%CA%   partitioning of the iteration domain, whose initial purpose is to
%CA%   execute every partition on a different processor. %In the same way,
%CA%   The execution order is modified with a proper unimodular
%CA%   transformation, then the tiles are obtained by cutting the
%CA%   iteration domain with the hyperplanes directed by every vector of
%CA%   the new (unimodular) basis, at regular intervals. When the tiling
%CA%   hyperplanes are properly chosen, we can both improve data-locality
%CA%   on every processor, and reduce the communication between two
%CA%   different tiles (which will be mapped on processors). This last
%CA%   property implying that one tend to find a degree of parallelism as
%CA%   great as possible.
%CA%
%CA%%affine scheduling
%CA%   The previous approaches were restricted to kernels with perfect
%CA%   loop nests (linear loop tree), and unimodular transformations. The
%CA%   last generation of approaches broke with these limitations. We now
%CA%   choose a different basis for every assignment, without the
%CA%   unimodularity restriction. A dual way to present the things is the
%CA%   notion of {\em affine schedule}, introduced by Feautrier [part1],
%CA%   that simply assigns an abstract execution date to every assignment
%CA%   execution. As an assignment execution is exactly characterised by
%CA%   the current value of the loops counters (iteration vector), the
%CA%   affine schedule will be defined as an affine form of the iteration
%CA%   vector (hence the 'affine'). The affine property allows to use
%CA%   integer programming techniques to compute the schedule. With this
%CA%   approach, additional techniques are required to allocate the
%CA%   parallel operations and the data to processor in an efficient way
%CA%   [griebl, feautrier].
%CA%
%CA%%modularity??
%CA%%%    As loop nests are no longer perfect, we deal with (transformed)
%CA%%%    iteration domains of different dimensions, which can possibly (and
%CA%%%    certainly) overlap. At this point, a new code generation technique
%CA%%%    was needed. The first attempt is due to Chamsky et al. [??], and
%CA%%%    was improved by Quillere et al. [QRW]. The code is now implemented
%CA%%%    in an efficient tool [cloog], that gave a new life to polyhedral
%CA%%%    techniques.
%CA%
%CA%%pluto's tiling
%CA%   The tiling techniques were extended to non-perfect loop nest with
%CA%   {\em affine partitioning}. Affine partitioning is to affine
%CA%   scheduling what (original) tiling was to unimodular
%CA%   transformations. An affine partitioning assigns to every assignment
%CA%   its coordinates in the basis defined by the normals to the tiling
%CA%   hyperplanes. Recently, a way to compute efficient hyperplanes were
%CA%   found [uday], with a good data locality, and communications
%CA%   confined in a small neighborhood around every processor.
%CA%
%CA%\subsubsection{Source-level Memory Optimisation}
%CA%  The HLS process allows to customise memory, which impacts on final
%CA%  circuit size and power consumption. Though most HLS tools already
%CA%  try to optimise memory usage, it is better to provide an independent
%CA%  source-level pass, that could be reused for different tools and in
%CA%  other contexts.
%CA%
%CA%  There exists many approaches to evaluate and reduce the memory
%CA%  requirement of a program. The first approaches are concerned with
%CA%  {\em memory size estimation}, which can be defined as the maximum
%CA%  number of memory cells used at the same time [clauss,zhao]. These
%CA%  approaches provide an estimation as a symbolic expression of program
%CA%  parameters, which can be used further to guide loop optimisations.
%CA%  However, no explicit way to reduce the memory size is given.  {\em
%CA%  Intra-array reuse} approaches brake with this limitation, and
%CA%  collapse the array cells which are not alive at the same time. The
%CA%  collapse is done by means of a data layout transformation, specified
%CA%  with a linear (modular) mapping.  The first approaches were
%CA%  developed at IMEC [balasa,catthoor], and basically try to linearize
%CA%  the arrays and fold them using a modulo operator. Then, Lefebvre et
%CA%  al. propose a solution to fold independently the array dimensions
%CA%  [lefebvre]. Finally, Darte et al. provide a general formalisation of
%CA%  the problem, together with a solution that subsumes the previous
%CA%  approaches [darte]. A first implementation was made with the tool
%CA%  {\sc Bee}, but there are still many limitations.
%CA%
%CA%  \begin{itemize} 
%CA%  \item The tool is restricted to regular programs, whereas more
%CA%  general programs could be handled with a conservative array liveness
%CA%  analysis.
%CA%
%CA%  \item Programs depending on parameters (inputs) are not handled,
%CA%  which forbids to handle, for example, the body of tiled loops.
%CA%
%CA%  \item The new array layout can brake spatial locality, and then impact
%CA%  performance and power consumption. One would like to get a mapping
%CA%  that improve or, at least, preserve the spatial locality of the
%CA%  program.
%CA%
%CA%  \item Finally, the final memory compaction strongly depends on the
%CA%  program schedule, and is naturally hindered by the
%CA%  parallelism. Consequently, there is a trade-off to find with
%CA%  automatic parallelization. An ideal solution would be to reduce
%CA%  memory usage, while preserving parallelism.  
%CA%  \end{itemize}

\subsubsection{Interfaces}
\begin{Large}\begin{verbatim}
-- A COMPLETER INSA Etat de l'art
\end{verbatim}
\end{Large}
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Objectives and innovation aspects}
\hspace{2cm}\begin{scriptsize}\begin{verbatim}
% 2.2.	OBJECTIFS ET CARACTERE AMBITIEUX/NOVATEUR DU PROJET 
% (2 pages maximum)
% D�crire les objectifs scientifiques/techniques du projet.
% Pr�senter l'avanc�e scientifique attendue. Pr�ciser l'originalit� et le caract�re 
% ambitieux du projet.
% D�tailler les verrous scientifiques et techniques � lever par la r�alisation du projet.
% D�crire �ventuellement le ou les produits finaux d�velopp�s � l'issue du projet  
% montrant le caract�re innovant du projet.
% Pr�senter les r�sultats escompt�s en proposant si possible des crit�res de r�ussite 
% et d'�valuation adapt�s au type de projet, permettant d'�valuer les r�sultats en 
% fin de projet.
% Le cas �ch�ant (programmes exigeant la pluridisciplinarit�), d�montrer l'articulation 
% entre les disciplines scientifiques.
\end{verbatim}
\end{scriptsize}

% les objectifs scientifiques/techniques du projet.
The objectives of COACH project are to develop a complete framework to 
HPC (accelerating solutions for existing software applications)
and embedded applications (implementing an application on a low power standalone device).
The design steps are presented figure 1.
\begin{figure}[hbtp]\leavevmode\center
  \includegraphics[width=.8\linewidth]{flow}
  \caption{\label{coach-flow} COACH flow.}
\end{figure}
\begin{description}
\item[HPC setup] Here the user splits the application into 2 parts: the host application
which remains on PC and the SoC application which migrates on SoC. 
The framework provides a simulation model allowing to evaluate the partitioning.
\item[SoC design] In this phase, 
The user can obtain simulators at different abstraction levels of the SoC by giving to COACH framework
a SoC description.  
This description consists of a process network corresponding to the SoC application, 
an OS, an instance of a generic hardware platform
and a mapping of processes on the platform components. The supported mapping are 
software (the process runs on a SoC processor),
XXXpeci (the process runs on a SoC processor enhanced with dedicated instructions),
and hardware (the process runs into a coprocessor generated by HLS and plugged on the SoC bus).
\item[Application compilation] Once SoC description is validated, COACH generates automatically
an FPGA bitstream containing the hardware platform with SoC application software and 
an executable containing the host application. The user can launch the application by
loading the bitstream on FPGA and running the executable on PC.
\end{description}
 
% l'avancee scientifique attendue. Preciser l'originalite et le caractere 
% ambitieux du projet. 
The main scientific contribution of the project is to unify various synthesis techniques
(same input and output formats) allowing the user to swap without engineering effort
from one to an other and even to chain them, for example, to run polyedric transformation 
before synthesis.
Another advantage of this framework is to provide different abstraction levels from
a single description.
Finally, this description is device family independent and its hardware implementation
is automatically generated.

% Detailler les verrous scientifiques et techniques a lever par la realisation du projet.
System design is a very complicated task and in this project we try to simplify it
as much as possible. For this purpose we have to deal with the following scientific
and technological barriers.
\begin{itemize}
\item The main problem in HPC is the communication between the PC and the SoC.
This problem has 2 aspects. The first one is the efficiency. The second is to 
eliminate enginnering effort to implement it at different abstract levels.
\item COACH design flow has a top-down approach. In the such case,
the required performance of a coprocessor (run frequency, maximum cycles for
a given computation, power consumption, etc) are imposed by the other system
components. The challenge is to allow user to control accurately the synthesis
process. For instance, the run frequency must not be a result of the RTL synthesis
but a strict synthesis constraint.
\item HLS tools are sensitive to the style in which the algorithm is written.
In addition, they are are not integrated into an architecture and system 
exploration tool.
Consequently, engineering work is required to swap from a tool to another,
to integrate the resulting simulation model to an architectural exploration tool 
and to synthesize the generated RTL description.
%CA Additionnal preprocessing, source-level transformations, are thus
%CA required to improve the process.
%CA Particularly, this includes parallelism exposure and efficient memory mapping.
\item Most HLS tools translate a sequential algorithm into a coprocessor
containing a single data-path and finite state machine (FSM). In this way,
only the fine grained parallelism is exploited (ILP parallelism).
The challenge is to identify the coarse grained parallelism and to generate,
from a sequential algorithm, coprocessor containing multiple communicating
tasks (data-paths and FSMs).
\end{itemize}

%Presenter les resultats escomptes en proposant si possible des criteres de reussite 
%et d'evaluation adaptes au type de projet, permettant d'evaluer les resultats en 
%fin de projet.
The main result is the framework. It is composed concretely of: 
2 HPC communication shemes with their implementation, 
5 HLS tools (control dominated HLS, data dominated HLS, Coarse grained HLS, 
Memory optimisation HLS and ASIP),
3 systemC based virtual prototyping environment extended with synthesizable
RTL IP cores (generic, ALTERA/NIOS/AVALON, XILINX/MICROBLAZE/OPB),
one design space exploration tool,
one operating system (OS).
\\
The framework fonctionality will be demonstrated with XXX-EXAMPLE1, XXX-EXAMPLE2
and XXX-EXAMPLE3 on 4 archictures (generic/XILINX, generic/ALTERA,
proprietary/XILINX, proprietary/ALTERA).

%% \section{}
%% %3.	PROGRAMME SCIENTIFIQUE ET TECHNIQUE, ORGANISATION DU PROJET
%% \subsection{}
%% %3.1.	PROGRAMME SCIENTIFIQUE ET STRUCTURATION DU PROJET 
%% %(2 pages maximum)
%% %Pr�sentez le programme scientifique et justifiez la d�composition en t�ches du 
%% %programme de travail en coh�rence avec les objectifs poursuivis. 
%% %Utilisez un diagramme pour pr�senter les liens entre les diff�rentes t�ches 
%% %(organigramme technique)
%% %Les t�ches repr�sentent les grandes phases du projet. Elles sont en nombre limit�.
%% %N'oubliez pas les activit�s et actions correspondant � la diss�mination et � la 
%% %valorisation.
%% 
%% %METTRE UNE FIGURE ICI DECRIVANT LES TACHES ET LEURS INTERACTION (AVEC LE FLOT  
%% %EN FILIGRANE ? )
%% \subsection{}
%% %3.2.	MANAGEMENT DU PROJET
%% %(2 pages maximum)
%% %Pr�ciser les aspects organisationnels du projet et les modalit�s de coordination 
%% %(si possible individualisation d'une t�che coordination : cf. t�che 0 du document 
%% %de soumission A).
%% \subsection{}
%% %3.3.	DESCRIPTION DES TRAVAUX PAR TACHE
%% %(id�alement 1 ou 2 pages par t�che)
%% %Pour chaque t�che, d�crire : 
%% %-	les objectifs  de la t�che et �ventuels indicateurs de succ�s,
%% %-	le responsable de la t�che et les partenaires impliqu�s (possibilit� de 
%% %l'indiquer sous forme graphique),
%% %-	le programme d�taill� des travaux par t�che,
%% %-	les livrables de la t�che,
%% %-	les contributions des partenaires (le " qui fait quoi "),
%% %-	la description des m�thodes et des choix techniques et de la mani�re dont 
%% %les solutions seront apport�es,
%% %-	les risques de la t�che et les solutions de repli envisag�es.