[12] | 1 | Our project covers several critical domains in system design in order |
---|
| 2 | to achieve high performance computing. Starting from a high level description we aim |
---|
| 3 | at generating automatically both hardware and software components of the system. |
---|
| 4 | |
---|
| 5 | \subsubsection{High Performance Computing} |
---|
| 6 | Accelerating high-performance computing (HPC) applications with field-programmable |
---|
| 7 | gate arrays (FPGAs) can potentially improve performance. |
---|
| 8 | However, using FPGAs presents significant challenges~\cite{hpc06a}. |
---|
| 9 | First, the operating frequency of an FPGA is low compared to a high-end microprocessor. |
---|
| 10 | Second, based on Amdahl law, HPC/FPGA application performance is unusually sensitive |
---|
| 11 | to the implementation quality~\cite{hpc06b}. |
---|
| 12 | Finally, High-performance computing programmers are a highly sophisticated but scarce |
---|
| 13 | resource. Such programmers are expected to readily use new technology but lack the time |
---|
| 14 | to learn a completely new skill such as logic design~\cite{hpc07a} . |
---|
| 15 | \\ |
---|
| 16 | HPC/FPGA hardware is only now emerging and in early commercial stages, |
---|
| 17 | but these techniques have not yet caught up. |
---|
| 18 | Thus, much effort is required to develop design tools that translate high level |
---|
| 19 | language programs to FPGA configurations. |
---|
| 20 | |
---|
| 21 | \subsubsection{System Synthesis} |
---|
| 22 | Today, several solutions for system design are proposed and commercialized. |
---|
| 23 | The most common are those provided by Altera and Xilinx to promote their |
---|
| 24 | FPGA devices. |
---|
| 25 | \\ |
---|
| 26 | The Xilinx System Generator for DSP~\cite{system-generateur-for-dsp} is a |
---|
| 27 | plug-in to Simulink that enables designers to develop high-performance DSP |
---|
| 28 | systems for Xilinx FPGAs. |
---|
| 29 | Designers can design and simulate a system using MATLAB and Simulink. The |
---|
| 30 | tool will then automatically generate synthesizable Hardware Description |
---|
| 31 | Language (HDL) code mapped to Xilinx pre-optimized algorithms. |
---|
| 32 | However, this tool targets only DSP based algorithms, Xilinx FPGAs and |
---|
| 33 | cannot handle complete SoC. Thus, it is not really a system synthesis tool. |
---|
| 34 | \\ |
---|
| 35 | In the opposite, SOPC Builder~\cite{spoc-builder} allows to describe a |
---|
| 36 | system, to synthesis it, to programm it into a target FPGA and to upload a |
---|
| 37 | software application. |
---|
| 38 | % FIXME(C2H from Altera, marche vite mais ressource monstrueuse) |
---|
| 39 | Nevertheless, SOPC Builder does not provide any facilities to synthesize |
---|
| 40 | coprocessors. System Designer must provide the synthesizable description |
---|
| 41 | with the feasible bus interface. |
---|
| 42 | \\ |
---|
| 43 | In addition, Xilinx System Generator and SOPC Builder are closed world |
---|
| 44 | since each one imposes their own IPs which are not interchangeable. |
---|
| 45 | We can conclude that the existing commercial or free tools does not |
---|
| 46 | coverthe whole system synthesis process in a full automatic way. Moreover, |
---|
| 47 | they are bound to a particular device family and to IPs library. |
---|
| 48 | |
---|
| 49 | \subsubsection{High Level Synthesis} |
---|
| 50 | High Level Synthesis translates a sequential algorithmic description and a |
---|
| 51 | constraints set (area, power, frequency, ...) to a micro-architecture at |
---|
| 52 | Register Transfer Level (RTL). |
---|
| 53 | Several academic and commercial tools are today available. Most common |
---|
| 54 | tools are SPARK~\cite{spark04}, GAUT~\cite{gaut08}, UGH~\cite{ugh08} in the |
---|
| 55 | academic world and CATAPULTC~\cite{catapult-c}, PICO~\cite{pico} and |
---|
| 56 | CYNTHETIZER~\cite{cynthetizer} in commercial world. Despite their |
---|
| 57 | maturity, their usage is restrained by: |
---|
| 58 | \begin{itemize} |
---|
| 59 | \item They do not respect accurately the frequency constraint when they target an FPGA device. |
---|
| 60 | Their error is about 10 percent. This is annoying when the generated component is integrated |
---|
| 61 | in a SoC since it will slow down the hole system. |
---|
| 62 | \item These tools take into account only one or few constraints simultaneously while realistic |
---|
| 63 | designs are multi-constrained. |
---|
| 64 | Moreover, low power consumption constraint is mandatory for embedded systems. |
---|
| 65 | However, it is not yet well handled by common synthesis tools. |
---|
| 66 | \item The parallelism is extracted from initial algorithm. To get more parallelism or to reduce |
---|
| 67 | the amout of required memory, the user must re-write it while there is techniques as polyedric |
---|
| 68 | transformations to increase the intrinsec parallelism. |
---|
| 69 | \item Despite they have the same input language (C/C++), they are sensitive to the style in |
---|
| 70 | which the algorithm is written. Consequently, engineering work is required to swap from |
---|
| 71 | a tool to another. |
---|
| 72 | \item The HLS tools are not integrated into an architecture and system exploration tool. |
---|
| 73 | Thus, a designer who needs to accelerate a software part of the system, must adapt it manually |
---|
| 74 | to the HLS input dialect and performs engineering work to exploit the synthesis result |
---|
| 75 | at the system level. |
---|
| 76 | \end{itemize} |
---|
| 77 | Regarding these limitations, it is necessary to create a new tool generation reducing the gap |
---|
| 78 | between the specification of an heterogenous system and its hardware implementation. |
---|
| 79 | |
---|
| 80 | \subsubsection{Application Specific Instruction Processors} |
---|
| 81 | |
---|
| 82 | ASIP (Application-Specific Instruction-Set Processor) are programmable |
---|
| 83 | processors in which both the instruction and the micro architecture have |
---|
| 84 | been tailored to a given application domain (eg. video processing), or to a |
---|
| 85 | specific application. This specialization usually offers a good compromise |
---|
| 86 | between performance (w.r.t a pure software implementation on an embeded |
---|
| 87 | CPU) and flexibility (w.r.t an application specific hardware co-processor). |
---|
| 88 | In spite of their obvious advantages, using/designing ASIPs remains a |
---|
| 89 | difficult task, since it involves designing both a micro-architecture and a |
---|
| 90 | compiler for this architecture. Besides, to our knowledge, there is still |
---|
| 91 | no available open-source design flow\footnote{There are commercial tools |
---|
| 92 | such a } for ASIP design even if such a tool would be valuable in the |
---|
| 93 | context of a System Level design exploration tool. |
---|
| 94 | \par |
---|
| 95 | In this context, ASIP design based on Instruction Set Extensions (ISEs) has |
---|
| 96 | received a lot of interest~\cite{NIOS2,ST70}, as it makes micro architecture synthesis |
---|
| 97 | more tractable \footnote{ISEs rely on a template micro-architecture in which |
---|
| 98 | only a small fraction of the architecture has to be specialized}, and help ASIP |
---|
| 99 | designers to focus on compilers, for which there are still many open |
---|
| 100 | problems\cite{CODES04,FPGA08}. |
---|
| 101 | This approach however has a strong weakness, since it also significantly reduces |
---|
| 102 | opportunities for achieving good seedups (most speedup remain between 1.5x and |
---|
| 103 | 2.5x), since ISEs performance is generally tied down by I/O constraints as |
---|
| 104 | they generally rely on the main CPU register file to access data. |
---|
| 105 | |
---|
| 106 | % ( |
---|
| 107 | %automaticcaly extraction ISE candidates for application code \cite{CODES04}, |
---|
| 108 | %performing efficient instruction selection and/or storage resource (register) |
---|
| 109 | %allocation \cite{FPGA08}). |
---|
| 110 | To cope with this issue, recent approaches~\cite{DAC09,DAC08} advocate the use of |
---|
| 111 | micro-architectural ISE models in which the coupling between the processor micro-architecture |
---|
| 112 | and the ISE component is thightened up so as to allow the ISE to overcome the register |
---|
| 113 | I/O limitations, however these approaches tackle the problem for a compiler/simulation |
---|
| 114 | point of view and not address the problem of generating synthesizable representations for |
---|
| 115 | these models. |
---|
| 116 | |
---|
| 117 | We therefore strongly believe that there is a need for an open-framework which |
---|
| 118 | would allow researchers and system designers to : |
---|
| 119 | \begin{itemize} |
---|
| 120 | \item Explore the various level of interactions between the original CPU micro-architecure |
---|
| 121 | and its extension (for example throught a Domain Specific Language targeted at micro-architecture |
---|
| 122 | specification and synthesis). |
---|
| 123 | \item Retarget the compiler instruction-selection (or prototype nex passes) passes so as |
---|
| 124 | to be able to take advantage of this ISEs. |
---|
| 125 | \item Provide a complete System-level Integration for using ASIP as SoC building blocks |
---|
| 126 | (integration with application specific blocks, MPSoc, etc.) |
---|
| 127 | \end{itemize} |
---|
| 128 | |
---|
| 129 | \subsubsection{Automatic Parallelization} |
---|
| 130 | % FIXME:LIP FIXME:PF FIXME:CA |
---|
| 131 | % Paul je ne suis pas sur que ce soit vraiment un etat de l'art |
---|
| 132 | % Christophe, ce que tu m'avais envoye se trouve dans obsolete/body.tex |
---|
| 133 | \mustbecompleted{ |
---|
| 134 | Hardware is inherently parallel. On the other hand, high level languages, |
---|
| 135 | like C or Fortran, are abstractions of the processors of the 1970s, and |
---|
| 136 | hence are sequential. One of the aims of an HLS tool is therefore to |
---|
| 137 | extract hidden parallelism from the source program, and to infer enough |
---|
| 138 | hardaware operators for its efficient exploitation. |
---|
| 139 | \\ |
---|
| 140 | Present day HLS tools search for parallelism in linear pieces of code |
---|
| 141 | acting only on scalars -- the so-called basic blocs. On the other hand, |
---|
| 142 | it is well known that most programs, especially in the fields of signal |
---|
| 143 | processing and image processing, spend most of their time executing loops |
---|
| 144 | acting on arrays. Efficient use of the large amount of hardware available |
---|
| 145 | in the next generation of FPGA chips necessitates parallelism far beyond |
---|
| 146 | what can be extracted from basic blocs only. |
---|
| 147 | \\ |
---|
| 148 | The Compsys team of LIP has built an automatic parallelizer, Syntol, which |
---|
| 149 | handle restricted C programs -- the well known polyhedral model --, |
---|
| 150 | computes dependences and build a symbolic schedule. The schedule is |
---|
| 151 | a specification for a parallel program. The parallelism itself can be |
---|
| 152 | expressed in several ways: as a system of threads, or as data-parallel |
---|
| 153 | operations, or as a pipeline. In the context of the COACH project, one |
---|
| 154 | of the task will be to decide which form of parallelism is best suited |
---|
| 155 | to hardware, and how to convey the results of Syntol to the actual |
---|
| 156 | synthesis tools. One of the advantages of this approach is that the |
---|
| 157 | resulting degree of parallelism can be easilly controlled, e.g. by |
---|
| 158 | adjusting the number of threads, as a mean of exploring the |
---|
| 159 | area / performance tradeoff of the resulting design. |
---|
| 160 | \\ |
---|
| 161 | Another point is that potentially parallel programs necessarily involve |
---|
| 162 | arrays: two operations which write to the same location must be executed |
---|
| 163 | in sequence. In synthesis, arrays translate to memory. However, in FPGAs, |
---|
| 164 | the amount of on-chip memory is limited, and access to an external memory |
---|
| 165 | has a high time penalty. Hence the importance of reducing the size of |
---|
| 166 | temporary arrays to the minimum necessary to support the requested degree |
---|
| 167 | of parallelism. Compsys has developped a stand-alone tool, Bee, based |
---|
| 168 | on research by A. Darte, F. Baray and C. Alias, which can be extended |
---|
| 169 | into a memory optimizer for COACH. |
---|
| 170 | } |
---|
| 171 | |
---|
| 172 | \subsubsection{Interfaces} |
---|
| 173 | \newcommand{\ip}{\sc ip} |
---|
| 174 | \newcommand{\dma}{\sc dma} |
---|
| 175 | \newcommand{\soc}{\sc SoC} |
---|
| 176 | \newcommand{\mwmr}{\sc mwmr} |
---|
| 177 | The hardware/software interface has been a difficult task since the advent |
---|
| 178 | of complex systems on chip. After the first Co-design |
---|
| 179 | environments~\cite{Coware,Polis,Ptolemy}, the Hardware Abstraction Layer |
---|
| 180 | has been defined so that software applications can be developed without low |
---|
| 181 | level hardware implementation details. In~\cite{jerraya}, Yoo and Jerraya |
---|
| 182 | propose an {\sc api} with extension ability instead of a unique hardware |
---|
| 183 | abstraction layer. System level communication frameworks have been |
---|
| 184 | introduced~\cite{JerrayaPetrot,mwmr}. |
---|
| 185 | \par |
---|
| 186 | A good abstraction of a hardware/software interface has been proposed |
---|
| 187 | in~\cite{Jantsch}: it is composed of a software driver, a {\dma} and and a |
---|
| 188 | bus interface circuit. Automatic wrapping between bus protocols has |
---|
| 189 | generated a lot of papers~\cite{Avnit,smith,Narayan, Alberto}. These works |
---|
| 190 | do not use a {\dma}. In COACH, the hardware/software interface is done at a |
---|
| 191 | higher level and uses burst communication in the bus interface circuit to |
---|
| 192 | improve the communication performances. |
---|
| 193 | \par |
---|
| 194 | There are two important projects related to efficient interface of |
---|
| 195 | data-flow {\ip}s : the work of Park and Diniz~\cite{ Park01} and the the |
---|
| 196 | Lip6 work on {\mwmr}~\cite{mwmr}. Park and Diniz~\cite{ Park01} proposed |
---|
| 197 | of a generic interface that can be parameterized to connect different |
---|
| 198 | data-flow {\ip}s. This approach does not request the communications to be |
---|
| 199 | statically known and proposes a runtime resolution to solve conflicting |
---|
| 200 | access to the bus. To our knowledge this approach has not been implemented |
---|
| 201 | further since 2003. |
---|
| 202 | \par |
---|
| 203 | {\mwmr}~\cite{mwmr} stands for both a computation model (multi-write, |
---|
| 204 | multi-read {\sc fifo}) inherited from the Khan Process Networks and a bus |
---|
| 205 | interface circuit protocol. As for the work of Park and Diniz, {\mwmr} |
---|
| 206 | does not make the assumption of a static communication flow. This implies |
---|
| 207 | simple software driver to write, but introduces additional complexity due |
---|
| 208 | to the mutual exclusion locks necessary to protect the shared memory. |
---|
| 209 | \par |
---|
| 210 | we propose, in COACH, to use recent work on hardware/software |
---|
| 211 | interface~\cite{FR-vlsi} that uses a {\em clever} {\dma} responsible for |
---|
| 212 | managing data streams. A assumption is that the behavior of the {\ip}s can |
---|
| 213 | be statically described. A similar choice has been made in the Faust |
---|
| 214 | {\soc}~\cite{FAUST} which includes the {\em smart memory engine} component. |
---|
| 215 | Jantsch and O'Nils already noticed in ~\cite{Jantsch} the huge complexity |
---|
| 216 | of writing this hardware/software interface, in COACH, automatic |
---|
| 217 | generation of the interface will be achieved, this is one goal of the CITI |
---|
| 218 | contribution to COACH. |
---|
| 219 | |
---|