1 | % vim:set spell: |
---|
2 | % vim:spell spelllang=en: |
---|
3 | |
---|
4 | Our project covers several critical domains in system design in order |
---|
5 | to achieve high performance computing. Starting from a high level description we aim |
---|
6 | at generating automatically both hardware and software components of the system. |
---|
7 | |
---|
8 | \subsubsection{High Performance Computing} |
---|
9 | % Un marché bouffé par les archi GPGPU tel que le FERMI de NvidiaCUDA programming language |
---|
10 | The High-Performance Computing (HPC) world is composed of three main families of architectures: |
---|
11 | many-core, GPGPU (General Purpose computation on Graphics Unit Processing) and FPGA. |
---|
12 | The first two families are dominating the market by taking benefit |
---|
13 | of the strength and influence of mass-market leaders (Intel, Nvidia). |
---|
14 | %such as Intel for many-core CPU and Nvidia for GPGPU. |
---|
15 | In this market, FPGA architectures are emerging and very promising. |
---|
16 | By adapting architecture to the software, % (the opposite is done in the others families) |
---|
17 | FPGAs architectures enable better performance |
---|
18 | (typically between x10 and x100 accelerations) |
---|
19 | while using smaller size and less energy (and heat). |
---|
20 | However, using FPGAs presents significant challenges~\cite{hpc06a}. |
---|
21 | First, the operating frequency of an FPGA is low compared to a high-end microprocessor. |
---|
22 | Second, based on Amdahl law, HPC/FPGA application performance is unusually sensitive |
---|
23 | to the implementation quality~\cite{hpc06b}. |
---|
24 | % Thus, the performance strongly relies on the detected parallelism. |
---|
25 | % (pour résumer les 2 derniers points) |
---|
26 | Finally, efficient design methodology are required in order to |
---|
27 | hide FPGA complexity and the underlying implantation subtleties to HPC users, |
---|
28 | so that they do not have to change their habits and can have equivalent design productivity |
---|
29 | than in others families~\cite{hpc07a}. |
---|
30 | |
---|
31 | %état de l'art FPGA |
---|
32 | HPC/FPGA hardware is only now emerging and in early commercial stages, |
---|
33 | but these techniques have not yet caught up. |
---|
34 | Industrial (Mitrionics~\cite{hpc08}, Gidel~\cite{hpc09}, Convey Computer~\cite{hpc10}) and academic (CHREC) |
---|
35 | researches on HPC-FPGA are mainly conducted in the USA. |
---|
36 | None of the approaches developed in these researches are fulfilling entirely the |
---|
37 | challenges described above. For example, Convey Computer proposes application-specific instruction set extension of x86 cores in FPGA accelerator, |
---|
38 | but extension generation is not automated and requires hardware design skills. |
---|
39 | Mitrionics has an elegant solution based on a compute engine specifically |
---|
40 | developed for high-performance execution in FPGAs. Unfortunately, the design flow |
---|
41 | is based on a new programming language (mitrionC) implying important designer efforts and poor portability. |
---|
42 | % tool relying on operator libraries (XtremeData), |
---|
43 | % Parle t-on de l'OPenFPGA consortium, dont le but est : "to accelerate the incorporation of reconfigurable computing technology in high-performance and enterprise applications" ? |
---|
44 | |
---|
45 | Thus, much effort is required to develop design tools that translate high level |
---|
46 | language programs to FPGA configurations. |
---|
47 | Moreover, as already remarked in~\cite{hpc11}, Dynamic Partial Reconfiguration~\cite{hpc12} |
---|
48 | (DPR, which enables changing a part of the FPGA, while the rest is still working) |
---|
49 | appears very interesting for improving HPC performance as well as reducing required area. |
---|
50 | |
---|
51 | \subsubsection{System Synthesis} |
---|
52 | Today, several solutions for system design are proposed and commercialized. |
---|
53 | The existing commercial or free tools do not |
---|
54 | cover the whole system synthesis process in a full automatic way. Moreover, |
---|
55 | they are bound to a particular device family and to IPs library. |
---|
56 | The most commonly used are provided by \altera and \xilinx to promote their |
---|
57 | FPGA devices. These two representative tools used to synthesize SoC on FPGA |
---|
58 | are introduced below. |
---|
59 | \\ |
---|
60 | The \xilinx System Generator for DSP~\cite{system-generateur-for-dsp} is a |
---|
61 | plug-in to Simulink that enables designers to develop high-performance DSP |
---|
62 | systems for \xilinx FPGAs. |
---|
63 | Designers can design and simulate a system using MATLAB and Simulink. The |
---|
64 | tool will then automatically generate synthesizable Hardware Description |
---|
65 | Language (HDL) code mapped to \xilinx pre-optimized algorithms. |
---|
66 | However, this tool targets only DSP based algorithms, \xilinx FPGAs and |
---|
67 | cannot handle a complete SoC. Thus, it is not really a system synthesis tool. |
---|
68 | \\ |
---|
69 | In the opposite, SOPC Builder~\cite{spoc-builder} allows to describe a |
---|
70 | system, to synthesis it, to programm it into a target FPGA and to upload a |
---|
71 | software application. |
---|
72 | % FIXME(C2H from \altera, marche vite mais ressource monstrueuse) |
---|
73 | Nevertheless, SOPC Builder does not provide any facilities to synthesize |
---|
74 | coprocessors. System Designer must provide the synthesizable description |
---|
75 | with the feasible bus interface. Design Space Exploration is thus limited |
---|
76 | and SystemC simulation is not possible neither at transactional nor at cycle |
---|
77 | accurate level. |
---|
78 | \\ |
---|
79 | In addition, \xilinx System Generator and SOPC Builder are closed world |
---|
80 | since each one imposes their own IPs which are not interchangeable. |
---|
81 | %By using SOPC Builder~\cite{spoc-builder} from \altera, designers can select and |
---|
82 | %parameterize components from an extensive drop-down list of IP cores (I/O core, DSP, |
---|
83 | %processor, bus core, ...) as well as incorporate their own IP. |
---|
84 | %Designers can then generate a synthesized netlist, simulation test bench and custom |
---|
85 | %software library that reflect the hardware configuration. |
---|
86 | %Nevertheless, SOPC Builder does not provide any facilities to synthesize coprocessors and to |
---|
87 | %simulate the platform at a high design level (systemC). |
---|
88 | %In addition, SOPC Builder is proprietary and only works together with \altera's Quartus compilation |
---|
89 | %tool to implement designs on \altera devices (Stratix, Arria, Cyclone). |
---|
90 | %PICO~\cite{pico} and CATAPULT-C~\cite{catapult-c} allow to synthesize |
---|
91 | %coprocessors from a C++ description. |
---|
92 | %Nevertheless, they can only deal with data dominated applications and they do not handle |
---|
93 | %the platform level. |
---|
94 | %Similarly, the System Generator for DSP~\cite{system-generateur-for-dsp} is a plug-in to |
---|
95 | %Simulink that enables designers to develop high-performance DSP systems for \xilinx FPGAs. |
---|
96 | %Designers can design and simulate a system using MATLAB and Simulink. The tool will then |
---|
97 | %automatically generate synthesizable Hardware Description Language (HDL) code mapped to |
---|
98 | %\xilinx pre-optimized macro-cells. |
---|
99 | %However, this tool targets only DSP based algorithms. |
---|
100 | %\\ |
---|
101 | %Consequently, a designer developping an embedded system needs to master four different |
---|
102 | %design environments: |
---|
103 | %\begin{enumerate} |
---|
104 | % \item a virtual prototyping environment such as SoCLib for system level exploration, |
---|
105 | % \item an architecture compiler (such as SOPC Builder from \altera, or System generator |
---|
106 | % from \xilinx) to define the hardware architecture, |
---|
107 | % \item one or several HLS tools (such as PICO~\cite{pico} or CATAPULT-C~\cite{catapult-c}) for |
---|
108 | % coprocessor synthesis, |
---|
109 | % \item and finally backend synthesis tools (such as Quartus or Synopsys) for the bit-stream generation. |
---|
110 | %\end{enumerate} |
---|
111 | %Furthermore, mixing these tools requires an important interfacing effort and this makes |
---|
112 | %the design process very complex and achievable only by designers skilled in many domains. |
---|
113 | |
---|
114 | \subsubsection{High Level Synthesis} |
---|
115 | High Level Synthesis translates a sequential algorithmic description and a |
---|
116 | set of constraints (area, power, frequency, ...) to a micro-architecture at |
---|
117 | Register Transfer Level (RTL). |
---|
118 | Several academic and commercial tools are today available. The most common |
---|
119 | tools are SPARK~\cite{spark04}, GAUT~\cite{gaut08}, UGH~\cite{ugh08} in the |
---|
120 | academic world and CATAPULTC~\cite{catapult-c}, PICO~\cite{pico} and |
---|
121 | CYNTHETIZER~\cite{cynthetizer} in the commercial world. Despite their |
---|
122 | maturity, their usage is restrained by \cite{IEEEDT} \cite{CATRENE} \cite{HLSBOOK}: |
---|
123 | \begin{itemize} |
---|
124 | \item The HLS tools are not integrated into an architecture and system exploration tool. |
---|
125 | Thus, a designer who needs to accelerate a software part of the system, must adapt it manually |
---|
126 | to the HLS input dialect and performs engineering work to exploit the synthesis result |
---|
127 | at the system level, |
---|
128 | \item Current HLS tools can not target control AND data oriented applications, |
---|
129 | \item HLS tools take into account only one or few constraints simultaneously while realistic |
---|
130 | designs are multi-constrained, |
---|
131 | Moreover, low power consumption constraint is mandatory for embedded systems. |
---|
132 | However, it is not yet well handled or not handled at all by the synthesis tools already available, |
---|
133 | \item The parallelism is extracted from initial algorithmic specification. |
---|
134 | To get more parallelism or to reduce the amount of required memory in the SoC, the user |
---|
135 | must re-write the algorithmic specification while there is techniques such as polyedric |
---|
136 | transformations to increase the intrinsic parallelism, |
---|
137 | \item While they support limited loop transformations like loop unrolling and loop |
---|
138 | pipelining, current HLS tools do not provide support for design space exploration neither |
---|
139 | through automatic loop transformations nor through memory mapping, |
---|
140 | \item Despite having the same input language (C/C++), they are sensitive to the style in |
---|
141 | which the algorithm dis written. Consequently, engineering work is required to swap from |
---|
142 | a tool to another, |
---|
143 | \item They do not respect accurately the frequency constraint when they target an FPGA device. |
---|
144 | Their error is about 10 percent. This is annoying when the generated component is integrated |
---|
145 | in a SoC since it will slow down the whole system. |
---|
146 | \end{itemize} |
---|
147 | Regarding these limitations, it is necessary to create a new tool generation reducing the gap |
---|
148 | between the specification of an heterogeneous system and its hardware implementation \cite{HLSBOOK} \cite{IEEEDT}. |
---|
149 | |
---|
150 | \subsubsection{Application Specific Instruction Processors} |
---|
151 | |
---|
152 | ASIP (Application-Specific Instruction-Set Processor) are programmable |
---|
153 | processors in which both the instruction and the micro architecture have |
---|
154 | been tailored to a given application domain (e.g. video processing), or to a |
---|
155 | specific application. This specialization usually offers a good compromise |
---|
156 | between performance (w.r.t a pure software implementation on an embedded |
---|
157 | CPU) and flexibility (w.r.t an application specific hardware co-processor). |
---|
158 | In spite of their obvious advantages, using/designing ASIPs remains a |
---|
159 | difficult task, since it involves designing both a micro-architecture and a |
---|
160 | compiler for this architecture. Besides, to our knowledge, there is still |
---|
161 | no available open-source design flow for ASIP design even if such a tool |
---|
162 | would be valuable in the |
---|
163 | context of a System Level design exploration tool. |
---|
164 | \par |
---|
165 | In this context, ASIP design based on Instruction Set Extensions (ISEs) has |
---|
166 | received a lot of interest~\cite{NIOS2}, as it makes micro architecture synthesis |
---|
167 | more tractable \footnote{ISEs rely on a template micro-architecture in which |
---|
168 | only a small fraction of the architecture has to be specialized}, and help ASIP |
---|
169 | designers to focus on compilers, for which there are still many open |
---|
170 | problems\cite{ARC08}. |
---|
171 | This approach however has a severe weakness, since it also significantly reduces |
---|
172 | opportunities for achieving good speedups (most speedups remain between 1.5x and |
---|
173 | 2.5x), since ISEs performance is generally tied down by I/O constraints as |
---|
174 | they generally rely on the main CPU register file to access data. |
---|
175 | |
---|
176 | % ( |
---|
177 | %automaticcaly extraction ISE candidates for application code \cite{CODES04}, |
---|
178 | %performing efficient instruction selection and/or storage resource (register) |
---|
179 | %allocation \cite{FPGA08}). |
---|
180 | To cope with this issue, recent approaches~\cite{DAC09,CODES08,TVLSI06} advocate the use of |
---|
181 | micro-architectural ISE models in which the coupling between the processor micro-architecture |
---|
182 | and the ISE component is tightened up so as to allow the ISE to overcome the register |
---|
183 | I/O limitations. However these approaches generally tackle the problem from a compiler/simulation |
---|
184 | point of view and do not address the problem of generating synthesizable representations for |
---|
185 | these models. |
---|
186 | |
---|
187 | We therefore strongly believe that there is a need for an open-framework which |
---|
188 | would allow researchers and system designers to : |
---|
189 | \begin{itemize} |
---|
190 | \item Explore the various level of interactions between the original CPU micro-architecture |
---|
191 | and its extension (for example through a Domain Specific Language targeted at micro-architecture |
---|
192 | specification and synthesis). |
---|
193 | \item Retarget the compiler instruction-selection pass |
---|
194 | (or prototype new passes) so as to be able to take advantage of this ISEs. |
---|
195 | \item Provide a complete System-level Integration for using ASIP as SoC building blocks |
---|
196 | (integration with application specific blocks, MPSoc, etc.) |
---|
197 | \end{itemize} |
---|
198 | |
---|
199 | \subsubsection{Automatic Parallelization} |
---|
200 | |
---|
201 | The problem of compiling sequential programs for parallel computers |
---|
202 | has been studied since the advent of the first parallel architectures |
---|
203 | in the 1970s. The basic approach consists in applying program transformations |
---|
204 | which exhibit or increase the potential parallelism, while guaranteeing |
---|
205 | the preservation of the program semantics. Most of these transformations |
---|
206 | just reorder the operations of the program; some of them modify its |
---|
207 | data structures. Dependences (exact or conservative) are checked to guarantee |
---|
208 | the legality of the transformation. |
---|
209 | |
---|
210 | This has lead to the invention of many loop transformations (loop fusion, |
---|
211 | loop splitting, loop skewing, loop interchange, loop unrolling, ...) |
---|
212 | which interact in a complicated way. More recently, it has been noticed |
---|
213 | that all of these are just changes of basis in the iteration domain of |
---|
214 | the program. This has lead to the introduction of the polyhedral model |
---|
215 | \cite{FP:96,DRV:2000}, in which the combination of two transformations is |
---|
216 | simply a matrix product. |
---|
217 | |
---|
218 | Since hardware is inherently parallel, finding parallelism in sequential |
---|
219 | programs in an important prerequisite for HLS. The large FPGA chips of |
---|
220 | today can accomodate much more parallelism than is available in basic blocks. |
---|
221 | The polyhedral model is the ideal tool for finding more parallelism in |
---|
222 | loops. |
---|
223 | |
---|
224 | As a side effect, it has been observed that the polyhedral model is a useful |
---|
225 | tool for many other optimization, like memory reduction and locality |
---|
226 | improvement. Another point is |
---|
227 | that the polyhedral domain \emph{stricto sensu} applies only to |
---|
228 | very regular programs. Its extension to more general programs is |
---|
229 | an active research subject. |
---|
230 | |
---|
231 | %\subsubsection{High Performance Computing} |
---|
232 | %Accelerating high-performance computing (HPC) applications with field-programmable |
---|
233 | %gate arrays (FPGAs) can potentially improve performance. |
---|
234 | %However, using FPGAs presents significant challenges~\cite{hpc06a}. |
---|
235 | %First, the operating frequency of an FPGA is low compared to a high-end microprocessor. |
---|
236 | %Second, based on Amdahl law, HPC/FPGA application performance is unusually sensitive |
---|
237 | %to the implementation quality~\cite{hpc06b}. |
---|
238 | %Finally, High-performance computing programmers are a highly sophisticated but scarce |
---|
239 | %resource. Such programmers are expected to readily use new technology but lack the time |
---|
240 | %to learn a completely new skill such as logic design~\cite{hpc07a} . |
---|
241 | %\\ |
---|
242 | %HPC/FPGA hardware is only now emerging and in early commercial stages, |
---|
243 | %but these techniques have not yet caught up. |
---|
244 | %Thus, much effort is required to develop design tools that translate high level |
---|
245 | %language programs to FPGA configurations. |
---|
246 | |
---|