[12] | 1 | \section{Project context} |
---|
| 2 | \hspace{2cm}\begin{scriptsize}\begin{verbatim} |
---|
| 3 | % 1. CONTEXTE ET POSITIONNEMENT DU PROJET |
---|
| 4 | % (1 page maximum) Prᅵsentation gᅵnᅵrale du problᅵme qu'il est proposᅵ de traiter |
---|
| 5 | % dans le projet et du cadre de travail (recherche fondamentale, industrielle ou |
---|
| 6 | % dï¿œveloppement expï¿œrimental). |
---|
| 7 | \end{verbatim} |
---|
| 8 | \end{scriptsize} |
---|
| 9 | An embedded system is an application integrated into one or several chips |
---|
| 10 | in order to accelerate it or to embedd it into a small device such as a personal |
---|
| 11 | digital assistant (PDA). |
---|
| 12 | This topic is investigated since 80s using Applications Specific Integrated Circuits (ASIC), |
---|
| 13 | Digital Signal Processing (DSP) and parallel computing on multiprocessor machines or networks. |
---|
| 14 | More recently, since end of 90s, other technologies appeared like Very Large Instruction Word (VLIW), |
---|
| 15 | Application Specific Instruction Processors (ASIP), System on Chip (SoC), |
---|
| 16 | Multi-Processors SoC (MPSoC). |
---|
| 17 | \\ |
---|
| 18 | During these last decades embedded system was reserved to major industrial companies targeting high volume market |
---|
| 19 | due to the design and fabrication costs. |
---|
| 20 | Nowadays Field Programmable Gate Arrays (FPGA), like Virtex5 from Xilinx and Stratix4 from Altera, |
---|
| 21 | can implement a SoC with multiple processors and several coprocessors for less than 10K euros |
---|
| 22 | per item. In addition, High Level Synthesis (HLS) becomes more mature and allows to automate |
---|
| 23 | design and to drastically decrease its cost in terms of man power. Thus, both FPGA and HLS |
---|
| 24 | tend to spread over HPC for small companies targeting low volume markets. |
---|
| 25 | \par |
---|
| 26 | To get an efficient embedded system, designer has to take into account application characteristics when it |
---|
| 27 | chooses one of the former technologies. |
---|
| 28 | This choice is not easy and in most cases designer has to try different technologies to retain the |
---|
| 29 | most adapted one. |
---|
| 30 | \\ |
---|
| 31 | The first objective of COACH is to provide an open-source framework to design embedded system |
---|
| 32 | on FPGA device. |
---|
| 33 | COACH framework allows designer to explore various software/hardware partitions of the |
---|
| 34 | target application, to run timing and functional simulations and to generate automatically both |
---|
| 35 | the software and the synthesizable description of the hardware. |
---|
| 36 | The main topics of the project are: |
---|
| 37 | \begin{itemize} |
---|
| 38 | \item |
---|
| 39 | Design space exploration: It consists in analysing the application runnig on FPGA, defining the target |
---|
| 40 | technology (SoC, MPSoC, ASIP, ...) and hardware/software partitioning of tasks depending on |
---|
| 41 | technology choice. This exploration is driven basically by throughput, latency and power consumption |
---|
| 42 | criteria. |
---|
| 43 | \item |
---|
| 44 | Micro-architectural exploration: When hardware components are required, the HLS tools of the framework |
---|
| 45 | generate them automatically. At this stage the framework provides various HLS tools allowing the |
---|
| 46 | micro-architectural space design exploration. The exploration criteria are also throughput, latency |
---|
| 47 | and power consumption. |
---|
| 48 | % FIXME |
---|
| 49 | %CA At this stage, preliminary source-level transformations will be |
---|
| 50 | %CA required to improve the efficiency of the target component. |
---|
| 51 | %CA COACH will also provide such facilities, such as automatic parallelization |
---|
| 52 | %CA and memory optimisation. |
---|
| 53 | \item |
---|
| 54 | Performance measurement: For each point of design space exploration, metrics of criteria are available |
---|
| 55 | such as throughput, latency, power consumption, area, memory allocation and data locality. |
---|
| 56 | They are evaluated using virtual prototyping, estimation or analysing methodologies. |
---|
| 57 | \item |
---|
| 58 | Targeted hardware technology: The COACH description of system is independent of the FPGA family. |
---|
| 59 | Every point of the design exploration space can be implemented on any FPGA having the required resources. |
---|
| 60 | Basically, COACH handles both Altera and Xilinx FPGA families. |
---|
| 61 | \end{itemize} |
---|
| 62 | As an extension of embedded system design, COACH deals also with High Performance Computing (HPC). |
---|
| 63 | In HPC, the kind of targeted application is an existing one running on PC. COACH helps designer |
---|
| 64 | to accelerate it by migrating critical parts into a SoC implemented on a FPGA plugged to the PC bus. |
---|
| 65 | \par |
---|
| 66 | COACH is the result of the will of several laboratory to unify their know how and skills in the |
---|
| 67 | following domains: Operating system and hardware communication (TIMA, SITI), SoC and MPSoC (LIP6 and TIMA), |
---|
| 68 | ASIP (IRISA) and HLS (LIP6, Lab-STIC and LIP). The project objective is to integrate these various |
---|
| 69 | domains into a unique free framework (licence ...) masking as much as possible these domains and its |
---|
| 70 | different tools to the user. |
---|
| 71 | |
---|
| 72 | |
---|
| 73 | \subsection{Economical context and interest} |
---|
| 74 | \hspace{2cm}\begin{scriptsize}\begin{verbatim} |
---|
| 75 | % 1.1. CONTEXTE ET ENJEUX ECONOMIQUES ET SOCIETAUX |
---|
| 76 | % (2 pages maximum) |
---|
| 77 | % Dï¿œcrire le contexte ï¿œconomique, social, rï¿œglementaire. dans lequel se situe |
---|
| 78 | % le projet en prï¿œsentant une analyse des enjeux sociaux, ï¿œconomiques, environnementaux, |
---|
| 79 | % industriels. Donner si possible des arguments chiffrï¿œs, par exemple, pertinence et |
---|
| 80 | % portᅵe du projet par rapport ᅵ la demande ᅵconomique (analyse du marchᅵ, analyse des |
---|
| 81 | % tendances), analyse de la concurrence, indicateurs de rï¿œduction de coï¿œts, perspectives |
---|
| 82 | % de marchï¿œs (champs d'application, .). Indicateurs des gains environnementaux, cycle |
---|
| 83 | % de vie. |
---|
| 84 | \end{verbatim} |
---|
| 85 | \end{scriptsize} |
---|
| 86 | Microelectronic allows to integrate complicated functions into products, to increase their |
---|
| 87 | commercial attractivity and to improve their competitivity. Multimedia and communication |
---|
| 88 | sectors have taken advantage from microelectronics facilities thanks to developpment of |
---|
| 89 | design methodologies and tools for real time embedded systems. Many other sectors could |
---|
| 90 | benefit from microelectronics if these methologies and tools are adapted to their features. |
---|
| 91 | The Non Recurring Engineering (NRE) costs involded in designing and manufacturing an ASIC is |
---|
| 92 | very high. It costs several milliars of euros for IC factory and several millions to fabricate |
---|
| 93 | a specific circuit for example a conservative estimate for a 65nm ASIC project is 10 million USD. |
---|
| 94 | Consequently, it is generally unfeasible to design and fabricate ASICs in |
---|
| 95 | low volumes and ICs are designed to cover a broad applications spectrum at the cost of |
---|
| 96 | performance degradation. |
---|
| 97 | \\ |
---|
| 98 | Today, FPGAs become important actors in the computational domain that was originally dominated |
---|
| 99 | by microprocessors and ASICs. Just like microprocessors FPGA based systems can be reprogrammed |
---|
| 100 | on a per-application basis. At the same time, FPGAs offer significant performance benefits over |
---|
| 101 | microprocessors implementation for a number of applications. Although these benefits are still |
---|
| 102 | generally an order of magnitude less than equivalent ASIC implementations, low costs |
---|
| 103 | (500 euros to 10K euros), fast time to market and flexibility of FPGAs make them an attractive |
---|
| 104 | choice for low-to-medium volume applications. |
---|
| 105 | Since their introduction in the mid eighties, FPGAs evolved from a simple, |
---|
| 106 | low-capacity gate array technology to devices (Altera STRATIX III, Xilinx Virtex V) that |
---|
| 107 | provide a mix of coarse-grained data path units, memory blocks, microprocessor cores, |
---|
| 108 | on chip A/D conversion, and gate counts by millions. This high logic capacity allows to implement |
---|
| 109 | complex systems like multi-processors platform with application dedicated coprocessors. |
---|
| 110 | Table~\ref{fpga_market} shows the estimation of FPGA worldwide market in the next years covering |
---|
| 111 | various application domains. The ``high end'' lines concern only FPGA with high logic capacity able |
---|
| 112 | to implement complex systems. |
---|
| 113 | This market is in significant expansion and is estimated to 914\,M\$ in 2012. |
---|
| 114 | Using FPGA limits the NRE costs to design cost. This boosts the developpment of methodologies |
---|
| 115 | and tools to automize design and reduce its cost. |
---|
| 116 | \begin{table}\leavevmode\center |
---|
| 117 | \begin{tabular}{|l|l|l|l|}\hline |
---|
| 118 | Segment & 2010 & 2011 & 2012 \\\hline\hline |
---|
| 119 | Communications & 1,867 & 1,946 & 2,096 \\ |
---|
| 120 | High end & 467 & 511 & 550 \\\hline |
---|
| 121 | Consumer & 550 & 592 & 672 \\ |
---|
| 122 | High end & 53 & 62 & 75 \\\hline |
---|
| 123 | Automotive & 243 & 286 & 358 \\ |
---|
| 124 | High end & - & - & - \\\hline |
---|
| 125 | Industrial & 1,102 & 1,228 & 1,406 \\ |
---|
| 126 | High end & 177 & 188 & 207 \\\hline |
---|
| 127 | Military/Aereo & 566 & 636 & 717 \\ |
---|
| 128 | High end & 56 & 65 & 82 \\\hline\hline |
---|
| 129 | Total FPGA/PLD & 4,659 & 5,015 & 5,583 \\ |
---|
| 130 | Total High-End FPGA & 753 & 826 & 914 \\\hline |
---|
| 131 | \end{tabular} |
---|
| 132 | \caption{\label{fga_market} Gartner estimation of worldwide FPGA/PLD consumption (Millions \$)} |
---|
| 133 | \end{table} |
---|
| 134 | \par |
---|
| 135 | Today, several companies (atipa, blue-arc, Bull, Chelsio, Convey, CRAY, DataDirect, DELL, hp, |
---|
| 136 | Wild Systems, IBM, Intel, Microsoft, Myricom, NEC, nvidia etc) are making systems where demand |
---|
| 137 | for very high performance (HPC) primes over other requirements. They tend to use the highest |
---|
| 138 | performing devices like Multi-core CPUs, GPUs, large FPGAs, custom ICs and the most innovative |
---|
| 139 | architectures and algorithms. Companies show up in different "traditional" applications and market |
---|
| 140 | segments like computing clusters (ad-hoc), servers and storage, networking and Telecom, ASIC |
---|
| 141 | emulation and prototyping, Mil/aero etc. HPC market size is estimated today by FPGA providers |
---|
| 142 | to 214\,M\$. |
---|
| 143 | This market is dominated by Multi-core CPUs and GPUs based solutions and the expansion |
---|
| 144 | of FPGA-based solutions is limited by the flow automation. Nowadays, there are neither commercial |
---|
| 145 | nor free tools covering the whole design process. |
---|
| 146 | For instance, with SOPC Builder from Altera, users can select and parameterize IP components |
---|
| 147 | from an extensive drop-down list of communication, digital signal processor (DSP), microprocessor |
---|
| 148 | and bus interface cores, as well as incorporate their own IP. Designers can then generate |
---|
| 149 | a synthesized netlist, simulation test bench and custom software library that reflect the hardware |
---|
| 150 | configuration. |
---|
| 151 | Nevertheless, SOPC Builder does not provide any facilities to synthesize coprocessors\emph{I |
---|
| 152 | (Steven) disagree : the C2H compiler bundled with SOPCBuilder does a pretty good job at this} and to |
---|
| 153 | simulate the platform at a high design level (system C). |
---|
| 154 | In addition, SOPC Builder is proprietary and only works together with Altera's Quartus compilation |
---|
| 155 | tool to implement designs on Altera devices (Stratix, Arria, Cyclone). |
---|
| 156 | PICO [CITATION] and CATAPULT [CITATION] allow to synthesize coprocessors from a C++ description. |
---|
| 157 | Nevertheless, they can only deal with data dominated applications and they do not handle the |
---|
| 158 | platform level. |
---|
| 159 | The Xilinx System Generator for DSP [http://www.xilinx.com/tools/sysgen.htm] is a plug-in to |
---|
| 160 | Simulink that enables designers to develop high-performance DSP systems for Xilinx FPGAs. |
---|
| 161 | Designers can design and simulate a system using MATLAB and Simulink. The tool will then |
---|
| 162 | automatically generate synthesizable Hardware Description Language (HDL) code mapped to Xilinx |
---|
| 163 | pre-optimized algorithms. |
---|
| 164 | However, this tool targets only DSP based algorithms. |
---|
| 165 | \\ |
---|
| 166 | Consequently, designers developping an embedded system needs to master for example |
---|
| 167 | SoCLib for design exploration, |
---|
| 168 | SOPC Builde at the platform level, |
---|
| 169 | PICO for synthesizing the data dominated coprocessors |
---|
| 170 | and Quartus for design implementation. |
---|
| 171 | This requires an important tools interfacing effort and makes the design process very complex |
---|
| 172 | and achievable only by designers skilled in many domains. |
---|
| 173 | COACH project integrates all these tools in the same framework masking them to the user. |
---|
| 174 | The objective is to allow \textbf{pure software} developpers to realize embedded systems. |
---|
| 175 | \par |
---|
| 176 | The combination of the framework dedicated to software developpers and FPGA target, allows to gain |
---|
| 177 | market share over Multi-core CPUs and GPUs HPC based solutions. |
---|
| 178 | Moreover, one can expect that small and even very small companies will be able to propose embedded |
---|
| 179 | system and accelerating solutions for standard software applications with acceptable prices, thanks |
---|
| 180 | to the elimination of huge hardware investment in opposite to ASIC based solution. |
---|
| 181 | \\ |
---|
| 182 | This new market may explose like it was done by micro-computing in eighties. This success were due |
---|
| 183 | to the low cost of first micro-computers (compared to main frame) and the advent of high level |
---|
| 184 | programming languages that allow a high number of programmers to launch start-ups in software |
---|
| 185 | engineering. |
---|
| 186 | |
---|
| 187 | \subsection{Project position} |
---|
| 188 | \hspace{2cm}\begin{scriptsize}\begin{verbatim} |
---|
| 189 | % 1.2. POSITIONNEMENT DU PROJET |
---|
| 190 | % (2 pages maximum) |
---|
| 191 | % Prï¿œciser : |
---|
| 192 | % - positionnement du projet par rapport au contexte dᅵveloppᅵ prᅵcᅵdemment : |
---|
| 193 | % vis- ᅵ-vis des projets et recherches concurrents, complᅵmentaires ou antᅵrieurs, |
---|
| 194 | % des brevets et standards. |
---|
| 195 | % - positionnement du projet par rapport aux axes thᅵmatiques de l'appel ᅵ projets. |
---|
| 196 | % - positionnement du projet aux niveaux europï¿œen et international. |
---|
| 197 | \end{verbatim} |
---|
| 198 | \end{scriptsize} |
---|
| 199 | The aim of this project is to propose an open-source framework for architecture synthesis |
---|
| 200 | targeting mainly field programmable gate array circuits (FPGA). |
---|
| 201 | \\% LIP6/TIMA |
---|
| 202 | To evaluate the different architectures, the project uses the prototyping platform |
---|
| 203 | of the SoCLIB ANR project (2006-2009). |
---|
| 204 | \\% IRISA |
---|
| 205 | The project will also borrow from the ROMA ANR project (2007-2009) and the ongoing |
---|
| 206 | joint INRIA-STMicro Nano2012 project. In particular we will adapt existing pattern |
---|
| 207 | extraction algorithms and datapath merging techniques to the synthesis of customized |
---|
| 208 | ASIP processors. |
---|
| 209 | \\ |
---|
| 210 | \textcolor{gris75}{Steven : Je propose de rajouter un lien avec le projet BioWic~:~on the HPC |
---|
| 211 | application side, we also hope to benefit from the experience in hardware acceleration of |
---|
| 212 | bioinformatic algorithms/workfows gathered by the CAIRN group in the context of the ANR |
---|
| 213 | BioWic project (2009-2011), so as to be able to validate the framework on |
---|
| 214 | real-life HPC applications.} |
---|
| 215 | |
---|
| 216 | \par |
---|
| 217 | %%% 1 -- POUVEZ VOUS CHACUN AJOUTER SVP (SI POSSIBLE) UNE LIGNE |
---|
| 218 | %%% 1 -- REFERANT UN PROJET ANR OU EUROPEEN |
---|
| 219 | %%% 1 -- Projets europï¿œens ou ANR rï¿œutilisï¿œs ou continuï¿œs |
---|
| 220 | %%% 1 LIP6/TIMA/LAB-STIC OK |
---|
| 221 | Regarding the expertise in High Level Synthesis (HLS), the project leverages on know-how acquired over 15 years |
---|
| 222 | with GAUT project developped in Lab-STIC laboratory and UGH project developped in LIP6 |
---|
| 223 | and TIMA laboratories. \\ |
---|
| 224 | Regarding architecture synthesis skills, the project is based on a know-how acquired over 10 years |
---|
| 225 | with the COSY European project (1998-2000) and the DISYDENT project developped in LIP6. \\ |
---|
| 226 | %%% 1 IRISA OK |
---|
| 227 | Regarding Application Specific Instruction Processor (ASIP) design, the CAIRN group at INRIA Bretagne |
---|
| 228 | Atlantique benefits from several years of expertise in the domain of retargetable compiler (Armor/Calife |
---|
| 229 | since 1996, and the Gecos compilers since 2002). |
---|
| 230 | |
---|
| 231 | |
---|
| 232 | % LIP FIXME:UN:PEU:LONG ET HORS:SUJET |
---|
| 233 | %CA% The source-level transformations required by the HLS tools will be |
---|
| 234 | %CA% designed in the {\em polyhedral model}, a general framework |
---|
| 235 | %CA% initiated by Paul Feautrier 20 years ago. The programs handled in |
---|
| 236 | %CA% the polyhedral model are such that loop iterators describe a |
---|
| 237 | %CA% polyhedron (hence the name). This includes most of the kernels used |
---|
| 238 | %CA% in embedded applications. This property allows to design precise |
---|
| 239 | %CA% analysis by means of integer programming techniques. |
---|
| 240 | %CA% %communaute active & internationale |
---|
| 241 | %CA% %transfert techno (Reservoir) |
---|
| 242 | %CA% The polyhedral community is very active, and the technological |
---|
| 243 | %CA% transfer has now started. Reservoir Labs inc., a company based in |
---|
| 244 | %CA% New-York, is currently integrating the last polyhedral developments |
---|
| 245 | %CA% in its commercial compiler. |
---|
| 246 | %CA% %transfert techno (gcc) |
---|
| 247 | %CA% Also, polyhedra are progressively migrating into the {\sc GNU Gcc} |
---|
| 248 | %CA% compiler, via {\sc Graphite}, a module initially developed by |
---|
| 249 | %CA% Sebastian Pop. |
---|
| 250 | %CA% %outils existants |
---|
| 251 | %CA% Several tools have been developed in the polyhedral community, |
---|
| 252 | %CA% such as {\sc Piplib} (parameter integer programming library), and |
---|
| 253 | %CA% {\sc Polylib}, a library providing set operations on polyhedra. Both |
---|
| 254 | %CA% tools are almost mandatory in polyhedral tools, and have reached |
---|
| 255 | %CA% a sufficient level of maturity to be considered as standard. |
---|
| 256 | %syntol & bee ??? |
---|
| 257 | % FIN |
---|
| 258 | % and on more than 15 years of experience on parallel hardware generation |
---|
| 259 | % in the polyedral model in the CAIRN group (MMAlpha software |
---|
| 260 | % developped in the group since 1996). |
---|
| 261 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
| 262 | %%% 2 -- A COMPLETER (COURT) |
---|
| 263 | %%% 2 -- For polyedric transformation and memory optimization ... LIP |
---|
| 264 | %%% 2 -- For ASIP IRISA |
---|
| 265 | %%% 2 -- For ... CITI |
---|
| 266 | %%% 2 -- For ... TIMA |
---|
| 267 | \par |
---|
| 268 | The SoCLIB ANR platform were developped by 11 laboratories and 6 companies. It allows to |
---|
| 269 | describe hardware architectures with shared memory space and to deploy software |
---|
| 270 | applications on them to evaluate their performance. |
---|
| 271 | The heart of this platform is a library containing simulation models (in SystemC) |
---|
| 272 | of hardware IP cores such as processors, buses, networks, memories, IO controller. |
---|
| 273 | The platform provides also embedded operating systems and software/hardware |
---|
| 274 | communication components useful to implement applications quickly. |
---|
| 275 | However, the synthesisable description of IPs have to be provided by users. \\ |
---|
| 276 | This project enhances SoCLib by providing synthesisable VHDL of standard IPs. |
---|
| 277 | In addition, HLS tools such as UGH and GAUT allow to get automatically a synthesisable |
---|
| 278 | description of an IP (coprocessor) from a sequential algorithm. |
---|
| 279 | %\par |
---|
| 280 | %%% 2 IRISA ? |
---|
| 281 | %%% 2 ASIP tool such as ... IRISA |
---|
| 282 | %%% 2 ... |
---|
| 283 | %%% 2 Coach uses pattern extractions from ROMA |
---|
| 284 | %\par |
---|
| 285 | %%% 2 LIP ? |
---|
| 286 | \par |
---|
| 287 | The different points proposed in this project cover priorities defined by the commission |
---|
| 288 | experts in the field of Information Technolgies Society (IST) for Embedded |
---|
| 289 | systems: <<Concepts, methods and tools for designing systems dealing with systems complexity |
---|
| 290 | and allowing to apply efficiently applications and various products on embedded platforms, |
---|
| 291 | considering resources constraints (delais, power, memory, etc.), security and quality |
---|
| 292 | services>>. |
---|
| 293 | \\ |
---|
| 294 | Our team aims at covering all the steps of the design flow of architecture synthesis. |
---|
| 295 | Our project overcomes the complexity of using various synthesis tools and description |
---|
| 296 | languages required today to design architectures. |
---|
| 297 | |
---|
| 298 | \section{Scientific and Technical Description} |
---|
| 299 | \subsection{State of the art} |
---|
| 300 | \hspace{2cm}\begin{scriptsize}\begin{verbatim} |
---|
| 301 | % 2. DESCRIPTION SCIENTIFIQUE ET TECHNIQUE |
---|
| 302 | % 2.1. ï¿œTAT DE L'ART |
---|
| 303 | % (3 pages maximum) |
---|
| 304 | % Dï¿œcrire le contexte et les enjeux scientifiques dans lequel se situe le projet |
---|
| 305 | % en prï¿œsentant un ï¿œtat de l'art national et international dressant l'ï¿œtat des |
---|
| 306 | % connaissances sur le sujet. Faire apparaï¿œtre d'ï¿œventuels rï¿œsultats prï¿œliminaires. |
---|
| 307 | % Inclure les rï¿œfï¿œrences bibliographiques nï¿œcessaires en annexe 7.1. |
---|
| 308 | \end{verbatim} |
---|
| 309 | \end{scriptsize} |
---|
| 310 | Our project covers several critical domains in system design in order |
---|
| 311 | to achieve high performance computing. Starting from a high level description we aim |
---|
| 312 | at generating automatically both hardware and software components of the system. |
---|
| 313 | |
---|
| 314 | \subsubsection{High Performance Computing} |
---|
| 315 | Accelerating high-performance computing (HPC) applications with field-programmable |
---|
| 316 | gate arrays (FPGAs) can potentially improve performance. |
---|
| 317 | However, using FPGAs presents significant challenges [1]. |
---|
| 318 | First, the operating frequency of an FPGA is low compared to a high-end microprocessor. |
---|
| 319 | Second, based on Amdahl law, HPC/FPGA application performance is unusually sensitive |
---|
| 320 | to the implementation quality [2]. |
---|
| 321 | Finally, High-performance computing programmers are a highly sophisticated but scarce |
---|
| 322 | resource. Such programmers are expected to readily use new technology but lack the time |
---|
| 323 | to learn a completely new skill such as logic design [3]. |
---|
| 324 | \\ |
---|
| 325 | HPC/FPGA hardware is only now emerging and in early commercial stages, |
---|
| 326 | but these techniques have not yet caught up. |
---|
| 327 | Thus, much effort is required to develop design tools that translate high level |
---|
| 328 | language programs to FPGA configurations. |
---|
| 329 | |
---|
| 330 | \hspace{2cm}\begin{scriptsize}\begin{verbatim} |
---|
| 331 | [1] M.B. Gokhale et al., Promises and Pitfalls of Reconfigurable |
---|
| 332 | Supercomputing, Proc. 2006 Conf. Eng. of Reconfigurable |
---|
| 333 | Systems and Algorithms, CSREA Press, 2006, pp. 11-20; |
---|
| 334 | http://nis-www.lanl.gov/~maya/papers/ersa06_gokhale_paper. |
---|
| 335 | pdf. |
---|
| 336 | [2] D. Buell, Programming Reconfigurable Computers: Language |
---|
| 337 | Lessons Learned, keynote address, Reconfigurable Systems |
---|
| 338 | Summer Institute 2006, 12 July 2006; http://gladiator. |
---|
| 339 | ncsa.uiuc.edu/PDFs/rssi06/presentations/00_Duncan_Buell.pdf |
---|
| 340 | [3] T. Van Court et al., Achieving High Performance |
---|
| 341 | with FPGA-Based Computing, Computer, vol. 40, no. 3, |
---|
| 342 | pp. 50-57, Mar. 2007, doi:10.1109/MC.2007.79 |
---|
| 343 | \end{verbatim} |
---|
| 344 | \end{scriptsize} |
---|
| 345 | |
---|
| 346 | \subsubsection{System Synthesis} |
---|
| 347 | Today, several solutions for system design are proposed and commercialized. The most common are |
---|
| 348 | those provided by Altera and Xilinx to promote their FPGA devices. |
---|
| 349 | \\ |
---|
| 350 | The Xilinx System Generator for DSP [http://www.xilinx.com/tools/sysgen.htm] is a plug-in to |
---|
| 351 | Simulink that enables designers to develop high-performance DSP systems for Xilinx FPGAs. |
---|
| 352 | Designers can design and simulate a system using MATLAB and Simulink. The tool will then |
---|
| 353 | automatically generate synthesizable Hardware Description Language (HDL) code mapped to Xilinx |
---|
| 354 | pre-optimized algorithms. |
---|
| 355 | However, this tool targets only DSP based algorithms, Xilinx FPGAs and cannot handle complete |
---|
| 356 | SoC. Thus, it is not really a system synthesis tool. |
---|
| 357 | \\ |
---|
| 358 | In the opposite, SOPC Builder [CITATION] allows to describe a system, to synthesis it, |
---|
| 359 | to programm it into a target FPGA and to upload a software application. |
---|
| 360 | % FIXME(C2H from Altera, marche vite mais ressource monstrueuse) |
---|
| 361 | Nevertheless, SOPC Builder does not provide any facilities to synthesize coprocessors. |
---|
| 362 | Users have to provide the synthesizable description with the feasible bus interface. |
---|
| 363 | \\ |
---|
| 364 | In addition, Xilinx System Generator and SOPC are closed world since each one imposes |
---|
| 365 | their own IPs which are not interchangeable. |
---|
| 366 | We can conclude that the existing commercial or free tools does not coverthe whole system |
---|
| 367 | synthesis process in a full automatic way. Moreover, they are bound to a particular device family |
---|
| 368 | and to IPs library. |
---|
| 369 | |
---|
| 370 | \subsubsection{High Level Synthesis} |
---|
| 371 | High Level Synthesis translates a sequential algorithmic description and a constraints set |
---|
| 372 | (area, power, frequency, ...) to a micro-architecture at Register Transfer Level (RTL). |
---|
| 373 | Several academic and commercial tools are today available. |
---|
| 374 | Most common tools are SPARK [HLS1], GAUT [HLS2], UGH [HLS3] in the academic world |
---|
| 375 | and catapultC [HLS4], PICO [HLS5] and Cynthesizer [HLS6] in commercial world. |
---|
| 376 | Despite their maturity, their usage is restrained by: |
---|
| 377 | \begin{itemize} |
---|
| 378 | \item They do not respect accurately the frequency constraint when they target an FPGA device. |
---|
| 379 | Their error is about 10 percent. This is annoying when the generated component is integrated |
---|
| 380 | in a SoC since it will slow down the hole system. |
---|
| 381 | \item These tools take into account only one or few constraints simultaneously while realistic |
---|
| 382 | designs are multi-constrained. |
---|
| 383 | Moreover, low power consumption constraint is mandatory for embedded systems. |
---|
| 384 | However, it is not yet well handled by common synthesis tools. |
---|
| 385 | \item The parallelism is extracted from initial algorithm. To get more parallelism or to reduce |
---|
| 386 | the amout of required memory, the user must re-write it while there is techniques as polyedric |
---|
| 387 | transformations to increase the intrinsec parallelism. |
---|
| 388 | \item Despite they have the same input language (C/C++), they are sensitive to the style in |
---|
| 389 | which the algorithm is written. Consequently, engineering work is required to swap from |
---|
| 390 | a tool to another. |
---|
| 391 | \item The HLS tools are not integrated into an architecture and system exploration tool. |
---|
| 392 | Thus, a designer who needs to accelerate a software part of the system, must adapt it manually |
---|
| 393 | to the HLS input dialect and performs engineering work to exploit the synthesis result |
---|
| 394 | at the system level. |
---|
| 395 | \end{itemize} |
---|
| 396 | Regarding these limitations, it is necessary to create a new tool generation reducing the gap |
---|
| 397 | between the specification of an heterogenous system and its hardware implementation. |
---|
| 398 | |
---|
| 399 | \hspace{2cm}\begin{scriptsize}\begin{verbatim} |
---|
| 400 | [HLS1] SPARK universite de californie San Diego |
---|
| 401 | [HLS2] GAUT UBS/Lab-STIC |
---|
| 402 | [HLS3] UGH |
---|
| 403 | [HLS4] catapultC Mentor |
---|
| 404 | [HLS5] PICO synfora |
---|
| 405 | [HLS6] Cynthesizer Forte design system |
---|
| 406 | \end{verbatim} |
---|
| 407 | \end{scriptsize} |
---|
| 408 | |
---|
| 409 | \subsubsection{Application Specific Instruction Processors} |
---|
| 410 | |
---|
| 411 | ASIP (Application-Specific Instruction-Set Processor) are programmable processors in |
---|
| 412 | which both the instruction and the micro architecture have been tailored to a given |
---|
| 413 | application domain (eg. video processing), or to a specific application. |
---|
| 414 | This specialization usually offers a good compromise between performance (w.r.t a pure software |
---|
| 415 | implementation on an embeded CPU) and flexibility (w.r.t an application specific |
---|
| 416 | hardware co-processor). |
---|
| 417 | In spite of their obvious advantages, using/designing ASIPs remains a difficult |
---|
| 418 | task, since it involves designing both a micro-architecture and a compiler for this |
---|
| 419 | architecture. Besides, to our knowledge, there is still no available open-source |
---|
| 420 | design flow\footnote{There are commercial tools such a } for ASIP design even if such a tool would |
---|
| 421 | be valuable in the context of a System Level design exploration tool. |
---|
| 422 | |
---|
| 423 | In this context, ASIP design based on Instruction Set Extensions (ISEs) has |
---|
| 424 | received a lot of interest [NIOSII,TENSILICA]%~\cite{NIOS2,ST70}, |
---|
| 425 | as it makes micro architecture synthesis |
---|
| 426 | more tractable \footnote{ISEs rely on a template micro-architecture in which |
---|
| 427 | only a small fraction of the architecture has to be specialized}, and help ASIP |
---|
| 428 | designers to focus on compilers, for which there are still many open problems |
---|
| 429 | [CODES04,FPGA08]. |
---|
| 430 | This approach however has a strong weakness, since it also significantly reduces |
---|
| 431 | opportunities for achieving good seedups (most speedup remain between 1.5x and |
---|
| 432 | 2.5x), since ISEs performance is generally tied down by I/O constraints as |
---|
| 433 | they generally rely on the main CPU register file to access data. |
---|
| 434 | |
---|
| 435 | % ( |
---|
| 436 | %automaticcaly extraction ISE candidates for application code \cite{CODES04}, |
---|
| 437 | %performing efficient instruction selection and/or storage resource (register) |
---|
| 438 | %allocation \cite{FPGA08}). |
---|
| 439 | |
---|
| 440 | |
---|
| 441 | To cope with this issue, recent approaches~[DAC09,DAC08]%\cite{DAC09,DAC08} |
---|
| 442 | advocate the use of |
---|
| 443 | micro-architectural ISE models in which the coupling between the processor micro-architecture |
---|
| 444 | and the ISE component is thightened up so as to allow the ISE to overcome the register |
---|
| 445 | I/O limitations, however these approaches tackle the problem for a compiler/simulation |
---|
| 446 | point of view and not address the problem of generating synthesizable representations for |
---|
| 447 | these models. |
---|
| 448 | |
---|
| 449 | We therefore strongly believe that there is a need for an open-framework which |
---|
| 450 | would allow researchers and system designers to : |
---|
| 451 | \begin{itemize} |
---|
| 452 | \item Explore the various level of interactions between the original CPU micro-architecure |
---|
| 453 | and its extension (for example throught a Domain Specific Language targeted at micro-architecture |
---|
| 454 | specification and synthesis). |
---|
| 455 | \item Retarget the compiler instruction-selection (or prototype nex passes) passes so as |
---|
| 456 | to be able to take advantage of this ISEs. |
---|
| 457 | \item Provide a complete System-level Integration for using ASIP as SoC building blocks |
---|
| 458 | (integration with application specific blocks, MPSoc, etc.) |
---|
| 459 | \end{itemize} |
---|
| 460 | |
---|
| 461 | \hspace{2cm} |
---|
| 462 | \begin{scriptsize}\begin{verbatim} |
---|
| 463 | |
---|
| 464 | [CODES08] Theo Kluter, Philip Brisk, Paolo Ienne, and Edoardo Charbon, Speculative DMA for |
---|
| 465 | Architecturally Visible Storage in Instruction Set Extensions |
---|
| 466 | |
---|
| 467 | [DAC09] Theo Kluter, Philip Brisk, Paolo Ienne, Edoardo Charbon, Way Stealing: Cache-assisted |
---|
| 468 | Automatic Instruction Set Extensions. |
---|
| 469 | |
---|
| 470 | [CODES04] Pan Yu, Tulika Mitra, Scalable Custom Instructions Identification for |
---|
| 471 | Instruction Set Extensible Processors. |
---|
| 472 | |
---|
| 473 | [FPGA08] Quang Dinh, Deming Chen, Martin D. F. Wong, Efficient ASIP Design for Configurable |
---|
| 474 | Processors with Fine-Grained Resource Sharing. |
---|
| 475 | |
---|
| 476 | [NIOSII] Nios II Custom Instruction User Guide |
---|
| 477 | |
---|
| 478 | \end{verbatim} |
---|
| 479 | |
---|
| 480 | \end{scriptsize} |
---|
| 481 | %, either |
---|
| 482 | %because the target architecture is proprietary, or because the compiler |
---|
| 483 | %technology is closed/commercial. |
---|
| 484 | |
---|
| 485 | |
---|
| 486 | |
---|
| 487 | |
---|
| 488 | % We propose to explore how to tighten the coupling of the extensions and |
---|
| 489 | % the underlyoing template micro-architecture. |
---|
| 490 | % * Thightne Even if such |
---|
| 491 | % an approach offers less flexiblity and forbids very tight coupling |
---|
| 492 | % between the extensions and the template micro-architecture, it makes the |
---|
| 493 | % design of the micro-architecture more tractable and amenable to a fully |
---|
| 494 | % automated flow. |
---|
| 495 | % \\ |
---|
| 496 | % \\ |
---|
| 497 | % In the context of the COACH project, we propose to add to the |
---|
| 498 | % infra-structure a design flow targeted to automatic instruction set |
---|
| 499 | % extension for the MIPS-based CPU, which will come as a complement or an |
---|
| 500 | % alternative to the other proposed approaches (hardware accelerator, |
---|
| 501 | % multi processors). |
---|
| 502 | % |
---|
| 503 | |
---|
| 504 | \subsubsection{Automatic Parallelization} |
---|
| 505 | \begin{Large}\begin{verbatim} |
---|
| 506 | -- A COMPLETER LIP |
---|
| 507 | \end{verbatim} |
---|
| 508 | \end{Large} |
---|
| 509 | %CA% Parallel machines are often difficult and painful to program |
---|
| 510 | %CA% directly, and one would like the compiler to %do the job, that is to |
---|
| 511 | %CA% turn automatically a sequential program into a parallel form. This |
---|
| 512 | %CA% transformation is referred as {\em automatic parallelization}, and has |
---|
| 513 | %CA% been widely addressed since the 70s. Automatic parallelization |
---|
| 514 | %CA% relies on data dependences, which cannot be computed in general.%, as |
---|
| 515 | %CA% %one cannot predict at compile time the variable values on a given |
---|
| 516 | %CA% %execution point. |
---|
| 517 | %CA% This negative result led researchers to (i) find a |
---|
| 518 | %CA% program model in which no approximation is needed (ie polyhedral |
---|
| 519 | %CA% model), (ii) make conservative approximations (iii) remark that |
---|
| 520 | %CA% variable values are known at runtime, and make the decisions during |
---|
| 521 | %CA% program execution. The latter approach is obviously not suitable |
---|
| 522 | %CA% there, as we target hardware generation. We will give there a short |
---|
| 523 | %CA% history of the approaches that fall in the first category. |
---|
| 524 | %CA% |
---|
| 525 | %CA%% In the real world, we deal with a limited amount of processors, |
---|
| 526 | %CA%% and the communication between processors takes time, and is |
---|
| 527 | %CA%% critical for performance. %Whenever we have synchronisation-free |
---|
| 528 | %CA%% parallelism, like for embarrassingly parallel kernels, this is not an |
---|
| 529 | %CA%% issue. But in case of pipelined parallelism, we need to reduce |
---|
| 530 | %CA%% communications as much as possible. |
---|
| 531 | %CA%% So we also need to find parallelism toghether with a proper mapping |
---|
| 532 | %CA%% of operations and data on physical processors. |
---|
| 533 | %CA% |
---|
| 534 | %CA% As programs spend most of there time in loops, the community has |
---|
| 535 | %CA% focused on loop transformations that reveal parallelism. |
---|
| 536 | %CA%%unimodulaire |
---|
| 537 | %CA% The first approaches worked on perfect loop nests, where the tree |
---|
| 538 | %CA% formed by the nested loops is linear. In this program model, the |
---|
| 539 | %CA% loops can be seen as a basis that drive the way the iteration |
---|
| 540 | %CA% domain will be described. Hence, a first idea was to change this |
---|
| 541 | %CA% basis such that one vector (one loop) at least is parallel. To ease |
---|
| 542 | %CA% the code generation, the area of defined by the news vectors must |
---|
| 543 | %CA% be a unit volume. %Otherwise, one would produce an homothetic |
---|
| 544 | %CA%% expansion of the iteration domain, which will force to put modulos |
---|
| 545 | %CA%% in the target code. |
---|
| 546 | %CA% For this reason, these transformations are called {\em unimodular |
---|
| 547 | %CA% transformations}. |
---|
| 548 | %CA%%tiling |
---|
| 549 | %CA% |
---|
| 550 | %CA% The next approaches include {\em loop tiling}, a simple |
---|
| 551 | %CA% partitioning of the iteration domain, whose initial purpose is to |
---|
| 552 | %CA% execute every partition on a different processor. %In the same way, |
---|
| 553 | %CA% The execution order is modified with a proper unimodular |
---|
| 554 | %CA% transformation, then the tiles are obtained by cutting the |
---|
| 555 | %CA% iteration domain with the hyperplanes directed by every vector of |
---|
| 556 | %CA% the new (unimodular) basis, at regular intervals. When the tiling |
---|
| 557 | %CA% hyperplanes are properly chosen, we can both improve data-locality |
---|
| 558 | %CA% on every processor, and reduce the communication between two |
---|
| 559 | %CA% different tiles (which will be mapped on processors). This last |
---|
| 560 | %CA% property implying that one tend to find a degree of parallelism as |
---|
| 561 | %CA% great as possible. |
---|
| 562 | %CA% |
---|
| 563 | %CA%%affine scheduling |
---|
| 564 | %CA% The previous approaches were restricted to kernels with perfect |
---|
| 565 | %CA% loop nests (linear loop tree), and unimodular transformations. The |
---|
| 566 | %CA% last generation of approaches broke with these limitations. We now |
---|
| 567 | %CA% choose a different basis for every assignment, without the |
---|
| 568 | %CA% unimodularity restriction. A dual way to present the things is the |
---|
| 569 | %CA% notion of {\em affine schedule}, introduced by Feautrier [part1], |
---|
| 570 | %CA% that simply assigns an abstract execution date to every assignment |
---|
| 571 | %CA% execution. As an assignment execution is exactly characterised by |
---|
| 572 | %CA% the current value of the loops counters (iteration vector), the |
---|
| 573 | %CA% affine schedule will be defined as an affine form of the iteration |
---|
| 574 | %CA% vector (hence the 'affine'). The affine property allows to use |
---|
| 575 | %CA% integer programming techniques to compute the schedule. With this |
---|
| 576 | %CA% approach, additional techniques are required to allocate the |
---|
| 577 | %CA% parallel operations and the data to processor in an efficient way |
---|
| 578 | %CA% [griebl, feautrier]. |
---|
| 579 | %CA% |
---|
| 580 | %CA%%modularity?? |
---|
| 581 | %CA%%% As loop nests are no longer perfect, we deal with (transformed) |
---|
| 582 | %CA%%% iteration domains of different dimensions, which can possibly (and |
---|
| 583 | %CA%%% certainly) overlap. At this point, a new code generation technique |
---|
| 584 | %CA%%% was needed. The first attempt is due to Chamsky et al. [??], and |
---|
| 585 | %CA%%% was improved by Quillere et al. [QRW]. The code is now implemented |
---|
| 586 | %CA%%% in an efficient tool [cloog], that gave a new life to polyhedral |
---|
| 587 | %CA%%% techniques. |
---|
| 588 | %CA% |
---|
| 589 | %CA%%pluto's tiling |
---|
| 590 | %CA% The tiling techniques were extended to non-perfect loop nest with |
---|
| 591 | %CA% {\em affine partitioning}. Affine partitioning is to affine |
---|
| 592 | %CA% scheduling what (original) tiling was to unimodular |
---|
| 593 | %CA% transformations. An affine partitioning assigns to every assignment |
---|
| 594 | %CA% its coordinates in the basis defined by the normals to the tiling |
---|
| 595 | %CA% hyperplanes. Recently, a way to compute efficient hyperplanes were |
---|
| 596 | %CA% found [uday], with a good data locality, and communications |
---|
| 597 | %CA% confined in a small neighborhood around every processor. |
---|
| 598 | %CA% |
---|
| 599 | %CA%\subsubsection{Source-level Memory Optimisation} |
---|
| 600 | %CA% The HLS process allows to customise memory, which impacts on final |
---|
| 601 | %CA% circuit size and power consumption. Though most HLS tools already |
---|
| 602 | %CA% try to optimise memory usage, it is better to provide an independent |
---|
| 603 | %CA% source-level pass, that could be reused for different tools and in |
---|
| 604 | %CA% other contexts. |
---|
| 605 | %CA% |
---|
| 606 | %CA% There exists many approaches to evaluate and reduce the memory |
---|
| 607 | %CA% requirement of a program. The first approaches are concerned with |
---|
| 608 | %CA% {\em memory size estimation}, which can be defined as the maximum |
---|
| 609 | %CA% number of memory cells used at the same time [clauss,zhao]. These |
---|
| 610 | %CA% approaches provide an estimation as a symbolic expression of program |
---|
| 611 | %CA% parameters, which can be used further to guide loop optimisations. |
---|
| 612 | %CA% However, no explicit way to reduce the memory size is given. {\em |
---|
| 613 | %CA% Intra-array reuse} approaches brake with this limitation, and |
---|
| 614 | %CA% collapse the array cells which are not alive at the same time. The |
---|
| 615 | %CA% collapse is done by means of a data layout transformation, specified |
---|
| 616 | %CA% with a linear (modular) mapping. The first approaches were |
---|
| 617 | %CA% developed at IMEC [balasa,catthoor], and basically try to linearize |
---|
| 618 | %CA% the arrays and fold them using a modulo operator. Then, Lefebvre et |
---|
| 619 | %CA% al. propose a solution to fold independently the array dimensions |
---|
| 620 | %CA% [lefebvre]. Finally, Darte et al. provide a general formalisation of |
---|
| 621 | %CA% the problem, together with a solution that subsumes the previous |
---|
| 622 | %CA% approaches [darte]. A first implementation was made with the tool |
---|
| 623 | %CA% {\sc Bee}, but there are still many limitations. |
---|
| 624 | %CA% |
---|
| 625 | %CA% \begin{itemize} |
---|
| 626 | %CA% \item The tool is restricted to regular programs, whereas more |
---|
| 627 | %CA% general programs could be handled with a conservative array liveness |
---|
| 628 | %CA% analysis. |
---|
| 629 | %CA% |
---|
| 630 | %CA% \item Programs depending on parameters (inputs) are not handled, |
---|
| 631 | %CA% which forbids to handle, for example, the body of tiled loops. |
---|
| 632 | %CA% |
---|
| 633 | %CA% \item The new array layout can brake spatial locality, and then impact |
---|
| 634 | %CA% performance and power consumption. One would like to get a mapping |
---|
| 635 | %CA% that improve or, at least, preserve the spatial locality of the |
---|
| 636 | %CA% program. |
---|
| 637 | %CA% |
---|
| 638 | %CA% \item Finally, the final memory compaction strongly depends on the |
---|
| 639 | %CA% program schedule, and is naturally hindered by the |
---|
| 640 | %CA% parallelism. Consequently, there is a trade-off to find with |
---|
| 641 | %CA% automatic parallelization. An ideal solution would be to reduce |
---|
| 642 | %CA% memory usage, while preserving parallelism. |
---|
| 643 | %CA% \end{itemize} |
---|
| 644 | |
---|
| 645 | \subsubsection{Interfaces} |
---|
| 646 | \begin{Large}\begin{verbatim} |
---|
| 647 | -- A COMPLETER INSA Etat de l'art |
---|
| 648 | \end{verbatim} |
---|
| 649 | \end{Large} |
---|
| 650 | % |
---|
| 651 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
| 652 | \subsection{Objectives and innovation aspects} |
---|
| 653 | \hspace{2cm}\begin{scriptsize}\begin{verbatim} |
---|
| 654 | % 2.2. OBJECTIFS ET CARACTERE AMBITIEUX/NOVATEUR DU PROJET |
---|
| 655 | % (2 pages maximum) |
---|
| 656 | % Dï¿œcrire les objectifs scientifiques/techniques du projet. |
---|
| 657 | % Prᅵsenter l'avancᅵe scientifique attendue. Prᅵciser l'originalitᅵ et le caractᅵre |
---|
| 658 | % ambitieux du projet. |
---|
| 659 | % Dᅵtailler les verrous scientifiques et techniques ᅵ lever par la rᅵalisation du projet. |
---|
| 660 | % Dᅵcrire ᅵventuellement le ou les produits finaux dᅵveloppᅵs ᅵ l'issue du projet |
---|
| 661 | % montrant le caractï¿œre innovant du projet. |
---|
| 662 | % Prï¿œsenter les rï¿œsultats escomptï¿œs en proposant si possible des critï¿œres de rï¿œussite |
---|
| 663 | % et d'ï¿œvaluation adaptï¿œs au type de projet, permettant d'ï¿œvaluer les rï¿œsultats en |
---|
| 664 | % fin de projet. |
---|
| 665 | % Le cas ᅵchᅵant (programmes exigeant la pluridisciplinaritᅵ), dᅵmontrer l'articulation |
---|
| 666 | % entre les disciplines scientifiques. |
---|
| 667 | \end{verbatim} |
---|
| 668 | \end{scriptsize} |
---|
| 669 | |
---|
| 670 | % les objectifs scientifiques/techniques du projet. |
---|
| 671 | The objectives of COACH project are to develop a complete framework to |
---|
| 672 | HPC (accelerating solutions for existing software applications) |
---|
| 673 | and embedded applications (implementing an application on a low power standalone device). |
---|
| 674 | The design steps are presented figure 1. |
---|
| 675 | \begin{figure}[hbtp]\leavevmode\center |
---|
| 676 | \includegraphics[width=.8\linewidth]{flow} |
---|
| 677 | \caption{\label{coach-flow} COACH flow.} |
---|
| 678 | \end{figure} |
---|
| 679 | \begin{description} |
---|
| 680 | \item[HPC setup] Here the user splits the application into 2 parts: the host application |
---|
| 681 | which remains on PC and the SoC application which migrates on SoC. |
---|
| 682 | The framework provides a simulation model allowing to evaluate the partitioning. |
---|
| 683 | \item[SoC design] In this phase, |
---|
| 684 | The user can obtain simulators at different abstraction levels of the SoC by giving to COACH framework |
---|
| 685 | a SoC description. |
---|
| 686 | This description consists of a process network corresponding to the SoC application, |
---|
| 687 | an OS, an instance of a generic hardware platform |
---|
| 688 | and a mapping of processes on the platform components. The supported mapping are |
---|
| 689 | software (the process runs on a SoC processor), |
---|
| 690 | XXXpeci (the process runs on a SoC processor enhanced with dedicated instructions), |
---|
| 691 | and hardware (the process runs into a coprocessor generated by HLS and plugged on the SoC bus). |
---|
| 692 | \item[Application compilation] Once SoC description is validated, COACH generates automatically |
---|
| 693 | an FPGA bitstream containing the hardware platform with SoC application software and |
---|
| 694 | an executable containing the host application. The user can launch the application by |
---|
| 695 | loading the bitstream on FPGA and running the executable on PC. |
---|
| 696 | \end{description} |
---|
| 697 | |
---|
| 698 | % l'avancee scientifique attendue. Preciser l'originalite et le caractere |
---|
| 699 | % ambitieux du projet. |
---|
| 700 | The main scientific contribution of the project is to unify various synthesis techniques |
---|
| 701 | (same input and output formats) allowing the user to swap without engineering effort |
---|
| 702 | from one to an other and even to chain them, for example, to run polyedric transformation |
---|
| 703 | before synthesis. |
---|
| 704 | Another advantage of this framework is to provide different abstraction levels from |
---|
| 705 | a single description. |
---|
| 706 | Finally, this description is device family independent and its hardware implementation |
---|
| 707 | is automatically generated. |
---|
| 708 | |
---|
| 709 | % Detailler les verrous scientifiques et techniques a lever par la realisation du projet. |
---|
| 710 | System design is a very complicated task and in this project we try to simplify it |
---|
| 711 | as much as possible. For this purpose we have to deal with the following scientific |
---|
| 712 | and technological barriers. |
---|
| 713 | \begin{itemize} |
---|
| 714 | \item The main problem in HPC is the communication between the PC and the SoC. |
---|
| 715 | This problem has 2 aspects. The first one is the efficiency. The second is to |
---|
| 716 | eliminate enginnering effort to implement it at different abstract levels. |
---|
| 717 | \item COACH design flow has a top-down approach. In the such case, |
---|
| 718 | the required performance of a coprocessor (run frequency, maximum cycles for |
---|
| 719 | a given computation, power consumption, etc) are imposed by the other system |
---|
| 720 | components. The challenge is to allow user to control accurately the synthesis |
---|
| 721 | process. For instance, the run frequency must not be a result of the RTL synthesis |
---|
| 722 | but a strict synthesis constraint. |
---|
| 723 | \item HLS tools are sensitive to the style in which the algorithm is written. |
---|
| 724 | In addition, they are are not integrated into an architecture and system |
---|
| 725 | exploration tool. |
---|
| 726 | Consequently, engineering work is required to swap from a tool to another, |
---|
| 727 | to integrate the resulting simulation model to an architectural exploration tool |
---|
| 728 | and to synthesize the generated RTL description. |
---|
| 729 | %CA Additionnal preprocessing, source-level transformations, are thus |
---|
| 730 | %CA required to improve the process. |
---|
| 731 | %CA Particularly, this includes parallelism exposure and efficient memory mapping. |
---|
| 732 | \item Most HLS tools translate a sequential algorithm into a coprocessor |
---|
| 733 | containing a single data-path and finite state machine (FSM). In this way, |
---|
| 734 | only the fine grained parallelism is exploited (ILP parallelism). |
---|
| 735 | The challenge is to identify the coarse grained parallelism and to generate, |
---|
| 736 | from a sequential algorithm, coprocessor containing multiple communicating |
---|
| 737 | tasks (data-paths and FSMs). |
---|
| 738 | \end{itemize} |
---|
| 739 | |
---|
| 740 | %Presenter les resultats escomptes en proposant si possible des criteres de reussite |
---|
| 741 | %et d'evaluation adaptes au type de projet, permettant d'evaluer les resultats en |
---|
| 742 | %fin de projet. |
---|
| 743 | The main result is the framework. It is composed concretely of: |
---|
| 744 | 2 HPC communication shemes with their implementation, |
---|
| 745 | 5 HLS tools (control dominated HLS, data dominated HLS, Coarse grained HLS, |
---|
| 746 | Memory optimisation HLS and ASIP), |
---|
| 747 | 3 systemC based virtual prototyping environment extended with synthesizable |
---|
| 748 | RTL IP cores (generic, ALTERA/NIOS/AVALON, XILINX/MICROBLAZE/OPB), |
---|
| 749 | one design space exploration tool, |
---|
| 750 | one operating system (OS). |
---|
| 751 | \\ |
---|
| 752 | The framework fonctionality will be demonstrated with XXX-EXAMPLE1, XXX-EXAMPLE2 |
---|
| 753 | and XXX-EXAMPLE3 on 4 archictures (generic/XILINX, generic/ALTERA, |
---|
| 754 | proprietary/XILINX, proprietary/ALTERA). |
---|
| 755 | |
---|
| 756 | %% \section{} |
---|
| 757 | %% %3. PROGRAMME SCIENTIFIQUE ET TECHNIQUE, ORGANISATION DU PROJET |
---|
| 758 | %% \subsection{} |
---|
| 759 | %% %3.1. PROGRAMME SCIENTIFIQUE ET STRUCTURATION DU PROJET |
---|
| 760 | %% %(2 pages maximum) |
---|
| 761 | %% %Prï¿œsentez le programme scientifique et justifiez la dï¿œcomposition en tï¿œches du |
---|
| 762 | %% %programme de travail en cohï¿œrence avec les objectifs poursuivis. |
---|
| 763 | %% %Utilisez un diagramme pour prï¿œsenter les liens entre les diffï¿œrentes tï¿œches |
---|
| 764 | %% %(organigramme technique) |
---|
| 765 | %% %Les tᅵches reprᅵsentent les grandes phases du projet. Elles sont en nombre limitᅵ. |
---|
| 766 | %% %N'oubliez pas les activitᅵs et actions correspondant ᅵ la dissᅵmination et ᅵ la |
---|
| 767 | %% %valorisation. |
---|
| 768 | %% |
---|
| 769 | %% %METTRE UNE FIGURE ICI DECRIVANT LES TACHES ET LEURS INTERACTION (AVEC LE FLOT |
---|
| 770 | %% %EN FILIGRANE ? ) |
---|
| 771 | %% \subsection{} |
---|
| 772 | %% %3.2. MANAGEMENT DU PROJET |
---|
| 773 | %% %(2 pages maximum) |
---|
| 774 | %% %Prï¿œciser les aspects organisationnels du projet et les modalitï¿œs de coordination |
---|
| 775 | %% %(si possible individualisation d'une tï¿œche coordination : cf. tï¿œche 0 du document |
---|
| 776 | %% %de soumission A). |
---|
| 777 | %% \subsection{} |
---|
| 778 | %% %3.3. DESCRIPTION DES TRAVAUX PAR TACHE |
---|
| 779 | %% %(idï¿œalement 1 ou 2 pages par tï¿œche) |
---|
| 780 | %% %Pour chaque tï¿œche, dï¿œcrire : |
---|
| 781 | %% %- les objectifs de la tï¿œche et ï¿œventuels indicateurs de succï¿œs, |
---|
| 782 | %% %- le responsable de la tᅵche et les partenaires impliquᅵs (possibilitᅵ de |
---|
| 783 | %% %l'indiquer sous forme graphique), |
---|
| 784 | %% %- le programme dᅵtaillᅵ des travaux par tᅵche, |
---|
| 785 | %% %- les livrables de la tï¿œche, |
---|
| 786 | %% %- les contributions des partenaires (le " qui fait quoi "), |
---|
| 787 | %% %- la description des mï¿œthodes et des choix techniques et de la maniï¿œre dont |
---|
| 788 | %% %les solutions seront apportï¿œes, |
---|
| 789 | %% %- les risques de la tï¿œche et les solutions de repli envisagï¿œes. |
---|
| 790 | |
---|
| 791 | |
---|
| 792 | |
---|
| 793 | |
---|
| 794 | |
---|
| 795 | |
---|