Synthetic Benchmarks¶

The purpose of this section is to give an overview of the performance of the compute nodes over synthetic benchmarks and representative embedded applications.

CPU Memory Bandwidth¶

Measurement of the memory bandwidth between the CPU and the RAM according to the triad micro-benchmark (C[i] = x * A[i] + B[i]). The bandwidth benchmark is used. It is dedicated to efficient CPU memory bandwidth measurements like the good old STREAM.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "CPU memory bandwidth depending on the SBCs", "width": 750, "height": 300, "data": { "values": [ {"category": "xu4", "group": "triad", "value": 4}, {"category": "rpi3", "group": "triad", "value": 2}, {"category": "tx2", "group": "triad", "value": 20}, {"category": "xagx", "group": "triad", "value": 64}, {"category": "xnano", "group": "triad", "value": 9}, {"category": "rpi4", "group": "triad", "value": 5}, {"category": "xnx", "group": "triad", "value": 32}, {"category": "m1u", "group": "triad", "value": 320}, {"category": "vim1s", "group": "triad", "value": 7}, {"category": "onx", "group": "triad", "value": 45}, {"category": "oagx", "group": "triad", "value": 73}, {"category": "onano", "group": "triad", "value": 26}, {"category": "opi5", "group": "triad", "value": 20}, {"category": "rpi5", "group": "triad", "value": 10}, {"category": "em780", "group": "triad", "value": 62}, {"category": "bpif3", "group": "triad", "value": 7}, {"category": "x7ti", "group": "triad", "value": 73} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Throughput (GB/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "ubench"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }

CPU memory bandwidth depending on the SBC (higher is better).

Public GitHub repository: https://github.com/alsoc/bandwidth
Private GitLab repository: https://gitlab.lip6.fr/sog/bandwidth

CPU Peak Performance¶

Measurement of the CPU peak performance according to different operations:

FMA - Fused Multiply–Add, performs the following operation: $d = a \times b + c$ on 64-bit, 32-bit or 16-bit floating-point numbers (reffered as f64, f32 & f16 here).
DPA4 - Performs the dot product of four 8-bit integers (i8) and accumulates the result in a 32-bit integer (i32): $c^{i32} = c^{i32} + \sum^4_{s = 1}{ a_s^{i8} \times b_s^{i8}}$.
DPA2 - Performs the dot product of two 16-bit brain floats (bf16) and accumulates the result in a 32-bit float (f32): $c^{f32} = c^{f32} + \sum^2_{s = 1}{ a_s^{bf16} \times b_s^{bf16}}$.
MADOT - Performs a small matrix multiplication. For instance for RVV 1.0 256-bit + IME, MADOT i32/i8 performs: $C^{i32} = C^{i32} + A^{i8}B^{i8}$ where $A^{i8}$ dim is $8 \times 4$, $B^{i8}$ dim is $4 \times 8$ and $C^{i32}$ dim is $4 \times 4$.

The cpufp benchmark is used. Below, the obtained performance depending on the targeted SBC and on the number of cores. In multi-core, the number of cores that are used is given with the *-nt suffix. For instance, for the Raspberry Pi 3, the label is rpi3-A53-4t and it means that 4 cores are used.

Low Power SBCs (< 15 Watts)Medium Power SBCs (> 15 Watts)

Single-coreMulti-core

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "Peak performance on CPU (mono-core) depending on the type of operation and on the SBC.", "width": 1000, "height": 300, "data": { "values": [ {"category":"rpi3-A53", "group": "FMA f64", "value": 5}, {"category":"rpi3-A53", "group": "FMA f32", "value": 9}, {"category":"rpi3-A53", "group": "FMA f16", "value": 0}, {"category":"rpi3-A53", "group": "DPA2 f32/bf16", "value": 0}, {"category":"rpi3-A53", "group": "DPA4 i32/i8", "value": 0}, {"category":"rpi3-A53", "group": "MADOT i32/i8", "value": 0}, {"category":"xnano-A57", "group": "FMA f64", "value": 6}, {"category":"xnano-A57", "group": "FMA f32", "value": 12}, {"category":"rpi4-A72", "group": "FMA f64", "value": 6}, {"category":"rpi4-A72", "group": "FMA f32", "value": 12}, {"category":"vim1s-A35", "group": "FMA f64", "value": 1.6}, {"category":"vim1s-A35", "group": "FMA f32", "value": 3.2}, {"category":"onano-A78", "group": "FMA f64", "value": 12}, {"category":"onano-A78", "group": "FMA f32", "value": 24}, {"category":"onano-A78", "group": "FMA f16", "value": 48}, {"category":"onano-A78", "group": "DPA4 i32/i8", "value": 97}, {"category":"opi5-A55", "group": "FMA f64", "value": 7}, {"category":"opi5-A55", "group": "FMA f32", "value": 14}, {"category":"opi5-A55", "group": "FMA f16", "value": 29}, {"category":"opi5-A55", "group": "DPA4 i32/i8", "value": 58}, {"category":"opi5-A76", "group": "FMA f64", "value": 18}, {"category":"opi5-A76", "group": "FMA f32", "value": 36}, {"category":"opi5-A76", "group": "FMA f16", "value": 71}, {"category":"opi5-A76", "group": "DPA4 i32/i8", "value": 143}, {"category":"rpi5-A76", "group": "FMA f64", "value": 19}, {"category":"rpi5-A76", "group": "FMA f32", "value": 38}, {"category":"rpi5-A76", "group": "FMA f16", "value": 77}, {"category":"rpi5-A76", "group": "DPA4 i32/i8", "value": 153}, {"category":"bpif3-X60", "group": "FMA f64", "value": 13}, {"category":"bpif3-X60", "group": "FMA f32", "value": 25}, {"category":"bpif3-X60", "group": "FMA f16", "value": 53}, {"category":"bpif3-X60", "group": "MADOT i32/i8", "value": 408} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Performance (Gop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "Operation"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }

Peak performance on CPU (mono-core) depending on the type of operation and on the SBC (higher is better).

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "Peak performance on CPU depending on the type of operation and on the SBC.", "width": 1000, "height": 300, "data": { "values": [ {"category":"rpi3-A53-4t", "group": "FMA f64", "value": 19}, {"category":"rpi3-A53-4t", "group": "FMA f32", "value": 38}, {"category":"rpi3-A53-4t", "group": "FMA f16", "value": 0}, {"category":"rpi3-A53-4t", "group": "DPA2 f32/bf16", "value": 0}, {"category":"rpi3-A53-4t", "group": "DPA4 i32/i8", "value": 0}, {"category":"rpi3-A53-4t", "group": "MADOT i32/i8", "value": 0}, {"category":"xnano-A57-4t", "group": "FMA f64", "value": 24}, {"category":"xnano-A57-4t", "group": "FMA f32", "value": 47}, {"category":"rpi4-A72-4t", "group": "FMA f64", "value": 24}, {"category":"rpi4-A72-4t", "group": "FMA f32", "value": 48}, {"category":"vim1s-A35", "group": "FMA f64", "value": 6.3}, {"category":"vim1s-A35", "group": "FMA f32", "value": 12.7}, {"category":"onano-A78-6t", "group": "FMA f64", "value": 72}, {"category":"onano-A78-6t", "group": "FMA f32", "value": 145}, {"category":"onano-A78-6t", "group": "FMA f16", "value": 290}, {"category":"onano-A78-6t", "group": "DPA4 i32/i8", "value": 579}, {"category":"opi5-A55-4t", "group": "FMA f64", "value": 29}, {"category":"opi5-A55-4t", "group": "FMA f32", "value": 58}, {"category":"opi5-A55-4t", "group": "FMA f16", "value": 115}, {"category":"opi5-A55-4t", "group": "DPA4 i32/i8", "value": 231}, {"category":"opi5-A76-4t", "group": "FMA f64", "value": 71}, {"category":"opi5-A76-4t", "group": "FMA f32", "value": 142}, {"category":"opi5-A76-4t", "group": "FMA f16", "value": 284}, {"category":"opi5-A76-4t", "group": "DPA4 i32/i8", "value": 568}, {"category":"rpi5-A76-4t", "group": "FMA f64", "value": 77}, {"category":"rpi5-A76-4t", "group": "FMA f32", "value": 154}, {"category":"rpi5-A76-4t", "group": "FMA f16", "value": 307}, {"category":"rpi5-A76-4t", "group": "DPA4 i32/i8", "value": 614}, {"category":"bpif3-X60-8t", "group": "FMA f64", "value": 106}, {"category":"bpif3-X60-8t", "group": "FMA f32", "value": 214}, {"category":"bpif3-X60-8t", "group": "FMA f16", "value": 426}, {"category":"bpif3-X60-8t", "group": "MADOT i32/i8", "value": 1635} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Performance (Gop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "Operation"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }

Peak performance on CPU (multi-core) depending on the type of operation and on the SBC (higher is better).

Single-coreMulti-core

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "Peak performance on CPU (mono-core) depending on the type of operation and on the SBC.", "width": 1300, "height": 300, "data": { "values": [ {"category":"tx2-A57", "group": "FMA f64", "value": 8}, {"category":"tx2-A57", "group": "FMA f32", "value": 16}, {"category":"tx2-A57", "group": "FMA f16", "value": 0}, {"category":"tx2-A57", "group": "DPA2 f32/bf16", "value": 0}, {"category":"tx2-A57", "group": "DPA4 i32/i8", "value": 0}, {"category":"tx2-Denver", "group": "FMA f64", "value": 8}, {"category":"tx2-Denver", "group": "FMA f32", "value": 15}, {"category":"xagx-Carmel", "group": "FMA f64", "value": 17}, {"category":"xagx-Carmel", "group": "FMA f32", "value": 33}, {"category":"xagx-Carmel", "group": "FMA f16", "value": 66}, {"category":"xnx-Carmel", "group": "FMA f64", "value": 14}, {"category":"xnx-Carmel", "group": "FMA f32", "value": 28}, {"category":"xnx-Carmel", "group": "FMA f16", "value": 56}, {"category":"m1u-Icestorm", "group": "FMA f64", "value": 16}, {"category":"m1u-Icestorm", "group": "FMA f32", "value": 33}, {"category":"m1u-Icestorm", "group": "FMA f16", "value": 66}, {"category":"m1u-Icestorm", "group": "DPA4 i32/i8", "value": 132}, {"category":"m1u-Firestorm", "group": "FMA f64", "value": 52}, {"category":"m1u-Firestorm", "group": "FMA f32", "value": 103}, {"category":"m1u-Firestorm", "group": "FMA f16", "value": 206}, {"category":"m1u-Firestorm", "group": "DPA4 i32/i8", "value": 412}, {"category":"onx-A78", "group": "FMA f64", "value": 16}, {"category":"onx-A78", "group": "FMA f32", "value": 32}, {"category":"onx-A78", "group": "FMA f16", "value": 64}, {"category":"onx-A78", "group": "DPA4 i32/i8", "value": 127}, {"category":"oagx-A78", "group": "FMA f64", "value": 18}, {"category":"oagx-A78", "group": "FMA f32", "value": 35}, {"category":"oagx-A78", "group": "FMA f16", "value": 70}, {"category":"oagx-A78", "group": "DPA4 i32/i8", "value": 140}, {"category":"em780-7840u", "group": "FMA f64", "value": 62}, {"category":"em780-7840u", "group": "FMA f32", "value": 124}, {"category":"em780-7840u", "group": "DPA2 f32/bf16", "value": 248}, {"category":"em780-7840u", "group": "DPA4 i32/i8", "value": 497}, {"category":"x7ti-lpe", "group": "FMA f64", "value": 19}, {"category":"x7ti-lpe", "group": "FMA f32", "value": 40}, {"category":"x7ti-lpe", "group": "DPA2 f32/bf16", "value": 80}, {"category":"x7ti-lpe", "group": "DPA4 i32/i8", "value": 160}, {"category":"x7ti-e", "group": "FMA f64", "value": 30}, {"category":"x7ti-e", "group": "FMA f32", "value": 60}, {"category":"x7ti-e", "group": "DPA2 f32/bf16", "value": 120}, {"category":"x7ti-e", "group": "DPA4 i32/i8", "value": 242}, {"category":"x7ti-p", "group": "FMA f64", "value": 56}, {"category":"x7ti-p", "group": "FMA f32", "value": 113}, {"category":"x7ti-p", "group": "DPA2 f32/bf16", "value": 303}, {"category":"x7ti-p", "group": "DPA4 i32/i8", "value": 600} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Performance (Gop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "Operation"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }

Peak performance on CPU (mono-core) depending on the type of operation and on the SBC (higher is better).

Peak performance on

class="vegalite">{ .io/schema/vega-lite/v5.json", on CPU depending on the type of operation and on the SBC.", "group": "FMA f64", "value": 33}, "group": "FMA f32", "value": 65}, "group": "FMA f16", "value": 0}, "group": "DPA2 f32/bf16", "value": 0}, "group": "DPA4 i32/i8", "value": 0}, "group": "FMA f64", "value": 16}, "group": "FMA f32", "value": 31}, "group": "FMA f64", "value": 133}, "group": "FMA f32", "value": 264}, "group": "FMA f16", "value": 530}, "group": "FMA f64", "value": 84}, "group": "FMA f32", "value": 167}, "group": "FMA f16", "value": 334}, "group": "FMA f64", "value": 66}, "group": "FMA f32", "value": 132}, "group": "FMA f16", "value": 263}, "group": "DPA4 i32/i8", "value": 527}, "group": "FMA f64", "value": 775}, "group": "FMA f32", "value": 1551}, "group": "FMA f16", "value": 3102}, "group": "DPA4 i32/i8", "value": 6201}, "group": "FMA f64", "value": 125}, "group": "FMA f32", "value": 252}, "group": "FMA f16", "value": 504}, "group": "DPA4 i32/i8", "value": 1010}, "group": "FMA f64", "value": 210}, "group": "FMA f32", "value": 421}, "group": "FMA f16", "value": 842}, "group": "DPA4 i32/i8", "value": 1684}, "group": "FMA f64", "value": 442}, "group": "FMA f32", "value": 872}, "group": "DPA2 f32/bf16", "value": 1854}, "group": "DPA4 i32/i8", "value": 3560}, "group": "FMA f64", "value": 40}, "group": "FMA f32", "value": 79}, "group": "DPA2 f32/bf16", "value": 160}, "group": "DPA4 i32/i8", "value": 319}, "group": "FMA f64", "value": 210}, "group": "FMA f32", "value": 420}, "group": "DPA2 f32/bf16", "value": 841}, "group": "DPA4 i32/i8", "value": 1683}, "group": "FMA f64", "value": 430}, "group": "FMA f32", "value": 853}, "group": "DPA2 f32/bf16", "value": 1704}, "group": "DPA4 i32/i8", "value": 3430} "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "type": "quantitative", "axis": {"title": "Performance (Gop/s)"}, "scale": {"type": "linear"}}, "group", "sort": "none"}, "title": "Operation"} "value", "type": "quantitative", "sort": "none"} CPU (multi-core) depending on the type of operation and on the SBC (higher is better).

Public GitHub repository: https://github.com/pigirons/cpufp

GPU Memory Bandwidth¶

Measurement of the memory bandwidth between the GPU and its global memory with the clpeak benchmark. On the Nvidia Jetson platforms, PoCL has been installed to enable OpenCL support (see the PoCL Installation on Jetson section).

Integrated GPUsDiscrete GPUs

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "GPU memory bandwidth depending on the SBC.", "width": 1500, "height": 300, "data": { "values": [ {"category": "xu4-Mali-T628-MP6", "group": "float32x1", "value": 5.2}, {"category": "xu4-Mali-T628-MP6", "group": "float32x2", "value": 6.9}, {"category": "xu4-Mali-T628-MP6", "group": "float32x4", "value": 7.0}, {"category": "xu4-Mali-T628-MP6", "group": "float32x8", "value": 6.9}, {"category": "tx2-Pascal-2SMX", "group": "float32x1", "value": 37}, {"category": "tx2-Pascal-2SMX", "group": "float32x2", "value": 46}, {"category": "tx2-Pascal-2SMX", "group": "float32x4", "value": 46}, {"category": "tx2-Pascal-2SMX", "group": "float32x8", "value": 34}, {"category": "xagx-Volta-8SMX", "group": "float32x1", "value": 110}, {"category": "xagx-Volta-8SMX", "group": "float32x2", "value": 109}, {"category": "xagx-Volta-8SMX", "group": "float32x4", "value": 109}, {"category": "xagx-Volta-8SMX", "group": "float32x8", "value": 91}, {"category": "xnano-Maxwell-1SMX", "group": "float32x1", "value": 18}, {"category": "xnano-Maxwell-1SMX", "group": "float32x2", "value": 21}, {"category": "xnano-Maxwell-1SMX", "group": "float32x4", "value": 21}, {"category": "xnano-Maxwell-1SMX", "group": "float32x8", "value": 20}, {"category": "xnx-Volta-6SMX", "group": "float32x1", "value": 47}, {"category": "xnx-Volta-6SMX", "group": "float32x2", "value": 49}, {"category": "xnx-Volta-6SMX", "group": "float32x4", "value": 49}, {"category": "xnx-Volta-6SMX", "group": "float32x8", "value": 44}, {"category": "m1u-macos-48c", "group": "float32x1", "value": 699}, {"category": "m1u-macos-48c", "group": "float32x2", "value": 717}, {"category": "m1u-macos-48c", "group": "float32x4", "value": 729}, {"category": "m1u-macos-48c", "group": "float32x8", "value": 703}, {"category": "m1u-linux-48c", "group": "float32x1", "value": 500}, {"category": "m1u-linux-48c", "group": "float32x2", "value": 514}, {"category": "m1u-linux-48c", "group": "float32x4", "value": 523}, {"category": "m1u-linux-48c", "group": "float32x8", "value": 524}, {"category": "vim1s-Mali-G31-MP2", "group": "float32x1", "value": 3.5}, {"category": "vim1s-Mali-G31-MP2", "group": "float32x2", "value": 4.3}, {"category": "vim1s-Mali-G31-MP2", "group": "float32x4", "value": 4.2}, {"category": "vim1s-Mali-G31-MP2", "group": "float32x8", "value": 3.5}, {"category": "onx-Ampere-8SMX", "group": "float32x1", "value": 87}, {"category": "onx-Ampere-8SMX", "group": "float32x2", "value": 94}, {"category": "onx-Ampere-8SMX", "group": "float32x4", "value": 94}, {"category": "onx-Ampere-8SMX", "group": "float32x8", "value": 94}, {"category": "oagx-Ampere-16SMX", "group": "float32x1", "value": 174}, {"category": "oagx-Ampere-16SMX", "group": "float32x2", "value": 178}, {"category": "oagx-Ampere-16SMX", "group": "float32x4", "value": 179}, {"category": "oagx-Ampere-16SMX", "group": "float32x8", "value": 180}, {"category": "onano-Ampere-8SMX", "group": "float32x1", "value": 63}, {"category": "onano-Ampere-8SMX", "group": "float32x2", "value": 64}, {"category": "onano-Ampere-8SMX", "group": "float32x4", "value": 64}, {"category": "onano-Ampere-8SMX", "group": "float32x8", "value": 64}, {"category": "opi5-Mali-G610-MP4", "group": "float32x1", "value": 24}, {"category": "opi5-Mali-G610-MP4", "group": "float32x2", "value": 26}, {"category": "opi5-Mali-G610-MP4", "group": "float32x4", "value": 26}, {"category": "opi5-Mali-G610-MP4", "group": "float32x8", "value": 20}, {"category": "em780-Radeon-780M", "group": "float32x1", "value": 72}, {"category": "em780-Radeon-780M", "group": "float32x2", "value": 76}, {"category": "em780-Radeon-780M", "group": "float32x4", "value": 79}, {"category": "em780-Radeon-780M", "group": "float32x8", "value": 80}, {"category": "x7ti-Alchemist-8Xe", "group": "float32x1", "value": 73}, {"category": "x7ti-Alchemist-8Xe", "group": "float32x2", "value": 74}, {"category": "x7ti-Alchemist-8Xe", "group": "float32x4", "value": 75}, {"category": "x7ti-Alchemist-8Xe", "group": "float32x8", "value": 78} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Throughput (GB/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "Datatype"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }

GPU memory bandwidth depending on the SBC (higher is better).

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "GPU memory bandwidth depending on the SBC.", "width": 500, "height": 300, "data": { "values": [ {"category": "GeForce-RTX-3090", "group": "float32x 1", "value": 817}, {"category": "GeForce-RTX-3090", "group": "float32x 2", "value": 842}, {"category": "GeForce-RTX-3090", "group": "float32x 4", "value": 856}, {"category": "GeForce-RTX-3090", "group": "float32x 8", "value": 788}, {"category": "GeForce-RTX-3090", "group": "float32x16", "value": 845}, {"category": "Radeon-RX-7900-XTX", "group": "float32x 1", "value": 601}, {"category": "Radeon-RX-7900-XTX", "group": "float32x 2", "value": 623}, {"category": "Radeon-RX-7900-XTX", "group": "float32x 4", "value": 642}, {"category": "Radeon-RX-7900-XTX", "group": "float32x 8", "value": 666}, {"category": "Radeon-RX-7900-XTX", "group": "float32x16", "value": 683} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Throughput (GB/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "Datatype"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }

GPU memory bandwidth depending on the SBC (higher is better).

Public GitHub repository: https://github.com/krrishnarraj/clpeak

GPU Peak Performance¶

Measurement of the GPU peak performance. The clpeak benchmark is used. It is an OpenCL benchmark that executes a compute intensive program to estimate peak performance. On the Nvidia Jetson platforms, PoCL has been installed to enable OpenCL support (see the PoCL Installation on Jetson section).

Integrated GPUsDiscrete GPUs

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "GPU peak performance (32-bit float) depending on the SBC.", "width": 1300, "height": 300, "data": { "values": [ {"category": "xu4-Mali-T628-MP6", "group": "float64", "value": 26}, {"category": "xu4-Mali-T628-MP6", "group": "float32", "value": 56}, {"category": "xu4-Mali-T628-MP6", "group": "float16", "value": 114}, {"category": "tx2-Pascal-2SMX", "group": "float64", "value": 21}, {"category": "tx2-Pascal-2SMX", "group": "float32", "value": 652}, {"category": "tx2-Pascal-2SMX", "group": "float16", "value": 0}, {"category": "xagx-Volta-8SMX", "group": "float64", "value": 44}, {"category": "xagx-Volta-8SMX", "group": "float32", "value": 1404}, {"category": "xagx-Volta-8SMX", "group": "float16", "value": 0}, {"category": "xnano-Maxwell-1SMX", "group": "float64", "value": 7}, {"category": "xnano-Maxwell-1SMX", "group": "float32", "value": 230}, {"category": "xnano-Maxwell-1SMX", "group": "float16", "value": 0}, {"category": "xnx-Volta-6SMX", "group": "float64", "value": 27}, {"category": "xnx-Volta-6SMX", "group": "float32", "value": 847}, {"category": "xnx-Volta-6SMX", "group": "float16", "value": 0}, {"category": "m1u-macos-48c", "group": "float64", "value": 0}, {"category": "m1u-macos-48c", "group": "float32", "value": 7706}, {"category": "m1u-macos-48c", "group": "float16", "value": 0}, {"category": "m1u-linux-48c", "group": "float64", "value": 0}, {"category": "m1u-linux-48c", "group": "float32", "value": 7120}, {"category": "m1u-linux-48c", "group": "float16", "value": 6145}, {"category": "vim1s-Mali-G31-MP2", "group": "float64", "value": 0}, {"category": "vim1s-Mali-G31-MP2", "group": "float32", "value": 13}, {"category": "vim1s-Mali-G31-MP2", "group": "float16", "value": 27}, {"category": "onx-Ampere-8SMX", "group": "float64", "value": 30}, {"category": "onx-Ampere-8SMX", "group": "float32", "value": 1844}, {"category": "onx-Ampere-8SMX", "group": "float16", "value": 3520}, {"category": "oagx-Ampere-16SMX", "group": "float64", "value": 83}, {"category": "oagx-Ampere-16SMX", "group": "float32", "value": 5211}, {"category": "oagx-Ampere-16SMX", "group": "float16", "value": 9957}, {"category": "onano-Ampere-8SMX", "group": "float64", "value": 20}, {"category": "onano-Ampere-8SMX", "group": "float32", "value": 1255}, {"category": "onano-Ampere-8SMX", "group": "float16", "value": 2397}, {"category": "opi5-Mali-G610-MP4", "group": "float64", "value": 0}, {"category": "opi5-Mali-G610-MP4", "group": "float32", "value": 474}, {"category": "opi5-Mali-G610-MP4", "group": "float16", "value": 917}, {"category": "em780-Radeon-780M", "group": "float64", "value": 86}, {"category": "em780-Radeon-780M", "group": "float32", "value": 2522}, {"category": "em780-Radeon-780M", "group": "float16", "value": 4574}, {"category": "x7ti-Alchemist-8Xe", "group": "float64", "value": 149}, {"category": "x7ti-Alchemist-8Xe", "group": "float32", "value": 4774}, {"category": "x7ti-Alchemist-8Xe", "group": "float16", "value": 9473} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Peak Performance (Gflop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "Datatype"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }

GPU peak performance depending on the SBC (higher is better).

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "GPU peak performance (32-bit float) depending on the SBC.", "width": 500, "height": 300, "data": { "values": [ {"category": "GeForce-RTX-3090", "group": "float64", "value": 629}, {"category": "GeForce-RTX-3090", "group": "float32", "value": 36038}, {"category": "GeForce-RTX-3090", "group": "float16", "value": 39636}, {"category": "Radeon-RX-7900-XTX", "group": "float64", "value": 907}, {"category": "Radeon-RX-7900-XTX", "group": "float32", "value": 23952}, {"category": "Radeon-RX-7900-XTX", "group": "float16", "value": 40445} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Peak Performance (Gflop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "Datatype"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }

GPU peak performance depending on the SBC (higher is better).

Public GitHub repository: https://github.com/krrishnarraj/clpeak

Compute Intensive n-body Code¶

MUrB is a $n$-body code simulating Newtonian gravitational equations. This type of code is known to be mostly compute-bound because there are $O(n^2)$ computations for $n$ data. The CPU code is vectorized thanks to the MIPP SIMD wrapper and multi-threaded with OpenMP (all the available cores are used for the benchmark). On GPU, an OpenCL and a CUDA implementations are evaluated. In any cases, the computations are performed using float32 datatype.

CPU MIPP implementationGPU CUDA & OpenCL implementations

The following command lines is used:

murb --nv -i 100 -n 20000 --gf --im cpu+simd # on CPU with SIMD (MIPP wrapper)

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "MUrB: achieved performance on CPU depending on the SBC.", "width": 500, "height": 300, "data": { "values": [ {"category": "xu4", "group": "MIPP", "value": 11}, {"category": "rpi3", "group": "MIPP", "value": 10}, {"category": "tx2", "group": "MIPP", "value": 53}, {"category": "xagx", "group": "MIPP", "value": 151}, {"category": "xnano", "group": "MIPP", "value": 20}, {"category": "rpi4", "group": "MIPP", "value": 22}, {"category": "xnx", "group": "MIPP", "value": 95}, {"category": "m1u", "group": "MIPP", "value": 838}, {"category": "vim1s", "group": "MIPP", "value": 10}, {"category": "onx", "group": "MIPP", "value": 103}, {"category": "oagx", "group": "MIPP", "value": 172}, {"category": "onano", "group": "MIPP", "value": 60}, {"category": "opi5", "group": "MIPP", "value": 80}, {"category": "rpi5", "group": "MIPP", "value": 58}, {"category": "em780", "group": "MIPP", "value": 633}, {"category": "x7ti", "group": "MIPP", "value": 627} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "MUrB Performance (Gflop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "API"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }

MUrB: Achieved performance on CPU depending on the SBC (higher is better).

Integrated GPUsDiscrete GPUs

The following command lines are used:

murb --nv -i 1500 -n 30000 --gf --im cuda+rsqrt4 # on GPU with CUDA API
murb --nv -i 1500 -n 30000 --gf --im ocl+rsqrt4  # on GPU with OpenCL API

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "MUrB: achieved performance on GPU depending on the SBC.", "width": 1000, "height": 300, "data": { "values": [ {"category":"xu4-Mali-T628-MP6", "group": "OCL", "value": 13}, {"category":"xu4-Mali-T628-MP6", "group": "CUDA", "value": null}, {"category":"tx2-Pascal-2SMX", "group": "OCL", "value": 274}, {"category":"tx2-Pascal-2SMX", "group": "CUDA", "value": 276}, {"category":"xagx-Volta-8SMX", "group": "OCL", "value": 735}, {"category":"xagx-Volta-8SMX", "group": "CUDA", "value": 736}, {"category":"xnano-Maxwell-1SMX", "group": "OCL", "value": 109}, {"category":"xnano-Maxwell-1SMX", "group": "CUDA", "value": 108}, {"category":"xnx-Volta-6SMX", "group": "OCL", "value": 446}, {"category":"xnx-Volta-6SMX", "group": "CUDA", "value": 454}, {"category":"m1u-macos-48c", "group": "OCL", "value": 2104}, {"category":"m1u-linux-48c", "group": "OCL", "value": 1558}, {"category":"vim1s-Mali-G31-MP2", "group": "OCL", "value": 8}, {"category":"onx-Ampere-8SMX", "group": "OCL", "value": 594}, {"category":"onx-Ampere-8SMX", "group": "CUDA", "value": 595}, {"category":"oagx-Ampere-16SMX", "group": "OCL", "value": 1572}, {"category":"oagx-Ampere-16SMX", "group": "CUDA", "value": 1629}, {"category":"onano-Ampere-8SMX", "group": "OCL", "value": 423}, {"category":"onano-Ampere-8SMX", "group": "CUDA", "value": 437}, {"category":"opi5-Mali-G610-MP4", "group": "OCL", "value": 148}, {"category":"opi5-Mali-G610-MP4", "group": "CUDA", "value": null}, {"category":"em780-Radeon-780M", "group": "OCL", "value": 1292}, {"category":"em780-Radeon-780M", "group": "CUDA", "value": null}, {"category":"x7ti-Alchemist-8Xe", "group": "OCL", "value": 2143} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "MUrB Performance (Gflop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group"}, "color": {"field": "group", "title": "API"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative"} } }] }

MUrB: Achieved performance on GPU depending on the SBC (higher is better).

For the Nvidia Geforce RTX 3090 the code is ran with the following command lines:

murb --nv -i 750 -n 200000 --gf --im cuda+locu2 --wg 32 # for CUDA   API
murb --nv -i 750 -n 200000 --gf --im ocl+locu2  --wg 32 # for OpenCL API

For the AMD Radeon RX 7900 XTX the code is ran with the following command line:

murb --nv -i 750 -n 200000 --gf --im ocl+rsqrt2 --wg 32

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "MUrB: achieved performance on GPU depending on the SBC.", "width": 300, "height": 300, "data": { "values": [ {"category":"GeForce-RTX-3090", "group": "OCL", "value": 12878}, {"category":"GeForce-RTX-3090", "group": "CUDA", "value": 12513}, {"category":"Radeon-RX-7900-XTX", "group": "OCL", "value": 19692} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "MUrB Performance (Gflop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group"}, "color": {"field": "group", "title": "API"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative"} } }] }

MUrB: Achieved performance on GPU depending on the SBC (higher is better).

Public GitLab repository for students: https://gitlab.lip6.fr/parallel-programming/murb-se
Private GitLab repository for teachers: https://gitlab.lip6.fr/parallel-programming/murb-s

Fast Meteor Detection Toolbox¶

FMDT is an application to detect moving meteors in the sky. The most optimized version of FMDT on FullHD frames and with a {1, 4, 1} pipeline is executed. In total there are 6 active threads where 4 of them are really stressed. The application relies on the StreamPU multi-threading runtime and on the FLSL algorithm for labeling (CPU only code).

The following command line is used:

fmdt-detect-rt-opt-pip --vid-in-path ../2022_05_31_tauh_34_meteors.mp4 --vid-in-buff --vid-in-loop 30 --rt-stats --ccl-impl LSLM --pip-threads '[1,4,1]'

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "FMDT: achieved number of FPS on the SBC.", "width": 500, "height": 300, "data": { "values": [ {"category": "xu4", "group": "SPU{1,4,1}+FLSL", "value": 237}, {"category": "rpi3", "group": "SPU{1,4,1}+FLSL", "value": 82}, {"category": "tx2", "group": "SPU{1,4,1}+FLSL", "value": 719}, {"category": "xagx", "group": "SPU{1,4,1}+FLSL", "value": 2033}, {"category": "xnano", "group": "SPU{1,4,1}+FLSL", "value": 356}, {"category": "rpi4", "group": "SPU{1,4,1}+FLSL", "value": 152}, {"category": "xnx", "group": "SPU{1,4,1}+FLSL", "value": 1563}, {"category": "m1u", "group": "SPU{1,4,1}+FLSL", "value": 5714}, {"category": "vim1s", "group": "SPU{1,4,1}+FLSL", "value": 230}, {"category": "onx", "group": "SPU{1,4,1}+FLSL", "value": 1430}, {"category": "oagx", "group": "SPU{1,4,1}+FLSL", "value": 1812}, {"category": "onano", "group": "SPU{1,4,1}+FLSL", "value": 1234}, {"category": "opi5", "group": "SPU{1,4,1}+FLSL", "value": 873}, {"category": "rpi5", "group": "SPU{1,4,1}+FLSL", "value": 187}, {"category": "em780", "group": "SPU{1,4,1}+FLSL", "value": 3118}, {"category": "x7ti", "group": "SPU{1,4,1}+FLSL", "value": 4235} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Frames Per Second (FPS)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group"}, "color": {"field": "group", "title": "Version"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative"} } }] }

FMDT: Achieved number of FPS on the SBC (higher is better).

Public GitHub repository: https://github.com/alsoc/fmdt
Private GitLab repository: https://gitlab.lip6.fr/sog/fmdt

A Fast Forward Error Correction Toolbox¶

AFF3CT is a software dedicated to the Forward Error Correction (or channel coding) simulations (for instance the simulation of the physical layer of digital telecommunications like 5G standard).

The following command line is used:

aff3ct -p 8 --sim-type BFER -m 4.5 -M 4.5 -C POLAR -K 1755 -N 2048 --src-type AZCW --crc-type 32-GZIP --crc-implem FAST --enc-fb-gen-method GA --chn-type AWGN --chn-implem FAST --qnt-type POW2 --qnt-implem FAST --qnt-bits 6 --qnt-dec 1 --dec-type ASCL --dec-implem FAST --dec-simd INTRA -L 32 --dec-polar-nodes '{R0,R0L,R1,REP_2-8,REPL,SPC_4}' --sim-stop-time 60

It is a simulation of a Polar code (2048,1755) where an ASCL decoder is used (see the following page for more details https://aff3ct.github.io/#performances). The reported metric is the information throughput in bits/s. The code uses all the CPU cores available on the node (thanks to the StreamPU runtime) and it is vectorized using the MIPP SIMD wrapper.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "AFF3CT: achieved information throughput depending on the SBCs.", "width": 500, "height": 300, "data": { "values": [ {"category": "xu4", "group": "Polar-ASCL+SPU+MIPP", "value": 122}, {"category": "rpi3", "group": "Polar-ASCL+SPU+MIPP", "value": 47}, {"category": "tx2", "group": "Polar-ASCL+SPU+MIPP", "value": 195}, {"category": "xagx", "group": "Polar-ASCL+SPU+MIPP", "value": 465}, {"category": "xnano", "group": "Polar-ASCL+SPU+MIPP", "value": 78}, {"category": "rpi4", "group": "Polar-ASCL+SPU+MIPP", "value": 92}, {"category": "xnx", "group": "Polar-ASCL+SPU+MIPP", "value": 326}, {"category": "m1u", "group": "Polar-ASCL+SPU+MIPP", "value": 3196}, {"category": "vim1s", "group": "Polar-ASCL+SPU+MIPP", "value": 55}, {"category": "onx", "group": "Polar-ASCL+SPU+MIPP", "value": 550}, {"category": "oagx", "group": "Polar-ASCL+SPU+MIPP", "value": 907}, {"category": "onano", "group": "Polar-ASCL+SPU+MIPP", "value": 315}, {"category": "opi5", "group": "Polar-ASCL+SPU+MIPP", "value": 343}, {"category": "rpi5", "group": "Polar-ASCL+SPU+MIPP", "value": 245}, {"category": "em780", "group": "Polar-ASCL+SPU+MIPP", "value": 2626}, {"category": "x7ti", "group": "Polar-ASCL+SPU+MIPP", "value": 2415} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Information Throughput (Mb/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group"}, "color": {"field": "group", "title": "Simulation"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative"} } }] }

AFF3CT: Achieved information throughput depending on the SBC (higher is better).

Public GitHub repository: https://github.com/aff3ct/aff3ct

Summary Table¶

The following table summarizes the different benchmarks and apps:

SBC	`bandwidth` RAM CPU triad (GB/s)	`cpufp` CPU f32 FMA peak (GFlop/s)	`clpeak` RAM GPU bandwidth (GB/s)	`clpeak` GPU f32 peak (GFlop/s)	`MUrB` CPU (GFlop/s)	`MUrB` GPU OpenCL (GFlop/s)	`MUrB` GPU CUDA (GFlop/s)	`FMDT` FullHD (FPS)	`AFF3CT` Sim. Polar (Mb/s)
xu4	4		7	56	11	13		237	122
rpi3	2	38			10			82	47
tx2	20	96	46	652	53	274	276	719	195
xagx	64	264	109	1404	151	735	736	2033	465
xnano	9	47	21	230	20	109	108	356	78
rpi4	5	48			22			152	92
xnx	32	167	50	847	95	446	454	1563	326
m1u	320	1654	729	7706	838	1558		5714	3196
vim1s	7	13	4	13	10	8		230	55
onx	45	252	94	1844	103	594	595	1430	550
oagx	73	421	180	5211	172	1572	1629	1812	907
onano	26	145	64	1255	60	423	437	1234	315
opi5	20	200	26	474	80	148		873	343
rpi5	10	153			58			187	245
em780	62	872	83	2522	633	1292		3118	2626
bpif3	7	214
x7ti	73	1352	80	4774	627	2143		4235	2415