Skip to content

Synthetic Benchmarks

The purpose of this section is to give an overview of the performance of the compute nodes over synthetic benchmarks and representative embedded applications.

CPU Memory Bandwidth

Measurement of the memory bandwidth between the CPU and the RAM according to the triad micro-benchmark (C[i] = x * A[i] + B[i]). The bandwidth benchmark is used. It is dedicated to efficient CPU memory bandwidth measurements like the good old STREAM.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "CPU memory bandwidth depending on the SBCs", "width": 750, "height": 300, "data": { "values": [ {"category": "xu4", "group": "triad", "value": 4}, {"category": "rpi3", "group": "triad", "value": 2}, {"category": "tx2", "group": "triad", "value": 20}, {"category": "xagx", "group": "triad", "value": 64}, {"category": "xnano", "group": "triad", "value": 9}, {"category": "rpi4", "group": "triad", "value": 5}, {"category": "xnx", "group": "triad", "value": 32}, {"category": "m1u", "group": "triad", "value": 320}, {"category": "vim1s", "group": "triad", "value": 7}, {"category": "onx", "group": "triad", "value": 45}, {"category": "oagx", "group": "triad", "value": 73}, {"category": "onano", "group": "triad", "value": 26}, {"category": "opi5", "group": "triad", "value": 20}, {"category": "rpi5", "group": "triad", "value": 10}, {"category": "em780", "group": "triad", "value": 62}, {"category": "bpif3", "group": "triad", "value": 7}, {"category": "x7ti", "group": "triad", "value": 73} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Throughput (GB/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "ubench"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }

CPU memory bandwidth depending on the SBC (higher is better).

CPU Peak Performance

Measurement of the CPU peak performance according to different operations:

  • FMA - Fused Multiply–Add, performs the following operation: \(d = a \times b + c\) on 64-bit, 32-bit or 16-bit floating-point numbers (reffered as f64, f32 & f16 here).
  • DPA4 - Performs the dot product of four 8-bit integers (i8) and accumulates the result in a 32-bit integer (i32): \(c^{i32} = c^{i32} + \sum^4_{s = 1}{ a_s^{i8} \times b_s^{i8}}\).
  • DPA2 - Performs the dot product of two 16-bit brain floats (bf16) and accumulates the result in a 32-bit float (f32): \(c^{f32} = c^{f32} + \sum^2_{s = 1}{ a_s^{bf16} \times b_s^{bf16}}\).
  • MADOT - Performs a small matrix multiplication. For instance for RVV 1.0 256-bit + IME, MADOT i32/i8 performs: \(C^{i32} = C^{i32} + A^{i8}B^{i8}\) where \(A^{i8}\) dim is \(8 \times 4\), \(B^{i8}\) dim is \(4 \times 8\) and \(C^{i32}\) dim is \(4 \times 4\).

The cpufp benchmark is used. Below, the obtained performance depending on the targeted SBC and on the number of cores. In multi-core, the number of cores that are used is given with the *-nt suffix. For instance, for the Raspberry Pi 3, the label is rpi3-A53-4t and it means that 4 cores are used.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "Peak performance on CPU (mono-core) depending on the type of operation and on the SBC.", "width": 1000, "height": 300, "data": { "values": [ {"category":"rpi3-A53", "group": "FMA f64", "value": 5}, {"category":"rpi3-A53", "group": "FMA f32", "value": 9}, {"category":"rpi3-A53", "group": "FMA f16", "value": 0}, {"category":"rpi3-A53", "group": "DPA2 f32/bf16", "value": 0}, {"category":"rpi3-A53", "group": "DPA4 i32/i8", "value": 0}, {"category":"rpi3-A53", "group": "MADOT i32/i8", "value": 0}, {"category":"xnano-A57", "group": "FMA f64", "value": 6}, {"category":"xnano-A57", "group": "FMA f32", "value": 12}, {"category":"rpi4-A72", "group": "FMA f64", "value": 6}, {"category":"rpi4-A72", "group": "FMA f32", "value": 12}, {"category":"vim1s-A35", "group": "FMA f64", "value": 1.6}, {"category":"vim1s-A35", "group": "FMA f32", "value": 3.2}, {"category":"onano-A78", "group": "FMA f64", "value": 12}, {"category":"onano-A78", "group": "FMA f32", "value": 24}, {"category":"onano-A78", "group": "FMA f16", "value": 48}, {"category":"onano-A78", "group": "DPA4 i32/i8", "value": 97}, {"category":"opi5-A55", "group": "FMA f64", "value": 7}, {"category":"opi5-A55", "group": "FMA f32", "value": 14}, {"category":"opi5-A55", "group": "FMA f16", "value": 29}, {"category":"opi5-A55", "group": "DPA4 i32/i8", "value": 58}, {"category":"opi5-A76", "group": "FMA f64", "value": 18}, {"category":"opi5-A76", "group": "FMA f32", "value": 36}, {"category":"opi5-A76", "group": "FMA f16", "value": 71}, {"category":"opi5-A76", "group": "DPA4 i32/i8", "value": 143}, {"category":"rpi5-A76", "group": "FMA f64", "value": 19}, {"category":"rpi5-A76", "group": "FMA f32", "value": 38}, {"category":"rpi5-A76", "group": "FMA f16", "value": 77}, {"category":"rpi5-A76", "group": "DPA4 i32/i8", "value": 153}, {"category":"bpif3-X60", "group": "FMA f64", "value": 13}, {"category":"bpif3-X60", "group": "FMA f32", "value": 25}, {"category":"bpif3-X60", "group": "FMA f16", "value": 53}, {"category":"bpif3-X60", "group": "MADOT i32/i8", "value": 408} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Performance (Gop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "Operation"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }
Peak performance on CPU (mono-core) depending on the type of operation and on the SBC (higher is better).

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "Peak performance on CPU depending on the type of operation and on the SBC.", "width": 1000, "height": 300, "data": { "values": [ {"category":"rpi3-A53-4t", "group": "FMA f64", "value": 19}, {"category":"rpi3-A53-4t", "group": "FMA f32", "value": 38}, {"category":"rpi3-A53-4t", "group": "FMA f16", "value": 0}, {"category":"rpi3-A53-4t", "group": "DPA2 f32/bf16", "value": 0}, {"category":"rpi3-A53-4t", "group": "DPA4 i32/i8", "value": 0}, {"category":"rpi3-A53-4t", "group": "MADOT i32/i8", "value": 0}, {"category":"xnano-A57-4t", "group": "FMA f64", "value": 24}, {"category":"xnano-A57-4t", "group": "FMA f32", "value": 47}, {"category":"rpi4-A72-4t", "group": "FMA f64", "value": 24}, {"category":"rpi4-A72-4t", "group": "FMA f32", "value": 48}, {"category":"vim1s-A35", "group": "FMA f64", "value": 6.3}, {"category":"vim1s-A35", "group": "FMA f32", "value": 12.7}, {"category":"onano-A78-6t", "group": "FMA f64", "value": 72}, {"category":"onano-A78-6t", "group": "FMA f32", "value": 145}, {"category":"onano-A78-6t", "group": "FMA f16", "value": 290}, {"category":"onano-A78-6t", "group": "DPA4 i32/i8", "value": 579}, {"category":"opi5-A55-4t", "group": "FMA f64", "value": 29}, {"category":"opi5-A55-4t", "group": "FMA f32", "value": 58}, {"category":"opi5-A55-4t", "group": "FMA f16", "value": 115}, {"category":"opi5-A55-4t", "group": "DPA4 i32/i8", "value": 231}, {"category":"opi5-A76-4t", "group": "FMA f64", "value": 71}, {"category":"opi5-A76-4t", "group": "FMA f32", "value": 142}, {"category":"opi5-A76-4t", "group": "FMA f16", "value": 284}, {"category":"opi5-A76-4t", "group": "DPA4 i32/i8", "value": 568}, {"category":"rpi5-A76-4t", "group": "FMA f64", "value": 77}, {"category":"rpi5-A76-4t", "group": "FMA f32", "value": 154}, {"category":"rpi5-A76-4t", "group": "FMA f16", "value": 307}, {"category":"rpi5-A76-4t", "group": "DPA4 i32/i8", "value": 614}, {"category":"bpif3-X60-8t", "group": "FMA f64", "value": 106}, {"category":"bpif3-X60-8t", "group": "FMA f32", "value": 214}, {"category":"bpif3-X60-8t", "group": "FMA f16", "value": 426}, {"category":"bpif3-X60-8t", "group": "MADOT i32/i8", "value": 1635} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Performance (Gop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "Operation"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }
Peak performance on CPU (multi-core) depending on the type of operation and on the SBC (higher is better).

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "Peak performance on CPU (mono-core) depending on the type of operation and on the SBC.", "width": 1300, "height": 300, "data": { "values": [ {"category":"tx2-A57", "group": "FMA f64", "value": 8}, {"category":"tx2-A57", "group": "FMA f32", "value": 16}, {"category":"tx2-A57", "group": "FMA f16", "value": 0}, {"category":"tx2-A57", "group": "DPA2 f32/bf16", "value": 0}, {"category":"tx2-A57", "group": "DPA4 i32/i8", "value": 0}, {"category":"tx2-Denver", "group": "FMA f64", "value": 8}, {"category":"tx2-Denver", "group": "FMA f32", "value": 15}, {"category":"xagx-Carmel", "group": "FMA f64", "value": 17}, {"category":"xagx-Carmel", "group": "FMA f32", "value": 33}, {"category":"xagx-Carmel", "group": "FMA f16", "value": 66}, {"category":"xnx-Carmel", "group": "FMA f64", "value": 14}, {"category":"xnx-Carmel", "group": "FMA f32", "value": 28}, {"category":"xnx-Carmel", "group": "FMA f16", "value": 56}, {"category":"m1u-Icestorm", "group": "FMA f64", "value": 16}, {"category":"m1u-Icestorm", "group": "FMA f32", "value": 33}, {"category":"m1u-Icestorm", "group": "FMA f16", "value": 66}, {"category":"m1u-Icestorm", "group": "DPA4 i32/i8", "value": 132}, {"category":"m1u-Firestorm", "group": "FMA f64", "value": 52}, {"category":"m1u-Firestorm", "group": "FMA f32", "value": 103}, {"category":"m1u-Firestorm", "group": "FMA f16", "value": 206}, {"category":"m1u-Firestorm", "group": "DPA4 i32/i8", "value": 412}, {"category":"onx-A78", "group": "FMA f64", "value": 16}, {"category":"onx-A78", "group": "FMA f32", "value": 32}, {"category":"onx-A78", "group": "FMA f16", "value": 64}, {"category":"onx-A78", "group": "DPA4 i32/i8", "value": 127}, {"category":"oagx-A78", "group": "FMA f64", "value": 18}, {"category":"oagx-A78", "group": "FMA f32", "value": 35}, {"category":"oagx-A78", "group": "FMA f16", "value": 70}, {"category":"oagx-A78", "group": "DPA4 i32/i8", "value": 140}, {"category":"em780-7840u", "group": "FMA f64", "value": 62}, {"category":"em780-7840u", "group": "FMA f32", "value": 124}, {"category":"em780-7840u", "group": "DPA2 f32/bf16", "value": 248}, {"category":"em780-7840u", "group": "DPA4 i32/i8", "value": 497}, {"category":"x7ti-lpe", "group": "FMA f64", "value": 19}, {"category":"x7ti-lpe", "group": "FMA f32", "value": 40}, {"category":"x7ti-lpe", "group": "DPA2 f32/bf16", "value": 80}, {"category":"x7ti-lpe", "group": "DPA4 i32/i8", "value": 160}, {"category":"x7ti-e", "group": "FMA f64", "value": 30}, {"category":"x7ti-e", "group": "FMA f32", "value": 60}, {"category":"x7ti-e", "group": "DPA2 f32/bf16", "value": 120}, {"category":"x7ti-e", "group": "DPA4 i32/i8", "value": 242}, {"category":"x7ti-p", "group": "FMA f64", "value": 56}, {"category":"x7ti-p", "group": "FMA f32", "value": 113}, {"category":"x7ti-p", "group": "DPA2 f32/bf16", "value": 303}, {"category":"x7ti-p", "group": "DPA4 i32/i8", "value": 600} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Performance (Gop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "Operation"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }
Peak performance on CPU (mono-core) depending on the type of operation and on the SBC (higher is better).

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "Peak performance on CPU depending on the type of operation and on the SBC.", "width": 1300, "height": 300, "data": { "values": [ {"category":"tx2-A57-4t", "group": "FMA f64", "value": 33}, {"category":"tx2-A57-4t", "group": "FMA f32", "value": 65}, {"category":"tx2-A57-4t", "group": "FMA f16", "value": 0}, {"category":"tx2-A57-4t", "group": "DPA2 f32/bf16", "value": 0}, {"category":"tx2-A57-4t", "group": "DPA4 i32/i8", "value": 0}, {"category":"tx2-Denver-2t", "group": "FMA f64", "value": 16}, {"category":"tx2-Denver-2t", "group": "FMA f32", "value": 31}, {"category":"xagx-Carmel-8t", "group": "FMA f64", "value": 133}, {"category":"xagx-Carmel-8t", "group": "FMA f32", "value": 264}, {"category":"xagx-Carmel-8t", "group": "FMA f16", "value": 530}, {"category":"xnx-Carmel-6t", "group": "FMA f64", "value": 84}, {"category":"xnx-Carmel-6t", "group": "FMA f32", "value": 167}, {"category":"xnx-Carmel-6t", "group": "FMA f16", "value": 334}, {"category":"m1u-Icestorm-4t", "group": "FMA f64", "value": 66}, {"category":"m1u-Icestorm-4t", "group": "FMA f32", "value": 132}, {"category":"m1u-Icestorm-4t", "group": "FMA f16", "value": 263}, {"category":"m1u-Icestorm-4t", "group": "DPA4 i32/i8", "value": 527}, {"category":"m1u-Firestorm-16t", "group": "FMA f64", "value": 775}, {"category":"m1u-Firestorm-16t", "group": "FMA f32", "value": 1551}, {"category":"m1u-Firestorm-16t", "group": "FMA f16", "value": 3102}, {"category":"m1u-Firestorm-16t", "group": "DPA4 i32/i8", "value": 6201}, {"category":"onx-A78-8t", "group": "FMA f64", "value": 125}, {"category":"onx-A78-8t", "group": "FMA f32", "value": 252}, {"category":"onx-A78-8t", "group": "FMA f16", "value": 504}, {"category":"onx-A78-8t", "group": "DPA4 i32/i8", "value": 1010}, {"category":"oagx-A78-12t", "group": "FMA f64", "value": 210}, {"category":"oagx-A78-12t", "group": "FMA f32", "value": 421}, {"category":"oagx-A78-12t", "group": "FMA f16", "value": 842}, {"category":"oagx-A78-12t", "group": "DPA4 i32/i8", "value": 1684}, {"category":"em780-7840u-8t", "group": "FMA f64", "value": 442}, {"category":"em780-7840u-8t", "group": "FMA f32", "value": 872}, {"category":"em780-7840u-8t", "group": "DPA2 f32/bf16", "value": 1854}, {"category":"em780-7840u-8t", "group": "DPA4 i32/i8", "value": 3560}, {"category":"x7ti-lpe-2t", "group": "FMA f64", "value": 40}, {"category":"x7ti-lpe-2t", "group": "FMA f32", "value": 79}, {"category":"x7ti-lpe-2t", "group": "DPA2 f32/bf16", "value": 160}, {"category":"x7ti-lpe-2t", "group": "DPA4 i32/i8", "value": 319}, {"category":"x7ti-e-8t", "group": "FMA f64", "value": 210}, {"category":"x7ti-e-8t", "group": "FMA f32", "value": 420}, {"category":"x7ti-e-8t", "group": "DPA2 f32/bf16", "value": 841}, {"category":"x7ti-e-8t", "group": "DPA4 i32/i8", "value": 1683}, {"category":"x7ti-p-6t", "group": "FMA f64", "value": 430}, {"category":"x7ti-p-6t", "group": "FMA f32", "value": 853}, {"category":"x7ti-p-6t", "group": "DPA2 f32/bf16", "value": 1704}, {"category":"x7ti-p-6t", "group": "DPA4 i32/i8", "value": 3430} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Performance (Gop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "Operation"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }
Peak performance on CPU (multi-core) depending on the type of operation and on the SBC (higher is better).

GPU Memory Bandwidth

Measurement of the memory bandwidth between the GPU and its global memory with the clpeak benchmark. On the Nvidia Jetson platforms, PoCL has been installed to enable OpenCL support (see the PoCL Installation on Jetson section).

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "GPU memory bandwidth depending on the SBC.", "width": 1500, "height": 300, "data": { "values": [ {"category": "xu4-Mali-T628-MP6", "group": "float32x1", "value": 5.2}, {"category": "xu4-Mali-T628-MP6", "group": "float32x2", "value": 6.9}, {"category": "xu4-Mali-T628-MP6", "group": "float32x4", "value": 7.0}, {"category": "xu4-Mali-T628-MP6", "group": "float32x8", "value": 6.9}, {"category": "tx2-Pascal-2SMX", "group": "float32x1", "value": 37}, {"category": "tx2-Pascal-2SMX", "group": "float32x2", "value": 46}, {"category": "tx2-Pascal-2SMX", "group": "float32x4", "value": 46}, {"category": "tx2-Pascal-2SMX", "group": "float32x8", "value": 34}, {"category": "xagx-Volta-8SMX", "group": "float32x1", "value": 110}, {"category": "xagx-Volta-8SMX", "group": "float32x2", "value": 109}, {"category": "xagx-Volta-8SMX", "group": "float32x4", "value": 109}, {"category": "xagx-Volta-8SMX", "group": "float32x8", "value": 91}, {"category": "xnano-Maxwell-1SMX", "group": "float32x1", "value": 18}, {"category": "xnano-Maxwell-1SMX", "group": "float32x2", "value": 21}, {"category": "xnano-Maxwell-1SMX", "group": "float32x4", "value": 21}, {"category": "xnano-Maxwell-1SMX", "group": "float32x8", "value": 20}, {"category": "xnx-Volta-6SMX", "group": "float32x1", "value": 47}, {"category": "xnx-Volta-6SMX", "group": "float32x2", "value": 49}, {"category": "xnx-Volta-6SMX", "group": "float32x4", "value": 49}, {"category": "xnx-Volta-6SMX", "group": "float32x8", "value": 44}, {"category": "m1u-macos-48c", "group": "float32x1", "value": 699}, {"category": "m1u-macos-48c", "group": "float32x2", "value": 717}, {"category": "m1u-macos-48c", "group": "float32x4", "value": 729}, {"category": "m1u-macos-48c", "group": "float32x8", "value": 703}, {"category": "m1u-linux-48c", "group": "float32x1", "value": 500}, {"category": "m1u-linux-48c", "group": "float32x2", "value": 514}, {"category": "m1u-linux-48c", "group": "float32x4", "value": 523}, {"category": "m1u-linux-48c", "group": "float32x8", "value": 524}, {"category": "vim1s-Mali-G31-MP2", "group": "float32x1", "value": 3.5}, {"category": "vim1s-Mali-G31-MP2", "group": "float32x2", "value": 4.3}, {"category": "vim1s-Mali-G31-MP2", "group": "float32x4", "value": 4.2}, {"category": "vim1s-Mali-G31-MP2", "group": "float32x8", "value": 3.5}, {"category": "onx-Ampere-8SMX", "group": "float32x1", "value": 87}, {"category": "onx-Ampere-8SMX", "group": "float32x2", "value": 94}, {"category": "onx-Ampere-8SMX", "group": "float32x4", "value": 94}, {"category": "onx-Ampere-8SMX", "group": "float32x8", "value": 94}, {"category": "oagx-Ampere-16SMX", "group": "float32x1", "value": 174}, {"category": "oagx-Ampere-16SMX", "group": "float32x2", "value": 178}, {"category": "oagx-Ampere-16SMX", "group": "float32x4", "value": 179}, {"category": "oagx-Ampere-16SMX", "group": "float32x8", "value": 180}, {"category": "onano-Ampere-8SMX", "group": "float32x1", "value": 63}, {"category": "onano-Ampere-8SMX", "group": "float32x2", "value": 64}, {"category": "onano-Ampere-8SMX", "group": "float32x4", "value": 64}, {"category": "onano-Ampere-8SMX", "group": "float32x8", "value": 64}, {"category": "opi5-Mali-G610-MP4", "group": "float32x1", "value": 24}, {"category": "opi5-Mali-G610-MP4", "group": "float32x2", "value": 26}, {"category": "opi5-Mali-G610-MP4", "group": "float32x4", "value": 26}, {"category": "opi5-Mali-G610-MP4", "group": "float32x8", "value": 20}, {"category": "em780-Radeon-780M", "group": "float32x1", "value": 72}, {"category": "em780-Radeon-780M", "group": "float32x2", "value": 76}, {"category": "em780-Radeon-780M", "group": "float32x4", "value": 79}, {"category": "em780-Radeon-780M", "group": "float32x8", "value": 80}, {"category": "x7ti-Alchemist-8Xe", "group": "float32x1", "value": 73}, {"category": "x7ti-Alchemist-8Xe", "group": "float32x2", "value": 74}, {"category": "x7ti-Alchemist-8Xe", "group": "float32x4", "value": 75}, {"category": "x7ti-Alchemist-8Xe", "group": "float32x8", "value": 78} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Throughput (GB/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "Datatype"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }
GPU memory bandwidth depending on the SBC (higher is better).

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "GPU memory bandwidth depending on the SBC.", "width": 500, "height": 300, "data": { "values": [ {"category": "GeForce-RTX-3090", "group": "float32x 1", "value": 817}, {"category": "GeForce-RTX-3090", "group": "float32x 2", "value": 842}, {"category": "GeForce-RTX-3090", "group": "float32x 4", "value": 856}, {"category": "GeForce-RTX-3090", "group": "float32x 8", "value": 788}, {"category": "GeForce-RTX-3090", "group": "float32x16", "value": 845}, {"category": "Radeon-RX-7900-XTX", "group": "float32x 1", "value": 601}, {"category": "Radeon-RX-7900-XTX", "group": "float32x 2", "value": 623}, {"category": "Radeon-RX-7900-XTX", "group": "float32x 4", "value": 642}, {"category": "Radeon-RX-7900-XTX", "group": "float32x 8", "value": 666}, {"category": "Radeon-RX-7900-XTX", "group": "float32x16", "value": 683} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Throughput (GB/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "Datatype"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }
GPU memory bandwidth depending on the SBC (higher is better).

GPU Peak Performance

Measurement of the GPU peak performance. The clpeak benchmark is used. It is an OpenCL benchmark that executes a compute intensive program to estimate peak performance. On the Nvidia Jetson platforms, PoCL has been installed to enable OpenCL support (see the PoCL Installation on Jetson section).

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "GPU peak performance (32-bit float) depending on the SBC.", "width": 1300, "height": 300, "data": { "values": [ {"category": "xu4-Mali-T628-MP6", "group": "float64", "value": 26}, {"category": "xu4-Mali-T628-MP6", "group": "float32", "value": 56}, {"category": "xu4-Mali-T628-MP6", "group": "float16", "value": 114}, {"category": "tx2-Pascal-2SMX", "group": "float64", "value": 21}, {"category": "tx2-Pascal-2SMX", "group": "float32", "value": 652}, {"category": "tx2-Pascal-2SMX", "group": "float16", "value": 0}, {"category": "xagx-Volta-8SMX", "group": "float64", "value": 44}, {"category": "xagx-Volta-8SMX", "group": "float32", "value": 1404}, {"category": "xagx-Volta-8SMX", "group": "float16", "value": 0}, {"category": "xnano-Maxwell-1SMX", "group": "float64", "value": 7}, {"category": "xnano-Maxwell-1SMX", "group": "float32", "value": 230}, {"category": "xnano-Maxwell-1SMX", "group": "float16", "value": 0}, {"category": "xnx-Volta-6SMX", "group": "float64", "value": 27}, {"category": "xnx-Volta-6SMX", "group": "float32", "value": 847}, {"category": "xnx-Volta-6SMX", "group": "float16", "value": 0}, {"category": "m1u-macos-48c", "group": "float64", "value": 0}, {"category": "m1u-macos-48c", "group": "float32", "value": 7706}, {"category": "m1u-macos-48c", "group": "float16", "value": 0}, {"category": "m1u-linux-48c", "group": "float64", "value": 0}, {"category": "m1u-linux-48c", "group": "float32", "value": 7120}, {"category": "m1u-linux-48c", "group": "float16", "value": 6145}, {"category": "vim1s-Mali-G31-MP2", "group": "float64", "value": 0}, {"category": "vim1s-Mali-G31-MP2", "group": "float32", "value": 13}, {"category": "vim1s-Mali-G31-MP2", "group": "float16", "value": 27}, {"category": "onx-Ampere-8SMX", "group": "float64", "value": 30}, {"category": "onx-Ampere-8SMX", "group": "float32", "value": 1844}, {"category": "onx-Ampere-8SMX", "group": "float16", "value": 3520}, {"category": "oagx-Ampere-16SMX", "group": "float64", "value": 83}, {"category": "oagx-Ampere-16SMX", "group": "float32", "value": 5211}, {"category": "oagx-Ampere-16SMX", "group": "float16", "value": 9957}, {"category": "onano-Ampere-8SMX", "group": "float64", "value": 20}, {"category": "onano-Ampere-8SMX", "group": "float32", "value": 1255}, {"category": "onano-Ampere-8SMX", "group": "float16", "value": 2397}, {"category": "opi5-Mali-G610-MP4", "group": "float64", "value": 0}, {"category": "opi5-Mali-G610-MP4", "group": "float32", "value": 474}, {"category": "opi5-Mali-G610-MP4", "group": "float16", "value": 917}, {"category": "em780-Radeon-780M", "group": "float64", "value": 86}, {"category": "em780-Radeon-780M", "group": "float32", "value": 2522}, {"category": "em780-Radeon-780M", "group": "float16", "value": 4574}, {"category": "x7ti-Alchemist-8Xe", "group": "float64", "value": 149}, {"category": "x7ti-Alchemist-8Xe", "group": "float32", "value": 4774}, {"category": "x7ti-Alchemist-8Xe", "group": "float16", "value": 9473} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Peak Performance (Gflop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "Datatype"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }
GPU peak performance depending on the SBC (higher is better).

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "GPU peak performance (32-bit float) depending on the SBC.", "width": 500, "height": 300, "data": { "values": [ {"category": "GeForce-RTX-3090", "group": "float64", "value": 629}, {"category": "GeForce-RTX-3090", "group": "float32", "value": 36038}, {"category": "GeForce-RTX-3090", "group": "float16", "value": 39636}, {"category": "Radeon-RX-7900-XTX", "group": "float64", "value": 907}, {"category": "Radeon-RX-7900-XTX", "group": "float32", "value": 23952}, {"category": "Radeon-RX-7900-XTX", "group": "float16", "value": 40445} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Peak Performance (Gflop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "Datatype"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }
GPU peak performance depending on the SBC (higher is better).

Compute Intensive n-body Code

MUrB is a \(n\)-body code simulating Newtonian gravitational equations. This type of code is known to be mostly compute-bound because there are \(O(n^2)\) computations for \(n\) data. The CPU code is vectorized thanks to the MIPP SIMD wrapper and multi-threaded with OpenMP (all the available cores are used for the benchmark). On GPU, an OpenCL and a CUDA implementations are evaluated. In any cases, the computations are performed using float32 datatype.

The following command lines is used:

murb --nv -i 100 -n 20000 --gf --im cpu+simd # on CPU with SIMD (MIPP wrapper)

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "MUrB: achieved performance on CPU depending on the SBC.", "width": 500, "height": 300, "data": { "values": [ {"category": "xu4", "group": "MIPP", "value": 11}, {"category": "rpi3", "group": "MIPP", "value": 10}, {"category": "tx2", "group": "MIPP", "value": 53}, {"category": "xagx", "group": "MIPP", "value": 151}, {"category": "xnano", "group": "MIPP", "value": 20}, {"category": "rpi4", "group": "MIPP", "value": 22}, {"category": "xnx", "group": "MIPP", "value": 95}, {"category": "m1u", "group": "MIPP", "value": 838}, {"category": "vim1s", "group": "MIPP", "value": 10}, {"category": "onx", "group": "MIPP", "value": 103}, {"category": "oagx", "group": "MIPP", "value": 172}, {"category": "onano", "group": "MIPP", "value": 60}, {"category": "opi5", "group": "MIPP", "value": 80}, {"category": "rpi5", "group": "MIPP", "value": 58}, {"category": "em780", "group": "MIPP", "value": 633}, {"category": "x7ti", "group": "MIPP", "value": 627} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "MUrB Performance (Gflop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group", "sort": "none"}, "color": {"field": "group", "title": "API"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative", "sort": "none"} } }] }
MUrB: Achieved performance on CPU depending on the SBC (higher is better).

The following command lines are used:

murb --nv -i 1500 -n 30000 --gf --im cuda+rsqrt4 # on GPU with CUDA API
murb --nv -i 1500 -n 30000 --gf --im ocl+rsqrt4  # on GPU with OpenCL API

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "MUrB: achieved performance on GPU depending on the SBC.", "width": 1000, "height": 300, "data": { "values": [ {"category":"xu4-Mali-T628-MP6", "group": "OCL", "value": 13}, {"category":"xu4-Mali-T628-MP6", "group": "CUDA", "value": null}, {"category":"tx2-Pascal-2SMX", "group": "OCL", "value": 274}, {"category":"tx2-Pascal-2SMX", "group": "CUDA", "value": 276}, {"category":"xagx-Volta-8SMX", "group": "OCL", "value": 735}, {"category":"xagx-Volta-8SMX", "group": "CUDA", "value": 736}, {"category":"xnano-Maxwell-1SMX", "group": "OCL", "value": 109}, {"category":"xnano-Maxwell-1SMX", "group": "CUDA", "value": 108}, {"category":"xnx-Volta-6SMX", "group": "OCL", "value": 446}, {"category":"xnx-Volta-6SMX", "group": "CUDA", "value": 454}, {"category":"m1u-macos-48c", "group": "OCL", "value": 2104}, {"category":"m1u-linux-48c", "group": "OCL", "value": 1558}, {"category":"vim1s-Mali-G31-MP2", "group": "OCL", "value": 8}, {"category":"onx-Ampere-8SMX", "group": "OCL", "value": 594}, {"category":"onx-Ampere-8SMX", "group": "CUDA", "value": 595}, {"category":"oagx-Ampere-16SMX", "group": "OCL", "value": 1572}, {"category":"oagx-Ampere-16SMX", "group": "CUDA", "value": 1629}, {"category":"onano-Ampere-8SMX", "group": "OCL", "value": 423}, {"category":"onano-Ampere-8SMX", "group": "CUDA", "value": 437}, {"category":"opi5-Mali-G610-MP4", "group": "OCL", "value": 148}, {"category":"opi5-Mali-G610-MP4", "group": "CUDA", "value": null}, {"category":"em780-Radeon-780M", "group": "OCL", "value": 1292}, {"category":"em780-Radeon-780M", "group": "CUDA", "value": null}, {"category":"x7ti-Alchemist-8Xe", "group": "OCL", "value": 2143} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "MUrB Performance (Gflop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group"}, "color": {"field": "group", "title": "API"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative"} } }] }
MUrB: Achieved performance on GPU depending on the SBC (higher is better).

For the Nvidia Geforce RTX 3090 the code is ran with the following command lines:

murb --nv -i 750 -n 200000 --gf --im cuda+locu2 --wg 32 # for CUDA   API
murb --nv -i 750 -n 200000 --gf --im ocl+locu2  --wg 32 # for OpenCL API

For the AMD Radeon RX 7900 XTX the code is ran with the following command line:

murb --nv -i 750 -n 200000 --gf --im ocl+rsqrt2 --wg 32

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "MUrB: achieved performance on GPU depending on the SBC.", "width": 300, "height": 300, "data": { "values": [ {"category":"GeForce-RTX-3090", "group": "OCL", "value": 12878}, {"category":"GeForce-RTX-3090", "group": "CUDA", "value": 12513}, {"category":"Radeon-RX-7900-XTX", "group": "OCL", "value": 19692} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "MUrB Performance (Gflop/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group"}, "color": {"field": "group", "title": "API"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative"} } }] }
MUrB: Achieved performance on GPU depending on the SBC (higher is better).

Fast Meteor Detection Toolbox

FMDT is an application to detect moving meteors in the sky. The most optimized version of FMDT on FullHD frames and with a {1, 4, 1} pipeline is executed. In total there are 6 active threads where 4 of them are really stressed. The application relies on the StreamPU multi-threading runtime and on the FLSL algorithm for labeling (CPU only code).

The following command line is used:

fmdt-detect-rt-opt-pip --vid-in-path ../2022_05_31_tauh_34_meteors.mp4 --vid-in-buff --vid-in-loop 30 --rt-stats --ccl-impl LSLM --pip-threads '[1,4,1]'

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "FMDT: achieved number of FPS on the SBC.", "width": 500, "height": 300, "data": { "values": [ {"category": "xu4", "group": "SPU{1,4,1}+FLSL", "value": 237}, {"category": "rpi3", "group": "SPU{1,4,1}+FLSL", "value": 82}, {"category": "tx2", "group": "SPU{1,4,1}+FLSL", "value": 719}, {"category": "xagx", "group": "SPU{1,4,1}+FLSL", "value": 2033}, {"category": "xnano", "group": "SPU{1,4,1}+FLSL", "value": 356}, {"category": "rpi4", "group": "SPU{1,4,1}+FLSL", "value": 152}, {"category": "xnx", "group": "SPU{1,4,1}+FLSL", "value": 1563}, {"category": "m1u", "group": "SPU{1,4,1}+FLSL", "value": 5714}, {"category": "vim1s", "group": "SPU{1,4,1}+FLSL", "value": 230}, {"category": "onx", "group": "SPU{1,4,1}+FLSL", "value": 1430}, {"category": "oagx", "group": "SPU{1,4,1}+FLSL", "value": 1812}, {"category": "onano", "group": "SPU{1,4,1}+FLSL", "value": 1234}, {"category": "opi5", "group": "SPU{1,4,1}+FLSL", "value": 873}, {"category": "rpi5", "group": "SPU{1,4,1}+FLSL", "value": 187}, {"category": "em780", "group": "SPU{1,4,1}+FLSL", "value": 3118}, {"category": "x7ti", "group": "SPU{1,4,1}+FLSL", "value": 4235} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Frames Per Second (FPS)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group"}, "color": {"field": "group", "title": "Version"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative"} } }] }

FMDT: Achieved number of FPS on the SBC (higher is better).

A Fast Forward Error Correction Toolbox

AFF3CT is a software dedicated to the Forward Error Correction (or channel coding) simulations (for instance the simulation of the physical layer of digital telecommunications like 5G standard).

The following command line is used:

aff3ct -p 8 --sim-type BFER -m 4.5 -M 4.5 -C POLAR -K 1755 -N 2048 --src-type AZCW --crc-type 32-GZIP --crc-implem FAST --enc-fb-gen-method GA --chn-type AWGN --chn-implem FAST --qnt-type POW2 --qnt-implem FAST --qnt-bits 6 --qnt-dec 1 --dec-type ASCL --dec-implem FAST --dec-simd INTRA -L 32 --dec-polar-nodes '{R0,R0L,R1,REP_2-8,REPL,SPC_4}' --sim-stop-time 60
It is a simulation of a Polar code (2048,1755) where an ASCL decoder is used (see the following page for more details https://aff3ct.github.io/#performances). The reported metric is the information throughput in bits/s. The code uses all the CPU cores available on the node (thanks to the StreamPU runtime) and it is vectorized using the MIPP SIMD wrapper.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "description": "AFF3CT: achieved information throughput depending on the SBCs.", "width": 500, "height": 300, "data": { "values": [ {"category": "xu4", "group": "Polar-ASCL+SPU+MIPP", "value": 122}, {"category": "rpi3", "group": "Polar-ASCL+SPU+MIPP", "value": 47}, {"category": "tx2", "group": "Polar-ASCL+SPU+MIPP", "value": 195}, {"category": "xagx", "group": "Polar-ASCL+SPU+MIPP", "value": 465}, {"category": "xnano", "group": "Polar-ASCL+SPU+MIPP", "value": 78}, {"category": "rpi4", "group": "Polar-ASCL+SPU+MIPP", "value": 92}, {"category": "xnx", "group": "Polar-ASCL+SPU+MIPP", "value": 326}, {"category": "m1u", "group": "Polar-ASCL+SPU+MIPP", "value": 3196}, {"category": "vim1s", "group": "Polar-ASCL+SPU+MIPP", "value": 55}, {"category": "onx", "group": "Polar-ASCL+SPU+MIPP", "value": 550}, {"category": "oagx", "group": "Polar-ASCL+SPU+MIPP", "value": 907}, {"category": "onano", "group": "Polar-ASCL+SPU+MIPP", "value": 315}, {"category": "opi5", "group": "Polar-ASCL+SPU+MIPP", "value": 343}, {"category": "rpi5", "group": "Polar-ASCL+SPU+MIPP", "value": 245}, {"category": "em780", "group": "Polar-ASCL+SPU+MIPP", "value": 2626}, {"category": "x7ti", "group": "Polar-ASCL+SPU+MIPP", "value": 2415} ] }, "mark": "bar", "encoding": { "x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}}, "y": {"field": "value", "type": "quantitative", "axis": {"title": "Information Throughput (Mb/s)"}, "scale": {"type": "linear"}}, "xOffset": {"field": "group"}, "color": {"field": "group", "title": "Simulation"} }, "layer": [{ "mark": "bar" }, { "mark": { "type": "text", "align": "center", "baseline": "middle", "dy": -10 }, "encoding": { "text": {"field": "value", "type": "quantitative"} } }] }

AFF3CT: Achieved information throughput depending on the SBC (higher is better).

Summary Table

The following table summarizes the different benchmarks and apps:

SBC bandwidth RAM CPU triad (GB/s) cpufp CPU f32 FMA peak (GFlop/s) clpeak RAM GPU bandwidth (GB/s) clpeak GPU f32 peak (GFlop/s) MUrB CPU (GFlop/s) MUrB GPU OpenCL (GFlop/s) MUrB GPU CUDA (GFlop/s) FMDT FullHD (FPS) AFF3CT Sim. Polar (Mb/s)
xu4 4 7 56 11 13 237 122
rpi3 2 38 10 82 47
tx2 20 96 46 652 53 274 276 719 195
xagx 64 264 109 1404 151 735 736 2033 465
xnano 9 47 21 230 20 109 108 356 78
rpi4 5 48 22 152 92
xnx 32 167 50 847 95 446 454 1563 326
m1u 320 1654 729 7706 838 1558 5714 3196
vim1s 7 13 4 13 10 8 230 55
onx 45 252 94 1844 103 594 595 1430 550
oagx 73 421 180 5211 172 1572 1629 1812 907
onano 26 145 64 1255 60 423 437 1234 315
opi5 20 200 26 474 80 148 873 343
rpi5 10 153 58 187 245
em780 62 872 83 2522 633 1292 3118 2626
bpif3 7 214
x7ti 73 1352 80 4774 627 2143 4235 2415