Synthetic Benchmarks
The purpose of this section is to give an overview of the performance of the
compute nodes over synthetic benchmarks and representative embedded
applications.
CPU Memory Bandwidth
Measurement of the memory bandwidth between the CPU and the RAM according to the
triad
micro-benchmark (C[i] = x * A[i] + B[i]
). The bandwidth
benchmark
is used. It is dedicated to efficient CPU memory bandwidth measurements like
the good old STREAM .
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"description": "CPU memory bandwidth depending on the SBCs",
"width": 750,
"height": 300,
"data": {
"values": [
{"category": "xu4", "group": "triad", "value": 4},
{"category": "rpi3", "group": "triad", "value": 2},
{"category": "tx2", "group": "triad", "value": 20},
{"category": "xagx", "group": "triad", "value": 64},
{"category": "xnano", "group": "triad", "value": 9},
{"category": "rpi4", "group": "triad", "value": 5},
{"category": "xnx", "group": "triad", "value": 32},
{"category": "m1u", "group": "triad", "value": 320},
{"category": "vim1s", "group": "triad", "value": 7},
{"category": "onx", "group": "triad", "value": 45},
{"category": "oagx", "group": "triad", "value": 73},
{"category": "onano", "group": "triad", "value": 26},
{"category": "opi5", "group": "triad", "value": 20},
{"category": "rpi5", "group": "triad", "value": 10},
{"category": "em780", "group": "triad", "value": 62},
{"category": "bpif3", "group": "triad", "value": 7},
{"category": "x7ti", "group": "triad", "value": 73}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}},
"y": {"field": "value", "type": "quantitative", "axis": {"title": "Throughput (GB/s)"}, "scale": {"type": "linear"}},
"xOffset": {"field": "group", "sort": "none"},
"color": {"field": "group", "title": "ubench"}
},
"layer": [{
"mark": "bar"
}, {
"mark": {
"type": "text",
"align": "center",
"baseline": "middle",
"dy": -10
},
"encoding": {
"text": {"field": "value", "type": "quantitative", "sort": "none"}
}
}]
}
CPU memory bandwidth depending on the SBC (higher is better).
Measurement of the CPU peak performance according to different operations:
FMA - Fused Multiply–Add , performs the following operation:
\(d = a \times b + c\) on 64-bit, 32-bit or 16-bit floating-point numbers
(reffered as f64, f32 & f16 here).
DPA4 - Performs the dot product of four 8-bit integers (i8) and accumulates
the result in a 32-bit integer (i32):
\(c^{i32} = c^{i32} + \sum^4_{s = 1}{ a_s^{i8} \times b_s^{i8}}\) .
DPA2 - Performs the dot product of two 16-bit brain floats (bf16) and
accumulates the result in a 32-bit float (f32):
\(c^{f32} = c^{f32} + \sum^2_{s = 1}{ a_s^{bf16} \times b_s^{bf16}}\) .
MADOT - Performs a small matrix multiplication . For instance for RVV 1.0
256-bit + IME, MADOT i32/i8 performs: \(C^{i32} = C^{i32} + A^{i8}B^{i8}\) where
\(A^{i8}\) dim is \(8 \times 4\) , \(B^{i8}\) dim is \(4 \times 8\) and \(C^{i32}\) dim
is \(4 \times 4\) .
The cpufp
benchmark is used. Below, the obtained performance depending on the
targeted SBC and on the number of cores. In multi-core, the number of cores that
are used is given with the *-nt
suffix. For instance, for the Raspberry Pi 3,
the label is rpi3-A53-4t
and it means that 4 cores are used.
Low Power SBCs (< 15 Watts) Medium Power SBCs (> 15 Watts)
Single-core Multi-core
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"description": "Peak performance on CPU (mono-core) depending on the type of operation and on the SBC.",
"width": 1000,
"height": 300,
"data": {
"values": [
{"category":"rpi3-A53", "group": "FMA f64", "value": 5},
{"category":"rpi3-A53", "group": "FMA f32", "value": 9},
{"category":"rpi3-A53", "group": "FMA f16", "value": 0},
{"category":"rpi3-A53", "group": "DPA2 f32/bf16", "value": 0},
{"category":"rpi3-A53", "group": "DPA4 i32/i8", "value": 0},
{"category":"rpi3-A53", "group": "MADOT i32/i8", "value": 0},
{"category":"xnano-A57", "group": "FMA f64", "value": 6},
{"category":"xnano-A57", "group": "FMA f32", "value": 12},
{"category":"rpi4-A72", "group": "FMA f64", "value": 6},
{"category":"rpi4-A72", "group": "FMA f32", "value": 12},
{"category":"vim1s-A35", "group": "FMA f64", "value": 1.6},
{"category":"vim1s-A35", "group": "FMA f32", "value": 3.2},
{"category":"onano-A78", "group": "FMA f64", "value": 12},
{"category":"onano-A78", "group": "FMA f32", "value": 24},
{"category":"onano-A78", "group": "FMA f16", "value": 48},
{"category":"onano-A78", "group": "DPA4 i32/i8", "value": 97},
{"category":"opi5-A55", "group": "FMA f64", "value": 7},
{"category":"opi5-A55", "group": "FMA f32", "value": 14},
{"category":"opi5-A55", "group": "FMA f16", "value": 29},
{"category":"opi5-A55", "group": "DPA4 i32/i8", "value": 58},
{"category":"opi5-A76", "group": "FMA f64", "value": 18},
{"category":"opi5-A76", "group": "FMA f32", "value": 36},
{"category":"opi5-A76", "group": "FMA f16", "value": 71},
{"category":"opi5-A76", "group": "DPA4 i32/i8", "value": 143},
{"category":"rpi5-A76", "group": "FMA f64", "value": 19},
{"category":"rpi5-A76", "group": "FMA f32", "value": 38},
{"category":"rpi5-A76", "group": "FMA f16", "value": 77},
{"category":"rpi5-A76", "group": "DPA4 i32/i8", "value": 153},
{"category":"bpif3-X60", "group": "FMA f64", "value": 13},
{"category":"bpif3-X60", "group": "FMA f32", "value": 25},
{"category":"bpif3-X60", "group": "FMA f16", "value": 53},
{"category":"bpif3-X60", "group": "MADOT i32/i8", "value": 408}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}},
"y": {"field": "value", "type": "quantitative", "axis": {"title": "Performance (Gop/s)"}, "scale": {"type": "linear"}},
"xOffset": {"field": "group", "sort": "none"},
"color": {"field": "group", "title": "Operation"}
},
"layer": [{
"mark": "bar"
}, {
"mark": {
"type": "text",
"align": "center",
"baseline": "middle",
"dy": -10
},
"encoding": {
"text": {"field": "value", "type": "quantitative", "sort": "none"}
}
}]
}
Peak performance on CPU (mono-core) depending on the type of operation and on the SBC (higher is better).
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"description": "Peak performance on CPU depending on the type of operation and on the SBC.",
"width": 1000,
"height": 300,
"data": {
"values": [
{"category":"rpi3-A53-4t", "group": "FMA f64", "value": 19},
{"category":"rpi3-A53-4t", "group": "FMA f32", "value": 38},
{"category":"rpi3-A53-4t", "group": "FMA f16", "value": 0},
{"category":"rpi3-A53-4t", "group": "DPA2 f32/bf16", "value": 0},
{"category":"rpi3-A53-4t", "group": "DPA4 i32/i8", "value": 0},
{"category":"rpi3-A53-4t", "group": "MADOT i32/i8", "value": 0},
{"category":"xnano-A57-4t", "group": "FMA f64", "value": 24},
{"category":"xnano-A57-4t", "group": "FMA f32", "value": 47},
{"category":"rpi4-A72-4t", "group": "FMA f64", "value": 24},
{"category":"rpi4-A72-4t", "group": "FMA f32", "value": 48},
{"category":"vim1s-A35", "group": "FMA f64", "value": 6.3},
{"category":"vim1s-A35", "group": "FMA f32", "value": 12.7},
{"category":"onano-A78-6t", "group": "FMA f64", "value": 72},
{"category":"onano-A78-6t", "group": "FMA f32", "value": 145},
{"category":"onano-A78-6t", "group": "FMA f16", "value": 290},
{"category":"onano-A78-6t", "group": "DPA4 i32/i8", "value": 579},
{"category":"opi5-A55-4t", "group": "FMA f64", "value": 29},
{"category":"opi5-A55-4t", "group": "FMA f32", "value": 58},
{"category":"opi5-A55-4t", "group": "FMA f16", "value": 115},
{"category":"opi5-A55-4t", "group": "DPA4 i32/i8", "value": 231},
{"category":"opi5-A76-4t", "group": "FMA f64", "value": 71},
{"category":"opi5-A76-4t", "group": "FMA f32", "value": 142},
{"category":"opi5-A76-4t", "group": "FMA f16", "value": 284},
{"category":"opi5-A76-4t", "group": "DPA4 i32/i8", "value": 568},
{"category":"rpi5-A76-4t", "group": "FMA f64", "value": 77},
{"category":"rpi5-A76-4t", "group": "FMA f32", "value": 154},
{"category":"rpi5-A76-4t", "group": "FMA f16", "value": 307},
{"category":"rpi5-A76-4t", "group": "DPA4 i32/i8", "value": 614},
{"category":"bpif3-X60-8t", "group": "FMA f64", "value": 106},
{"category":"bpif3-X60-8t", "group": "FMA f32", "value": 214},
{"category":"bpif3-X60-8t", "group": "FMA f16", "value": 426},
{"category":"bpif3-X60-8t", "group": "MADOT i32/i8", "value": 1635}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}},
"y": {"field": "value", "type": "quantitative", "axis": {"title": "Performance (Gop/s)"}, "scale": {"type": "linear"}},
"xOffset": {"field": "group", "sort": "none"},
"color": {"field": "group", "title": "Operation"}
},
"layer": [{
"mark": "bar"
}, {
"mark": {
"type": "text",
"align": "center",
"baseline": "middle",
"dy": -10
},
"encoding": {
"text": {"field": "value", "type": "quantitative", "sort": "none"}
}
}]
}
Peak performance on CPU (multi-core) depending on the type of operation and on the SBC (higher is better).
Single-core Multi-core
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"description": "Peak performance on CPU (mono-core) depending on the type of operation and on the SBC.",
"width": 1300,
"height": 300,
"data": {
"values": [
{"category":"tx2-A57", "group": "FMA f64", "value": 8},
{"category":"tx2-A57", "group": "FMA f32", "value": 16},
{"category":"tx2-A57", "group": "FMA f16", "value": 0},
{"category":"tx2-A57", "group": "DPA2 f32/bf16", "value": 0},
{"category":"tx2-A57", "group": "DPA4 i32/i8", "value": 0},
{"category":"tx2-Denver", "group": "FMA f64", "value": 8},
{"category":"tx2-Denver", "group": "FMA f32", "value": 15},
{"category":"xagx-Carmel", "group": "FMA f64", "value": 17},
{"category":"xagx-Carmel", "group": "FMA f32", "value": 33},
{"category":"xagx-Carmel", "group": "FMA f16", "value": 66},
{"category":"xnx-Carmel", "group": "FMA f64", "value": 14},
{"category":"xnx-Carmel", "group": "FMA f32", "value": 28},
{"category":"xnx-Carmel", "group": "FMA f16", "value": 56},
{"category":"m1u-Icestorm", "group": "FMA f64", "value": 16},
{"category":"m1u-Icestorm", "group": "FMA f32", "value": 33},
{"category":"m1u-Icestorm", "group": "FMA f16", "value": 66},
{"category":"m1u-Icestorm", "group": "DPA4 i32/i8", "value": 132},
{"category":"m1u-Firestorm", "group": "FMA f64", "value": 52},
{"category":"m1u-Firestorm", "group": "FMA f32", "value": 103},
{"category":"m1u-Firestorm", "group": "FMA f16", "value": 206},
{"category":"m1u-Firestorm", "group": "DPA4 i32/i8", "value": 412},
{"category":"onx-A78", "group": "FMA f64", "value": 16},
{"category":"onx-A78", "group": "FMA f32", "value": 32},
{"category":"onx-A78", "group": "FMA f16", "value": 64},
{"category":"onx-A78", "group": "DPA4 i32/i8", "value": 127},
{"category":"oagx-A78", "group": "FMA f64", "value": 18},
{"category":"oagx-A78", "group": "FMA f32", "value": 35},
{"category":"oagx-A78", "group": "FMA f16", "value": 70},
{"category":"oagx-A78", "group": "DPA4 i32/i8", "value": 140},
{"category":"em780-7840u", "group": "FMA f64", "value": 62},
{"category":"em780-7840u", "group": "FMA f32", "value": 124},
{"category":"em780-7840u", "group": "DPA2 f32/bf16", "value": 248},
{"category":"em780-7840u", "group": "DPA4 i32/i8", "value": 497},
{"category":"x7ti-lpe", "group": "FMA f64", "value": 19},
{"category":"x7ti-lpe", "group": "FMA f32", "value": 40},
{"category":"x7ti-lpe", "group": "DPA2 f32/bf16", "value": 80},
{"category":"x7ti-lpe", "group": "DPA4 i32/i8", "value": 160},
{"category":"x7ti-e", "group": "FMA f64", "value": 30},
{"category":"x7ti-e", "group": "FMA f32", "value": 60},
{"category":"x7ti-e", "group": "DPA2 f32/bf16", "value": 120},
{"category":"x7ti-e", "group": "DPA4 i32/i8", "value": 242},
{"category":"x7ti-p", "group": "FMA f64", "value": 56},
{"category":"x7ti-p", "group": "FMA f32", "value": 113},
{"category":"x7ti-p", "group": "DPA2 f32/bf16", "value": 303},
{"category":"x7ti-p", "group": "DPA4 i32/i8", "value": 600}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}},
"y": {"field": "value", "type": "quantitative", "axis": {"title": "Performance (Gop/s)"}, "scale": {"type": "linear"}},
"xOffset": {"field": "group", "sort": "none"},
"color": {"field": "group", "title": "Operation"}
},
"layer": [{
"mark": "bar"
}, {
"mark": {
"type": "text",
"align": "center",
"baseline": "middle",
"dy": -10
},
"encoding": {
"text": {"field": "value", "type": "quantitative", "sort": "none"}
}
}]
}
Peak performance on CPU (mono-core) depending on the type of operation and on the SBC (higher is better).
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"description": "Peak performance on CPU depending on the type of operation and on the SBC.",
"width": 1300,
"height": 300,
"data": {
"values": [
{"category":"tx2-A57-4t", "group": "FMA f64", "value": 33},
{"category":"tx2-A57-4t", "group": "FMA f32", "value": 65},
{"category":"tx2-A57-4t", "group": "FMA f16", "value": 0},
{"category":"tx2-A57-4t", "group": "DPA2 f32/bf16", "value": 0},
{"category":"tx2-A57-4t", "group": "DPA4 i32/i8", "value": 0},
{"category":"tx2-Denver-2t", "group": "FMA f64", "value": 16},
{"category":"tx2-Denver-2t", "group": "FMA f32", "value": 31},
{"category":"xagx-Carmel-8t", "group": "FMA f64", "value": 133},
{"category":"xagx-Carmel-8t", "group": "FMA f32", "value": 264},
{"category":"xagx-Carmel-8t", "group": "FMA f16", "value": 530},
{"category":"xnx-Carmel-6t", "group": "FMA f64", "value": 84},
{"category":"xnx-Carmel-6t", "group": "FMA f32", "value": 167},
{"category":"xnx-Carmel-6t", "group": "FMA f16", "value": 334},
{"category":"m1u-Icestorm-4t", "group": "FMA f64", "value": 66},
{"category":"m1u-Icestorm-4t", "group": "FMA f32", "value": 132},
{"category":"m1u-Icestorm-4t", "group": "FMA f16", "value": 263},
{"category":"m1u-Icestorm-4t", "group": "DPA4 i32/i8", "value": 527},
{"category":"m1u-Firestorm-16t", "group": "FMA f64", "value": 775},
{"category":"m1u-Firestorm-16t", "group": "FMA f32", "value": 1551},
{"category":"m1u-Firestorm-16t", "group": "FMA f16", "value": 3102},
{"category":"m1u-Firestorm-16t", "group": "DPA4 i32/i8", "value": 6201},
{"category":"onx-A78-8t", "group": "FMA f64", "value": 125},
{"category":"onx-A78-8t", "group": "FMA f32", "value": 252},
{"category":"onx-A78-8t", "group": "FMA f16", "value": 504},
{"category":"onx-A78-8t", "group": "DPA4 i32/i8", "value": 1010},
{"category":"oagx-A78-12t", "group": "FMA f64", "value": 210},
{"category":"oagx-A78-12t", "group": "FMA f32", "value": 421},
{"category":"oagx-A78-12t", "group": "FMA f16", "value": 842},
{"category":"oagx-A78-12t", "group": "DPA4 i32/i8", "value": 1684},
{"category":"em780-7840u-8t", "group": "FMA f64", "value": 442},
{"category":"em780-7840u-8t", "group": "FMA f32", "value": 872},
{"category":"em780-7840u-8t", "group": "DPA2 f32/bf16", "value": 1854},
{"category":"em780-7840u-8t", "group": "DPA4 i32/i8", "value": 3560},
{"category":"x7ti-lpe-2t", "group": "FMA f64", "value": 40},
{"category":"x7ti-lpe-2t", "group": "FMA f32", "value": 79},
{"category":"x7ti-lpe-2t", "group": "DPA2 f32/bf16", "value": 160},
{"category":"x7ti-lpe-2t", "group": "DPA4 i32/i8", "value": 319},
{"category":"x7ti-e-8t", "group": "FMA f64", "value": 210},
{"category":"x7ti-e-8t", "group": "FMA f32", "value": 420},
{"category":"x7ti-e-8t", "group": "DPA2 f32/bf16", "value": 841},
{"category":"x7ti-e-8t", "group": "DPA4 i32/i8", "value": 1683},
{"category":"x7ti-p-6t", "group": "FMA f64", "value": 430},
{"category":"x7ti-p-6t", "group": "FMA f32", "value": 853},
{"category":"x7ti-p-6t", "group": "DPA2 f32/bf16", "value": 1704},
{"category":"x7ti-p-6t", "group": "DPA4 i32/i8", "value": 3430}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}},
"y": {"field": "value", "type": "quantitative", "axis": {"title": "Performance (Gop/s)"}, "scale": {"type": "linear"}},
"xOffset": {"field": "group", "sort": "none"},
"color": {"field": "group", "title": "Operation"}
},
"layer": [{
"mark": "bar"
}, {
"mark": {
"type": "text",
"align": "center",
"baseline": "middle",
"dy": -10
},
"encoding": {
"text": {"field": "value", "type": "quantitative", "sort": "none"}
}
}]
}
Peak performance on CPU (multi-core) depending on the type of operation and on the SBC (higher is better).
GPU Memory Bandwidth
Measurement of the memory bandwidth between the GPU and its global
memory with the clpeak
benchmark. On the Nvidia Jetson platforms,
PoCL has been installed to enable OpenCL support (see
the PoCL Installation on Jetson
section).
Integrated GPUs Discrete GPUs
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"description": "GPU memory bandwidth depending on the SBC.",
"width": 1500,
"height": 300,
"data": {
"values": [
{"category": "xu4-Mali-T628-MP6", "group": "float32x1", "value": 5.2},
{"category": "xu4-Mali-T628-MP6", "group": "float32x2", "value": 6.9},
{"category": "xu4-Mali-T628-MP6", "group": "float32x4", "value": 7.0},
{"category": "xu4-Mali-T628-MP6", "group": "float32x8", "value": 6.9},
{"category": "tx2-Pascal-2SMX", "group": "float32x1", "value": 37},
{"category": "tx2-Pascal-2SMX", "group": "float32x2", "value": 46},
{"category": "tx2-Pascal-2SMX", "group": "float32x4", "value": 46},
{"category": "tx2-Pascal-2SMX", "group": "float32x8", "value": 34},
{"category": "xagx-Volta-8SMX", "group": "float32x1", "value": 110},
{"category": "xagx-Volta-8SMX", "group": "float32x2", "value": 109},
{"category": "xagx-Volta-8SMX", "group": "float32x4", "value": 109},
{"category": "xagx-Volta-8SMX", "group": "float32x8", "value": 91},
{"category": "xnano-Maxwell-1SMX", "group": "float32x1", "value": 18},
{"category": "xnano-Maxwell-1SMX", "group": "float32x2", "value": 21},
{"category": "xnano-Maxwell-1SMX", "group": "float32x4", "value": 21},
{"category": "xnano-Maxwell-1SMX", "group": "float32x8", "value": 20},
{"category": "xnx-Volta-6SMX", "group": "float32x1", "value": 47},
{"category": "xnx-Volta-6SMX", "group": "float32x2", "value": 49},
{"category": "xnx-Volta-6SMX", "group": "float32x4", "value": 49},
{"category": "xnx-Volta-6SMX", "group": "float32x8", "value": 44},
{"category": "m1u-macos-48c", "group": "float32x1", "value": 699},
{"category": "m1u-macos-48c", "group": "float32x2", "value": 717},
{"category": "m1u-macos-48c", "group": "float32x4", "value": 729},
{"category": "m1u-macos-48c", "group": "float32x8", "value": 703},
{"category": "m1u-linux-48c", "group": "float32x1", "value": 500},
{"category": "m1u-linux-48c", "group": "float32x2", "value": 514},
{"category": "m1u-linux-48c", "group": "float32x4", "value": 523},
{"category": "m1u-linux-48c", "group": "float32x8", "value": 524},
{"category": "vim1s-Mali-G31-MP2", "group": "float32x1", "value": 3.5},
{"category": "vim1s-Mali-G31-MP2", "group": "float32x2", "value": 4.3},
{"category": "vim1s-Mali-G31-MP2", "group": "float32x4", "value": 4.2},
{"category": "vim1s-Mali-G31-MP2", "group": "float32x8", "value": 3.5},
{"category": "onx-Ampere-8SMX", "group": "float32x1", "value": 87},
{"category": "onx-Ampere-8SMX", "group": "float32x2", "value": 94},
{"category": "onx-Ampere-8SMX", "group": "float32x4", "value": 94},
{"category": "onx-Ampere-8SMX", "group": "float32x8", "value": 94},
{"category": "oagx-Ampere-16SMX", "group": "float32x1", "value": 174},
{"category": "oagx-Ampere-16SMX", "group": "float32x2", "value": 178},
{"category": "oagx-Ampere-16SMX", "group": "float32x4", "value": 179},
{"category": "oagx-Ampere-16SMX", "group": "float32x8", "value": 180},
{"category": "onano-Ampere-8SMX", "group": "float32x1", "value": 63},
{"category": "onano-Ampere-8SMX", "group": "float32x2", "value": 64},
{"category": "onano-Ampere-8SMX", "group": "float32x4", "value": 64},
{"category": "onano-Ampere-8SMX", "group": "float32x8", "value": 64},
{"category": "opi5-Mali-G610-MP4", "group": "float32x1", "value": 24},
{"category": "opi5-Mali-G610-MP4", "group": "float32x2", "value": 26},
{"category": "opi5-Mali-G610-MP4", "group": "float32x4", "value": 26},
{"category": "opi5-Mali-G610-MP4", "group": "float32x8", "value": 20},
{"category": "em780-Radeon-780M", "group": "float32x1", "value": 72},
{"category": "em780-Radeon-780M", "group": "float32x2", "value": 76},
{"category": "em780-Radeon-780M", "group": "float32x4", "value": 79},
{"category": "em780-Radeon-780M", "group": "float32x8", "value": 80},
{"category": "x7ti-Alchemist-8Xe", "group": "float32x1", "value": 73},
{"category": "x7ti-Alchemist-8Xe", "group": "float32x2", "value": 74},
{"category": "x7ti-Alchemist-8Xe", "group": "float32x4", "value": 75},
{"category": "x7ti-Alchemist-8Xe", "group": "float32x8", "value": 78}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}},
"y": {"field": "value", "type": "quantitative", "axis": {"title": "Throughput (GB/s)"}, "scale": {"type": "linear"}},
"xOffset": {"field": "group", "sort": "none"},
"color": {"field": "group", "title": "Datatype"}
},
"layer": [{
"mark": "bar"
}, {
"mark": {
"type": "text",
"align": "center",
"baseline": "middle",
"dy": -10
},
"encoding": {
"text": {"field": "value", "type": "quantitative", "sort": "none"}
}
}]
}
GPU memory bandwidth depending on the SBC (higher is better).
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"description": "GPU memory bandwidth depending on the SBC.",
"width": 500,
"height": 300,
"data": {
"values": [
{"category": "GeForce-RTX-3090", "group": "float32x 1", "value": 817},
{"category": "GeForce-RTX-3090", "group": "float32x 2", "value": 842},
{"category": "GeForce-RTX-3090", "group": "float32x 4", "value": 856},
{"category": "GeForce-RTX-3090", "group": "float32x 8", "value": 788},
{"category": "GeForce-RTX-3090", "group": "float32x16", "value": 845},
{"category": "Radeon-RX-7900-XTX", "group": "float32x 1", "value": 601},
{"category": "Radeon-RX-7900-XTX", "group": "float32x 2", "value": 623},
{"category": "Radeon-RX-7900-XTX", "group": "float32x 4", "value": 642},
{"category": "Radeon-RX-7900-XTX", "group": "float32x 8", "value": 666},
{"category": "Radeon-RX-7900-XTX", "group": "float32x16", "value": 683}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}},
"y": {"field": "value", "type": "quantitative", "axis": {"title": "Throughput (GB/s)"}, "scale": {"type": "linear"}},
"xOffset": {"field": "group", "sort": "none"},
"color": {"field": "group", "title": "Datatype"}
},
"layer": [{
"mark": "bar"
}, {
"mark": {
"type": "text",
"align": "center",
"baseline": "middle",
"dy": -10
},
"encoding": {
"text": {"field": "value", "type": "quantitative", "sort": "none"}
}
}]
}
GPU memory bandwidth depending on the SBC (higher is better).
Measurement of the GPU peak performance. The clpeak
benchmark is used. It is
an OpenCL benchmark that executes a compute intensive program to estimate peak
performance. On the Nvidia Jetson platforms, PoCL has
been installed to enable OpenCL support (see the
PoCL Installation on Jetson
section).
Integrated GPUs Discrete GPUs
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"description": "GPU peak performance (32-bit float) depending on the SBC.",
"width": 1300,
"height": 300,
"data": {
"values": [
{"category": "xu4-Mali-T628-MP6", "group": "float64", "value": 26},
{"category": "xu4-Mali-T628-MP6", "group": "float32", "value": 56},
{"category": "xu4-Mali-T628-MP6", "group": "float16", "value": 114},
{"category": "tx2-Pascal-2SMX", "group": "float64", "value": 21},
{"category": "tx2-Pascal-2SMX", "group": "float32", "value": 652},
{"category": "tx2-Pascal-2SMX", "group": "float16", "value": 0},
{"category": "xagx-Volta-8SMX", "group": "float64", "value": 44},
{"category": "xagx-Volta-8SMX", "group": "float32", "value": 1404},
{"category": "xagx-Volta-8SMX", "group": "float16", "value": 0},
{"category": "xnano-Maxwell-1SMX", "group": "float64", "value": 7},
{"category": "xnano-Maxwell-1SMX", "group": "float32", "value": 230},
{"category": "xnano-Maxwell-1SMX", "group": "float16", "value": 0},
{"category": "xnx-Volta-6SMX", "group": "float64", "value": 27},
{"category": "xnx-Volta-6SMX", "group": "float32", "value": 847},
{"category": "xnx-Volta-6SMX", "group": "float16", "value": 0},
{"category": "m1u-macos-48c", "group": "float64", "value": 0},
{"category": "m1u-macos-48c", "group": "float32", "value": 7706},
{"category": "m1u-macos-48c", "group": "float16", "value": 0},
{"category": "m1u-linux-48c", "group": "float64", "value": 0},
{"category": "m1u-linux-48c", "group": "float32", "value": 7120},
{"category": "m1u-linux-48c", "group": "float16", "value": 6145},
{"category": "vim1s-Mali-G31-MP2", "group": "float64", "value": 0},
{"category": "vim1s-Mali-G31-MP2", "group": "float32", "value": 13},
{"category": "vim1s-Mali-G31-MP2", "group": "float16", "value": 27},
{"category": "onx-Ampere-8SMX", "group": "float64", "value": 30},
{"category": "onx-Ampere-8SMX", "group": "float32", "value": 1844},
{"category": "onx-Ampere-8SMX", "group": "float16", "value": 3520},
{"category": "oagx-Ampere-16SMX", "group": "float64", "value": 83},
{"category": "oagx-Ampere-16SMX", "group": "float32", "value": 5211},
{"category": "oagx-Ampere-16SMX", "group": "float16", "value": 9957},
{"category": "onano-Ampere-8SMX", "group": "float64", "value": 20},
{"category": "onano-Ampere-8SMX", "group": "float32", "value": 1255},
{"category": "onano-Ampere-8SMX", "group": "float16", "value": 2397},
{"category": "opi5-Mali-G610-MP4", "group": "float64", "value": 0},
{"category": "opi5-Mali-G610-MP4", "group": "float32", "value": 474},
{"category": "opi5-Mali-G610-MP4", "group": "float16", "value": 917},
{"category": "em780-Radeon-780M", "group": "float64", "value": 86},
{"category": "em780-Radeon-780M", "group": "float32", "value": 2522},
{"category": "em780-Radeon-780M", "group": "float16", "value": 4574},
{"category": "x7ti-Alchemist-8Xe", "group": "float64", "value": 149},
{"category": "x7ti-Alchemist-8Xe", "group": "float32", "value": 4774},
{"category": "x7ti-Alchemist-8Xe", "group": "float16", "value": 9473}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}},
"y": {"field": "value", "type": "quantitative", "axis": {"title": "Peak Performance (Gflop/s)"}, "scale": {"type": "linear"}},
"xOffset": {"field": "group", "sort": "none"},
"color": {"field": "group", "title": "Datatype"}
},
"layer": [{
"mark": "bar"
}, {
"mark": {
"type": "text",
"align": "center",
"baseline": "middle",
"dy": -10
},
"encoding": {
"text": {"field": "value", "type": "quantitative", "sort": "none"}
}
}]
}
GPU peak performance depending on the SBC (higher is better).
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"description": "GPU peak performance (32-bit float) depending on the SBC.",
"width": 500,
"height": 300,
"data": {
"values": [
{"category": "GeForce-RTX-3090", "group": "float64", "value": 629},
{"category": "GeForce-RTX-3090", "group": "float32", "value": 36038},
{"category": "GeForce-RTX-3090", "group": "float16", "value": 39636},
{"category": "Radeon-RX-7900-XTX", "group": "float64", "value": 907},
{"category": "Radeon-RX-7900-XTX", "group": "float32", "value": 23952},
{"category": "Radeon-RX-7900-XTX", "group": "float16", "value": 40445}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}},
"y": {"field": "value", "type": "quantitative", "axis": {"title": "Peak Performance (Gflop/s)"}, "scale": {"type": "linear"}},
"xOffset": {"field": "group", "sort": "none"},
"color": {"field": "group", "title": "Datatype"}
},
"layer": [{
"mark": "bar"
}, {
"mark": {
"type": "text",
"align": "center",
"baseline": "middle",
"dy": -10
},
"encoding": {
"text": {"field": "value", "type": "quantitative", "sort": "none"}
}
}]
}
GPU peak performance depending on the SBC (higher is better).
Compute Intensive n-body Code
MUrB is a \(n\) -body code simulating Newtonian gravitational equations. This
type of code is known to be mostly compute-bound because there are \(O(n^2)\)
computations for \(n\) data. The CPU code is vectorized thanks to the
MIPP SIMD wrapper and multi-threaded with
OpenMP (all the available cores are used for the benchmark). On GPU , an OpenCL
and a CUDA implementations are evaluated. In any cases, the computations are
performed using float32 datatype.
CPU MIPP implementationGPU CUDA & OpenCL implementations
The following command lines is used:
murb --nv -i 100 -n 20000 --gf --im cpu+simd # on CPU with SIMD (MIPP wrapper)
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"description": "MUrB: achieved performance on CPU depending on the SBC.",
"width": 500,
"height": 300,
"data": {
"values": [
{"category": "xu4", "group": "MIPP", "value": 11},
{"category": "rpi3", "group": "MIPP", "value": 10},
{"category": "tx2", "group": "MIPP", "value": 53},
{"category": "xagx", "group": "MIPP", "value": 151},
{"category": "xnano", "group": "MIPP", "value": 20},
{"category": "rpi4", "group": "MIPP", "value": 22},
{"category": "xnx", "group": "MIPP", "value": 95},
{"category": "m1u", "group": "MIPP", "value": 838},
{"category": "vim1s", "group": "MIPP", "value": 10},
{"category": "onx", "group": "MIPP", "value": 103},
{"category": "oagx", "group": "MIPP", "value": 172},
{"category": "onano", "group": "MIPP", "value": 60},
{"category": "opi5", "group": "MIPP", "value": 80},
{"category": "rpi5", "group": "MIPP", "value": 58},
{"category": "em780", "group": "MIPP", "value": 633},
{"category": "x7ti", "group": "MIPP", "value": 627}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}},
"y": {"field": "value", "type": "quantitative", "axis": {"title": "MUrB Performance (Gflop/s)"}, "scale": {"type": "linear"}},
"xOffset": {"field": "group", "sort": "none"},
"color": {"field": "group", "title": "API"}
},
"layer": [{
"mark": "bar"
}, {
"mark": {
"type": "text",
"align": "center",
"baseline": "middle",
"dy": -10
},
"encoding": {
"text": {"field": "value", "type": "quantitative", "sort": "none"}
}
}]
}
MUrB: Achieved performance on CPU depending on the SBC (higher is better).
Integrated GPUs Discrete GPUs
The following command lines are used:
murb --nv -i 1500 -n 30000 --gf --im cuda+rsqrt4 # on GPU with CUDA API
murb --nv -i 1500 -n 30000 --gf --im ocl+rsqrt4 # on GPU with OpenCL API
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"description": "MUrB: achieved performance on GPU depending on the SBC.",
"width": 1000,
"height": 300,
"data": {
"values": [
{"category":"xu4-Mali-T628-MP6", "group": "OCL", "value": 13},
{"category":"xu4-Mali-T628-MP6", "group": "CUDA", "value": null},
{"category":"tx2-Pascal-2SMX", "group": "OCL", "value": 274},
{"category":"tx2-Pascal-2SMX", "group": "CUDA", "value": 276},
{"category":"xagx-Volta-8SMX", "group": "OCL", "value": 735},
{"category":"xagx-Volta-8SMX", "group": "CUDA", "value": 736},
{"category":"xnano-Maxwell-1SMX", "group": "OCL", "value": 109},
{"category":"xnano-Maxwell-1SMX", "group": "CUDA", "value": 108},
{"category":"xnx-Volta-6SMX", "group": "OCL", "value": 446},
{"category":"xnx-Volta-6SMX", "group": "CUDA", "value": 454},
{"category":"m1u-macos-48c", "group": "OCL", "value": 2104},
{"category":"m1u-linux-48c", "group": "OCL", "value": 1558},
{"category":"vim1s-Mali-G31-MP2", "group": "OCL", "value": 8},
{"category":"onx-Ampere-8SMX", "group": "OCL", "value": 594},
{"category":"onx-Ampere-8SMX", "group": "CUDA", "value": 595},
{"category":"oagx-Ampere-16SMX", "group": "OCL", "value": 1572},
{"category":"oagx-Ampere-16SMX", "group": "CUDA", "value": 1629},
{"category":"onano-Ampere-8SMX", "group": "OCL", "value": 423},
{"category":"onano-Ampere-8SMX", "group": "CUDA", "value": 437},
{"category":"opi5-Mali-G610-MP4", "group": "OCL", "value": 148},
{"category":"opi5-Mali-G610-MP4", "group": "CUDA", "value": null},
{"category":"em780-Radeon-780M", "group": "OCL", "value": 1292},
{"category":"em780-Radeon-780M", "group": "CUDA", "value": null},
{"category":"x7ti-Alchemist-8Xe", "group": "OCL", "value": 2143}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}},
"y": {"field": "value", "type": "quantitative", "axis": {"title": "MUrB Performance (Gflop/s)"}, "scale": {"type": "linear"}},
"xOffset": {"field": "group"},
"color": {"field": "group", "title": "API"}
},
"layer": [{
"mark": "bar"
}, {
"mark": {
"type": "text",
"align": "center",
"baseline": "middle",
"dy": -10
},
"encoding": {
"text": {"field": "value", "type": "quantitative"}
}
}]
}
MUrB: Achieved performance on GPU depending on the SBC (higher is better).
For the Nvidia Geforce RTX 3090 the code is ran with the following
command lines:
murb --nv -i 750 -n 200000 --gf --im cuda+locu2 --wg 32 # for CUDA API
murb --nv -i 750 -n 200000 --gf --im ocl+locu2 --wg 32 # for OpenCL API
For the AMD Radeon RX 7900 XTX the code is ran with the following
command line:
murb --nv -i 750 -n 200000 --gf --im ocl+rsqrt2 --wg 32
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"description": "MUrB: achieved performance on GPU depending on the SBC.",
"width": 300,
"height": 300,
"data": {
"values": [
{"category":"GeForce-RTX-3090", "group": "OCL", "value": 12878},
{"category":"GeForce-RTX-3090", "group": "CUDA", "value": 12513},
{"category":"Radeon-RX-7900-XTX", "group": "OCL", "value": 19692}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}},
"y": {"field": "value", "type": "quantitative", "axis": {"title": "MUrB Performance (Gflop/s)"}, "scale": {"type": "linear"}},
"xOffset": {"field": "group"},
"color": {"field": "group", "title": "API"}
},
"layer": [{
"mark": "bar"
}, {
"mark": {
"type": "text",
"align": "center",
"baseline": "middle",
"dy": -10
},
"encoding": {
"text": {"field": "value", "type": "quantitative"}
}
}]
}
MUrB: Achieved performance on GPU depending on the SBC (higher is better).
FMDT is an application to detect moving meteors in the sky. The most optimized
version of FMDT on FullHD frames and with a {1, 4, 1}
pipeline is executed. In
total there are 6 active threads where 4 of them are really stressed. The
application relies on the StreamPU
multi-threading runtime and on the FLSL algorithm for labeling (CPU only code).
The following command line is used:
fmdt-detect-rt-opt-pip --vid-in-path ../2022_05_31_tauh_34_meteors.mp4 --vid-in-buff --vid-in-loop 30 --rt-stats --ccl-impl LSLM --pip-threads '[1,4,1]'
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"description": "FMDT: achieved number of FPS on the SBC.",
"width": 500,
"height": 300,
"data": {
"values": [
{"category": "xu4", "group": "SPU{1,4,1}+FLSL", "value": 237},
{"category": "rpi3", "group": "SPU{1,4,1}+FLSL", "value": 82},
{"category": "tx2", "group": "SPU{1,4,1}+FLSL", "value": 719},
{"category": "xagx", "group": "SPU{1,4,1}+FLSL", "value": 2033},
{"category": "xnano", "group": "SPU{1,4,1}+FLSL", "value": 356},
{"category": "rpi4", "group": "SPU{1,4,1}+FLSL", "value": 152},
{"category": "xnx", "group": "SPU{1,4,1}+FLSL", "value": 1563},
{"category": "m1u", "group": "SPU{1,4,1}+FLSL", "value": 5714},
{"category": "vim1s", "group": "SPU{1,4,1}+FLSL", "value": 230},
{"category": "onx", "group": "SPU{1,4,1}+FLSL", "value": 1430},
{"category": "oagx", "group": "SPU{1,4,1}+FLSL", "value": 1812},
{"category": "onano", "group": "SPU{1,4,1}+FLSL", "value": 1234},
{"category": "opi5", "group": "SPU{1,4,1}+FLSL", "value": 873},
{"category": "rpi5", "group": "SPU{1,4,1}+FLSL", "value": 187},
{"category": "em780", "group": "SPU{1,4,1}+FLSL", "value": 3118},
{"category": "x7ti", "group": "SPU{1,4,1}+FLSL", "value": 4235}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}},
"y": {"field": "value", "type": "quantitative", "axis": {"title": "Frames Per Second (FPS)"}, "scale": {"type": "linear"}},
"xOffset": {"field": "group"},
"color": {"field": "group", "title": "Version"}
},
"layer": [{
"mark": "bar"
}, {
"mark": {
"type": "text",
"align": "center",
"baseline": "middle",
"dy": -10
},
"encoding": {
"text": {"field": "value", "type": "quantitative"}
}
}]
}
FMDT: Achieved number of FPS on the SBC (higher is better).
AFF3CT is a software dedicated to the Forward Error Correction (or channel
coding) simulations (for instance the simulation of the physical layer of
digital telecommunications like 5G standard).
The following command line is used:
aff3ct -p 8 --sim-type BFER -m 4 .5 -M 4 .5 -C POLAR -K 1755 -N 2048 --src-type AZCW --crc-type 32 -GZIP --crc-implem FAST --enc-fb-gen-method GA --chn-type AWGN --chn-implem FAST --qnt-type POW2 --qnt-implem FAST --qnt-bits 6 --qnt-dec 1 --dec-type ASCL --dec-implem FAST --dec-simd INTRA -L 32 --dec-polar-nodes '{R0,R0L,R1,REP_2-8,REPL,SPC_4}' --sim-stop-time 60
It is a simulation of a Polar code (2048,1755) where an ASCL decoder is used
(see the following page for more details
https://aff3ct.github.io/#performances ).
The reported metric is the information throughput in bits/s. The code uses all
the CPU cores available on the node (thanks to the
StreamPU runtime) and it is vectorized
using the MIPP SIMD wrapper.
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"description": "AFF3CT: achieved information throughput depending on the SBCs.",
"width": 500,
"height": 300,
"data": {
"values": [
{"category": "xu4", "group": "Polar-ASCL+SPU+MIPP", "value": 122},
{"category": "rpi3", "group": "Polar-ASCL+SPU+MIPP", "value": 47},
{"category": "tx2", "group": "Polar-ASCL+SPU+MIPP", "value": 195},
{"category": "xagx", "group": "Polar-ASCL+SPU+MIPP", "value": 465},
{"category": "xnano", "group": "Polar-ASCL+SPU+MIPP", "value": 78},
{"category": "rpi4", "group": "Polar-ASCL+SPU+MIPP", "value": 92},
{"category": "xnx", "group": "Polar-ASCL+SPU+MIPP", "value": 326},
{"category": "m1u", "group": "Polar-ASCL+SPU+MIPP", "value": 3196},
{"category": "vim1s", "group": "Polar-ASCL+SPU+MIPP", "value": 55},
{"category": "onx", "group": "Polar-ASCL+SPU+MIPP", "value": 550},
{"category": "oagx", "group": "Polar-ASCL+SPU+MIPP", "value": 907},
{"category": "onano", "group": "Polar-ASCL+SPU+MIPP", "value": 315},
{"category": "opi5", "group": "Polar-ASCL+SPU+MIPP", "value": 343},
{"category": "rpi5", "group": "Polar-ASCL+SPU+MIPP", "value": 245},
{"category": "em780", "group": "Polar-ASCL+SPU+MIPP", "value": 2626},
{"category": "x7ti", "group": "Polar-ASCL+SPU+MIPP", "value": 2415}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "category", "type": "nominal", "sort": "none", "axis": {"title": "Single Board Computer", "labelAngle": -65}},
"y": {"field": "value", "type": "quantitative", "axis": {"title": "Information Throughput (Mb/s)"}, "scale": {"type": "linear"}},
"xOffset": {"field": "group"},
"color": {"field": "group", "title": "Simulation"}
},
"layer": [{
"mark": "bar"
}, {
"mark": {
"type": "text",
"align": "center",
"baseline": "middle",
"dy": -10
},
"encoding": {
"text": {"field": "value", "type": "quantitative"}
}
}]
}
AFF3CT: Achieved information throughput depending on the SBC (higher is better).
Summary Table
The following table summarizes the different benchmarks and apps: