Benchmarking CPUs (not yet a research article)

Recommended Posts

With this commit I added 7-zip benchmark reporting to Armbian now. Will be available after next updates and with next batch of new images.

Why not recommending to just do an 'apt install p7zip ; 7zr b'? Since 'fire and forget' benchmarking is always BS. You need some monitoring in parallel to know whether your system was really idle and at which clockspeeds the CPU cores were operating (throttling occuring or not?).

Most recent 7-zip contains an own routine to 'pre-heat' the system prior to starting the benchmark (to let cpufreq scaling switch from low clockspeeds to highest ones and e.g. on Intel systems let the system enter TurboBoost modes). This 7-zip code runs single threaded so based on the kernel's scheduler sometimes ending up on the 'wrong' CPU core (e.g. a little core on big.LITTLE SoCs)

On a NanoPC T4 with conservative settings (limiting big CPU cores to 1.8 GHz and little cores to 1.4 GHz) this looks like this:

We get an overall score of above 6100 and 7-zip's 'CPU Freq' line reports CPU0 (a little core) being clocked at 1.4 GHz. But since this is a big.LITTLE design we need the monitoring output that gets displayed below 7-zip benchmark numbers.

By looking at the 2nd line we see that the system was totally idle prior to starting the benchmark (I implemented a 10 second sleep between starting monitoring and firing up the benchmark for this reason -- to control whether the system was already busy or not). As a comparison 7-zip numbers of another RK3399 board that allowed the CPU cores to clock slightly higher (2.0/1.5 GHz): ODROID-N1 scored 6500.

Please keep in mind that benchmarks that run fully multi threaded are NOT representative for most workloads running on computers (they're single threaded). Also please keep in mind that while 7-zip is not that much affected by different compiler settings (like the infamous sysbench) of course it is somewhat. So when you see 7-zip benchmark numbers generated few years ago when the 7z binary has been built with a GCC 4.x most probably with today's software and a binary built by GCC 7.x you see higher scores.

PineH64 scores close to 2600 which as said already is irrelevant for the SoC's performance since with current mainline kernel the SoC runs at a fixed clockspeed reported by 7-zip as ~910 MHz -- it's 912 MHz). We know that the chip can clock up to 1.8 GHz (confirmed by @wtarreau's cool mhz tool) so once cpufreq/dvfs is working we see almost twice the numbers than these 2600. Why only 'almost twice'? Since the 7-zip benchmark depends also on memory bandwidth (like any real workload -- that's just another reason why sysbench sucks as CPU benchmark since sysbench does not depend on memory bandwidth at all).

Those numbers are made with 4 tasks stressing all 4 CPU cores in parallel. Most real world workloads look differently and are single threaded. So even if a H6 clocked at 1.8 GHz produces 7-zip scores above 5000 any big.LITTLE ARM SoC with a similar score will be way faster in reality since when single threaded loads are running on the big cores they perform much faster.

Throttling occured. The board with vendor's standard heatsink starts to overheat badly when running demanding loads. Active cooling is needed and that's why monitoring when running benchmarks is that important!

This is an octa-core Cortex-A53 SoC showing with this benchmark a score of well above 7000 when no throttling happens. Once again: such multithreaded results are BS wrt most real world workloads. An RK3399 board like an ODROID-N1 scoring 'just' 6500 will be the faster performing board with almost all usual workloads since equipped with 2 fast A72 cores while the Fire3 only has 8 slow A53 cores. Most workloads do not scale linearly with count of CPU cores. This has to be taken into account.

(for A20, H3, H5, H6, RK3328 and S5P6818 results see above, for A64 see here and there, for i.MX6 see here (search for my Wandboard), for RK3288 see MiQi results here and for S905 see ODROID-C2 numbers there)

All the SoCs above are quad-core except A20 (dual) and S5P6818 (octa). And it's all about the type of CPU cores: A20 and H3 are Cortex-A7, A64/H5/H6/RK3328/S5P6818/S905 are Cortex-A53, i.MX6 is Cortex-A9 and RK3288 is Cortex-A17. So let's look at all the results, take count of cores into account and MHz also. The following is a table of 7-zip-MIPS per single core at 1GHz clockspeed:

A7: 475
A9: 525
A53: 625
A15: 700
A17: 750
A72: 850

(yeah, A15 and A72 also exist -- see below).

So that's roughly what you can expect from each individual Cortex core running at 1 GHz. As expected if a SoC contains more cores specific workloads that benefit from parallel code execution get faster (once again: most workloads are single-threaded!). Also as expected clockspeeds matter: if you buy an H3 or H5 board without voltage regulation limiting the maximum clockspeed then obviously this board will perform slower compared to another H3/H5 board with sophisticated voltage regulation allowing the CPU cores to clock much higher.

What also matters with this benchmark and most if not all real world workloads: memory bandwidth. Boards with just a 16-bit memory interface are slower than those with 32- or even 64-bit memory interfaces (something that the incapable sysbench pseudo cpu test is not able to report since whole execution happens inside the CPU cores). Boards that use 'better' DRAM (DDR4 vs DDR3) can be faster as long as available software/settings are available (and that's often not the case -- for example we're still waiting for Rockchip releasing new BLOBs with faster DRAM initialization for (L)PPDR4 equipped Rockchip boards).

Speaking about software/settings it should also be obvious that in the meantime we always also have to take care about heat dissipation of ARM SoCs used today. Heat dissipation is an issue to prevent damaging the SoCs due to overheating under load. But without fully functioning cpufreq/dvfs/thermal drivers we can not allow the CPU cores to clock at their upper limits since we need working throttling to protect the chips. And that's why the results for Allwinner H5 and H6 boards look that bad: since linux-sunxi community still is working on upstreaming driver support and/or we at Armbian have not incorporated latest patches flying around into the build system. Once cpufreq/dvfs/thermal is ready for those newer Allwinner SoCs H5 boards will get 1.5 as fast and H6 boards almost twice as fast as today.

Software/settings matter. Always. That's why it's so disappointing to see all those benchmark numbers flying around not taking this into account.

What about Cortex-A15 and A72? When we look at boards Armbian supports we find those cores in SoCs implementing big.LITTLE: ODROID XU4/HC1/HC2 use Exynos 5442 which consists of 4 fast A15 and 4 slow A7 cores (32-bit ARMv7). Boards based on RK3399 have 2 fast A72 cores and 4 slow A53 cores (64-bit ARMv8)

Does it make sense to run the very same 7-zip benchmark on those big.LITTLE designs? Not that much since we can not easily draw any conclusion for normal workloads from such a benchmark number. For example when executing '7z b' on all 6 cores of an ODROID-N1 at the same time we get an overall score of ~6550 7-zip MIPS. When limiting benchmark execution to only the 2 fast A72 cores at 2 GHz we get ~3350 (that's ~1700 7-zip MIPS per core), when we execute the benchmark on the 4 little cores only (1.5GHz) then it's ~3900 (~975 7-zip MIPS per core). A single threaded task running on one of the two big cores will perform almost twice as fast compared to running on a little core. That's important to keep in mind since based on the workload running in reality some of the benchmark numbers are simply misleading or just... numbers without meaning.

Same with ODROID-XU4/HC1/HC2: when running on the A15 big cores ~4950 7-zip MIPS are reported at around 1.8GHz (~1250 7-zip MIPS per big core), when running only on the little A7 cores at 1.4 GHz it's ~2725 (~675 per core). Same situation as with RK3399: single threaded stuff moved to the big cores performs almost twice as fast as on the little cores. I never measured 7-zip running on all cores together since 'numbers without meaning' but I would assume we get something similar as with RK3399: not the addition of big+little numbers (3350+3900=7250 vs. 6550 in reality) but something lower since all cores have to fight for memory bandwidth.

Other things to keep in mind:

When looking at the above benchmark numbers we see A53 cores performing with this specific benchmark 30% better compared to A7 cores at the same clockspeed (so there's a slight advantage ARMv8/64-bit has over ARMv7/32-bit). But as soon as we use other software/benchmarks that make heavy use of NEON optimizations we usually see a performance increase much higher (A53 usually performing twice as fast as A7 -- can be easily checked with cpuminer). So as always it depends on the use case.

Speaking of 'use case' we should also keep in mind that all those ARM SoCs have special engines for this and that. Almost all ARMv8/64-bit SoCs for example contain a cryptographic acceleration engine called 'ARMv8 Crypto Extensions' that make a massive difference with AES for example compared to 32-bit/ARMv7 SoCs that have to do crypto stuff on the CPU cores (see here for numbers). So again: it's about the use case: if you're interested in VPN stuff or disk encryption looking at generic CPU benchmarks is BS since you want an ARMv8 SoC with crypto support (almost all have, the only exceptions are RPi 3/3+, ODROID-C2 and NanoPi K2)

CPU performance with many use cases isn't that important. With Marvell based boards (EspressoBin, Clearfogs, Helios4) CPU benchmarks look rather low but these SoCs are designed for highest I/O and networking throughput and even if the SoC in question scores low in CPU benchmarks those boards outperform everything else if it's about fast storage and network

Even if the CPU cores were running at 1600 MHz the numbers are still too high compared to the i.MX6 numbers generated above with a Wandboard (i.MX6 is also Cortex-A9). So maybe someone with one of the i.MX6 boards supported by Armbian can provide recent results with 'armbianmonitor -z'?

Share this post

Link to post

Share on other sites

Is it worth to move this thread to the research guides & tutorials section? There's IMO a bit to often a 'results irrelevant' (most likely cause DVFS doesn't work on those boards?) inside, but can we summarize this work to make something relevant out of it? 'CPU' benchmarking seems to be something people are interested in.. I would like to see a nice 'research guide' out of it. Maybe with a short example why multicore numbers are often 'worthless' in real world scenarios and how to properly interpret those 'numbers'.

Share this post

Link to post

Share on other sites

Is it worth to move this thread to the research guides & tutorials section?

That was my intention (to split everything starting from post #5 above to a new thread I'm preparing an intro post for within the next 2 weeks). Top post will then focus on an overview about CPU performance capabilities of different boards and when (or with which use cases) this matters.

So for whatever reasons we have a nice mismatch between clockspeeds reported via sysfs and real clockspeeds with Armada 38x

Please note that the operating points is usually fed via the DT while the operating frequency is defined by the jumpers on the board. It's very possible that the DT doesn't reference the correct frequencies here. From what I've apparently seen till now, the Armada 38x has limited ability to do frequency scaling, something like full speed or half speed possibly. When I was running mine at 1.6 GHz, I remember seeing only 1600 or 800 being effectively used. I didn't check since I upgraded to 2 GHz (well 1.992 to be precise) but I suspect I'm now doing either 2000 or 1000 and nothing else. Thus if you have a smaller number of operating points it would be possible that they are incorrectly mapped. Just my two cents :-)

Share this post

Link to post

Share on other sites

To late... I think it makes more sense to split it yet otherwise it will be a hard nut to distinguish between what should be part of the research article and what's the left over from the original thread (this splitting often leads in headache for the one who has to do it). Feel free to change the title as soon as you think it's ready.