Wednesday, August 19, 2015

Accessing unaligned memory

Thanks to Herman Brule, I recently received an access to real ARM hardware systems, in order to test C code and tune them for performance. It proved a great experience, with lots of learnings.

It started with the finding that xxhash speed was rubbish on ARM systems. To this end, 2 systems were benchmarked : first, an ARMv6-J, and then an ARMv7-A.

This was a unwelcomed surprise, and among the multiple potential reasons, it turns out that accessing unaligned data became the most critical one.

Since my latest blog entry on this issue, I converted unaligned-access code to the QEMU-promoted solution using `memcpy()`. Compared with earlier method (`pack` statement), the `memcpy()` version has a big advantage : it's highly portable. It's also supposed to be correctly optimized by the compiler, to end up to a trivial `unaligned load` instruction on CPU architecture which support this feature.

Thanks to these new tools, the issue could be summarized into a selection between 3 possibilities to access unaligned memory :

1. Using `memcpy()` : this is the most portable and safe one.
It's also efficient in a large number of situations. For example, on all tested targets, clang translates `memcpy()` into a single `load` instruction when hardware supports it. gcc is also good on most target tested (x86, x64, arm64, ppc), with just arm 32bits standing out.
The issue here is that your mileage will vary depending on specific compiler / targets. And it's difficult, if not impossible, to test and check all possible combinations. But at least, `memcpy()` is a good generic backup, a safe harbour to be compared to.

2. `pack` instruction : the problem is that it's a compiler-specific extension. It tends to be present on most compilers, but using multiple different, and incompatible, semantics. Therefore, it's a pain for portability and maintenance.

That being said, in a number of cases where `memcpy()` doesn't produce optimal code, `pack` tends to do a better job. So it's possible to `special case` these situations, and left the rest to `memcpy`.

The most important use case was gcc with ARMv7, basically the most important 32-bits ARM version nowadays (included in current crop of smartphones and tablets).
Here, using `pack` for unaligned memory improved performance from 120 MB/s to 765 MB/s compared to `memcpy()`. That's definitely a too large difference to be missed.

Unfortunately, on gcc with ARMv6, this solution was still as bad as `memcpy()`.

3. direct `u32` access : the only solution I could find for gcc on ARMv6.
This solution is not recommended, as it basically "lies" to the compiler by pretending data is properly aligned, thus generating a fast `load` instruction. It works when the target cpu is hardware compatible with unaligned memory access, and does not risk generating some opcode which are only compatible with strictly-aligned memory accesses.
This is exactly the situation of ARMv6.
Don't use it for ARMv7 though : although it's compatible with unaligned load, it can also issue multiple load instruction, which is a strict-align only opcode. So the resulting binary would crash.

In this case too, the performance gain is too large to be neglected : on unaligned memory access, read speed went up from 75 MB/s to 390 MB/s compared to `memcpy()` or `pack`. That's more than 5 times faster.

So there you have it, a complex setup, which tries to select the best possible method depending on compiler and target. Current findings can be summarized as below :

A good news is that there is a safe default method, which tends to work well in a majority of situations. Now, it's only a matter of special-casing specific combinations, to use alternate method.

Of course, a better solution would be for all compilers, and gcc specifically, to properly translate `memcpy()` into efficient assembly for all targets. But that's wishful thinking, clearly outside of our responsibility. Even if it does improve some day, we nonetheless need an efficient solution now, for current crop of compilers.

9 comments:

You mention gcc 4.8 when comparing compilers, then 4.7 in that table at the end—are you really using different versions? GCC 4.8 is getting a bit old… I'd be interested in seeing the results fro 5.1, or even 4.9.

Did you file a bug about the memcpy performance on ARM? It would be very nice to be able to track the issue…

If you want, I could provide SSH access to an ARM board or two of mine… I have ARMv6 and 7 boards with gcc 4.9, and I could probably get 5.1 on at least the ARMv7 fairly easily.

That said, I'm more interested in getting a bug report to the GCC people so the situation for memcpy can be improved where possible. Getting GCC fixed benefits everyone, and if memcpy is universally optimal you'll eventually be able to remove a bunch of portability cruft.

arm v7 is an architecture (i.e. command set). it will be betterto report micro architecture (A7, A15 and so on) and frequency. Like that arm v8 is the new architecture (that contains 32 bit and 64 bit command sets) that is implemented in A53/A57/A72 and Apple microarchitectures