Porting to the NEON intrinsics from experience

Hey you. Yes, you. Did you, inspired by my introduction to NEON on iPhone, write ARM NEON code, or are you maintaining ARM NEON code in some way? Is this NEON code written as ARM32 assembly? If you answered yes to both questions, then I hope you realize that any app that has your NEON code as a dependency is currently unable of taking advantage of ARM64 on supported hardware (now there may or may not be any real benefit for the app from doing so, but that is beside the point). ARM64, at the very least, is the future, so you will have to do something about that code so that it can run in ARM64 mode, but porting it to ARM64 assembly is not going to be straightforward, as the structure of the NEON register file has changed in ARM64 mode. Rather, I propose here porting your NEON ARM32 assembly algorithms to NEON intrinsics which can compile to both ARM32 and ARM64, and present here the outcome of my experience doing such a port, so that you can learn from it.

An introduction to the ARM NEON intrinsic support

The good thing about ARM NEON intrinsics is that they apply equally well in ARM32 and ARM64 mode, in fact you don’t have to follow any specific rule to support both with the same intrinsics source file: correct NEON intrinsics code that works on ARM32 will also work on ARM64 for free. At the most fundamental level, NEON intrinsics code is simply a C source file that includes <arm_neon.h> and uses a number of specific functions and types. The documentation for the ARM NEON intrinsics can be found here, on the ARM Information Center. This documentation ostensibly covers ARM DS-5, but in fact for iOS clang implements the same support; if you target other platforms in addition to or instead of iOS, you will have to check your toolchain compiler documentation, but if it supports any ARM NEON intrinsics at all it ought to have the same support as ARM DS-5.

Unfortunately, this document pretty much only documents the intrinsic function names and the types: for documentation on the operations these functions perform, it is still necessary to refer to the NEON instructions descriptions in the ARM instruction set document (don’t worry about the “registered ARM customers” part, you only need to create an account and agree to a license in order to download the PDF); furthermore, most material online (including my introduction to NEON on iPhone, if you need to get up to speed with NEON) will discuss NEON in terms of the instruction names rather than in terms of the C intrinsics, so it is a good idea to get used to locating the intrinsic function that corresponds to a given instruction; the most straightforward way is to open arm_neon.h in Xcode (add it as an include, compile once to refresh the index, then open it as one of this file’s includes in the “Related Files” menu), and just do a search for the instruction name: this will turn up the different intrinsic function variants that implement the instruction’s functionality, as the intrinsic function name is based on the instruction name. There is a trick situation, however, as for some instructions there is no matching intrinsic, these cases are documented here, with what you should do to get the equivalent functionality.

The converse also exists, where some intrinsics provide a functionality not provided by a particular instruction, or where the name does not match any instruction, such as:

In particular, the last two are what you will use in replacement of the parts of your ARM32 NEON algorithm where you would put results in, say, d6 and d7, and then the next operation would use q3, which is aliased to these two D registers. Indeed, it is important to realize (in particular if you are coming from NEON assembly coding) that these intrinsics work functionally, rather than procedurally over a register file; notably, the input variables are never modified. So stop worrying about placement and just write your NEON intrinsic code in functional fashion: factor_vec = vrsqrteq_f32(vmlaq_f32(vmulq_f32(x_vec, x_vec), y_vec, y_vec)); (assuming the initial reciprocal square root estimate is enough for your purposes). Things should come naturally once you integrate this way of thinking.

Variables should be reserved for results that you want to use more than once. Those need to be typed correctly, as the whole system is typed, with such fun variable type names as uint8x16_t; this explains the various vcombine_tnn variants, from vcombine_s8 to vcombine_p16, which in fact all come down to the same thing: the sole purpose of the variants is to preserve the correct element typing between the inputs and the output. I personally welcome the discipline: even if you think you know what you are doing, it’s so easy to get it subtly wrong in the middle of your algorithm, and you are left wondering at the end where you wrongly took a left turn (it was at Albuquerque. It is always at Albuquerque).

Less pleasant to use are the types that represent an array of vectors, of the form uint8x16x4_t for instance. Indeed, some intrinsics rely on these types, such as the transpositions ones, but also the deinterleaving loads and stores vld#/vst# (I presented them in my introduction to NEON on iPhone), which are just as indispensable when using intrinsics as they are when programming in assembly, and so when using these intrinsics you have to contend with these variables that represent multiple vectors at once (and that you of course cannot directly use as the input of another intrinsic); fortunately taking the individual vectors of those (for further calculations) is done using normal C array subscripting: coords_vec_arr.val[1], but this makes expressions less straightforward and elegant than they could otherwise have been.

Note that loading and storing vectors to memory without deinterleaving is not performed with an intrinsic, but simply by casting the pointer (typically one to your input/output element arrays) to a pointer to the correct vector type, and dereferencing that; this will result in the correct vector loads and stores being generated.

In practice

I am not going to share the code I ported or the actual benchmark results, but I can share the experience of porting a non-trivial NEON algorithm from ARM32 assembly to NEON intrinsics.

First, if the assembly code is competently commented (in particular with a clear register map), porting it is just a matter of following the algorithm main sequence and is rather straightforward, translating instructions one by one, with the addition of the occasional vcombine when two D vectors become a Q vector; your activity will mostly consist in finding the correct name for the intrinsic function for the given input element type, and finding variable names for these previously unnamed intermediate results (again, for these intermediate results which are only used once, save yourself the trouble of defining a variable and directly use the intrinsic output as the input for the next intrinsic). This was completed quickly.

But this is only the start. The next order of business is running the original algorithm and the new one on test inputs, and compare the results. For integer-only algorithms such as the one I ported, the results must match bit for bit between the original algorithm, the new one compiled as ARM32, and the new one compiled as ARM64; in my case they did. For algorithms that involve floating-point calculations they might not match bit for bit because of the different rounding control in ARM64, so compare within a tolerance that is appropriate for your purposes.

Once this check is done, you might wish to take a look at the assembly code generated from your intrinsics. In my case I discovered the ARM32 compiled version needed more vector storage than there are available registers, and as a result was performing various extra vectors loads and stores from memory at various points in the algorithm. The reason for this is that the automatic register allocation clang performed (at least in this case) just could not compare with my elaborate work in the original ARM32 NEON assembly code to tightly squeeze the necessary work data to never take more than 12 Q vectors at any given time (even avoiding the use of q4-q7, saving the trouble of having to preserve and restore them); also, it appears that, with clang, the intrinsics that use a scalar as one input do not actually generate the scalar-using instruction, but instead require the scalar to be splat on a vector register, harming register usage.

I have not been able to improve the situation by changing the way the intrinsic code was written; it seems it is the compiler which is going to have to improve. However, the ARM64 compiled version had no need for temporary storage beyond the NEON registers: twice as many vector registers are available in this mode, easing the pressure on the compiler register allocator.

But in the end what really matters is the actual performance of the code, so even if you take a look at the compiled code it is only by benchmarking the code (again, comparing between the original algorithm, the new version compiled as ARM32, and the new version compiled as ARM64) that you can reasonably decide which improvements are necessary. Don’t skimp on that part, you could be surprised. In my case, it turned out that the “inefficient”, ARM32 compiled version of the ported algorithm performed just as well as the original NEON ARM32 assembly. The probable reason is that my algorithm (and likely yours too) is in fact memory bandwidth constrained, and taking more time to perform the computations does not really matter when you then have to wait for the memory transfers to or from the level 3 cache or main memory to complete anyway.

As a result, in my case I could just replace the original algorithm by the new one without any performance regression. But that might not always be the case, and so if doing so would result in a performance regression, one course of action would be to keep using the original NEON assembly version in ARM32 mode, and use the new intrinsic-based algorithm only in ARM64 mode; use conditional compilation to select which code is used in each mode (I have a preprocessor macro defined for this purpose in the Xcode build settings, whose value depends on an architecture-dependent build setting). Fortunately, given the number of NEON registers available in ARM64, you should never see a performance regression on ARM64 capable hardware between the original ARM32 NEON assembly algorithm and the new one compiled as ARM64.

It worked

So your mileage may vary, certainly. But in my experience porting a NEON algorithm from ARM32 assembly to C intrinsics gave an adequate result, and was a quick and straightforward process, while writing an ARM64 assembly version would have been much more time consuming and would have required maintaining both versions in the future. And remember, no app that depends on your NEON algorithms can ship as a 64-bit capable app as long as you only have an ARM32 assembly version of these algorithms; if they haven’t been ported already, by now you’d better get started.