From ARM NEON to Intel SSE- the automatic porting solution, tips and tricks

I love ARM. Yes, I do. - Why do I work for Intel then? Because I love Intel even more. That's why I'd like to help independent software vendors to port their products from ARM to Intel Architecture. But if and only if they would like to do it.

Why would they like it? The answer is simple. Currently, Intel CPUs are in smartphones and tablets while Android, Windows 8, and some other operating systems support both ARM and x86, increasing the developers options enormously. In most cases porting to x86 is very easy – the work varies from zero (for managed code and generic native code running via Intel Houdini binary translator) to simple code rebuild with the corresponding compiler. But for some applications it is not true.

Modern ARM CPU widely used in mobile devices ( iPhone, iPad, Microsoft Surface, Samsung devices and millions of others) have the 64-128bit SIMD instruction set (aka NEON or "MPE" Media Processing Engine) defined first as a part of the ARM® Architecture, Version 7 (ARMv7). NEON is used by numerous developers for performance critical tasks via assembler or NEON intrinsics set supported by modern compilers like gcc, rvct and Microsoft. NEON could be found in such famous open source projects as FFMPEG, VP8, OpenCV, etc. For such projects achieving the maximal performance on x86 causes the need to port ARM NEON instructions or intrinsics to Intel SIMD (SSE). Namely to Intel SSSE3 for the first generation of Intel Atom CPU based devices and to Intel SSE4 for the second and later generations available since 2013.

However x86 SIMD and NEON instructions sets and therefore intrinsic functions are different, there is no one-to-one correspondence between them, so the porting task is not trivial. Or, to be more precise it was not trivial before this post publication.

Attached is the automatic solution for intrinsic functions based ARM NEON source code port to Intel x86 SIMD (SSE up to 4.2). Why intrinsics? - X86 SIMD intrinsic functions are supported by all widespread C/C++ compilers – Intel Compiler, Microsoft Compiler, gcc etc. These intrinsics are very mature and their performance is equal tor even greater than pure assembler performance, while the usability is much better.

Why Intel SSE only but not MMX? - While Intel MMX (64 bit data processing instructions) instructions set usage is possible for 64 bit NEON instructions substitution, it is not recommended: MMX performance is commonly the same or lower than for the SSE instructions but the specific MMX problem of floating point registers sharing with the serial code could cause a lot of problems in SW if not properly treated. Moreover, MMX is NOT supported on 64-bit systems that are coming to mobile devices.

By default SSE up to SSSE3 is used for porting but for gcc if SSE4 flag was ompilation with if uncomment the correspoding "#define USE_SSE4" line then the SSE up to SSE4 are used for porting.

Though the solution is targeted for intrinsics, it could be used for pure assembler porting assistance. Namely, for each NEON function the corresponding NEON asm instruction is provided, while the corresponding x86 intrinsics code could be used directly and the asm code could be copied from the complier intermediate output.

The solution is shaped as C/C++ language header to be included in the sources ported instead of the standard "arm_neon.h" and provide the fully automatic porting.

Main ARM NEON - x86 SIMD porting challenges:

64-bits processing functions. As 128 bit SSE registers only are used for x86 vector operations. It means that for each 64-bit processing function we need to load data to SSE (xmm registers) somehow and then store it back. It impacts not code quality only but the performance as well. Various load-store techniques are preferred for different compilers, for some functions the serial processing is faster.

Some x86 intrinsic functions require immediate parameters rather than constants resulting in compile time “catastrophic error” when called from a wrapper function. Fortunately it happens not for all compilers and in non-optimized (Debug) build only . The solution is to replace such functions with a corresponding switch for immediate parameters using branches (cases) in debug mode.

Not all arithmetic operations are available for 8-bit data in x86 SIMD. Also there is no shift for such data. The common solution used is to convert 8-bit data to 16-bit, process them and then pack to 8 bit again. However in some cases it is possible to use tricks like the one shown in vector right shift sample above (vshrq_n_u8 function) to avoid such conversions.

For some functions where x86 implementation contains more than 1 instruction, the intermediate overflow is possible. The solution is to use the overflow safe algorithm implementation even if it is slower. Say, if we need to calculate the average of a and b i.e. (a+b)/2, the calculation should be done as (a/2 + b/2).

For some NEON functions there exist corresponding x86 SIMD functions, however their behavior differs when the functions parameters are “out of range”. Such cases need the special processing like the following Table lookup sample. While in NEON specification indices out of range return 0, for Intel SIMD we need to set the most significant bit to 1 for zero return:

For some NEON functions there exist corresponding x86 SIMD functions, however their rounding rules are different, so we need to compensate it by adding or subtracting 1 from the final result.

For some functions x86 SIMD implementation is not possible or not effective. Such function samples are: shift of vector by another vector, some arithmetic operations for 64 and 32 bit data. The only solution here is the serial code implementation.

Performance

First it is necessary to notice that the exact porting solution selection for each function is based on common sense and x86 SIMD latency and throughput data for latest Intel Atom CPU. However for some CPUs and conditions better solution might be possible.

Solution performance was tested on several projects demonstrating very similar results that lead to the first and very important conclusion:

For most cases of x86 porting expect the perfomance increase ratio similar to the ARM NEON for vectorized /serial code ration if 128 bit processing NEON functions used.

Unfortunately the situation is different for 64 bit processing NEON functions (even for those taking 64 input and returning 128bits or vice-versa). For them the speedup is significantly lower.

So the second very important conclustion is:

Avoid 64-bit processing NEON functions try to use 128-bit versions even if your data are 64-bit. If you use 64-bit NEON functions - expect the corresponding performance penalty.

Other porting considerations and best known methods are:

Use 16-bit data alignment for faster load and store

Avoid NEON functions working with constants. It gives not gain but performance penalty for constants load\propagation instead. If constants usage is necessary try to move constants initialization out of hotspots loops and if applicable replace it with logical and compare operations.

Try to avoid functions marked as "serialy implemented" because they need to store data from vector registers to memory, process them serialy and load them again. Probably you could change the data type or algorithm used to make the whole port vectorized not a serial one.

Once again - just include the file below in your project instead of "arm_neon.h" header and your code will be ported without any other changes necessary!

Upd. NEONvsSSE.h file has been updated on June 16, 2015 for a minor bugfix and better compilers compartibility.