Motivation for hand-optimized Assembly code:

Last updated on 5th July, 2013 by Shervin Emami. Posted originally on 8th Oct, 2010.

There's a popular saying that "in 90% of cases, a modern compiler writes faster code than a typical Assembly programmer would". But anyone that has actually tested this theory knows how wrong this statement is! Hand-written Assembly code is ALWAYS faster and/or smaller than the equivalent compiled code, as long as the programmer understands the many intricate details of the CPU they are developing for. eg: I wrote both an optimized C function and an optimized Assembly function (using NEON SIMD instructions) to shrink an image by 4, and my Assembly code was 30x faster than my best C code! Even the best C compilers are still terrible at using SIMD acceleration, which is a feature that is available on most modern CPUs and can allow code to run between 4 to 50 times faster, and yet is rarely used properly!

ARM's RVDS compiler typically generates code that is upto 2x faster than any other C compiler for ARM, but on most ARM devices, hand-written Assembly code can often be 10x faster! (Assuming you use SIMD vectorization such as ARM's NEON Media Processing Engine or Intel's MMX/SSE/AVX). This is similar to the speedups you can expect from GPGPU acceleration (using NVidia's CUDA or OpenCL), but on a small mobile device rather than an expensive desktop video card! And luckily the iPhone, iPad, iPod, Raspberry Pi, ODROID and Android phones & tablets nearly all use ARM CPUs with NEON vector processing, so you can use the same Assembly code in apps for the official iPhone App Store and the Android Market (with NDK) and Raspberry Pi. And with the recent popularity of ARM CPUs in portable devices, this is likely to continue for several generations of smartphones, tablets, and ultra-portables (eg: in the NVidia Tegra3 "Kal-el", TI OMAP4, QualComm Snapdragon S4 "Krait", Apple iPad2 & iPhone5, etc). Obviously you shouldn't write a whole app using Assembly language, but if you need certain loops to run as fast as possible, then a few sections of Assembly language might be exactly what you need!

Modern processor architectures are much more complicated now than they were at the start of the PC era, which definitely makes efficient Assembly code hard to write by hand, but it also makes efficient code hard for a compiler to generate, and so there is significant room for improvement in efficient code design.

UPDATE: Note that Cortex-A9 and Cortex-A15 CPUs are much more advanced than Cortex-A5, Cortex-A7 & Cortex-A8, so the advantages of Assembly code & NEON SIMD will be less important in Cortex-A9 than in simpler devices such as Cortex-A8.

There are already some free libraries of hand-optimized code for Intel x86 and ARM CPUs, so for some tasks you can simply use one of these existing libraries from your C/C++ code without doing any Assembly language code yourself.

If you use ARM's DS-5 or RVDS 4 compiler, you can enable auto vectorization so it will try to optimize your C code using NEON, perhaps generating code that runs twice as fast as normal.

Or if you use GCC or LLVM or CodeSourcery you can also enable auto vectorization, but it rarely makes any improvement (in XCode 3, it would be "GCC 4.2 - Language" -> "Other C flags"):
"-O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -ftree-vectorize -ffast-math -funsafe-math-optimizations -fsingle-precision-constant"

write standalone Assembly functions in a '.s' file and simply add it in your XCode sources, or

write standalone Assembly functions for an external assembler. You write Assembly code in a '.s' file and generate a '.o' object file and link the object file with your project in XCode.

So if you are just trying to write a few Assembly instructions then inline assembler would be the easiest way, but if you plan on writing many Assembly functions then I'd recommend a standalone Assembly file for GCC, or an external assembler such as FASMARM.

Once you have setup your assembler environment, you need to learn how to write ARM Assembly code, since iPhones and pretty much all portable devices, smartphones, tablets, smartwatches & Raspberry Pi / Linux dev boards use the same ARM instruction set. Some good intro tutorials to learn ARM Assembly are:

There are also the books ARM System Developer's Guide and ARM Assembly Language. These are a good way to learn the basics of ARM Assembly from scratch, and then you can target specific features for your device such as NEON or Thumb-2 or multi-core.

When it comes to Assembly programming, the official instruction set reference manual is usually the main source of information for everything you will write, so you should go to the ARM website and download the ARM and Thumb-2 Quick Reference Card (6 pages long) as well as the 2 full documents for your exact CPU. For example, the iPhone 3GS and iPhone 4 both have an ARMv7-A Cortex-A8 CPU, so you can download the ARM Architecture Reference Manual ARMv7-A and ARMv7-R Edition (2000 pages long) that tells you exactly which instructions are available and exactly how they work, and the Cortex-A8 Technical Reference Manual (700 pages long) that explains the instruction timing, etc for your specific CPU. There is also a recent ARM Cortex-A Programmer's Guide, containing useful info and comparisons of Cortex-A8, Cortex-A9, Cortex-A5 and Cortex-A15 CPUs.

UPDATE: Note that the Cortex-A5 & Cortex-A7 CPUs in recent ARM devices such as Raspberry Pi 2 and ODROID-C1 all use the ARMv7 instruction set and have CPUs similar to ARM Cortex-A8 or Cortex-A9 CPUs. Whereas the original Raspberry Pi 1 and the original iPhone 1 use an old ARMv6 instruction set and use an old ARM11 CPU, so they are quite different to all modern ARM CPUs.

It is important to understand that many ARM CPU's include the NEON Advanced SIMD coprocessor (aka NEON or Media Processing Engine), and so if you expect to run operations that can take advantage of SIMD architecture (eg: heavily data parallel tasks), then you should make it a big priority to learn how to use NEON effectively! As mentioned above, the official ARM Architecture Reference Manual and ARM Cortex-A8 Reference Manual are the most important sources of info, but there are other places for quicker info such as:

Note: The Assembler in GCC (GNU "as", "gas" or "gcc -assembler-as-cpp") can have certain peculiarities, such as:

All Assembly instructions should be in lower-case, so CAPITALS are not allowed!

The macro features are not nearly as powerful as other assemblers such as NASM.

Some versions of GCC / BINUTILS have a bug in parsing NEON alignment, so you may need a space after the comma. eg: vld1.8 {q0}, [r0, :64] instead of: vld1.8 {q0}, [r0,:64]

I found that NEON can only do simple addressing modes (as mentioned in the specs), but GCC does not give an error if you use an invalid one!
eg: vld1.32 {q0}, [r1,r2] // Load vector from mem[r1+r2]or: vld1.32 {q0}, [r1,r2,lsl#2] // Load vector from mem[r1+r2*4]silently compiles to:
vld1.32 {q0}, [r1] // Load vector from mem[r1]

To actually use Assembly code in your XCode project, I recommend creating a .H header file with function headers that can be included by your iPhone code. For example:

In your Objective-C or C or C++ files:

#include "libASM.h"
....
// Add the 4 pairs of numbers using NEON SIMD.
int arrA[4] = {5,10,15,20};
int arrB[4] = {1,2,3,4};
int *arrOut;
// NOTE: The same array is used in this function for arrA and the return value
// (ie: it will overwrite the data in arrA), but I'm just using a separate "arrOut"
// pointer to show how to return data from the Assembly function using r0.
arrOut = addFourIntsUsingNeon(arrA, arrB);
printf("arrOut = {%d, %d, %d, %d}\n", arrOut[0], arrOut[1], arrOut[2], arrOut[3]);

If you have a syntax error in your Assembly code, then XCode will fail with the error message:
Command /Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/gcc-4.2 failed with exit code 1But to see what the actual error message is, click on the small icon on the right-side of that line, which expands the message. At the bottom, it should now show the GNU assembler's error message, such as:
/iPhone/TestAssembly/Classes/libASM.s:341:bad instruction `xor r0, r0, r0'

The ARM Cortex-A8 single-core CPU is used on many smart phones and portable devices, such as the Apple iPhone 3GS, iPhone 4, iPad, iPod Touch (3rd Gen), the Palm Pre, Motorola Droid, Nokia N900, the BeagleBoard (palm-sized Linux computer), the Gumstix Overo (finger-sized computer) and Pandora (open-source gaming console). So I have collected the following information while developing Assembly language code for the iPhone, but that can be useful for other mobile devices with ARM CPUs:

ARM functions can use 32bit registers R0-R12 for general-purpose (R13 is SP and R15 is PC), but must restore R4-R14. In iOS, R7 is frame pointer so should never be used but R9 and R12 can be used without preserving. Also in iOS, a 'bool' is possibly 1 byte and data is little-endian (ARM can potentially support little-endian & big-endian).

When interfacing to C programs, the first 4 integer parameters of a function are passed in r0-r3. Return value is in r0 and even more in r1-r3.

ARMv7 has dual-pipeline execution, and most instructions take 1 cycle, so 2 instructions can potentially run in 1 cycle. But any load/store/multiply/branch/register-reuse must wait for next cycle, and accessing memory from a destination register has a 2 cycle stall. LDM can load 2 memory words per cycle but only runs in pipeline0 and will only free pipeline1 in the last iteration.

ARMv7-A has a 13-stage pipeline, which means that if a branch-prediction fails (eg: an 'else' statement), it is a 13 cycle penalty! And accessing memory that is not in L2 cache is atleast 25 cycles!

Cache line size of Cortex-A8 is 16-words (64 bytes), so should ideally align everything to 64 bytes.

NEON coprocessor mainly runs at 1 SIMD instruction per cycle, 5 cycles behind the ARM unit. Data in Level-1 cache is accessed instantly by NEON. The NEON coprocessor can access ARM registers instantly, but for ARM to access a NEON register or the same memory location, it has atleast 20 cycles penalty! But during the penalty, it's possible to continue processing as long as nothing requires the ARM registers.

The ARM Cortex-A9 CPU can be single-core, dual-core or quad-core, and features speculative Out-of-Order Execution (allows high-level code such as C/C++ to automatically run more efficiently), yet is extremely low in battery power. So the ARM Cortex-A9 is used in most of the latest multi-core devices, such as the Apple iPad2 (Apple A5 processor), LG Optimus 2X (nVidia Tegra2), Samsung Galaxy S II (Samsung Exynos 4210), Sony NGP PSP2, and the PandaBoard (TI OMAP4430). Here are some notes I made when reading the ARM Cortex-A Programmer's Guide:

Cortex-A9 has many advanced features for a RISC CPU, such as speculative data accesses, branch prediction, multi-issuing of instructions, hardware cache coherency, out-of-order execution and register renaming. Cortex-A8 does not have these, except for dual-issuing instructions and branch prediction. Therefore Assembly code optimizations &NEON SIMD are not as important in Cortex-A9 anymore.

Cortex-A9 MPCore has separate L1 Data and Instruction caches for each core, with hardware cache coherency for the L1 Data cache but not the L1 Instruction cache. Any L2 cache is shared externally between all the cores.

Cortex-A9 must use the PreLoad Engine in the external L2 cache controller (if it has one), whereas Cortex-A8 has an internal PLE for its L2 cache.

Cortex-A9 has a full VFPv3 FPU, whereas Cortex-A8 only has VFPLite. The main difference being that most float operations take 1 cycle on Cortex-A9 but take 10 cycles on Cortex-A8! Therefore VFP is very slow on Cortex-A8 but decent on Cortex-A9.

Cortex-A8 had the NEON unit behind the ARM unit, so NEON had fast access to ARM registers & memory but it took 20 cycles delay for any registers or flags from NEON to reach the ARM! This often occurs with function return values (unless if "hardfp" convention or function inlining is used).

Cortex-A8 had a separate load/store unit for NEON and one for ARM, so if they were both loading or storing addresses in the same cache line, it adds about 20 cycles delay.

All Cortex-A8 CPUs have a NEON SIMD unit, where some Cortex-A9 CPUs don't have a NEON SIMD unit (eg: nVidia Tegra 2 does not have NEON, but nVidia Tegra 3 will have NEON).

Notes on ARM Cortex-A9 or any ARM Cortex-A in general:

Cortex-A9 has a 4-way set associative L1 Data Cache using 32 bytes per cache line (16kB, 32kB or 64kB of L1 cache, which is 512, 1024 or 2048 L1 cache lines).

Cortex-A9 MPCore can't clean or invalidate both L1 & external L2 at the same time, so incoherency can occur unless if done in correct order by softare: To clean, clean the L1 cache first then L2, or to invalidate, invalidate the L2 cache first then L1.

Cortex-A9 contains a "Fast Loop Mode" where very small loops (under 64 bytes of code and possibly cache line aligned) can run completely in the CPU decode & prefetch stages without accessing the instruction cache.

Cortex-A9 has support for Automatic Data Prefetching (if enabled by the OS), so that if you are accessing 1 or 2 arrays sequentially, it will detect this and prefetch the next data to cache before you will need it.

Cortex-A9 can detect when the instruction STM is used for memset() & memcpy(), and optimize the cache access by not loading data into cache if it will be overwritten anyway.

Cortex-A9 MPCore has a separate NEON module for each core. eg: a quad-core Cortex-A9 has 4 NEON units!

If the TLB does not have an page in its table, then a "page table walk" needs 2 or 3 memory accesses instead of 1.

"char" variables on ARM may default to unsigned chars, whereas they default to signed chars on x86, so this can cause runtime errors if not expected.

The first 4 arguments to a function are sent directly in the first 4 32-bit registers, whereas the rest of arguments use stack memory so are slower. But C++ automatically uses the 1st argument as a pointer to "this", so only 3 function arguments can go in registers.

64-bit arguments are more tricky and limiting due to the 8-byte alignment requirement.

If a function will call another function, it needs to maintain an 8-byte stack alignment, so should PUSH/POP an even number of times. Leaf functions don't need 8-byte stack alignment.

When passing arguments with NEON Advanced SIMD using the "hardfp" calling convention, registers q0-q3 (s0-s15 or d0-d7) are used. Registers q4-q7 (s16-s31 or d8-d15) must be preserved if modified.

Newer C99 compilers allow the "restrict" keyword to say that pointers do not overlap other pointers, allowing compiler optimizations.

Cortex-A does not have integer division, so any divide instruction is a slow (~50 cycle) function call or floating-point divide. But shifts left or right are often free.

Since the Branch Target Address Cache (BTAC) is based on 16-byte sizes and only allows 2 branches per line, if any code has more than 2 branches within 16-bytes of code, then it is likely to flush the instruction pipeline.

Since Cortex-A9 does Register Renaming at upto 2 registers per cycle, LDM or STM instructions of 5 or more registers can cause pipeline stalls.

Conditional Execution of ARM mode (not Thumb) allowed speedups in older CPUs but now it is often faster to use branches, because conditional instructions may need unwinding.

Good info on optimizing memset() & memcpy() is given on page 17.19 of the ARM Programmers Guide, saying to use LDM & STM of a whole cache line, where aligned store is more important than aligned load, and upto 4 PLD's should be inserted, for roughly 3 cache lines ahead of current cache line.

Some info on optimizing float operations with VFP are given in Chapter 18 of the ARM Programmers Guide.

The Cortex-A9 has a big delay when switching between VFP and NEON instructions.

NEON can't process 64-bit floats, divisions or square roots, so they are done with VFP instead.

NEON can be detected at compile time by checking: #ifdef __ARM_NEON__

NEON can be detected at runtime on Linux by checking the CPU flags, by running "cat /proc/cpuinfo" or searching the file "/proc/self/auxv" for AT_HWCAP to check for the HWCAP_NEON bit (4096).

Cortex-A9 MPCore uses the MESI protocol to keep all L1 caches coherent. Unfortunately, if a thread is often writing to a piece of data and another thread is often reading from a different piece of data on the same cache line, that cache line is transferred significantly (thrashed).

The ARM DS-5 development suite generates faster code than GCC/LVDS compilers and has a more powerful debugger (using Eclipse IDE) that can analyze the system non-intrusively using CoreSight or JTAG.

The ARM "Vector Floating Point" (VFP) module was intended for SIMD vector operations, but it never became so! The VFP unit is just a scalar FPU for 32-bit floats and 64-bit doubles.

Many image processing operations can be performed very efficiently using NEON SIMD acceleration, since they operate locally on a small neighborhood of pixels at a time and can scan through the whole image in the same serial manner that it is stored in memory (ie: row-major form). But some tasks such as rotation require sparse or discontinuous memory access, and therefore may be tricky to implement in SIMD and still achieve a high performance boost.

This animation details how to rotate an image by 90 degrees (turn it onto its side) efficiently using NEON instructions. You can watch the data move instead of just reading about it:

Note: The very last slide in this animation shows the pixels as rotated 90 degrees counter-clockwise instead of 90 degrees clockwise! (Thanks to John Driscoll for pointing it out).

Here are some speed results I obtained from 2 different types of image processing functions: Rotating an image, and Shrinking an image. These two operations behave quite differently, since image shrinking can be done from top-to-bottom so memory is accessed in a serial manner, whereas rotation requires accessing memory in "column-major" format that causes major delays in memory access rather than CPU delays.