Keep in mind that mixing NEON and ARM load/stores can sometimes stall significantly. See [http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ this link] for more info.

On Cortex-A9, there is a much higher performance floating point unit which can sustain 1 cycle/instr throughput, with low result latencies.

On Cortex-A9, there is a much higher performance floating point unit which can sustain 1 cycle/instr throughput, with low result latencies.

Revision as of 17:14, 6 April 2009

gcc compiler

CodeSourceryARM GNU/Linux tool chain is the version with the support for the latest ARM architecture. Mainline gcc also has stable ARM support. Enhancements are made in the Codesourcery version first, and are then pushed back to mainline.

ARM Cortex Floating Point

There are two types of instructions in the ARM v7 ISA that handle floating point:

1) VFPv3 Floating point instruction set (used for single/double precision scalar operations).
These is used by gcc for C floating point operations on 'float' and 'double'

2) NEONNEON vectorized single precision operations (2 values in a D-register, or 4 values in a Q-register)
These can be use by gcc when -ftree-vectorize is enabled and -mfpu=neon is specified, and the code can be vectorized. In other cases the VFPv3 scalar ops will be used.

ARM Cortex-A processors have separate floating point pipelines that handle these different instructions.

On Cortex-A8, the designers' focus was on the NEON unit performance which can sustain 1 cycle/instr throughput (processing 2 single-precision values at once). The scalar VFPv3 FPU cannot achieve this level of performance (cycle timings are in the Cortex-A8 TRM download), but it is still a lot better than doing floating point using integer instructions.

If you need the highest performance floating point on Cortex-A8, you need to use single precision and ensure the code uses the NEON vectorized instructions:

use gcc with -ftree-vectorize (possibly modify source code to make it vector friendly)