ARM VFP Vector Programming, Part 1: Introduction

The ARM VFP co-processor is most commonly used for individual floating-point computations, in the so-called “scalar mode.” In Flynn’s taxonomy, this is known as SISD, or “single instruction, single destination.” This design philosophy is the basic form for most low-level assembly most high-level compilers. In cases where different data sources are treated differently, for whatever reason, SISD is the norm.

However, when a block of identical operations are carried out on a sequence of data points, then it is possible to fetch several of the source data, and perform the operations on them all at once. This may be something as simple as adding the values of two arrays and storing the results into a third array, which may be part of a sophisticated analysis on a digital image. A close examination of the processing can show where Single Instruction, Multiple Destination (SIMD) design can boost a program’s performance.

VFP configuration

The ARM VFP contains two math co-processors, one for single-precision operations, and one for double-precision. They share their register files, as well as their Floating-Point Status and Control Register (FPSCR). The FPSCR controls the behavior of the VFP, for such things as rounding modes, exception flags, and the results of individual comparison operations. Two aspects of VFP programming controlled by the FPSCR are the vector Length and Stride.

Programmers who are familiar with Intel’s MMX and SSE paradigms are accustomed to having separate instructions based on the types and widths of operands in the vector registers. ARM uses the same instructions for vector and non-vector operations, instead accessing register sets within banks. Each VFP coprocessor has four banks, meaning the double-precision coprocessor sees 4 registers per bank, and the single-precision coprocessor sees 8 registers per bank. (Remember, the ARM math coprocessors share their computational register files.) Thus, the banks look like this:

Bank

Double-precision registers

Single-precision registers

0

D0-D3

S0-S7

1

D4-D7

S8-S15

2

D8-D11

S16-S23

3

D12-D15

S24-S31

When the vector Length is greater than 1 (so its field in the FPSCR is not all 0’s), an instruction that can operate on multiple operands might do so, depending on a couple conditions. The first condition is that the destination register is not in bank 0; any instruction specifying a destination register of D0-D3 or S0-S7 is automatically scalar, no matter what the current Length and Stride are. The second condition is a little more complex. Remember that typical RISC instructions have three register fields: the destination, the first operand, and the second operand. In the ARM VFP, if the second operand is in bank 0, then the operation is called “mixed.” The Length and Stride apply to the first operand and the destination, but the second operand comes from a single register. This is kind of like saying {1,2,3}+5={6,7,8}. The vector {1,2,3} is shifted on all dimensions, giving the vector {6,7,8}. Such a mode is convenient for operations like adjusting the silence level in an audio file; in this case, the uncorrected silence level would be at -5. The same concept applies to multiplication: the operation {1,2,3}×2={2,4,6}. The vector {1,2,3} is doubled on all dimensions, giving the vector {2,4,6}. In an audio file, this would correspond to amplification.

The instructions affected by the Length and Stride settings are the following:

Add, subtract, multiply, divide

Copy

Absolute value

Negate

Square root

Combined multiply/negate/add/subtract

The remaining operations are unaffected by the Length and Stride values, even for particular instruction operating on multiple registers by design. In particular, pushing and popping multiple VFP registers on the stack always involves consecutive registers; the instruction encoding specifies the first and last register numbers to be transferred.

Programming Note

Low-level programmers should take care: Activating banks on the ARM VFP will put the VFP into a state that is incompatible with the ARM EABI! The Linux kernel does manage the FPSCR during a task swap, so one task using ARM VFP vectors won’t interfere with another task in “scalar mode.” However, enabling ARM VFP banks in an assembly function requires caution to save registers and disable the banks before leaving pure-assembly routines. The GCC compiler does not yet have support for vector operations in the ARM VFP.

Explaining Length and Stride

The Length field in the FPSCR is a 3-bit field (bits 18-16) containing Length-1, so basically lengths {1…8} are represented as {0…7} in the FPSCR. For vector or mixed operations, it corresponds to the number of operands handled from/to each bank. The Stride field is not so simple; it is a 2-bit field, containing binary 00 for a stride of 1, or 11 for a stride of 2. The binary values 01 and 10 are untreated in the ARM literature, and so should be considered UNPREDICTABLE and therefore unusable. What does this mean for a vector operation?

The Length field is exactly what its name implies. If a vector instruction should operate on 5 values, the Length field should contain 4 (binary 0100). So the instruction

FADDS S8, S18, S25

would perform the following vector addition, assuming the Stride value is 1 (binary 00):

{S8..S12} := {S18..S22} + {S25..S29}

That is to say, S18+S25 would be stored in S8, S19+S26 would be stored in S9, and so on, for a total of five additions. The instruction mnemonic specifies the first vector elements for the instruction, and the Length and Stride fields determine which vector elements get used/modified.

If the first vector element, Length and Stride would cause a vector list to exceed its bank, the vector list will wrap around to the beginning of the bank. Thus, keeping the same Length and Stride values for 5 consecutive elements, a vector starting at S14 would contain the registers {S14, S15, S8, S9, S10}.

However, if a Length and Stride would cause the same register to appear more than once in a vector operand, such a combination of Length and Stride is specified as UNPREDICTABLE in the ARM documentation. This means the product of Length and Stride should not exceed the length of the bank (4 for double-precision, 8 for single-precision).

An instruction’s vector operands are under no constraints relative to one another, save for the Length and Stride values in the FPSCR, which apply to all the vectors in an instruction. A vector or mixed instruction may specify any arbitrary base register for any vector, and expect that the vector/mixed operation will be carried out successfully (barring an unmasked floating-point exception).

Can ARM Vector Programming Benefit?

It can, but with a few caveats. Divisions and square roots are iterative processes on nearly any CPU architecture, and the ARM is no exception. If a vector operation involves divisions or square roots, the cost of execution latency is likely to outweigh any possible benefit of using vector banks.

Another factor (pun intended) is the instruction scheduler in a particular compiler. The scheduler in GCC 4.7.2 was not as good as the current scheduler in GCC 4.8.1, so the benefit of hand-coded assembly has decreased between these versions. (Side note: a non-VFP integer-based comparison, which under GCC 4.7.2 showed some benefit from using assembly code, showed zero benefit under GCC 4.8.1. The instruction scheduler improved that much.)

Finally, note that ARM Ltd. is no longer developing CPU architectures using VFP. VFP development is officially deprecated, and therefore discouraged. However, this is part of what made the Broadcom BCM2835 CPU desirable for the Raspberry Pi. From a cost-management perspective, using slightly-outdated goods in the bill of materials is one way to reduce the cost per unit. From a reliability perspective, the bugs are known and documented, and yes, the Linux kernel for the Raspberry Pi includes a couple work-arounds for known ARM1136-family hardware bugs.

So what good is the Vector in Vector Floating Point? It’s good for the same thing as the Raspberry Pi itself: tinkering, trying it out, seeing what it can do (or can’t do). In Part 2, I’ll show how to enable and disable vector banks in the VFP, with a quick-and-dirty basic demonstration of “making it work.” I’ll also provide an example of VFP code with real-world implications in signal processing.