Automatic vectorization for the masses

Automatic vectorization is a neat technique, which enable programmers to use the "streaming multimedia" SIMD instructions such as SSE or Altivec at moderate costs. Programmers, however, should not assume that they just can program stuff and the optimizer will do everything for them...

I lately tried to use the automatic vectorization features found in compilers such as icc and gcc. Here's a technical discussion of vectorization and associated difficulties.

What is vectorization? This is a technique once only found on scientific supercomputers, but now found in all current processors (MMX/SSE instruction set on Intel/AMD, Altivec on PowerPC). It consists in the possibility of performing the same operation on n simultaneous sets of operands, instead of 1 set of operands as in normal instructions. As an example, on the Pentium 4 and later processors, you have instructions that take as arguments two "vectors" (a1, a2, a3, a4) and (b1, b2, b3, b4) of single-precision IEEE-754 numbers and computes (a1+b1, a2+b2, a3+b3, a4+b4). This is fast than using 4 operations to compute a1+b1, then a2+b2, a3+b3, a4+b4 etc.

This is especially important for repetitive operations such as matrix computations, combination of images (transparency...), sound processing. Typically, such computations will have some inner loops that do something on operands a[i], b[i], c[i] etc. Vectorization allows them to do them at once on, say, (a[4*i], b[4*i], c[4*i]), (a[4*i+1], b[4*i+1], c[4*i+1]), (a[4*i+2], b[4*i+2], c[4*i+2]), (a[4*i+3], b[4*i+3], c[4*i+3]). The speed-up can be as much as 4 times, in this example.

Now, this seems all fine and dandy, but as you may have guessed, there is another side to the coin and many constraints. The main ones:

The usual way of branching between several possibilities is by tests and jumps. But since the control flow is the same for all operands in the vector (single instruction - multiple data, or SIMD), we cannot do that.

Typically, instructions for memory read/writes will impose that successive elements in the vector will be at successive locations in memory.

Perhaps the whole vector in memory has to be specifically aligned. (Recall that alignment is when the architecture imposes that the address of some data is a multiple of some known power of 2.)

First, there are ways to take care of the branching issue. Often, sides of the branching operation will be expressions with no side effects (such as: x > 3 : x + y : y * 2 or equivalently if (x > 3) z = x+y; else x=y*2; ). In such case, one computes both sides of the expression and the truth value of the condition, then uses and/or/not bitwise operations to combine the results according to the condition. The SSE instruction set includes operations that, say, take two vectors of single-precision numbers (a1, a2, a3, a4) and (b1, b2, b3, b4) and output (a1 > b1, a2 > b2, a3 > b3, a4 > b4) where each boolean condition is represented by a word of identical bits, 0 for false and 1 for true.

The problem of alignment can be annoying : typically, one can suppose that functions will be called with operands aligned correctly for scalar (normal) operations, but not for vector memory accesses. This can be alleviated by either specifying extra alignment restrictions, either adjusting operations according to alignment (which can be painful). Techniques such as "loop peeling" help in that matter.

Now, handling all his by hand is quite painful. Furthermore, such code will be suitable only for a particular architecture. It would be better if the compiler could do it for us. Fortunately, compiler designers have implemented "vectorizers" that take code and rewrite it in a vector fashion.

First of all, let us see some difficulties in doing so. Let us for instance consider the following code:

This code won't be vectorized if one uses gcc -O2 -ftree-vectorize. The reason is that it cannot be vectorized (at least, not straightforwardly; there are ways to deal with this situation). The reason is that we do not know whether r points to, say, a[1]; if it does, we cannot compute r[0]=a[0]+b[0] and r[1]=a[1]+b[1] straightforwardly in parallel.

In order to work around this, we can tell the compiler that r points to something distinct from a and b, and C99 introduced the 'restrict' keyword for this. "restrict" means, basically, that r points to something not accessible through other variables or pointers. In order to allow vectorization by gcc, we can use: float * restrict r (replace 'restrict' by '__restrict__' if you're not using -std=c99).

Intel's compiler, icc, works around this by adding some code testing for overlap. It does not need the 'restrict' keyword (but it generates shorter code if the keyword is present, because it will not have to do tests).

Such a reorganization assumes that we suppose the operator (+ here) to be associative and commutative, i.e. that a+(b+c)=(a+b)+c and a+b=b+a. This is true of the mathematical plus operation, but it is not true of the floating-point plus operation. (Recall that (1e12 + 1) - 1e12 = 0 in floating-point...)

gcc will not vectorize this loop by default, because it would imply expression reorderings that do not respect floating-point semantics. (Arbitrary reorderings of expressions can have dramatic consequences, as William Kahan pointed out, and it is probably wise that compilers should not be able to do them outside the express consent of the programmer.) However, you can allow it to perform such reorderings using -ffast-math (or -funsafe-math-optimizations).

icc doesn't have such qualms: by default, it assumes a lax floating-point model, violating the C99 spect (you can get compliant behavior using -fp-model precise).

To summarize usage:

With gcc, use a very recent gcc (4.2 beta versions), and use -O2 -ftree-vectorize -ftree-vectorizer-verbose=3 ; add -ffast-math if your code is not very sensitive to C99/IEEE-754 conformance, and always add -march=pentium3 or pentium4 if you're on IA32.

With icc, use -xW -O2 -vec-report3 on AMD64 or Pentium 4.

If the compiler refuses to vectorize your code, increase the level of verbosity (here, 3) and look at the messages. Watch for indications that the compiler cannot determine the dependences between variables. You may need to use restricted pointers (see above). Do not hesitate to simplify pointer computations (move out of the loops some pointers computations).

Beware that vectorization has limited ways to cope with tests. While you can vectorize branching between several cases using boolean masks, this assumes that the operations do not have side effects (even nonobvious, such as fetching from memory using a pointer, because of the possible segmentation violation). Beware that "semantically identical" code (in a certain sense) may be compiled differently: for instance, a strictly compliant C compiler will compile (x <= y ? x : y) and (x < y ? x : y) differently (because of NaNs, +0/-0 etc.), but the second expression is compiled into a single SSE instruction!

Now, you may also wonder about the reliability of all these optimizations. I have bad news in that respect: ever since I tried the optimizer in gcc, I found 3 bugs, one of which resulted in incorrect code being emitted (segfault at runtime): bugs 25371, 25413 and 25881 in gcc bugzilla.

To summarize: vectorization is neat, but don't hold your breath...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser
code is live. It needs further work but already handles most
markup better than the original parser.