Search

An Introduction to GCC Compiler Intrinsics in Vector Processing

Speed is essential in multimedia, graphics and signal
processing. Sometimes programmers resort to assembly language to get
every last bit of speed out of their machines. GCC offers an intermediate
between assembly and standard C that can get you more speed and processor
features without having to go all the way to assembly language: compiler
intrinsics. This article discusses GCC's compiler intrinsics, emphasizing
vector processing on three platforms: X86 (using MMX, SSE and SSE2);
Motorola, now Freescale (using Altivec); and ARM Cortex-A (using Neon).
We conclude with some debugging tips and references.

So, What Are Compiler Intrinsics?

Compiler intrinsics (sometimes called "builtins") are like the
library functions you're used to, except they're built in to the
compiler. They may be faster than regular library functions (the
compiler knows more about them so it can optimize better) or handle
a smaller input range than the library functions. Intrinsics also
expose processor-specific functionality so you can use them as an
intermediate between standard C and assembly language. This gives you
the ability to get to assembly-like functionality, but still let the
compiler handle details like type checking, register allocation,
instruction scheduling and call stack maintenance. Some builtins are
portable, others are not--they are processor-specific. You can find
the lists of the portable and target specific intrinsics in the GCC
info pages and the include files (more about that below). This
article focuses on the intrinsics useful for vector processing.

Vectors and Scalars

In this article, a vector is an ordered collection of numbers, like
an array. If all the elements of a vector are measures of the same
thing, it's said to be a uniform vector. Non-uniform vectors have
elements that represent different things, and their elements have to be
processed differently. In software, vectors have their own types and
operations. A scalar is a single value, a vector of size one. Code
that uses vector types and operations is said to be vector code. Code
that uses only scalar types and operations is said to be scalar code.

Vector Processing Concepts

Vector processing is in the category of Single Instruction, Multiple
Data (SIMD). In SIMD, the same operation happens to all the data (the
values in the vector) at the same time. Each value in the vector is
computed independently. Vector operations include logic and math. Math
within a single vector is called horizontal math. Math
between two vectors is called vertical math.

Instead of writing: 10 x 2 = 20, express it vertically as:

10
x 2
------
20

In vertical math, vectors are lines of these values; multiple
operations happen at the same time:

Saturation arithmetic is like normal arithmetic except that when the
result of the operation that would overflow or underflow an element in
the vector, that is clamped at the end of the range and not allowed to
wrap around. (For instance, 255 is the largest unsigned character. In
saturation arithmetic on unsigned characters, 250 + 10 = 255.)
Regular arithmetic would allow the value to wrap around zero and
become smaller. For example, saturation arithmetic is useful if you
have a pixel that is slightly brighter than maximum brightness. It
should be maximum brightness, not wrap around to be dark.

We're including integer math in the discussion. While integer math
is not unique to vector processing, it's useful to have in case your
vector hardware is integer-only or if integer math is much faster
than floating-point math. Integer math will be an approximation of
floating-point math, but you might be able to get a faster answer that
is acceptably close.

The first option in integer math is rearranging operations. If your
formula is simple enough, you might be able to rearrange operations to
preserve precision. For instance, you could rearrange:

F = (9/5)C + 32

into:

F = (9*C)/5 + 32

So long as 9 * C doesn't overflow the type you're using, precision
is preserved. It's lost when you divide by 5; so do that after the
multiplication. Rearrangement may not work for more complex
formulas.

The seond choice is scaled math.
In the scaled math option, you decide how much precision you want, then
multiply both sides of your equation by a constant, round or truncate
your coefficients to integers, and work with that. The final step to get
your answer then is to divide by that constant. For instance, if you
wanted to convert Celsius to Fahrenheit:

If you multiply by a power of 2 instead of 10, you change that final
division into a shift, which is almost certainly faster, though harder to
understand. (So don't do this gratuitously.)

The third choice for integer math is shift-and-add.
Shift-and-add is another method based on the idea that a floating-point
multiplication can be implemented with a number of shifts and adds. So
our troublesome 1.8C can be approximated as:

1.0C + 0.5C + 0.25C + ... OR C + (C >> 1) + (C >> 2) + ...

Again, it's almost certainly faster, but harder to understand.

There are examples of integer math in
samples/simple/temperatures*.c, and shift-and-add in
samples/colorconv2/scalar.c.

Vector Types, the Compiler and the Debugger

To use your processor's vector hardware, tell the compiler to use
intrinsics to generate SIMD code, include the file that defines
the vector types, and use a vector type to put your data into
vector form.

The compiler's SIMD command-line arguments are listed in Table 1.
(This article covers only these, but GCC offers much more.)

Table 1. GCC Command-Line Options to Generate SIMD Code

Processor/

Options

X86/MMX/SSE1/SSE2

-mfpmath=sse -mmmx -msse -msse2

ARM Neon

-mfpu=neon -mfloat-abi=softfp

Freescale Altivec

-maltivec -mabi=altivec

Here are the include files you need:

arm_neon.h - ARM Neon types & intrinsics

altivec.h - Freescale Altivec types & intrinsics

mmintrin.h - X86 MMX

xmmintrin.h - X86 SSE1

emmintrin.h - X86 SSE2

X86: MMX, SSE, SSE2 Types and Debugging

The X86 compatibles with MMX, SSE1 and SSE2 have the following types:

MMX: __m64 64 bits of integers broken down as eight 8-bit integers,
four 16-bit shorts or two 32-bit integers.

Because the debugger doesn't know how you're using these types,
printing X86 vector variables in gdb/ddd shows you the
packed form of the vector instead of the collection of elements. To
get to the individual elements, tell the debugger how to decode the
packed form as "print (type[]) x" For instance if you have:

__m64 avariable; /* storing 4 shorts */

You can tell ddd to list individual elements as shorts saying:

print (short[]) avariable

If you're working with char vectors and want gdb to print the
vector's elements as numbers instead of characters, you can tell it to
using the "/" option. For instance:

print/d acharvector

will print the contents of acharvector as a series of decimal
values.

PowerPC Altivec Types and Debugging

PowerPC Processors with Altivec (also known as VMX and Velocity
Engine) add the keyword "vector" to their types. They're all 16
bytes long. The following are some Altivec vector types:

vector unsigned char: 16 unsigned chars

vector signed char: 16 signed chars

vector bool char: 16 unsigned chars (0 false, 255 true)

vector unsigned short: 8 unsigned shorts

vector signed short: 8 signed shorts

vector bool short: 8 unsigned shorts (0 false, 65535 true)

vector unsigned int: 4 unsigned ints

vector signed int: 4 signed ints

vector bool int: 4 unsigned ints (0 false, 2^32 -1 true)

vector float: 4 floats

The debugger prints these vectors as collections of individual
elements.

ARM Neon Types and Debugging

On ARM processors that have Neon extensions available, the Neon
types follow the pattern [type]x[elementcount]_t. Types include
those in the following list:

uint64x1_t - single 64-bit unsigned integer

uint32x2_t - pair of 32-bit unsigned integers

uint16x4_t - four 16-bit unsigned integers

uint8x8_t - eight 8-bit unsigned integers

int32x2_t - pair of 32-bit signed integers

int16x4_t - four 16-bit signed integers

int8x8_t - eight 8-bit signed integers

int64x1_t - single 64-bit signed integer

float32x2_t - pair of 32-bit floats

uint32x4_t - four 32-bit unsigned integers

uint16x8_t - eight 16-bit unsigned integers

uint8x16_t - 16 8-bit unsigned integers

int32x4_t - four 32-bit signed integers

int16x8_t - eight 16-bit signed integers

int8x16_t - 16 8-bit signed integers

uint64x2_t - pair of 64-bit unsigned integers

int64x2_t - pair of 64-bit signed integers

float32x4_t - four 32-bit floats

uint32x4_t - four 32-bit unsigned integers

uint16x8_t - eight 16-bit unsigned integers

The debugger prints these vectors as collections of individual
elements.

There are examples of these in the samples/simple directory.

Now that we've covered the vector types, let's talk about vector
programs.

As Ian Ollman points out, vector programs are blitters. They load
data from memory, process it, then store it to memory
elsewhere. Moving data between memory and vector registers is
necessary, but it's overhead. Taking big bites of data from memory,
processing it, then writing it back to memory will minimize that overhead.

Alignment is another aspect of data movement to watch for. Use GCC's
"aligned" attribute to align data sources and destinations on 16-bit
boundaries for best performance. For instance:

Failure to align can result in getting the right answer, silently
getting the wrong answer or crashing. Techniques are available
for handling unaligned data, but they are slower than using aligned
data. There are examples of these in the sample code.

The sample code uses intrinsics for vector operations on X86,
Altivec and Neon. These intrinsics follow naming conventions to make
them easier to decode. Here are the naming conventions:

Table 2. Subset of vector operators and intrinsics used in the examples.

Operation

Altivec

Neon

MMX/SSE/SSE2

loading

vec_ld

vld1q_f32

_mm_set_epi16

vector

vec_splat

vld1q_s16

_mm_set1_epi16

vec_splat_s16

vsetq_lane_f32

_mm_set1_pi16

vec_splat_s32

vld1_u8

_mm_set_pi16

vec_splat_s8

vdupq_lane_s16

_mm_load_ps

vec_splat_u16

vdupq_n_s16

_mm_set1_ps

vec_splat_u32

vmovq_n_f32

_mm_loadh_pi

vec_splat_u8

vset_lane_u8

_mm_loadl_pi

storing

vec_st

vst1_u8

vector

vst1q_s16

_mm_store_ps

vst1q_f32

vst1_s16

add

vec_madd

vaddq_s16

_mm_add_epi16

vec_mladd

vaddq_f32

_mm_add_pi16

vec_adds

vmlaq_n_f32

_mm_add_ps

subtract

vec_sub

vsubq_s16

multiply

vec_madd

vmulq_n_s16

_mm_mullo_epi16

vec_mladd

vmulq_s16

_mm_mullo_pi16

vmulq_f32

_mm_mul_ps

vmlaq_n_f32

arithmetic

vec_sra

vshrq_n_s16

_mm_srai_epi16

shift

vec_srl

_mm_srai_pi16

vec_sr

byte

vec_perm

vtbl1_u8

_mm_shuffle_pi16

permutation

vec_sel

vtbx1_u8

_mm_shuffle_ps

vec_mergeh

vget_high_s16

vec_mergel

vget_low_s16

vdupq_lane_s16

vdupq_n_s16

vmovq_n_f32

vbsl_u8

type

vec_cts

vmovl_u8

_mm_packs_pu16

conversion

vec_unpackh

vreinterpretq_s16_u16

vec_unpackl

vcvtq_u32_f32

vec_cts

vqmovn_s32

_mm_cvtps_pi16

vec_ctu

vqmovun_s16

_mm_packus_epi16

vqmovn_u16

vcvtq_f32_s32

vmovl_s16

vmovq_n_f32

vector

vec_pack

vcombine_u16

combination

vec_packsu

vcombine_u8

vcombine_s16

maximum

_mm_max_ps

minimum

_mm_min_ps

vector

_mm_andnot_ps

logic

_mm_and_ps

_mm_or_ps

rounding

vec_trunc

misc

_mm_empty

Suggestions for Writing Vector Code

Examine the Tradeoffs

Writing vector code with intrinsics forces you to make
trade-offs. Your program will have a balance between scalar and
vector operations. Do you have enough work for the vector hardware to
make using it worthwhile? You must balance the portability of C
against the need for speed and the complexity of vector code,
especially if you maintain code paths for scalar and vector code. You
must judge the need for speed versus accuracy. It may be that integer
math will be fast enough and accurate enough to meet the need. One way
to make those decisions is to test: write your program with a scalar
code path and a vector code path and compare the two.

Data Structures

Start by laying out your data structures assuming that you'll be
using intrinsics. This means getting data items aligned. If you can
arrange the data for uniform vectors, do that.

Write Portable Scalar Code and Profile

Next, write your portable scalar code and profile it. This will be
your reference code for correctness and the baseline to time your
vector code. Profiling the code will show where the bottlenecks
are. Make vector versions of the bottlenecks.

Write Vector Code

When you're writing that vector code, group the non-portable code
into separate files by architecture. Write a separate Makefile for
each architecture. That makes it easy to select the files you want to
compile and supply arguments to the compiler for each
architecture. Minimize the intermixing of scalar and vector code.

Use Compiler-Supplied Symbols if you #ifdef

For files that are common to more than one architecture, but have
architecture-specific parts, you can #ifdef with symbols supplied by
the compiler when SIMD instructions are available. These are:

__MMX__ -- X86 MMX

__SSE__ -- X86 SSE

__SSE2__ -- X86 SSE2

__VEC__ -- altivec functions

__ARM_NEON__ -- neon functions

To see the baseline macros defined for other processors:

touch emptyfile.c
gcc -E -dD emptyfile.c | more

To see what's added for SIMD, do this with the SIMD command-line
arguments for your compiler (see Table 1). For example:

Check Processor at Runtime

Next, your code should check your processor at runtime to see if
you have vector support for it. If you don't have a vector code path
for that processor, fall back to your scalar code. If you have vector
support, and the vector support is faster, use the vector code
path. Test processor features on X86 with the cpuid instruction from
<cpuid.h>. (You saw examples of that in samples/simple/x86/*c.) We
couldn't find something that well established for Altivec and Neon, so
the examples there parse /proc/cpuinfo. (Serious code might insert a
test SIMD instruction. If the processor throws a SIGILL signal when it
encounters that test instruction, you do not have that feature.)

Test, Test, Test

Test everything. Test for timing: see if your scalar or
vector code is faster. Test for correct results: compare the
results of your vector code against the results of your scalar
code. Test at different optimization levels: the behavior of the
programs can change at different levels of optimization. Test against
integer math versions of your code. Finally, watch for compiler
bugs. GCC's SIMD and intrinsics are a work in progress.

This gets us to our last code sample. In samples/colorconv2 is a
colorspace conversion library that takes images in non-planar YUV422
and turns them into RGBA. It runs on PowerPCs using Altivec; ARM
Cortex-A using Neon; and X86 using MMX, SSE and SSE2. (We tested on
PowerMac G5 running Fedora 12, a Beagleboard running Angstrom
2009.X-test-20090508 and a Pentium 3 running Fedora 10.) Colorconv
detects CPU features and uses code for them. It falls back to scalar
code if no supported features are detected.

To build, untar the sources file and run make. Make uses the
"uname" command to look for an architecture specific Makefile.
(Unfortunately, Angstrom's uname on Beagleboard returns "unknown", so
that's what the directory is called.)

Test programs are built along with the library. Testrange compares
the results of the scalar code to the vector code over the entire
range of inputs. Testcolorconv runs timing tests comparing the code
paths it has available (intrinsics and scalar code) so you can see
which runs faster.

Finally, here are some performance tips.

First, get a recent compiler and use the best code generation
options. (Check the info pages that come with your compiler for things
like the -mcpu option.)

Second, profile your code. Humans are bad at guessing where the
bottlenecks are. Fix the bottlenecks, not other parts.

Third, get the most work you can from each vector operation by using
the vector with the narrowest type elements that your data will fit
into. Get the most work you can in each time slice by having enough
work that you keep your vector hardware busy. Take big bites of
data. If your vector hardware can handle a lot of vectors at the same
time, use them. However, exceeding the number of vector registers you
have available will slow things down. (Check your processor's
documentation.)

Fourth, don't re-invent the wheel. Intel, Freescale and ARM all
offer libraries and code samples to help you get the most from their
processors. These include Intel's Integrated Performance Primitives,
Freescale's libmotovec and ARM's OpenMAX.

Summary

In summary, GCC offers intrinsics that allow you to get more from
your processor without the work of going all the way to assembly. We
have covered basic types and some of the vector math functions. When
you use intrinsics, make sure you test thoroughly. Test for speed and
correctness against a scalar version of your code. Different features
of each processor and how well they operate means that this is a wide
open field. The more effort you put into it, the more you will get out.

References:

The GCC include files that map intrinsics to compiler built-ins (eg arm_neon.h) and the GCC info pages that explain those built-ins: