SIMD Vector Types

In Bristol N3571 (A Proposal to add Single Instruction Multiple Data Computation to the Standard Library) was discussed and the following straw poll was taken:
“Should C++ include a fixed length vector type to abstract vector registers”

2/2/1/4/6 SF/WF/N/WA/SA

Consensus not to move forward

This poll was assuming a slightly different approach than presented in this proposal.
It did not make sense to take over the discussion then to talk about this approach.
In/after that session I was told that I should write a new proposal.

Since many years the number of SIMD instructions and the size of SIMD registers have been growing.
Newer microarchitectures introduce new operations to optimize certain (common or specialized) operations.
Additionally, the size of SIMD registers has increased and may increase further in the future.

The typical minimal set of SIMD instructions for a given scalar data type comes down to the following:

Store instructions: store from a SIMD register to N successive scalar values at a given address. SSE examples:

movaps %xmm0,(%rax)
movups %xmm1,0x4(%rax)

Arithmetic instructions. SSE examples:

addps %xmm0,%xmm1
mulps %xmm1,%xmm1
divps %xmm1,%xmm0

Compare instructions. SSE examples:

cmpeqps %xmm0,%xmm1

Bitwise instructions. SSE examples:

andps %xmm0,%xmm1
xorps %xmm0,%xmm0

The set of available operations can differ considerably between different microarchitectures of the same CPU family.
Furthermore there are different SIMD register sizes.
Future extensions will certainly add more instructions and larger SIMD registers.

There is no need to motivate SIMD programming.
It is very much needed, the open question is only: “How?”.

There have been several approaches to vectorization.
I'd like to only discuss the merits of SIMD types here.

SIMD registers and operations are the low-level ingredients to SIMD programming.
Higher-level abstractions can be built on top of these.
If the lowest-level access to SIMD is not provided, users of C++ will be constrained to work within the limits of the provided abstraction.

In some cases the compiler might generate better code if only the intent is stated instead of an exact sequence of operations.
Thus higher-level abstractions might seem preferable to low-level SIMD types.
In my experience this is not the case because programming with SIMD types makes intent very clear and compilers can optimize sequences of SIMD operations just like they can for scalar operations.
SIMD types do not lead to an easy and obvious answer to efficient and easily usable data structures, though.

One major benefit from SIMD types is that the programmer can gain an intuition for SIMD.
This subsequently influences further design of data structures and algorithms to better suit SIMD architectures.

There are already many users of SIMD intrinsics (and thus a primitive form of SIMD types).
Providing a cleaner and portable SIMD API would provide many of them with a better alternative.
Thus SIMD types in C++ would capture existing practice.

The challenge remains in providing portable SIMD types and operations.

C++ has no means to use SIMD operations directly.
There are indirect uses through loop vectorization or optimized algorithms (that use extensions to C/C++ or assembler for their implementation).

All compiler vendors (that I worked with) add intrinsics support to their compiler products to make these operations accessible from C.
These intrinsics are inherently not portable and most of the time very directly bound to a specific instruction.
(Compilers are able to statically evaluate and optimize SIMD code written via intrinsics, though.)

Algorithms on Large Datasets

Solutions that encode data parallel operations via the application of operations or algorithms on a larger set of data, quickly lead to inefficient use of the CPUs caches.
Consider valarray:

§ 6.2 [stmt.expr]: "All side effects from an expression statement are completed before the next statement is executed." does not necessarily stop the compiler from combining the operations on lines 3 and 4.
Still, (for a larger number of expressions and containers) it cannot be reasonably expected/required that compilers will be able to combine the statements.
Thus, we must assume that line 3 will iterate over N values and only afterwards line 4 is allowed to iterate over the same memory again.
But modern CPUs require small working-sets to make efficient use of their caches.
Therefore N should not be too large.
General and exact bounds for N do not exist, though.

Any solution to achieving smaller working sets requires some form of loop construct with current C++.

The following is not a solution for this example unless expression templates were used in the implementation:

...
3 data = data * 2.f + 1.f;

The Array Building Blocks Approach

For the reasons sketched in the section above, Intel experimented with an approach that would split the algorithm in optimized working sets at runtime.
This required special sections of code which generated the code at runtime that subsequently was executed to do the actual work.
The semantics in different sections of the code were thus slightly different:
Something which can be really surprising and hard to understand for developers.

Whenever you take the approach of expressing the algorithm on the whole large data, you will either end up with inefficient cache usage or a solution resembling Intel Array Building Blocks.
(I'd be happy to learn of another — better — solution, of course.)

Cache Efficient on Large Data == Loops

Therefore loops still remain the state of the art for cache efficient processing of large data sets.
The loop count can be reduced by executing the loop body simultaneously on SIMD vectors.
(Additionally the loop can be divided onto multiple cores.)

Portability Considerations

The main portability concern stems from different SIMD register widths for different targets.
This is a real problem mainly when SIMD types are used in I/O.
But this is comparable to differing endianness.
It simply requires software to use a portable interchange format (e.g. SoA or AoS of scalar types).

Typically the compiler is told the target microarchitecture via flags.
Obviously this will create code that is not guaranteed to run on older/incompatible CPUs.
An implementation might decide to compile the code for several targets, though and decide for the best one to use at runtime.
But it is also possible to achieve runtime selection of the target microarchitecture without help from the compiler (e.g. Krita uses Vc this way).

This is a pure library proposal.
It does depend on implementation-specific language extensions, though (e.g. SIMD intrinsics/builtins).
The proposal builds upon the experience from the Vc library [1, 2].

Room for Follow-up Proposals

This proposal focuses on the core class only.
Therefore a lot of interesting functionality that is present in Vc will not be discussed here.
Follow-up proposals can address the following issues:

Mask types and solutions for conditional assignment / predicates

Load/store optimizations: streaming and prefetching

Gather/scatter

Shuffles, swizzles, vector shift/rotate

Portable optimized (de)interleaving

Iterators / Ranges (make iteration with SIMD types easier)

Containers for SIMD

“valarray<simd_vector<T>, N>”

support for load/store with half-precision floats

“fix” allocators to support SIMD types (over-aligned) throughout the C++ standard (new, std::allocator)

Types

Provide at least the following new types:

int16_v

int32_v

uint16_v

uint32_v

float_v

double_v

These types should probably be provided as well (they are not provided in Vc — I never had a need for them):

int8_v

uint8_v

int64_v

uint64_v

Each class has a single data member, an implementation-specific object to access the SIMD register (this could be __m256 with AVX intrinsics).
If the type is not supported for SIMD on the target platform, the data member will be a single scalar value.
For example double_v has one double member for ARM NEON.

The sizes of these SIMD types therefore depend on the natural size suggested by the architecture of the execution environment.
This is similar to § 3.9.1 [basic.fundamental] p2: “Plain ints have the natural size suggested by the architecture of the execution environment”.

These types should be instantiations of a SIMD vector template class: typedef simd_vector float_v;.
This makes generic code slightly easier to create, without the need for SFINAE:

The SIMD types all need to be inside a namespace whose name depends on the ABI/target.
For instance the symbols for SSE and AVX SIMD types must be different so that incompatible object files/libraries fail to link.

Conversions

Implicit conversion between vectors and implicit conversion from scalars to vectors follows the rules of conversion between scalar types as closely as possible.
The rules are:

If U implicitly converts to T then implicit conversion from U to simd_vector<T> also works.

If simd_vector<U>::size == simd_vector<T>::size works portably, then implicit conversion from simd_vector<U> to simd_vector<T> also works.

In Vc the following guarantees are made:

float_v::size == int32_v::size == uint32_v::size

int16_v::size == uint16_v::size

It is very convenient to have an integer vector of the same size as a float vector.
But on the other hand this does not map naturally to all targets (e.g. AVX).
A solution that is closer to reality is to only guarantee the equality for integer vectors:

int64_v::size == uint64_v::size

int32_v::size == uint32_v::size

int16_v::size == uint16_v::size

int8_v::size == uint8_v::size

The convenience can be restored with an abstraction above the simd_vector<T> types.

Explicit conversions between vectors of possibly different size are allowed and will either drop the last values or fill the remaining values in the destination with zeros.
(Together with vector shift/rotate functions and a bit of care this allows portable code that casts between int and float vectors.)

Loads / Stores

The most important portable way to efficiently fill a complete SIMD vector is via loads.
A load requires a pointer to at least size consecutive values in memory.
Whether load constructors should be specified needs to be discussed.
Their use does not explicitly state the operation:

float *data = ...;
float_v a(data); // it is not obvious what this line of code does
float_v b;
b.load(data); // this is a lot clearer

On the other hand there should be a way to fill a vector on construction (e.g. to ease the use of const).

As for loads, the most important portable way to efficiently store the results from a SIMD vector is via store functions.

The overloads of load/store with value_type* exists to support arguments that have an implicit conversion to value_type*.

Loads and stores can optionally do a conversion from/to memory of a different type (e.g. float_v fun(short *mem) { return float_v(mem); }).
This feature is important because:

hardware may provide special support for converting loads/stores (e.g. Intel MIC)

the pattern is otherwise much harder to use (e.g. using float for storage and double_v for calculation)

Binary Operators

All arithmetic, logical, and bit-wise operators are overloaded for combinations of different simd_vector<T> and builtin scalar types.

It is possible to get correct automatic conversion.
The determine_return_type type decides which type combinations are allowed and what conversions are done.

Consider:

void f(int32_v x, unsigned int y) {
auto z = x * y;
...

We expect the type of z in to be uint32_v.
Analogue to 1 * 1u yielding an unsigned int.

On the other hand the following example must always fail to compile:

void f(int32_v x, double_v y) {
auto z = x * y;
...

While 1 * 1. is well-defined and yields a double, the issue with SIMD vectors is that their number of entries is not guaranteed to match.
(Consider x holding four values and y holding two values.
The semantics of a multiplication of x with y is unclear.
It could mean [x0 * y0, x1 * y0, x2 * y1, x3 * y1], or [x0 * y0, x1 * y1, x2, x3], or [x0 * y0, x1 * y1, x2 * y0, x3 * y1], or [x0 * y0, x1 * y1], or [x2 * y0, x3 * y1].
Or, for a different target, it could even simply mean [x0 * y0].)

Via SFINAE the operators can be defined such that only the following type combinations work:
simd_vector<T> × is_convertible<From, simd_vector<T> && !is_narrowing_float_conversion<From, T>

A follow-up proposal will define the determine_return_type class.
This proposal could also discuss operator implementations for better error reporting (via static_assert) if a simd_vector is combined with an incompatible second operand.

Math Functions

All the functions in <cmath> can/should be overloaded to accept SIMD vectors as input.

Examples

Convert Many Points to Polar Coordinates

With the simd_vector<T> class it is possible to explicitly vectorize simple loops:

already shows a few issues that will be solved in follow-up proposals:

The alignment of the float arrays is not guaranteed to work for aligned SIMD loads and stores.

The loads and stores are not important to the algorithm, but still they take up most of the code in the loop.

If the size of the arrays is not a multiple of the SIMD width, out-of-bounds accesses would result or an extra scalar epilogue were required.

If this example is compiled for an x86 target with SSE2 instructions the loop body calculates four results in parallel.
If it is compiled for an x86 target with AVX instructions the loop body calculates eight results in parallel.
If the target does not support float SIMD vectors the loop body calculates just one result.

Kalman-Filter

This is a (slightly reduced) Kalman-Filter example from CERN/RHIC track reconstruction.
It is used to fit the track parameters of several particles in parallel.

In the Filter function a new measurement (m) is added to the Track state.
In the data structures used here, the objects each contain the data of float_v::size tracks.
So instead of working with one track at a time, the code explicitly states that multiple tracks can be filtered together and that their data is stored interleaved in memory.
(The corresponding track (tn) in the memory layout of a Track object (with e.g. SSE) looks like this:
[xt0 xt1 xt2 xt3
yt0 yt1 yt2 yt3
txt0 ...
]
With AVX:
[xt0 xt1 xt2 xt3
xt4 xt5 xt6 xt7
yt0 yt1 yt2 yt3
yt4 ...
])