All Your Base Are Belong To Us

April 22, 2014

tl;dr

A couple of weeks ago at Build, the .NET/CLR team announced a preview release of a library, Microsoft.Bcl.Simd, that exposes a set of JIT intrinsics on top of CPU vector instructions (a.k.a. SIMD). This library relies on RyuJIT, another preview technology that is aimed to replace the existing JIT compiler. When using Microsoft.Bcl.Simd, you program against a vector abstraction that is then translated at runtime to the appropriate SIMD instructions that your processor supports.

In this post, I’d like to take a look at what exactly this SIMD support is about, and show you some examples of what kind of speedups and optimizations can be expected from using it.

Introduction to SIMD

Let’s start with the basics. What is SIMD and why should I care? It turns out that modern processors can perform vector operations, which affect more than a single word of data at the same time. For example, SSE-compliant processors have vector instructions for adding two vectors of four floating-point numbers in the same time (latency and throughput) as adding two single floating-point numbers. Vector operations are pretty incredible and if your compiler can take advantage of them, you can potentially gain 4-8x speedups for certain kinds of operations on modern hardware.

For example, consider the following method that performs pointwise addition of two floating-point arrays a and b, in place:

Each loop iteration performs a single ADD instruction. This instruction has a latency of 1 cycle and throughput of 3 instructions per cycle on Intel i* processors. The vectorized version of this method might look as follows (I’m assuming for simplicity that the size of the arrays is divisible by 4):

In this version, every loop iteration performs a single ADDPS instruction, which also has a latency of 1 cycle and throughput of 3 instructions per cycle on Intel i* processors. The result could be a 4x speedup, because we’re issuing 4x fewer instructions with the same throughput and latency.

Taxonomy of vectorizing compilers/runtimes

Most Windows developers use C/C++’s approach to vectorization. For years, Visual C++ has provided a set of intrinsic data types and functions that are recognized by the compiler and translated to vector instructions. For example, the preceding method could be written as follows using the Microsoft-specific vector intrinsics:

(It’s interesting to note how cryptic the compiler-generated code seems to be. It starts by doing the apparently-absurd operation b – a so that it can later increment the first array’s base directly instead of using another register. I suspect the reason is that the memory addressing instructions that take an address of the form [reg0+imm*reg1] can’t multiply reg1 by 16, which is required here to skip appropriately within the array.)

Clearly, manual vectorization leaves much to be desired. It means the developer is responsible for recognizing when vectorization can be beneficial, and for taking a dependency on the platform’s vector register size. For example, the preceding loop would not be able to use AVX instructions, such as VADDPS which operate on eight 32-bit words (a 256-bit value) at a time.

Therefore, some modern C/C++ compilers, Visual C++ included, can perform automatic vectorization of certain operations. Tight loops such as our one are great candidates for automatic vectorization. There are many difficult problems with automatic vectorization, however. The compiler must be able to prove that there are no weird interdependencies between the loop’s iterations, that there is no pointer aliasing, and other concerns. The Microsoft.Bcl.Simd package is not (at present) an automatic vectorization framework. It is instead a set of library types and intrinsics that are recognized by the JIT compiler much like the __m128 type is recognized by the Visual C++ compiler.

Configuring Microsoft.Bcl.Simd

To use Microsoft.Bcl.Simd, you need to install it as a NuGet package. On its own, however, this package only contains a bunch of library types and methods, such as Vector4f that represents four floating-point numbers packed in a vector register. To get the runtime performance benefits, you need a JIT compiler that can recognize these types and emit the appropriate CPU instructions at runtime — and that’s RyuJIT.

After you install RyuJIT, you have both JIT compilers on your system, and the standard one (clrjit.dll) will be used unless you specify that you want the RyuJIT by setting an environment variable:

SET COMPLUS_AltJit=*

To enable the SIMD-specific extensions in RyuJIT, you need to set another environment variable as well:

SET COMPLUS_FeatureSIMD=1

Now we can actually explore some code that uses the types in the Microsoft.Bcl.Simd package.

SIMD pointwise vector addition

To vectorize the loop we had earlier that adds two vectors pointwise, we can use the various .NET SIMD vector types. The fixed-size Vector2f, Vector3f, and Vector4f, are designed for situations where you already have a vector-like representation of your data, such as points in 2- or 3-dimensional space. There is also a generic Vector<T> type that works with various integral and floating-point types and will be sized accordingly based on your hardware capabilities. Although the current preview only supports 128-bit vector operations, a future release will hopefully also provide support for 256-bit vector operations (AVX) if the hardware provides it. That’s where using Vector<T> can be useful — you don’t have to commit to the vector size that really depends on what your hardware can offer.

So let’s use Vector<T> to vectorize our pointwise vector addition, assuming again that the array sizes are divisible by the “SIMD length” — the size of a vector register:

In the preceding code, initializing a Vector<float>, adding two of them together, and then copying the results back to the source array — are all JIT intrinsics that should use vector registers and not incur any method calls. In the current preview, unfortunately, the CopyTo method is not recognized by the JIT as an intrinsic, and incurs a pretty sizeable overhead.

So, how about we take this vectorized version for a spin? Unfortunately, we’re up for a disappointment. The vectorized version, even after getting rid of CopyTo, is much slower than the scalar one. For example, on my Core i7-3720QM processor, the results are:

Add Standard: 0.262msAdd SIMD: 0.300ms

Let’s take a look at the actual instructions generated and try to understand why that’s the case. Here’s what the compiler produced in the scalar case (I have annotated, reordered, and removed some instructions for brevity and clarity):

As you can see, the scalar version is pretty well-optimized, and there is special treatment for the case when a.Length <= b.Length, which implies that no range checks are required when going through the loop. Another thing to note is that even though the compiler uses vector registers (specifically XMM0 is a 128-bit register), it only stores 32-bit values in them, so there is no automatic vectorization magic taking place.

This code is much longer, and hopefully by reviewing it you can see the problematic sections which make it slower than the scalar version. The first problem is that the compiler no longer elides the range checks when accessing arrays. There are numerous range checks here to verify that i is within the bounds of both a and b, to verify that i + s is within the bounds of a, and so on. Additionally, va is treated like a stack location when it comes to copying from it back to the original array a — the copies are executed from the stack using a single MOVSS instruction per each 32-bit component of the 128-bit register. All in all, these factors produce a subpar running time.

(It’s important to understand that this is not an inevitable result. The Visual C++ compiler can take manually vectorized code and produce much faster running times. It’s only a matter of sufficient tuning in the RyuJIT compiler, which is still in preview, so I definitely hope there will be some improvement before it’s released.)

Vectorized min-max

Let’s take a look at a slightly more complex algorithm that can benefit even more from using vector instructions. Suppose we have an array of integers and we need to find the minimum and maximum elements in the array. The following scalar method will do the job:

This loop can also be vectorized if we make the following simple observation: if we find the minimum and maximum of all the even elements, and the minimum and maximum of all the odd elements, then the global minimum and maximum can be derived from these local ones. The same idea applies if we partition the array into more than two sets, and that’s where vectorization comes in. Here’s what we can do (again, under the simplifying assumption that the input array’s length is divisible by the SIMD vector length):

The crux of the matter are the Vector.LessThan and Vector.ConditionalSelect functions. Vector.LessThan performs a pointwise comparison of two vector registers and puts the Boolean results in another vector register. Specifically, vLessThan will have zeroes where va was not less than vmin, and will have ones where it was. Vector.ConditionalSelect can use the results of the comparison to conditionally place values from va or vmin back into vmin, based on which ones were smaller. The same thing applies to Vector.GreaterThan and the rest of the code.

The code is slightly less readable than a simple if statement or a Math.Min call, because we had to translate these conditional operations that only operate on a single scalar value to vector operations that can be translated to CPU instructions that operate on vector registers. The speedup, though, is considerable. On my box, I get a clean 4x speedup from using the vector version:

MinMax Standard: 1.136msMinMax SIMD: 0.274ms

More examples

If you are looking for more complex and realistic example of using vector operations to improve performance, you should check out the SIMD Sample on MSDN Code Gallery, which shows examples of vectorizing a Mandelbrot fractal generator and a ray tracer.

Summary

As you can see, modern processors are equipped with powerful vector units that can significantly improve the performance of certain algorithms without resorting to explicit parallelism in the shape of multiple threads. With Microsoft.Bcl.Simd and RyuJIT, .NET applications can also take advantage of vectorization, something that was only available through interop with native code until now.

I am posting short links and updates on Twitter as well as on this blog. You can follow me: @goldshtn

Yes, you can use vector operations to process multiple elements at once. It will be a bit tricky but doable. You should check out the Mandelbrot sample Microsoft posted — it has a pretty similar loop that is vectorized.