Building a C++ SIMD abstraction (2/N) – Status Quo (My Perspective)

Before looking at what SIMD abstraction I’ve come up with (that, by the way, isn’t super novel), I think it’s important to look at the status quo of existing methods for vectorizing C++ code. If you haven’t yet, have a look at my previous post for motivation on why vectorizing code is useful.

In my first post, I called out some aspects of programming that I use to evaluate how well a technique or library will fit my problem. While not a comprehensive list, they include:

Readability

Performance

Ease of writing correct code

Level of control (related to performance)

Ability to compose with existing C++ code

Portability

These things are going to be subjective to my background and style, but hopefully they can illuminate that I think there’s a real gap in offerings within the space of SIMD vectorization in C++. As I look at different methods of writing vectorized code, I will draw upon these in various combinations to “grade” options where each solution will have strengths and weaknesses to consider.

The categories I will (briefly) look at are:

Hand-coded intrinsics/assembly

Annotated C++ directives

Language extensions + non-standard tool chains

Libraries providing SIMD enabled types

(honorable mention) Parallel STL algorithms

Let’s dive in!

1. Hand-coded intrinsics/assembly

If you tremble when you think about this, know that I am right there with you. The overtone of using these as a first-hand method for accessing CPU vector instructions is that “the compiler can’t do what I want it to do, so I’ll hand-compile code myself”. For example:

Can you within 3 seconds figure out what the mathmatical expression this code is calculating? If so, you’re probably a rare breed and not likely to be interested in abstractions to do heavy lifting for you. For the rest of us (I would guess the vast majority), this is a struggle to look at when dealing with non-trivial amounts of code. To be honest, the above code example really isn’t that complicated, but it does not scale well at all with complicated expressions.

The reason this ends up more difficult than it needs to be is that regular algebraic math ends up expressed in somewhat unnatural syntax (intrinsics), or worse very unnatrual syntax (assembly). The skill of being able to read assembly when you need to is different than making it required of anyone who reads the code (i.e. people who implement all of their code with intrinsics). Here are how I view the trade-offs:

+Performance/Control

This solution is almost always selected for cases where control of code generation is absolutely required and (hopefully) other solutions have been completely ruled out. However, it should be mentioned that without some from of regular benchmarking, which should be automated (!), it can be possible that hand-coded solutions end up becoming performance regressions when compilers are finally able to generate better code than was written by a human. Most super low-level performance experts will probably push back on this argument, but that further proves the point that this only works well with a small number of programmers who are “fit” and “able” to cope with the complexity.

-Readability/Writing correct code/Portability

When you add control to any design decision, you add complexity because you must then posses more knowledge about the system, whether it be the software system or the run time hardware. As I mentioned above, what is naturally expressed in C++ as operator-based expressions turns into only functions calls. This can make non-trivial expressions very difficult to reason about: not impossible, but certainly not easy.

Writing correct code can also be a challenge because correctness of the abstract mathematical expressions in the implementation are now very tightly coupled to the exact instructions selected to implement them. Furthermore, when you add the mental indirection of needing to re-map unnatural syntax into natural syntax (how our brains view math), there’s ample opportunity to make mistakes.

Lastly, portability is completely off the table when you hand code anything needing SIMD: the kernels you write this way are for only a particular instruction set. Thus you must maintain multiple versions of the kernel if you need to support more than one instruction set (a very common problem). On the other hand, intrinsics (at least in the world of x86) tend are available on gcc/clang/icc/MSVC, which generally means you at least have some amount of compiler portability.

~Composing with existing C++

I consider this a wash: most code I’ve seen using intrinsics tend to stay away from using them at function/class interface boundaries. Obviously I can’t speak for all code out in the wild, but at least the use of intrinsics in a function body won’t require interference with a functions interface, so it’s more “opt-in” for how invasive intrinsics become.

So when should I to use intrinsics/assembly?

I recommend using intrinsics or assembly only when the following are true:

you have demonstrated that the kernel needing intrinsics is indeed a performance bottleneck

the size of the code written in intrinsics is small: something like a single function

you are equipped with unit tests that can verify correctness of inrinsics-based code

you are equipped with benchmarks that can monitor when your hand-written code is no longer worth having (over using another alternative)

2. Annotated C++ directives

On the other end of the spectrum, there are ways of decorating C++ code with preprocessor directives. Options here include OpenMP SIMD directives, OpenACC, and other, more esoteric options like ‘ivdep’ found in the Intel Compiler. All of these options share similar traits, so I’ll talk about them generally, realizing that they all still have tangible differences. For example:

This code is very readable from the perspective of syntax. The actual C++ is unchanged, where the pragma is used to tell the compiler “please vectorize this loop, I know it is safe to do so”. Technically the compiler is free to vectorize loops even without the decorations, but it’s so easy to pessimize that optimization that I don’t really consider it an actual strategy for vectorizing C++.

Trivia question: what is in the loop that might cause the compiler to back out doing any vectorization?

Let’s look at the trade-offs:

+Writing correct code/Readability/Portability

The pragma-based approach, to my knowledge, came about because people wanted the ability to affect the performance of code (both threading and vectorization) without the need for a complete rewrite. In a way, this is completely true: in the average case code which is decorated with pragmas will still behave the same way as the non-decorated code. Thus the correctness of your implementation is completely up to how correct your existing code is: meaning that the pragmas don’t detract from correctness by using them.

Readability is in the same boat as correctness: your code is largely as readable as a non-vectorized version. There is some amount of noise added by the decorations, but this is sufficiently “out of the way” to keep it less of a distraction from reading the actual C++.

Finally, portability I consider a “plus” because the kernels you write in this way are not mapped at all to a particular instruction set. However, this should not be confused with performance portability, which I’ll mention below.

-Level of control/Composing with exisiting C++

The big drawback here is that you become largely at the mercy of what you can say to your compiler. There are two aspects to this drawback. First, it’s a struggle to find any pragma based approach that is portable among compilers. In other words, you may always get correct/portable code that will compile, but the performance you get will likely vary widely. In the latest gcc, clang, and Intel compilers, all implement OpenMP SIMD directives, but gcc and clang’s vectorized code generation has much to be desired (and you don’t have those directives in MSVC anyway).

Furthermore, pragma based approaches either require inlining inside of loops which are being vectorized, or require library interface functions to also be equivalently decorated to provide vector versions of the library (never mind linking correct ISAs together). Thus if a library doesn’t provide these, but provides separate interfaces for different ISA specific SIMD sizes (ex: Embree ray intersection functions), I have nothing that lets me select function calls based on particular instruction sets being compiled for, and even if I did have that I can’t safely assume that types which are changed into vector versions (i.e. ‘int’ into ‘__m256i’), that I am allowed to safely cast to/from that type. As soon as I reinterpret_cast (oh no!) an ‘int’ inside a loop to a ‘__m256i’, the compiler will treat my loop differently and almost always turn off vectorizing optimizations.

~Performance

Performance with using pragmas can go either way, making it a wash in my mind. The serious performance portability issues make it difficult to maintain good performance across standard tool chains, though it can certainly be demonstrated that when the pragmas do work that you get fast(ish) code. However, I’ve yet to see

One downside about performance with pragma approaches is that you can very easily introduce performance regressions. Someone could sit down and verify that a vectorized loop indeed gets the expected vector code generation. However, later someone else could make just a change or two to the loop that accidently pessimizes the optimizer from doing any vectorization: the code will compile and it will probably execute correctly. The only recourse you have is hoping that future programmers will hunt down good code generation as you do. Other techniques are harder (but again not impossible) to “stumble” into performance regressions like that.

So when should I use annotations/directives to vectorize code?

My technical criteria for choosing this approach are as follows:

The kernel in question can be entirely inlined

The kernel in question does not call into any external libraries

The kernel is relatively small (easier for the compiler to reason about)

Code optimizations are being done in a short enough time frame that other options will take too long (i.e. just experimenting or making a proof-of-concept)

3. Language extensions + non-standard tool chains

There have been some attempts at creating language extensions to C++ that better enable vectorizing optimizations in the compiler (I see you Cilk+), but the biggest impact here have been in the use of non-standard compilers (CUDA, OpenCL, SYCL, ispc, etc.) to write code that breaks from the C++ standard in order to express data parallelism.

I think it’s important to point out that CUDA, OpenCL, SYCL, and ispc are all expressing the same thing: SPMD, or Single Program Multiple Data. With SPMD, you write normal “scalar” code, where each data type (i.e. ‘int’, ‘float’, etc.) is considered varying. This just simply means that computations are done in batches of size N at time: if any variable in a batch is the same for each in the batch, it is marked uniform. By default these solutions make everything ‘varying’ and they each have unique ways at specifying what values are ‘uniform’ (e.g. a ‘shared‘ variable in CUDA). All 4 of these technologies prove out that the execution model works for CPUs, GPUs, and other acclerators like FPGAs.

When will someone finally generate an x86 backend for the CUDA compiler inside clang/llvm? Just saying…

Now to the trade-offs:

+Readability/Performance

SPMD code really shines with readability because the syntax looks just like regular code. I can easily just say what is ‘varying’ and what is ‘uniform’ and let everything else be just regular C++. If you read a kernel written in plain C++ and is then vectorized with any of the aforementioned options, it will look remarkably close to the original.

Performance also is demonstrated to be great using this programming model. Just look at various vendors with how great performance is achieved on their hardware: usually one of these options is the answer. There is a serious downside that portability is only possible with OpenCL, where performance portability is very difficult (if at all possible with single implementations). What I’m encouraged by is that the SPMD programming model seems to have permeated much of the GPU (and even some CPU) throughput problem spaces, now we just have to figure out how to get them to play nicer with each other, which sadly may never come to be.

-Composing with existing C++/Portability

The biggest problem here is that this option requires invasive tool chain changes to make them work properly. OpenCL and ispc require separate source trees, while CUDA and SYCL enable you to at least use a “single source” model and bring together your “host” and “kernel” code. Minor things like ispc’s ability to digest carefully written C++ headers helps a little bit, but it doesn’t make up for it enough: you simply have to write code which isn’t truely C++ to use these options. This has improved over the years, but it still is a hurdle one must traverse.

Portability is also a problem here: while all of these support the SPMD programming model, the way the spell it are not compatible at all. Furthermore, the portability outlook of any of these is abysmal. There isn’t yet an option that lets me maximally write SPMD code that will run reasonably on all data parallel hardware platforms.

~Level of control/Writing correct code

I view level of control in a negative light, though you could easily take the argument the other way in some cases. The idea of SPMD is that you are intentionally trying to give the compiler leeway to vectorize code on your behalf (talking about ispc gangs, CUDA warps, and OpenCL work groups here), but you directly express what is data parallel and what is not: a big differentiating factor from pragma based approaches still requiring the compiler to figure out what can be data parallel. This means there are some micro-optimizations that are off the table, but it’s not as bad as you may think. Here I think SPMD does a reasonable job exposing enough parallelism to the compiler that you can have confidence in what the code generation is going to look like.

Writing correct code can also go either way for me. On the one hand, I think the expressiveness of SPMD for communicating parallelism to both the machine and the fellow programmer lends itself to better correct vector code. On the other hand the tools required/available for debugging are all over the place, some of which require special hardware to work at all (CUDA requires a GPU, unless you pay for PGI special compilers, which plays into the hand of difficulty). One nice argument on debugging in favor of ispc is that you’re just generating CPU code, which can be handed directly to a debugger like any other C++.

So when should I use non-standard tools?

My criteria for choosing these kinds of tools are as follows:

The target hardware has no (good) alternatives (GPUs)

When I can (for free) generate good vectorized x86 with C++ as the kernel language

I think this option has great potential, but is a bit too fragmented right now for me to generally vectorize C++.

4. Libraries providing SIMD enabled types

This is the territory that I’m driving toward, which maybe you’ve picked up by now. C++’s type system is very rich, so it’s a great breeding ground for creating abstractions that let us express intent in code without the need to disrupt our existing tool chain. Sounds promising, right?

Once you have a library to solve a problem for you, then it’s nice when someone else takes care of low level details. However, I want to distinguish libraries which will vectorize a problem you are interested in solving (e.g. Eigen or Intel MKL), vs. a library which provides just a SIMD abstraction which could be used to vectorize a problem.

A great example of this is boost.simd, a library that provides an abstraction for using SIMD registers. With a library like this, you can very easily write natural looking C++ expressions that uses the C++ type system to concretely guarantee good vectorized code generation. Thus in the end you write code with some of the benefits you get from writing SPMD code, but you do so in plain C++ and without the need for disrupting your build system.

I’ll mention that the SIMD library I am helping write (tsimd) is in the same vein as boost.simd, but makes some different design decisions. The differences are not huge, but enough for us to keep working on it. Furthermore, there’s also some minor social influences here too, which is an external motivator to pure technical comparisons. However, I’ll save that discussion for future posts in this series.

Trade-offs:

+Everything

Kind of extreme to say all of the aspects of the solution are positives? Maybe! I’ll explain some of the rationale, though.

Readability is a positive because it seeks to make the use of vector registers usable with natural syntax. This is achieved by defining a vector register type (e.g. vfloat, standing for “vector float” or “varying float”) and overloading all the operators for that type (e.g. operator+). This brings you almost all the way to the same syntax as SPMD programming constructs (not perfect, though), with the implementation of each operation mapping directly to an intrinsic function(s). When combined with the power of C++ inlining and modern C++ compiler optimizers and you get great performance.

Writing correct code then inherits the argument about natural looking code fostering correctness, but C++11 (and beyond) we get the ability to check correct use of SIMD types by using things like static_assert(). This makes it straightforward to give programmers an exact error message about how a type was misused, which helps ease fixing errors. Furthermore, because it’s all just C++, standard debugging tools work well.

Level of control is also not sacrificed because the abstraction is over SIMD registers directly, meaning that the implementation of every operation (i.e. operators, library functions, etc.) are all tightly controlled and are not opaque by being inside the compiler.

Portability is also not sacrificed by being plain C++. One design decision I think that is important for SIMD type libraries is for the abstraction to work for 1-wide SIMD registers. This gives you the ability to instantiate vectorized code in a scalar fashion, letting you compile for architectures that may not have SIMD instructions available.

5. But what about the parallel STL in C++17?

That’s a good question, and I think the answer lies in understanding what the C++17 parallel STL provides. For those who are not aware, C++17 provides new overloads to standard algorithms with an execution policy: sequenced, parallel, and parallel_unsequenced (cppreference). These policies allow your standard library to make choices about how to execute the algorithm function you are calling.

What I’m trying to explore in these posts is for how you would write vectorized code in C++ directly, meaning that I think the options discussed in this post would all potentially be implementation details of the parallel algorithms. Thus the parallel_unsequenced (i.e. vectorized) execution policy still requires your standard library implementation to vectorize it, otherwise the algorithm is no different! Thus I’m not going to talk much about using the parallel STL because I view it as orthogonal to the topic at hand: something like tsimd should coexist, not compete, with the parallel STL.

Conclusion

This post got a bit long because there’s a lot of options out there for trying to write data parallel code for SIMD hardware. I’ve only scratched the surface of each, but I think you should have a better idea now for why something like tsimd is being built and the gap its trying to fill. For the other solutions mentioned in this post, I think they need to continue to be developed, because they all have existing users which depend on them…a good thing!

Future posts are going to dive into tsimd and look at what it offers, some of its design decisions, and its progress over time. I hope you find it interesting enough to come back for more!

I’ll be going on vacation soon where I’ll largely “unplug” around the holidays, so it may be an extra week or two before I get to the next post…but it will come eventually.

I have briefly looked at Vc, though it was far after tsimd was started. The boost.simd and the simd wrappers inside the Embree ray tracing kernels were much more of an influence on tsimd. That said, it appears that tsimd and Vc are very closely related: both are trying to implement the concept of operating on simd registers without being closely tied to a particular CPU ISA. Thus I have no specific objections, as I am not an expert in Vc, and I’m VERY happy to see such a library proposed for the C++ standard.

Since I am not an expert in Vc or the proposal, please let me know where any of the following is incorrect….here’s what I believe differentiates tsimd to Vc/proposal:

– Implicit conversions to intrinsic types:

One intentional design with tsimd is to allow simd types from the library to implicitly cast to/from underlying intrinsic types. This allows programmers with existing intrinsics kernels to incrementally switch to the tsimd simd types, line by line. For an example, check out part 3 of the series. I am not sure if this is a critical component to the standard proposal, however, as it is a feature meant to help with porting and allow super advanced users to use intrinsics if there’s some super niche case for doing so: I don’t think that is needed for the standard library.

– Fallback for wider-than-native simd types:

If an algorithm requires a particular width, tsimd guarantees that if the width is not natively supported for the latest ISA being compiled for, then arithmetic functions/operators/etc. will fall back to multiple instances of a smaller width (which may be native for an older ISA). I’m not sure if Vc does this too….could be a wash if it does.

– No ABI involvement:

This could be viewed as a drawback or not-an-issue with tsimd. As far as I can tell, Vc encodes binary compatibility into the type for underlying CPU ISAs. This then disables the ability of kernels which use simd types in their interface to accidently have incompatibilities if each kernel was compiled for a different ISA (though perhaps both are individually compatible with a CPU at runtime). The intention with tsimd is that you shouldn’t be using tsimd types in interface boundaries that may cross ISA boundaries (i.e. don’t try to interface vbool16 and vbool8 between AVX2 and AVX512). Given that tsimd is being developed primarily with OSPRay, we simply don’t have this problem….but I won’t ignore the fact that it *could* be an issue for other folks. I think you can live without it, but that’s just my opinion!

– Better mask compatibility:

This one I don’t have any feeling with Vc, but encountered with boost.simd: the mask types for vfloat and vint were _different_, making kernels which mix integer and floating-point simd operations a pain to do with masks. In the case of tsimd, masks only match the size of the elements (bool32 vs. bool64), but this is something that I’d actually like to fix and have only a single vbool type. This then starts blending _spmd_ concepts into a _simd_ library, which muddies the waters a bit. You can probably make a thin spmd layer on top of a strictly simd library, but tsimd does a blend. Not a HUGE deal, in my opinion, but it could be to other people who want absolute control.

There may be more differences that pop up as I learn more, but that’s what I currently see at the moment.