Outsmarting the Compiler

Suppose we have two very similar structs which we need to partially populate “ahead of time” and store somewhere.
Then, a bit later, we need to very quickly finish populating the structs.
Here are some example structs:

Unfortunately, we don’t statically know what struct we are going to have to operate on; we only get this information at runtime.
We just have a blob of memory and a tag which indicates which of the two variants of the struct is sitting in the blob of memory:

enumclassVariant{eA, eB};
structWrapper{Variantv;
charpayload[];
};

So, our fast path write function will need to take a wrapper struct, switch on the tag, then call the appropriate version of writeFields:

If PADDING1 = PADDING2=, then, regardless of the value of the tag (which struct we are populating), we will need to write to the same offsets.
The cast and the templated function call will all compile out.
Take a look (clang-4.0 --std=c++1z -O3):

Before we move on, take a moment to appreciate what your compiler just did for you:

It allowed you to write a type safe writeFields method. If the layout of the struct changes for some reason, this part of the code will not begin to misbehave.

It removed the cost of the abstraction when it could figure out how to.

Unfortunately, if PADDING1 ! PADDING2=, we will need to write the value of c in a different location in struct A and struct B.
In this case, it looks like we will need read the tag out of the Wrapper*, then branch to the appropriate writeFields method.
We are good programmers, we know that branches might be expensive, so we really want avoid any branching.

We can skip the branch by storing the offset in our wrapper struct and precomputing the offset when the wrapper is set up.
Introduce a new wrapper type (and abandon all type safety):

structWrapperWithOffset{Variantv;
size_toffset;
charpayload[];
};

Next, we can write a new function which will operate on structs of type A or type B, but, instead of writing to c directly, it computes a pointer to c using the offset we’ve stored in the wrapper, then writes to that pointer.

This code is still very slightly longer than the unsafe code written previously, but, its really not bad at all.

The compiler has succeeded in avoiding a branch using a rather clever cmp and setne instruction pair.
Essentially, clang figured out that it could compute the offset of c using the tag we’ve placed in the Wrapper’s Variant field.
In this case, I’ve allowed the enum values to default to \(0\) and \(1\) (hence the cmp dword ptr [rdi], 0 checking if the first thing in the functions first arg is equal to \(0\)).

PADDING1 = 16 and PADDING2 = 173

Interesting.
This branch felt almost detectable in some micro-benchmarks, but I would require additional testing before I’m willing to declare that it is harmful.
At the moment I’m not convinced that it hurts much.

Conclusion

No conclusion.
None of my benchmarks have managed to detect any convincing cost for this branch (even when variants are randomly chosen inside of a loop in an attempt to confuse branch predictor) so none of this actually matters (probably).
The only interesting fact my benchmarks showed is that clang 4.0 looked very very slightly faster than gcc 6.3, possibly because of the vector instructions clang is generating, but also possibly because benchmarking is hard and I’m not benchmarking on isolated cores.
Here’s some code: gist.