Yet this is not fully framerate-independent; results are slightly different at 30fps and 60fps, and more importantly, spikes in the framerate cause lots of weird artifacts, causing developers to attempt to fix the situation by clamping delta_time, which is not ideal.

The exponentiation method

Here is one way to fix it: assume that the code works correctly at 60 fps. This means that each frame, velocity is effectively multiplied by 1 - D / 60.

This article describes a design choice in the C++ ABI of the Visual Studio compiler that I believe should be considered a bug. I propose a trivial workaround at the end.

TL;DR — if the topmost polymorphic class in a hierarchy has members with alignment requirement N where N > sizeof(void *), the Visual Studio compiler may add up to N bytes of useless padding to your objects.

Update: be sure to read the explanation by Jan Gray, who designed the relevant part of the MS C++ ABI some 22 years ago, in the comments section below.

My colleague ​Benlitz first hit the problem when trying to squeeze memory out of some of ​our game’s most often instantiated classes. I think it is best illustrated with the following minimal example:

There is no trick. This is by design. The Visual Studio compiler is literally stealing 8 bytes from us!

What the fuck is happening?

This is the memory layout of Foo on all observed platforms:

The vfptr field is a special pointer to the vtable. The vtable is probably the most widespread compiler-specific way to implement virtual methods. Since all the platforms studied here are 32-bit, this pointer requires 4 bytes. A float requires 4 bytes, too. The total size of the class is therefore 8 bytes.

This is the memory layout of Bar on eg. Linux using GCC:

The double type has an alignment requirement of 8 bytes, which makes it fit perfectly at byte offset 8.

And finally, this is the memory layout of Bar on Win32 using Visual Studio 2010:

This is madness! The requirement for the class to be 8-byte aligned causes the first element of the class to be 8-byte aligned, too! I demand a rational explanation for this design choice.

The problem is that the compiler decides to add the vtable pointer after it has aligned the class data, resulting in excessive realignment.

Compilers affected

The Visual Studio compilers for Win32, x64 and Xbox 360 all appear to create spurious padding in classes.

Though this article focuses on 32-bit platforms for the sake of simplicity, 64-bit Windows is affected, too.

The problem becomes even worse with larger alignment requirements, for instance with SSE3 or AltiVec types that require 16-byte storage alignment such as _FP128:

Ever got a link error for a library that was referenced nowhere in your Visual Studio project or even in the final link.exe command line? Here's a hint: check the contents of static libraries, too. They may be pulling unexpected dependencies behind your back!

If the static library is part of your solution, here is another hint: check that the [Configuration Properties] >> [C/C++] >> [Code Generation] >> [Runtime Library] configuration values match across projects.

Recently I needed a method for retrieving the binary logarithm of a 10-bit number (for the curious, it was for the purpose of converting between 32-bit and 16-bit floating point numbers).

Computing the binary logarithm is equivalent to knowing the position of the highest order set bit. For instance, log2(0x1) is 0 and log2(0x100) is 8.

One well known method for fast binary logarithm is presented at ​Bit Twiddling Hacks. It is a two-step method where first all lower bits are set to 1 and then a ​De Bruijn-like sequence is used to perform a table lookup:

Optimising more?

Could we do better? Now the last line is v |= v >> 8; and it is only useful to propagate the 9th and 10th bits to positions 1 and 2. What happens if we omit that line? Let’s see:

For most values of v, the expected value is obtained.

For values of v with a highest order bit at 9th position, 0x1fe could be obtained instead of 0x1ff.

For values of v with a highest order bit at 10th position, one of 0x3fc, 0x3fd or 0x3fe could be obtained instead of 0x3ff.

The list of possible output values would therefore be 0x1, 0x3, 0x7, 0xf, 0x1f, 0x3f, 0x7f, 0xff, 0x1fe, 0x1ff, 0x3fc, 0x3fd, 0x3fe, 0x3ff. What happens to these values when multiplying them with the De Bruijn sequence? Let's see:

v

v * 0x07C4ACDDU) >> 27

0x1

0

0x3

2

0x7

6

0xf

14

0x1f

30

0x3f

29

0x7f

27

0xff

23

0x1fe

15

0x1ff

16

0x3fc

30

0x3fd

31

0x3fe

0

0x3ff

1

Damn! Two values are colliding. It looks like we cannot omit the last line after all.

Beyond De Bruijn

Let’s give another try at the problem. Usually De Bruijn sequences are built using either nontrivial algorithms, or brute force. Maybe we could find another sequence that has no collision? Or a sequence that is not a De Bruijn sequence but that works for our problem?

Down to 8 instructions instead of 12. And the lookup table is now half the size!

Conclusion

It is possible for multiply-and-shift techniques similar to the De Bruijn sequence algorithm to exist for a larger set of problems. Brute forcing the search is a totally valid method for 32-bit multiplication.

Note: this is not an optimisation. It is just one more tool you should have in your toolbox when looking for optimisations. It may be useful.

This is the trick:

You can check for yourself that it is always true: when x > y, |x - y| is the same as x - y, etc.

What good is it for? There is often an implicit comparison in min or max. It might be interesting to replace it with a call to the branchless fabs.

Example usage

Consider the following code:

float a, b, c, d;/* ... */return(a > b)&&(c > d);

That kind of code is often used eg. in collision checks, where a lot of tests can be done. This code does two comparisons. On some architectures, this means two branches. Not always something you want.

The test condition is equivalent to:

(a - b >0)&&(c - d >0)

Now when are two given numbers both positive? That is if and only if the smallest is positive:

The problem with implicit promotion

But now we have a problem. A subtle problem. Consider the following code:

double x = fabs(-5.0);

What does this do? Well, it depends. It depends whether <cmath> was included or not. Because if <cmath> wasn’t included, then that code is going to automatically promote -5.0 to a real and call our custom function instead of the one provided by the math library! With no compile-time warning!

This is confusing. It should not happen. But it is a well known problem and there are several obvious workarounds:

What most professional C++ programmers will tell you: use namespaces

Mark the real(int) constructor explicit

The problem with 1. is that I am not a professional C++ programmer. I am a C programmer who uses C++. I use preprocessor macros and printf and memalign and goto. Try and stop me!

The problem with 2. is that I can no longer write real f = 15, I would need real f(15) or real f = real(15) instead. This is not acceptable, I want real to behave exactly like float and others, to the fullest extent of what the language allows.

Another solution

Fortunately, the C++ standard has a solution for us: “Implicit conversions will be performed [...] if the parameter type contains no template-parameters that participate in template argument deduction” (ISO/IEC 14882:1998, section 14.8.1.4). You cannot have implicit conversion and template argument deduction at the same time.

It means we just have to make fabs a template function! Which means making real a template class, too.

I am always interested in having the compiler do more things for me, without giving away code clarity or performance for the convenience. Today a colleague linked me to Pope Kim's ​Compile-Time Hash String Generation article which is a perfect example of the things I like: hidden syntactic sugar that does useful things.

Inline hash function

The goal: for a given hash function, write something like HASH_STRING("funny_bone") in the code, and have the compiler directly replace it with the result, 0xf1c6fd7f.

The solution: inline the function and hope that the compiler will be clever enough.

The right part is the Taylor series of sin around 0. It converges very quickly to the actual value of sin(a). This allows a computer to compute the sine of a number with arbitrary precision.

And when I say it’s magic, it’s because it is! Some functions, called the ​entire functions, can be computed everywhere using one single formula! Other functions may require a different formula for different intervals; they are the ​analytic functions, a superset of the entire functions. In general, Taylor series are an extremely powerful tool to compute the value of a given function with very high accuracy, because for several common functions such as sin, tan or exp the terms of the series are easy to compute and, when implemented on a computer, can even be stored in a table at compile time.

Approximating sin with Taylor series

This is how one would approximate sin using 7 terms of its Taylor series on the [-π/2, π/2] interval. The more terms, the better the precision, but we’ll stop at 7 for now:

If you are approximating a function over an interval using its Taylor series then either you or the person who taught you is a fucking idiot because a Taylor series approximates a function near a fucking point, not over a fucking interval, and if you don’t understand why it’s important then please read on because that shit is gonna blow your mind.

Error measurement

Let’s have a look at how much error our approximation introduces. The formula for the absolute error is simple:

And this is how it looks like over our interval:

You can see that the error skyrockets near the edges of the [-π/2, π/2] interval.

“Well the usual way to fix this is to split the interval in two or more parts, and use a different Taylor series for each interval.”

Oh, really? Well, let’s see the error on [-π/4, π/4] instead:

I see no difference! The error is indeed smaller, but again, it becomes extremely large at the edges of the interval. And before you start suggesting reducing the interval even more, here is the error on [-π/8, π/8] now:

I hope this makes it clear that:

the further from the centre of the interval, the larger the error

the error distribution is very unbalanced

the maximum error on [-π/2, π/2] is about 6.63e-10

And now I am going to show you why that maximum error value is pathetic.

It doesn’t look very different, right? Right. The values a0 to a6 are slightly different, but the rest of the code is strictly the same.

Yet what a difference it makes! Look at this error curve:

That new function makes it obvious that:

the error distribution is better spread over the interval

the maximum error on [-π/2, π/2] is about 4.96e-14

Check that last figure again. The new maximum error isn’t 10% better, or maybe twice as good. It is more than ten thousand times smaller!!

The minimax polynomial

The above coefficients describe a ​minimax polynomial: that is, the polynomial that minimises a given error when approximating a given function. I will not go into the mathematical details, but just remember this: if the function is sufficiently well-suited (as sin, tan, exp etc. are), then the minimax polynomial can be found.

The problem? It’s hard to find. The most popular algorithm to find it is the ​Remez exchange algorithm, and few people really seem to understand how it works (or there would be a lot less Taylor series). I am not going to explain it right now. Usually you need professional math tools such as Maple or Mathematica if you want to compute a minimax polynomial. The ​Boost library is a notable exception, though.

But you saw the results, so stop using Taylor series. Spending some time finding the minimax polynomial is definitely worth it. This is why I am working on a Remez framework that I will make public and free for everyone to use, modify and ​do what the fuck they want. In the meantime, if you have functions to numerically approximate, or Taylor-based implementations that you would like to improve, let me know in the comments! This will be great use cases for me.

There are two major ways of using ​lookup tables (LUTs) in C/C++ code:

build them at runtime,

embed them in the code.

One major advantage of runtime initialisation is the choice between static initialisation (at program startup), or lazy initialisation (on demand) to save memory. Also, the generating code can be complex, or use information that is only available at runtime.

In the case of an embedded table, the generation cost is only at compile time, which can be very useful. Also, the compiler may take advantage of its early knowledge of the table contents to optimise code. However, quite often the content of embedded tables is abstruse and hardly useful to someone viewing the code. Usually this is due to the use of an external program for generation, sometimes in a completely different language. But the generation can also often be done using the C/C++ preprocessor.

Practical example

Consider the ​bit interleaving routine at Sean Eron Anderson's Bit Twiddling Hacks page (which, by the way, I recommend you read and bookmark). It uses the following LUT (shortened for brevity):

staticconstunsignedshort MortonTable256[256]={0x0000,0x0001,0x0004,0x0005,0x0010,0x0011,0x0014,0x0015,0x0040,0x0041,0x0044,0x0045,0x0050,0x0051,0x0054,0x0055,0x0100,0x0101,0x0104,0x0105,0x0110,0x0111,0x0114,0x0115,...32 lines in total ...0x5540,0x5541,0x5544,0x5545,0x5550,0x5551,0x5554,0x5555};

The MortonTable256 table has, as its name suggests, 256 elements. It was pregenerated by some external piece of code which probably looked like this:

The problem with that external piece of code is that it is external. You cannot write it in this form and have it fill the table at compile time.

If you only take the output of this table, the information on how the table was created is lost. It makes it impractical to build another table that has, for instance, all values shifted one bit left. Even if such a table was created using a modified version of the above code, switching between the two tables would be a hassle unless both versions were kept between preprocessor tests.

Preprocessor iterator

Here is one way to get the best of both worlds. First, declare the following iterator macros. They can be declared somewhere in a global .h, maybe with more descriptive names:

In this case the macro code is slightly bigger and was slightly rewritten, but is no more complicated than the original code. Note also the elegant reuse of previous values in the second half of the table.

This trick is certainly not new, but since I have found practical uses for it, I thought you may find it useful, too.

During the past month, I have found myself in the position of having to explain the contents of this article to six different persons, either at work or over the Internet. Though there are a lot of articles on the subject, it’s still as if almost everyone gets it wrong. I was still polishing this article when I had the opportunity to explain it ​a seventh time.

And two days ago a coworker told me the source code of a certain framework disagreed with me… The kind of framework that probably has three NDAs preventing me from even thinking about it.

Well that framework got it wrong, too. So now I’m mad at the entire world for no rational reason other than the ever occurring realisation of the amount of wrong out there, and this article is but a catharsis to deal with my uncontrollable rage.

A simple example

Imagine a particle with position Pos and velocity Vel affected by acceleration Accel. Let’s say for the moment that the acceleration is constant. This is the case when only gravity is present.

A typical game engine loop will update position with regards to a timestep (often the duration of a frame) using the following method, known as Euler integration:

Putting these two differential equations into Euler integration gives us the above code.

Measuring accuracy

Typical particle trajectories would look a bit like this:

These are three runs of the above simulation with the same initial values.

once with maximum accuracy,

once at 60 frames per second,

once at 30 frames per second.

You can notice the slight inaccuracy in the trajectories.

You may think…

“Oh, it could be worse; it’s just the expected inaccuracy with different framerate values.”

Well, no.

Just… no.

If you are updating positions this way and you do not have a really good reason for doing so then either you or the person who taught you is a fucking idiot and should not have been allowed to write so-called physics code in the first place and I most certainly hope to humbly bestow enlightenment upon you in the form of a massive cluebat and don’t you dare stop reading this sentence before I’m finished.

Why this is wrong

When doing ​kinematics, computing position from acceleration is an integration process. First you integrate acceleration with respect to time to get velocity, then you integrate velocity to get position.

The integral of a function can be seen as the area below its curve. So, how do you properly get the integral of our velocity between t and t + dt, ie. the green area below?

And now for the correct code

This is what the update method should look like. It’s called Velocity Verlet integration (not strictly the same as Verlet integration, but with a similar error order) and it always gives the perfect, exact position of the particle in the case of constant acceleration, even with the nastiest framerate you can think of. Even at two frames per second.

Glenn Fiedler also explains in ​this article why idiots use Euler, and provides a nice introduction to RK4 together with source code.

Florian Boesch did a ​thorough coverage of various integration methods for the specific application of gravitation (it is one of the rare cases where Euler seems to actually perform better).

In practice, Verlet will still only give you an approximation of your particle’s position. But it will almost always be a much better approximation than Euler. If you need even more accuracy, look at the ​fourth-order Runge-Kutta (RK4) method. Your physics will suck a lot less, I guarantee it.

Acknowledgements

I would like to thank everyone cited in this article, explicitly or implicitly, as well as the commenters below who spotted mistakes and provided corrections or improvements.