Category: Uncategorized

Since I recently had to figure this out, and every other example on the internet is wrong, here’s how to erase from a vector using “swap and pop” without invalidating iterators or causing undefined behaviour.

Since maintaining build configurations in Visual Studio is such a pain, I only have two. Unfortunately one of them had slowly become useless.

Day to day there are two configurations I need:

One for debugging, with maximum checks and optimisations off.

One for testing performance, with full optimisation and minimal checks.

I can live with the debug build having half the frame rate of the performance build. Beyond that the game becomes unplayable. It’s no good having all those checks if I can’t actually run the thing.

So I did something slightly unusual, and profiled the debug build. It turns out that having no optimisations was not the main problem; rather it was the debug iterators in the standard library.

This wasn’t entirely surprising. I used to have my own container classes for that reason, among others. But now I use the standard library all the time, because it is incredibly useful and too much work to replicate. It feels wrong to write C++ without it. But why are the debug iterators so slow? It seems that they lock mutexes on every operation, which seems unnecessary. Anyway, I can’t do anything about that, short of switching development environment. And I don’t want to turn it off, because it’s caught quite a few bugs already.

I found the worst cases and replaced iterators with pointers. I was using a vector as a dynamic vertex buffer. I wrapped it in a class and put in my own checks which don’t murder performance.

This got it back up to within a factor of two of the performance build. Hopefully it will stay there for a while.

Up to now, I had no way to delete objects from the physics engine. They could be deactivated, but they were still there. This was fine for dead enemies and other objects that existed in limited numbers. But I wanted to have lots of particle-like objects that exist for a short time, and each one couldn’t stay around forever.

Why was there no deletion? Because it’s not that easy. A physics body is allocated from a pool. A list stores all the allocated pointers. Then there is a std::vector of pointers to all active bodies, and a pointer to the body in its current cell, which may be awake, asleep or deactivated. The user code gets a pointer to the body too, and that’s what it will give back to be deleted.

So if something wants to delete a body, I get a pointer to the body to be removed, but it needs to be removed from a number of containers, which erase by iterator. And I can’t get any of those iterators without a linear search through every container. This looks like it might be slow. What I want is constant time deletion.

Right then. I can narrow down the containers that might have pointers to the body by waking it up. Then only the active cells, active list and the main list will have the pointer. There’s a point in the update when the active cells are empty and get rebuilt from the active list. That’s where I’ll put the deletion. Until then, the body is marked as pending delete.

I already have a loop through the active bodies. I’ll hijack that and check whether any bodies need to be deleted. If they do, I swap them to the end of the vector, pop_back and return the pointer to the pool. Still constant time. That just leaves the main list. I can’t delete from that without iterating through it. But it is never used until the physics world is shut down, and I don’t care so much about performance at that point.

So I change that list to a hash set of pointers. Now I can delete from it in constant time.

I’m not complaining that it isn’t random enough (though it often isn’t). And rand() is convenient, so maybe if you are writing a 10 line guess-the-number program it will be the quickest way to get there. But beyond that, I would never use it.

The problem is that rand() is lying. It says, ‘Hey, call this function and you get a random number, no strings attached!’. And that’s just not the case. Instead, rand() generates pseudorandom numbers that depend on some hidden global state, which can be modified by any part of the program at any time.

Suppose one system is happily generating random numbers and another system comes along and calls srand() with some inappropriate parameter. Suddenly the numbers are not random anymore. Or suppose you are relying on getting a predetermined sequence from a fixed seed. Again, a call to srand() from outside will break it.

Or maybe you are happy with the behaviour of rand(), then you compile the program on another platform and it’s not random enough anymore because they used a different algorithm.

As always with program state, the best way is to make it explicit. A random number generator has state, so make it a class. There are many types of generator and they have different properties, so make the algorithm explicit too. If it’s a Mersenne Twister call it that.

Then you know who has access to it, you know it is properly initialised, and you know it isn’t going to change from one platform to another. In short, it does what it says. Which rand() doesn’t.

Almost every game needs to do frustum culling. There are many more objects in the game world than are visible at any one time, and the renderer should only be concerned with the ones that can be seen. There are many ways to do it. But the performance of such code is quite counter-intuitive. I decided to investigate.

The test case

I needed something extreme, so I chose a forest of 4 million trees, randomly positioned. In the end I wanted about 20000 visible, inside a 90 degree view cone. This is about how many separate objects a game could possibly render. The culling is done in 2D, but this is a roughly equivalent case to a landscape in 3D. I didn’t use any special optimisations for the culling routine; it’s standard floating point code returning whether the region is outside, partially inside or fully inside the frustum.

Brute force

Brute force culling of 4 million objects took 50 ms. A full frame is 16 ms, so this is clearly unacceptable. But it does show how fast the CPU can do these tests. Culling tests on the 20000 objects that were visible would only have taken 0.25 ms, if we hadn’t had to deal with all the rest.

Quad tree

I would say the standard structure for culling is a quad tree. I made one, optimised the depth, and the fastest I could make it go was 0.7 ms. That’s a huge improvement on brute force, but it only did around 8000 tests. Almost the entire time is overhead from cache misses and function calls.

Grid

Next I made a simple grid. Again, I optimised the cell size and made it as fast as I could. The result was 0.8 ms, only slightly slower than the quad tree. This time it did almost 40000 tests, but the overhead was much less, which accounts for the almost negligible difference compared to the tree.

Spatial indexing

Imagine a quad tree without any nodes except at the bottom level. The nodes are stored in the order you would see them if iterating through the tree depth first. It’s processed recursively, but without any pointers to follow it’s much more cache-friendly than a tree. This was the fastest method I found, at 0.5 ms.

Conclusions

However you do it, culling shouldn’t take very long. If the total object count is in the tens of thousands, consider brute force.

A tree structure gives almost no benefit compared to a grid.

Combining hierarchical culling with good cache behaviour is the fastest approach.

Since I’ve been doing a lot a physics development and hacking the game to put in test cases, I thought I would write a little physics game. This would also provide a way to test and sanity check the whole framework.

(Oh, it failed the sanity check. But that’s all part of the process).

Pong is a very simple game, and it doesn’t need all the systems I put in place for the main game. But I wanted to use them anyway. In particular, I wanted to do Pong with real physics, simulating the bats and ball rather than following a minimal set of rules.

First problem: the bats shouldn’t move when the ball hits them. For this I had to add constraints, which didn’t exist in the physics engine before. However, I had contacts, and joints are mathematically similar. So I made two joints per bat, which could slide on the vertical axis but were fixed on the horizontal. Two, because the bats also should not rotate.

This has the side effect that hitting the top or bottom of a bat will move it, but I like that. I could have given the bats infinite mass, but I wanted them to be simulated.

There’s a little more to the game than the physics, but not much. I have three actor types:

Pong game

Bat

Ball

The Pong game actor creates the other actors, starts the game, and keeps score. The bat actor responds to input and moves the bat up and down. The ball actor waits for collisions and reports a score to the game.

Then there’s a bit of code to render the physics objects and the score, and that’s it. The game is entirely event driven because the physics runs in the background.

As the next (maybe last) step in developing the physics engine, I wanted to add rotation to the objects. It should have been pretty easy, but ended up as a big overhaul in which I changed almost everything.

So what’s needed for rotation, on top of what I already had?

The rigid bodies get some extra state variables for angular velocity and rotation.

The integration has to update these variables.

Contacts need positions in order to work out the torque.

The solver matrix needs to include moment-of-inertia factors.

Not actually that much. The biggest task was adding rotation and position calculations to the collision functions. But then when I put it all in, the stability had gone straight to hell.

It actually wasn’t too bad for circles, or even circles with boxes. It was boxes against boxes that really had problems. I realised then that a single contact can’t keep a box stable, if it’s allowed to rotate. I changed it to two contacts along the contact edges. Much better, but still jittery.

The trouble with box collisions is that even tiny changes from frame to frame can dramatically alter the set of contacts. This reduces the effectiveness of contact caching, and that affects the convergence of the solver, and that means more jitter. A tidy stack is not too bad, but a big pile never quite settles.

I never solved this perfectly, but I did get it good enough with various techniques:

The collision functions have to be spot on. There’s no room for error.

Solving separately for positions and velocities prevents the intersection penalty feeding back into the next frame.

I treat low speed collisions as perfectly inelastic.

I don’t wake up objects for inelastic collisions. Instead I leave them asleep with infinite mass. This stops a jittery object waking up the whole stack.

I think there are still improvements to be had, but I have to stop somewhere. Maybe that’s it unless I get some new ideas.

The floating point registers are 128 bits wide and can process four 32-bit floats at once, but I’m running the simulation in serial, one float at a time. Can this be improved?

I can’t rely on the compiler to do anything about it. It just isn’t a simple enough case for the compiler to detect. And the potential is there for a 4x speed-up. Nothing less will do! So I have to do it by hand.

The strategy is to process four cells in parallel. And I would like the code to ressemble the non-parallel version, too, to make switching back and forth for testing and debugging easy. So rewriting the whole thing using intrinsics is out. Instead, replacing every float with a ‘float4’ would turn a cell into a block of cells, and then there would be four times fewer of them.

I needed a float4 type, which doesn’t exist in c++. So I made one. It’s just a class with a single __m128 member. It has no named member functions (float doesn’t have any), just constructors, casts and overloaded operators. The compiler seems to optimise this pretty well if everything is inlined.

There’s just one problem. To calculate a flow, I need a cell and its neighbour. But the neighbouring cell isn’t a separate object, it’s either offset by one in the same block, or it’s the first cell in the next block. To get a block of four neighbouring cells I use shuffling and masking to shift a block to the left, and then when applying the flow I shift it the other way. This shifting doesn’t add much overall to the cost. I can make it work with or without SIMD like this: