Data-oriented design (DOD) is when you look at your game’s state as data transforming into other data. In object-oriented programming (OOP) you view things as objects, each holding their own state and interacting with the world. A small example.

Suppose this is our game world. We have four balls bouncing around a room. When the balls hit a wall they go into a different direction.

In OOP, we say each ball has a position, a direction and a speed. Each ball gets an Update function that updates the position and a Render function that renders the ball’s sprite at their current position. This is what the data looks like:

We have a Ball class with some members and we have a BallManager class that has a list of Ball instances. To update the balls, the BallManager calls each ball’s Update and Render function.

The advantages are not obvious. It just looks like I moved some stuff around. How is that going to help?

A good way of trying to wrap your brain around data-oriented programming is to think of it as a factory: data comes in, an operation is performed and data comes out.

The Problem with Object-Oriented Programming

You might not realize there is a problem. But it’s been slowly creeping up on the industry. Computers, consoles and even phones nowadays have multiple cores. But our programs are traditionally written for one-cored computers. Yes, we can do multithreading. Yes, we have job systems. But our data is not inherently multithreadable. Each object is treated as its own entity, which makes it difficult to stream.

The inherent problem is that while our processors have been steadily increasing in speed (at about 60% per year), memory hasn’t kept up, increasing speed by only 10% per year. The gap is widening, and it is a problem. Keeping your data in cache (very fast but small memory near the processor) is becoming increasingly important. If you can keep all your data as near to the processor as possible, you can squeeze some significant performance from it.

As a game programmer in training, I learned about the difference between an Array of Structures (AoS) and a Structure of Arrays (SoA). I never paid much attention to it, because it seemed like such a hassle. But that was because I wasn’t forced to think exclusively in AoS. When I did an internship intake for a large gaming company, they asked me to use data-oriented design. I found that there wasn’t much available on the subject. A handful of talks and a couple of blogposts. But no practical examples and no source code.

So I decided to write my own. :)

The Test Program

The program we’re going to discuss is a voxel renderer. Voxels are essentially three-dimensional pixels. You can build and shape worlds with them, but they are traditionally very expensive to render. That is because graphics cards aren’t optimized for square blocks, they’re optimized for triangles. On top of that, we’re going to throw a raytracer on top of it. Each voxel casts a ray from a target position to itself. If the ray hits a block other than itself it is not visible and culled.

There are two versions, one using object-oriented programming (OOP) and one using data-oriented design (DOD). I have set some restrictions for myself for this example:

No macro-optimizations – The OOP and the DOD version both use the same algorithm.

No micro-optimizations – The DOD version does not use SIMD functions, even though it totally could.

No multi-threading – Mostly because I didn’t want to multithread the OOP version.

Controls:

Left and Right – Rotate camera

Spacebar – Switch between OOP and DOD

G – Generate new voxels

Q – Decrease amount of voxels

W – Increase amount of voxels

R – Turn raycasting on or off

D – Turn debug lines on or off

Format:
The example was built with Microsoft Visual Studio 2008. A VS2008 solution is provided. It should be forward-compatible with VS2010. The code will not compile out of the box on Macintosh or Linux machines, because of Microsoft-specific features (Windows window, Microsoft extension for font rendering).

This example is provided without any license. That means that anyone can download, modify, redistribute and sell it or any part of it. This example is provided “as-is”, without any guarantee of bug fixes or future support.

The Basics

We’ll need a window with an OpenGL context. This is handled by Framework.h and Framework.cpp. It’s not that interesting overall so I won’t focus too much on it. Suffice to say it creates a window, handles to the game code and runs their Update and Render functions. It also does some global key handling.

Common.h contains includes and global defines. Included with this example is part of my Toolbox: TBSettings.h, TBMath.h and TBVec3.h. You are free to use these headers in your own projects, if you wish. There are no license requirements attached to them.

The two interesting classes are OOPVoxels and DODVoxels. They are both children of the Game class.

We create voxels with a very simple algorithm. It starts at position (0, 0, 0) in the world and for every voxel, it takes a step forward, backwards, left, right, up or down. The color is a random color between (0, 0, 0) and (1, 1, 1, 1). The result is a colorful, but rough, “snake” of voxels.

The Object-Oriented Approach

I’ll skip the part where I generate the voxels. It’s not really relevant for this example. If you want to look at it you can find it in OOPVoxels::GenerateVoxels.

We will need some kind of container for voxel data. We are going to store the voxels in a VBO buffer. Every voxel consists of 6 quads, so each voxel will need to add its sides to the vertex buffer. The size of the voxels doesn’t change, so we store position offsets for each of the coordinates in the game class.

And now for the slightly harder part. In the Render function, we want to lock the vertex and color buffer and add voxels that aren’t clipped to it. So we add an AddToRenderList function to the Voxel class. It takes a destination vertex buffer, a destination color buffer and the offsets for each vertex of the voxel.

Every voxel needs to check its clipping state before it can add itself to the buffer.

Now, let’s focus on the raytracing portion. When raytracing is turned on, every voxel generates a ray that starts at the target position and points towards the voxel. This ray then stores the voxel instance that is closest to the target position. So, each ray checks each voxel to see which is the closest. Luckily one of my teachers had a very fast ray-AABB collision check.

But still, every ray needs to check every voxel. This results in horrible performance!

The obvious optimization is to update the algorithm so we don’t need to check so many voxels. But this is supposed to be an example in data-oriented design. ;)

The Data-Oriented Approach

Instead of thinking in terms of voxels, what is the minimum amount of data we need per voxel? Well, they’re all the same size. So all we need is a position and a color. So what we have now is a container for all our voxels.

What next? Well, our voxel data needs to be transformed into triangles on the screen. But what is the data that the VBO’s need to put stuff on the screen? It’s a position and a color. So we need a container for the vertex data.

struct VertexData
{
Vec3* vertex;
Vec3* color;
};

As the final step, we need to transform voxel data into vertex data. We’re going to use a function for that acts like a factory. Here is the diagram:

The internal structure of the function is much the same as the OOP version. However, I have split the saving of position and color data to improve caching. You can find it in DODVoxels::GenerateFaces.

So, what needs to change when we want to add raycasting? Obviously we’ll need a structure to hold our ray data.

This is, in my opinion, the biggest advantage of DOD over OOP. It’s clear where your data is and in what way it needs to be converted. The DODVoxels::SolveRayCast function is the biggest change from the OOP version. It has become massive. But the algorithm is the same.

But, what if we wanted to multi-thread? For instance, what if we wanted to split up the solving of raycasts into multiple jobs? With DOD it becomes extremely simple. We extend the SolveRayCast function to not only take a count of data it needs to chew, but also an offset into the array. Because we don’t read and write to the same array, we can split it up without race conditions.

A crazy cool benefit, wouldn’t you say? :)

Results

I made this example with an idea: I’m going to take an algorithm that abuses the CPU’s cache and apply data-oriented design to it to make it faster. The results are… disappointing. Here is the graph for just dumping voxels on screen:

Mind the gap between 5000 and 10000 voxels. As you can see, the DOD version is consistently faster than the OOP version. Even at 90,000 voxels it is 1 fps faster.

But then we get the graph for the version with raycasting:

What happened? The OOP version is consistently faster than the DOD version! To be honest, I haven’t a clue. The DOD version is aligned better in memory and with improved caching comes better performance. But I don’t get any.

Conclusion

I really like data-oriented design. Forcing you to think in terms of data instead of objects means you’re taking multithreading into account from the ground up. It can, however, be extremely tricky to think in terms of data. But I’m positive that will change with experience.

The reason I wrote this tutorial was because there wasn’t any proper literature on the subject. I hope my endeavors have saved you some time. :)

7 Responses to “Tutorial: A practical example of Data-Oriented Design”

Thanks for the article. This topic needs more concrete examples and less ‘head in the clouds’ discussion.

I could be wrong, but I think I know why your DOD is not performing better than the OO version.

In your OO voxel class, the data is allocated on the stack (no pointers / new/delete). I believe this will make it contiguous in memory.

But in your DOD example, you appear to be making structs that contain pointers to floats. I haven’t looked at the code, but I’m guessing this means you are allocating on the heap which make be fragmented. Also, you are probably dereferencing those pointers all over the place which is hardly cache friendly.

Why not declare your structs to contain plain old floats, then create a big array VoxelData voxels[MAXSIZE] on the stack. Or even split the VoxelData structure into multiple arrays to process the position/color/etc separately.

I may be wrong here, this is just my initial thoughts. Thanks for the article, please keep toying with it and find the performance barrier!!

Instead of your DOD structs containing arrays of each element of the data, your array should be of the struct itself, which just contains straight floats.

Because you are allocating all the float* x, for example, then all the float* y, then float* z. But your loops need that data to be contiguous. The memory should look like xyzxyzxyz, not xxxxxxxxxyyyyyyyzzzzzzzz, because otherwise your loops *are* jumping all around memory.

You’re sending member variables to member functions of the same class. What benefit does this have over using the member variables directly in the function? is it that the rest of the object doesnt need to be loaded into the cache?

As far as I understand DOD, your DODVoxels::GenerateVoxels should have as many loops as arrays, i.e. six: dst_pos_x, dst_pos_y, dst_pos_z, dst_color_r, dst_color_g, dst_color_b. The same for GenerateRayCast and other functions..

Apart from the dynamic allocation, which adds unnecessary overhead to the DOD version, DOD is actually much more than a shift from AoS to SoA. I cannot emphasize it enough: DOD is about thinking of the data flow. As an example, if you happen to use x, y and z together as a vector, you do want to store them together in an struct, same thing with colors. Why? Let’s put this code as an example:

What happens if there are a lot of colors? You first load one of the color components into the cache (which one is implementation defined), then another, and then the other one. And since they are separated, all three will inccur a cache-miss (instead of one). Next iteration, the r, g and b components that should already be in the cache have more than probably been offloaded from the cache, because you are loading more data from memory into the cache in the same iteration. Actually, you shouldn't be surprised if splitting the computation in different for loops actually increased performance, because that way you get the maximum cache-hit rate. And remember to profile everything!