Published

Data-Oriented Design (Or Why You Might Be Shooting Yourself in The Foot With OOP)

Picture this: Toward the end of the development cycle, your game crawls, but you don’t see any obvious hotspots in the profiler. The culprit? Random memory access patterns and constant cache misses. In an attempt to improve performance, you try to parallelize parts of the code, but it takes heroic efforts, and, in the end, you barely get much of a speed-up due to all the synchronization you had to add. To top it off, the code is so complex that fixing bugs creates more problems, and the thought of adding new features is discarded right away. Sound familiar?

That scenario pretty accurately describes almost every game I’ve been involved with for the last 10 years. The reasons aren’t the programming languages we’re using, nor the development tools, nor even a lack of discipline. In my experience, it’s object- oriented programming (OOP) and the culture that surrounds it that is in large part to blame for those problems. OOP could be hindering your project rather than helping it!

It’s All About Data

OOP is so ingrained in the current game development culture that it’s hard to think beyond objects when thinking about a game. After all, we’ve been creating classes representing vehicles, players, and state machines for many years. What are the alternatives? Procedural programming? Functional languages? Exotic programming languages?

Data-oriented design is a different way to approach program design that addresses all these problems. Procedural programming focuses on procedure calls as its main element, and OOP deals primarily with objects. Notice that the main focus of both approaches is code: plain procedures (or functions) in one case, and grouped code associated with some internal state in the other. Data-oriented design shifts the perspective of programming from objects to the data itself: The type of the data, how it is laid out in memory, and how it will be read and processed in the game.

Programming, by definition, is about transforming data: It’s the act of creating a sequence of machine instructions describing how to process the input data and create some specific output data. A game is nothing more than a program that works at interactive rates, so wouldn’t it make sense for us to concentrate primarily on that data instead of on the code that manipulates it?

I’d like to clear up potential confusion and stress that data-oriented design does not imply that something is data- driven. A data-driven game is usually a game that exposes a large amount of functionality outside of code and lets the data determine the behavior of the game. That is an orthogonal concept to data-oriented design, and can be used with any type of programming approach.

Ideal Data

If we look at a program from the data point of view, what does the ideal data look like? It depends on the data and how it’s used. In general, the ideal data is in a format that we can use with the least amount of effort. In the best case, the format will be the same we expect as an output, so the processing is limited to just copying that data. Very often, our ideal data layout will be large blocks of contiguous, homogeneous data that we can process sequentially. In any case, the goal is to minimize the amount of transformations, and whenever possible, you should bake your data into this ideal format offline, during your asset-building process.

Because data-oriented design puts data first and foremost, we can architect our whole program around the ideal data format. We won’t always be able to make it exactly ideal (the same way that code is hardly ever by-the-book OOP), but it’s the primary goal to keep in mind. Once we achieve that, most of the problems I mentioned at the beginning of the column tend to melt away (more about that in the next section).

When we think about objects, we immediately think of trees— inheritance trees, containment trees, or message-passing trees, and our data is naturally arranged that way. As a result, when we perform an operation on an object, it will usually result in that object in turn accessing other objects further down in the tree. Iterating over a set of objects performing the same operation generates cascading, totally different operations at each object (see Figure 1a).

To achieve the best possible data layout, it’s helpful to break down each object into the different components, and group components of the same type together in memory, regardless of what object they came from. This organization results in large blocks of homogeneous data, which allow us to process the data sequentially (see Figure 1b). A key reason why data-oriented design is so powerful is because it works very well on large groups of objects. OOP, by definition, works on a single object. Step back for a minute and think of the last game you worked on: How many places in the code did you have only one of something? One enemy? One vehicle? One pathfinding node? One bullet? One particle? Never! Where there’s one, there are many. OOP ignores that and deals with each object in isolation. Instead, we can make things easy for us and for the hardware and organize our data to deal with the common case of having many items of the same type.

Does this sound like a strange approach? Guess what? You’re probably already doing this in some parts of your code: The particle system! Data-oriented design is turning our whole codebase into a gigantic particle system. Perhaps a name for this approach that would be more familiar to game programmers would have been particle-driven programming.

Advantages of Data-Oriented Design

hinking about data first and architecting the program based on that brings along lots of advantages.

Parallelization.

These days, there’s no way around the fact that we need to deal with multiple cores. Anyone who has tried taking some OOP code and parallelizing it can attest how difficult, error prone, and possibly not very efficient that is. Often you end up adding lots of synchronization primitives to prevent concurrent access to data from multiple threads, and usually a lot of the threads end up idling for quite a while waiting for other threads to complete. As a result, the performance improvement can be quite underwhelming.

When we apply data-oriented design, parallelization becomes a lot simpler: We have the input data, a small function to process it, and some output data. We can easily take something like that and split it among multiple threads with minimal synchronization between them. We can even take it further and run that code on processors with local memory (like the SPUs on the Cell processor) without having to do anything differently.

Cache utilization.

In addition to using multiple cores, one of the keys to achieving great performance in modern hardware, with its deep instruction pipelines and slow memory systems with multiple levels of caches, is having cache-friendly memory access. Data-oriented design results in very efficient use of the instruction cache because the same code is executed over and over. Also, if we lay out the data in large, contiguous blocks, we can process the data sequentially, getting nearly perfect data cache usage and great performance. Possible optimizations. When we think of objects or functions, we tend to get stuck optimizing at the function or even the algorithm level; Reordering some function calls, changing the sort method, or even re-writing some C code with assembly.

That kind of optimization is certainly beneficial, but by thinking about the data first we can step further back and make larger, more important optimizations. Remember that all a game does is transform some data (assets, inputs, state) into some other data (graphics commands, new game states). By keeping in mind that flow of data, we can make higher-level, more intelligent decisions based on how the data is transformed, and how it is used. That kind of optimization can be extremely difficult and time- consuming to implement with more traditional OOP methods.

Modularity.

So far, all the advantages of data-oriented design have been based around performance: cache utilization, optimizations, and parallelization. There is no doubt that as game programmers, performance is an extremely important goal for us. There is often a conflict between techniques that improve performance and techniques that help readability and ease of development. For example, re-writing some code in assembly language can result in a performance boost, but usually makes the code harder to read and maintain.

Fortunately, data-oriented design is beneficial to both performance and ease of development. When you write code specifically to transform data, you end up with small functions, with very few dependencies on other parts of the code. The codebase ends up being very “flat,” with lots of leaf functions without many dependencies. This level of modularity and lack of dependences makes understanding, replacing, and updating the code much easier.

Testing.

The last major advantage of data-oriented design is ease of testing. As we saw in the June and August Inner Product columns, writing unit tests to check object interactions is not trivial. You need to set up mocks and test things indirectly. Frankly, it’s a bit of a pain. On the other hand, when dealing directly with data, it couldn’t be easier to write unit tests: Create some input data, call the transform function, and check that the output data is what we expect. There’s nothing else to it. This is actually a huge advantage and makes code extremely easy to test, whether you’re doing test-driven development or just writing unit tests after the code.

Drawbacks of Data-Oriented Design

Data-oriented design is not the silver bullet to all the problems in game development. It does help tremendously writing high-performance code and making programs more readable and easier to maintain, but it does come with a few drawbacks of its own.

The main problem with data-oriented design is that it’s different from what most programmers are used to or learned in school. It requires turning our mental model of the program ninety degrees and changing how we think about it. It takes some practice before it becomes second-nature.

Also, because it’s a different approach, it can be challenging to interface with existing code, written in a more OOP or procedural way. It’s hard to write a single function in isolation, but as long as you can apply data-oriented design to a whole subsystem you should be able to reap a lot of the benefits.

Applying Data-Oriented Design

Enough of the theory and overview. How do you actually get started with data-oriented design? To start with, just pick a specific area in your code: navigation, animations, collisions, or something else. Later on, when most of your game engine is centered around the data, you can worry about data flow all the way from the start of a frame until the end.

The next step is to clearly identify the data inputs required by the system, and what kind of data it needs to generate. It’s OK to think about it in OOP terms for now, just to help us identify the data. For example, in an animation system, some of the input data is skeletons, base poses, animation data, and current state. The result is not “the code plays animations,” but the data generated by the animations that are currently playing. In this case, our outputs would be a new set of poses and an updated state.

It’s important to take a step further and classify the input data based on how it is used. Is it read- only, read-write, or write-only? That classification will help guide design decisions about where to store it, and when to process it depending on dependencies with other parts of the program.

At this point, stop thinking of the data required for a single operation, and think in terms of applying it to dozens or hundreds of entries. We no longer have one skeleton, one base pose, and a current state, and instead we have a block of each of those types with many instances in each of the blocks.

Think very carefully how the data is used during the transformation process from input to output. You might realize that you need to scan a particular field in a structure to perform a pass on the data, and then you need to use the results to do another pass. In that case, it might make more sense to split that initial field into a separate block of memory that can be processed independently, allowing for better cache utilization and potential parallelization. Or maybe you need to vectorize some part of the code, which requires fetching data from different locations to put it in the same vector register. In that case, that data can be stored contiguously so vector operations can be applied directly, without any extra transformations.

Now you should have a very good understanding of your data. Writing the code to transform it is going to be much simpler. It’s like writing code by filling in the blanks. You’ll even be pleasantly surprised to realize that the code is much simpler and smaller than you thought in the first place, compared to what the equivalent OOP code would have been.

If you think back about most of the topics we’ve covered in this column over the last year, you’ll see that they were all leading toward this type of design. Now it’s the time to be careful about how the data is aligned (Dec 2008 and Jan 2009), to bake data directly into an input format that you can use efficiently (Oct and Nov 2008), or to use non- pointer references between data blocks so they can be easily relocated (Sept 2009).

Is Thre Room For OOP?

Does this mean that OOP is useless and you should never apply it in your programs? I’m not quite ready to say that. Thinking in terms of objects is not detrimental when there is only one of each object (a graphics device, a log manager, etc) although in that case you might as well write it with simpler C-style functions and file-level static data. Even in that situation, it’s still important that those objects are designed around transforming data.

Another situation where I still find myself using OOP is GUI systems. Maybe it’s because you’re working with a system that is already designed in an object-oriented way, or maybe it’s because performance and complexity are not crucial factors with GUI code. In any case, I much prefer GUI APIs that are light on inheritance and use containment as much as possible (Cocoa and CocoaTouch are good examples of this). It’s very possible that a data-oriented GUI system could be written for games that would be a pleasure to work with, but I haven’t seen one yet.

Finally, there’s nothing stopping you from still having a mental picture of objects if that’s the way you like to think about the game. It’s just that the enemy entity won’t be all in the same physical location in memory. Instead, it will be split up into smaller subcomponents, each one forming part of a larger data table of similar components.

Data-oriented design is a bit of a departure from traditional programming approaches, but by always thinking about the data and how it needs to be transformed, you’ll be able to reap huge benefits both in terms of performance and ease of development.

Thanks to Mike Acton and Jim Tilander for challenging my ideas over the years and for their feedback on this article.

Picture this: Toward the end of the development cycle, your game crawls, but you don’t see any obvious hotspots in the profiler. The culprit? Random memory access patterns and constant cache misses. In an attempt to improve performance, you try to parallelize parts of the code, but it takes heroic efforts, and, in the end, you barely get much of a speed-up due to all the synchronization you had to add. To top it off, the code is so complex that fixing bugs creates more problems, and the thought of adding new features is discarded right away. Sound familiar?

That scenario pretty accurately describes almost every game I’ve been involved with for the last 10 years. The reasons aren’t the programming languages we’re using, nor the development tools, nor even a lack of discipline. In my experience, it’s object- oriented programming (OOP) and the culture that surrounds it that is in large part to blame for those problems. OOP could be hindering your project rather than helping it!

It’s All About Data

OOP is so ingrained in the current game development culture that it’s hard to think beyond objects when thinking about a game. After all, we’ve been creating classes representing vehicles, players, and state machines for many years. What are the alternatives? Procedural programming? Functional languages? Exotic programming languages?

Data-oriented design is a different way to approach program design that addresses all these problems. Procedural programming focuses on procedure calls as its main element, and OOP deals primarily with objects. Notice that the main focus of both approaches is code: plain procedures (or functions) in one case, and grouped code associated with some internal state in the other. Data-oriented design shifts the perspective of programming from objects to the data itself: The type of the data, how it is laid out in memory, and how it will be read and processed in the game.

Programming, by definition, is about transforming data: It’s the act of creating a sequence of machine instructions describing how to process the input data and create some specific output data. A game is nothing more than a program that works at interactive rates, so wouldn’t it make sense for us to concentrate primarily on that data instead of on the code that manipulates it?

I’d like to clear up potential confusion and stress that data-oriented design does not imply that something is data- driven. A data-driven game is usually a game that exposes a large amount of functionality outside of code and lets the data determine the behavior of the game. That is an orthogonal concept to data-oriented design, and can be used with any type of programming approach.

Ideal Data

Figure 1a. Call sequence with an object-oriented approach

If we look at a program from the data point of view, what does the ideal data look like? It depends on the data and how it’s used. In general, the ideal data is in a format that we can use with the least amount of effort. In the best case, the format will be the same we expect as an output, so the processing is limited to just copying that data. Very often, our ideal data layout will be large blocks of contiguous, homogeneous data that we can process sequentially. In any case, the goal is to minimize the amount of transformations, and whenever possible, you should bake your data into this ideal format offline, during your asset-building process.

Because data-oriented design puts data first and foremost, we can architect our whole program around the ideal data format. We won’t always be able to make it exactly ideal (the same way that code is hardly ever by-the-book OOP), but it’s the primary goal to keep in mind. Once we achieve that, most of the problems I mentioned at the beginning of the column tend to melt away (more about that in the next section).

When we think about objects, we immediately think of trees— inheritance trees, containment trees, or message-passing trees, and our data is naturally arranged that way. As a result, when we perform an operation on an object, it will usually result in that object in turn accessing other objects further down in the tree. Iterating over a set of objects performing the same operation generates cascading, totally different operations at each object (see Figure 1a).

Figure 1b. Call sequence with a data-oriented approach

To achieve the best possible data layout, it’s helpful to break down each object into the different components, and group components of the same type together in memory, regardless of what object they came from. This organization results in large blocks of homogeneous data, which allow us to process the data sequentially (see Figure 1b). A key reason why data-oriented design is so powerful is because it works very well on large groups of objects. OOP, by definition, works on a single object. Step back for a minute and think of the last game you worked on: How many places in the code did you have only one of something? One enemy? One vehicle? One pathfinding node? One bullet? One particle? Never! Where there’s one, there are many. OOP ignores that and deals with each object in isolation. Instead, we can make things easy for us and for the hardware and organize our data to deal with the common case of having many items of the same type.

Does this sound like a strange approach? Guess what? You’re probably already doing this in some parts of your code: The particle system! Data-oriented design is turning our whole codebase into a gigantic particle system. Perhaps a name for this approach that would be more familiar to game programmers would have been particle-driven programming.

Advantages of Data-Oriented Design

Thinking about data first and architecting the program based on that brings along lots of advantages.

Parallelization.

These days, there’s no way around the fact that we need to deal with multiple cores. Anyone who has tried taking some OOP code and parallelizing it can attest how difficult, error prone, and possibly not very efficient that is. Often you end up adding lots of synchronization primitives to prevent concurrent access to data from multiple threads, and usually a lot of the threads end up idling for quite a while waiting for other threads to complete. As a result, the performance improvement can be quite underwhelming.

When we apply data-oriented design, parallelization becomes a lot simpler: We have the input data, a small function to process it, and some output data. We can easily take something like that and split it among multiple threads with minimal synchronization between them. We can even take it further and run that code on processors with local memory (like the SPUs on the Cell processor) without having to do anything differently.

Cache utilization.

In addition to using multiple cores, one of the keys to achieving great performance in modern hardware, with its deep instruction pipelines and slow memory systems with multiple levels of caches, is having cache-friendly memory access. Data-oriented design results in very efficient use of the instruction cache because the same code is executed over and over. Also, if we lay out the data in large, contiguous blocks, we can process the data sequentially, getting nearly perfect data cache usage and great performance. Possible optimizations. When we think of objects or functions, we tend to get stuck optimizing at the function or even the algorithm level; Reordering some function calls, changing the sort method, or even re-writing some C code with assembly.

That kind of optimization is certainly beneficial, but by thinking about the data first we can step further back and make larger, more important optimizations. Remember that all a game does is transform some data (assets, inputs, state) into some other data (graphics commands, new game states). By keeping in mind that flow of data, we can make higher-level, more intelligent decisions based on how the data is transformed, and how it is used. That kind of optimization can be extremely difficult and time- consuming to implement with more traditional OOP methods.

Modularity.

So far, all the advantages of data-oriented design have been based around performance: cache utilization, optimizations, and parallelization. There is no doubt that as game programmers, performance is an extremely important goal for us. There is often a conflict between techniques that improve performance and techniques that help readability and ease of development. For example, re-writing some code in assembly language can result in a performance boost, but usually makes the code harder to read and maintain.

Fortunately, data-oriented design is beneficial to both performance and ease of development. When you write code specifically to transform data, you end up with small functions, with very few dependencies on other parts of the code. The codebase ends up being very “flat,” with lots of leaf functions without many dependencies. This level of modularity and lack of dependences makes understanding, replacing, and updating the code much easier.

Testing.

The last major advantage of data-oriented design is ease of testing. As we saw in the June and August Inner Product columns, writing unit tests to check object interactions is not trivial. You need to set up mocks and test things indirectly. Frankly, it’s a bit of a pain. On the other hand, when dealing directly with data, it couldn’t be easier to write unit tests: Create some input data, call the transform function, and check that the output data is what we expect. There’s nothing else to it. This is actually a huge advantage and makes code extremely easy to test, whether you’re doing test-driven development or just writing unit tests after the code.

Drawbacks of Data-Oriented Design

Data-oriented design is not the silver bullet to all the problems in game development. It does help tremendously writing high-performance code and making programs more readable and easier to maintain, but it does come with a few drawbacks of its own.

The main problem with data-oriented design is that it’s different from what most programmers are used to or learned in school. It requires turning our mental model of the program ninety degrees and changing how we think about it. It takes some practice before it becomes second-nature.

Also, because it’s a different approach, it can be challenging to interface with existing code, written in a more OOP or procedural way. It’s hard to write a single function in isolation, but as long as you can apply data-oriented design to a whole subsystem you should be able to reap a lot of the benefits.

Applying Data-Oriented Design

Enough of the theory and overview. How do you actually get started with data-oriented design? To start with, just pick a specific area in your code: navigation, animations, collisions, or something else. Later on, when most of your game engine is centered around the data, you can worry about data flow all the way from the start of a frame until the end.

The next step is to clearly identify the data inputs required by the system, and what kind of data it needs to generate. It’s OK to think about it in OOP terms for now, just to help us identify the data. For example, in an animation system, some of the input data is skeletons, base poses, animation data, and current state. The result is not “the code plays animations,” but the data generated by the animations that are currently playing. In this case, our outputs would be a new set of poses and an updated state.

It’s important to take a step further and classify the input data based on how it is used. Is it read- only, read-write, or write-only? That classification will help guide design decisions about where to store it, and when to process it depending on dependencies with other parts of the program.

At this point, stop thinking of the data required for a single operation, and think in terms of applying it to dozens or hundreds of entries. We no longer have one skeleton, one base pose, and a current state, and instead we have a block of each of those types with many instances in each of the blocks.

Think very carefully how the data is used during the transformation process from input to output. You might realize that you need to scan a particular field in a structure to perform a pass on the data, and then you need to use the results to do another pass. In that case, it might make more sense to split that initial field into a separate block of memory that can be processed independently, allowing for better cache utilization and potential parallelization. Or maybe you need to vectorize some part of the code, which requires fetching data from different locations to put it in the same vector register. In that case, that data can be stored contiguously so vector operations can be applied directly, without any extra transformations.

Now you should have a very good understanding of your data. Writing the code to transform it is going to be much simpler. It’s like writing code by filling in the blanks. You’ll even be pleasantly surprised to realize that the code is much simpler and smaller than you thought in the first place, compared to what the equivalent OOP code would have been.

If you think back about most of the topics we’ve covered in this column over the last year, you’ll see that they were all leading toward this type of design. Now it’s the time to be careful about how the data is aligned (Dec 2008 and Jan 2009), to bake data directly into an input format that you can use efficiently (Oct and Nov 2008), or to use non- pointer references between data blocks so they can be easily relocated (Sept 2009).

Is There Room For OOP?

Does this mean that OOP is useless and you should never apply it in your programs? I’m not quite ready to say that. Thinking in terms of objects is not detrimental when there is only one of each object (a graphics device, a log manager, etc) although in that case you might as well write it with simpler C-style functions and file-level static data. Even in that situation, it’s still important that those objects are designed around transforming data.

Another situation where I still find myself using OOP is GUI systems. Maybe it’s because you’re working with a system that is already designed in an object-oriented way, or maybe it’s because performance and complexity are not crucial factors with GUI code. In any case, I much prefer GUI APIs that are light on inheritance and use containment as much as possible (Cocoa and CocoaTouch are good examples of this). It’s very possible that a data-oriented GUI system could be written for games that would be a pleasure to work with, but I haven’t seen one yet.

Finally, there’s nothing stopping you from still having a mental picture of objects if that’s the way you like to think about the game. It’s just that the enemy entity won’t be all in the same physical location in memory. Instead, it will be split up into smaller subcomponents, each one forming part of a larger data table of similar components.

Data-oriented design is a bit of a departure from traditional programming approaches, but by always thinking about the data and how it needs to be transformed, you’ll be able to reap huge benefits both in terms of performance and ease of development.

Thanks to Mike Acton and Jim Tilander for challenging my ideas over the years and for their feedback on this article.

This article was originally printed in the September 2009 issue of Game Developer.

Actually, it probably is on the non 3G iPhones because they only have 32KB of L1 cache and no L2 cache at all! But frankly, I wrote this article based completely on my experiences with current game consoles of high CPU speeds and slow main memory, and deep memory hierarchies.

Leo

Interesting read. I was wondering what your thoughts are on Immediate Mode GUIs as described here ( http://www.mollyrocket.com/forums/viewtopic.php?t=134 ). Personally, I feel it’s a lot more closer to data-oriented design than the traditional object-oriented GUI libraries I’ve used. Input is the current game state and output is the GUI elements describing that state to the user.

Highly optimized software has never had much of anything to do with OOP, it has always had a lot to do with understanding how hardware executes on code and data. It isn’t like Michael Abrash wrote about OOP in the 90s. No, he wrote about hardware and how to use it optimally.

Today’s high-tech games require writing interactive software that is optimized for multiple hardware devices. In a single frame you must get input, get the state of your game objects, decide what to do, and run the data through multiple CPU/SPU hardware, and run the data through Graphics/Audio hardware.

You can certainly construct system designs where you start up tons of tasks that are farmed off to independent threads/processors/cores and then collect all the results for presentation. You can certainly construct this code to be cache efficient if you have patience and use good profiling and cache optimization tools. But engineering this is costly and complicated for software that is supposed to look real or hyper-real. And it is even more costly if you have to fight with horrible copy-in, copy-out encapsulation done in the name of OOP (or sometimes in the name of exception safety, which is even more irrelevant in optimized games).

Sounds very very similar to component/property object systems…
Awesome post!

Victor Ceitelis

Hi,
Thank you for this brilliant post.
For my better understanding I have a question : can we say that this very interesting way is somewhere similar to the shader processing,or CUDA/openCL approch (of course not speaking about the cache optimisation), ie apply a simple program to multiple entities ?
thanks,
Victor

Robert Lewis

Hi Noel.

Well done for pointing out the importance of optimizing data flows in game software. Clearly getting your data flows right is essential for good performance for all the reasons you mention.

I’m not quite ready to give up on objects, though. C++ became an important language because it made it easier to manage the complexity of large C code bases.

Good software design isn’t easy, and OOP hasn’t changed that. Typical OOP tutorials don’t help when they give silly examples like Dog.Bark() or Car.Drive(). Classes need to encapsulate meaningful abstractions in your software to be useful.

I would propose making data flows explicit from the start, and introducing classes that work with the data flows rather than against them. If your “by-the-book” OOP doesn’t allow for this, then I’d say you’re reading the wrong book.

Maybe you could point out specific OOP practices that obscure the underlying data flows. Then we could discuss whether OOP is at fault or just some bad design practices. (And I’d be the first to agree that there are a LOT of bad design practices out there.)

@Leo, I love immediate GUI (and all the GUI in Flower Garden is done in immediate mode). In general, I love immediate everything. But that’s something orthogonal to data-design. You can have it immediate or retained either way.

mattz

“Now it’s the time to be careful about how the data is aligned (Dec 2008 and Jan 2009), to bake data directly into an input format that you can use efficiently (Oct and Nov 2008), or to use non- pointer references between data blocks so they can be easily relocated (Sept 2009).”

Can you post links to these articles? They’re a little bit hard to find.

@mattz, Yes, don’t worry. I’ll be putting up the other Inner Product articles starting next week. I wanted to start with this one since this is the one that I knew there was a lot of interest in, but I’ll be adding all the others too.

I don’t want to sound naive here but after reading the article, sounds like you are talking more about procedural programming with a strong focus on memory management and optimization. I guess it’s just that I didn’t grasp the difference, but it definitely doesn’t sound to me like a radical new paradigm. Could you give more details on what would be the differences between some traditional imperative code written in C and a the equivalent data-oriented one?

It’s not something radical and new. People have been doing this in one way or another for a while, especially in the area of DSP. It’s just that it’s an approach that it’s becoming more and more important with modern hardware.

It’s different from procedural/imperative code in that the focus is on the data and how it’s transformed. It’s all about well-defined inputs and outputs and the transformations in between. Procedural programming says nothing about that, and you’ll often have many function calls chained together. This is about the data flow instead.

I’ll definitely be following up this article with some other ones including examples. That should help clarify things.

Your description of Data Oriented Design feels similar to both Functional Programming as practiced by the likes of the Haskellers in academia, as well as the APL/J/K crews in the finance world.

I’m intrigued to see if there is a potential for cross polination between these areas of practice…

frevd

hello noel – nice post, but i’m having trouble to spot the differences between these two paradims. what you essentially do is manipulating data chunks in a pipeline, like sculpturing blocks of wood (input) using some method (transformation) into something else (output), only that you have a highly specialized function for doing so to all the blocks, i.e. processing a whole set of “objects” at once, rather than calling their respective member method for doing the same transformation for each object in a loop. Of course this potentially allows for a heavy speed-up and parallelization, but it still remains the same operation – all you have to do to switch paradigms is to not call the member method but to come up with a highly specialized transform function (one of a gazillion) that executes the very same code from the member method in a loop for a set of objects – modern compilers or jitters should do just that automatically when they encounter a member method being called for a set of objects.
Also, you are loosing the abilities that oop tries to provide for distinguishing between object instances (the private implementation aspect of an object, publishing issues and abstraction through inheritance), but of course in many scenarios this is not really needed. However, if at some point in the future of a program you suddenly have to care about differences in each block of data, you would have to reinvent the wheel. thats why oop is something a good choice as well.

Devdas Bhagat

OO is good when you want to describe a type. OO is also good at encapsulation.
What OO is not good at is handling collections (another example of this is ORM).

However, most programmers familiar with UNIX should be familiar with the concepts you mentioned. The shell uses exactly the same design pattern, as a way to pass data between pipelines processes.

Devdas Bhagat

And of course, s/pipelines/pipelined/ in that last statement.

Martyr2

I have to agree with frevd. I understand where you are going with the paradigm of processing data flows with a transformation focus, but data can be very volatile and the slightest need to change a flow can have you digging around several functions, albeit simple, but possibly numerous (frevd labeled it a gazillion which is probably pretty accurate). OOP provides a bit of a shield from this I think.

Personally I think you could do much of this in OOP by simply thinking in more abstract terms. A class that focuses on a single bullet and its trajectory? No, a class that focuses on the data flow that will take the bullet from point A to point B given a series of data transformations perhaps.

Definitely a word of thought and gives OOP another twist… at least when it comes to some of the classes one might design.

Michael Peters

In Perl it’s common to combine these 2 approaches (data-oriented and object-oriented programming) using something called “inside-out objects” (http://www.perlfoundation.org/perl5/index.cgi?inside_out_object). This let’s you have all the data together contiguously in memory but have it ordinarily controlled through OOP. But if you needed to work on that data by itself (in parallel, etc) then it’s accessible as a group. If not, then you can work on it at the object layer.

Thanks for the great comments and feedback. I’m planning on writing a few more posts on the subject, so I’ll definitely make sure to cover most of the points being raised.

It seems that one of the main points being raised is why can’t you do data-oriented design with objects. My response is that technically you can, but the resulting architecture is anything but OOP. Objects end up being very small and not representing a whole concept, but only a fraction of one. And that small fraction doesn’t have several member functions, but just one (or a few) transform functions. At that point, it doesn’t matter if it’s a class, a struct, or a dumb byte buffer, really.

And yes, it means that when you think about full logical concepts, their implementation is going to be spread out across multiple files. That’s one area where having unit tests helps a huge amount.

As for losing the private data and abstractions… it isn’t much of a loss (for me, personally, that is). I’ve been moving away from it more and more as I do more TDD, and objects are so small that it doesn’t matter as much. They also tend to be more stateless, and you don’t have cascading function calls, so there’s nothing to hide in private member functions.

More stuff better thought out next week for sure.

William Bowers

Great post Noel. I do have two comments to make though.

Parallelization – I thought this argument was pretty weak. I myself haven’t really done much multi-threaded or multi-core programming so I can’t really say which type of programming is better for it, but this section didn’t really have much of anything in the way of proof of why data-oriented design is better. Like this:

“When we apply data-oriented design, parallelization becomes a lot simpler: We have the input data, a small function to process it, and some output data.”

You’re assuming that in data-oriented design every function is small and does only one thing and that in OOP every function is large and does numerous things. Not true. This is actually the same as OOP, except the function call is on the other side:

output = myfunc(input)
vs.
output = input.myfunc()

There need not be any real difference except for the fact that in data-oriented design, the input can be one chunk of contiguous memory, and in OOP you have no control over that. Maybe that’s what you meant.

Unit testing – Unit testing in OOP is *not* hard. Well, it can be hard, but it’s not OOP’s fault. If anyone finds it difficult to test object-oriented code, it’s probably because their code is bloated, tightly-coupled, and/or their functions do too much and have too many dependencies. But those issues are not inherently OOP issues. You can have all of those things in data-oriented design, or any other kind of programming for that matter. Maybe OOP encourages the programmer to make these kinds of mistakes, and maybe data-oriented programming encourages them not too. And maybe that’s what you meant.

Once again, great article. I’ve headed back to C lately (from doing a lot of front-end development and web design), which lends itself nicely to data-oriented design.

William, I’m going based on my experience in OOD and general best practices of OOD. Yes, it’s possible to use classes in a way that it’s very data friendly, but that’s not standard or recommended OOD. Usually member functions change internal state, and objects are often interconnected so a function call chains to other function calls in other objects.

So in a way you’re right: If you design your objects so they don’t do that, it’s a great initial first step. Then, if as a second step, you break up those objects into concepts or slices of the original object and process those new objects separately, then we’re almost at the same point.

So yes, you can use classes, member functions, and private variables. Yes, you can put the update function as a static member function. I just wouldn’t call that traditional OOD.

William Bowers

Agreed. Good point.

Michele Mauro

The ’70 called, and they want their assembler coding back. Am I getting too old?

Just go and pick up good old “Algorithm + Data Structures = Programming” by Wirth, and you’ll find the other half of what you’re saying here.

The bottom line is: if you’re programming with costrained resources (as in game programming, where the CPU is never enough, or in embedded devices, like on mobile phones) OOP becomes a luxury, and you have to shift back (or in your case, rediscover) Structured Programming.
If you have a server based application, it is cheaper to throw hardware at the problem to manage the complexity of the system with OOP and abstraction layering, than drown in the sea of data. It’s a matter of the right tool for the right job.

OOP has its place. This style of programming has its place. Other styles of programming (Functional, Logical, and other esoteric things) have their places. End of the story.

Michele

Dave

My thoughts exactly. Sounds like FORTRAN common blocks. You can’t get much faster than parallel arrays of data.

The maintenance can be a pain, with constant array sizes (dynamic memory is evil in deterministic systems), and determining what entries are valid, but once you get used to it, it’s just like the good old days on a 390 mainframe.

philip andrew

I think it would be easier to write Pac-Man in data oriented design than OOP, but the code would be messier.

Matt

Hi Noel,

This is a nice approach for designing a game’s main subsystems. As you point out, it simplifies multithreading and cache utilization.

I’d just like to reiterate sentiments in Robert Lewis’s post (#7) above. It’s not immediately obvious to me why data-oriented design would be in conflict with OOP. Objects can (and often should) be completely abstract, and so can just as easily represent a data stream, an algorithm or a stream manager rather than an entire game entity.

You say (post #19) that the objects would end up being very small. I would say that this is good OOP – each object should have a single responsibility. Are there any OOP principles in particular that must be violated in order to implement a data-oriented design?

dan

Noel, an interesting read, but I think you’re missing just how critical object-oriented design is to many projects. The essence of OOD is to provide encapsulation so that you can isolate each object as a black box that performs operations on itself.

Why is that so important? This process modularises the code so that lots of people can work on it at once and can be built up gradually along the way. It helps encourage code reuse and simplifies the design for anyone coming to the project fresh. When looking at a carefully written OOP, you can start at the outer layer, then gradually dig deeper into each layer of abstraction to see specific implementations. Any low-level implementation is hidden behind layers, so you don’t need to look at complexities you’re not interested in. I entirely disagree with your point on modularity, a well-written OOP is much easier to understand than a completely data-oriented program.

In addition, as has been mentioned in these comments, a major headache with the data approach is passing in thousands of arguments into each method, which have to be kept up to date and restrict the flexibility to change the data (double to float for example), without changing every single argument in which it is used.

I am currently working on a GPU project that requires the correct arrangement of data for performance reasons (as you’ve listed). However, I have successfully been combining the improved performance of the data model with the better design of the object model. In my opinion this is the way to go, object-oriented design came about for a reason, just because parallel computing is much more prominent doesn’t necessarily mean we should throw away all of the evolution of OOD, even for data intensive or performance critical applications. We need the next step of evolution that ties these two areas together.

Erik

Reading through the comments I see that many here have problems understanding that OOD and data oriented design is both about a way of thinking, and not about syntax.

Some seem to think that the first one is functional or prodecural while the last one is OOP. The first might just as well be OOP. Just look at e.g. CoreFoundation API from Apple it is an OOP library in C. You can decide whether something is OOP or not based on a single function call. In the first example input could be an opaque pointer which means you can’t do stuff with it without using library provided functions (more OOP like). While if input is a simple data type which is publicly known and you can easily manipulate directly then it might be part of a more data oriented design or procedure oriented system.

Functional programming has nothing to do with data oriented design btw as some seem to think. In functional programming you don’t have state. Data driven design is all about state.

I’d also like to mention that data oriented design is also used in non game software like the Visualization Toolkit (VTK) from kitware. It is full object oriented at a higher level, but internally nodes, cells, points with attributs etc are not objects but separate chunks of primitive data as described here. E.g. a point with attributes is not stores as one unit. The vertices for all points are stored in one array. Each property is stored as an array. E.g. the color of a point is stored in an array with the colors of every point.

On the contrary, in OOD (and in any model that attempts to tackle the issue of wide parallelization and stream processing), the logic must be devoid of any state. Otherwise, you can’t go wide. If running the logic changes the internal state of the logic itself, subsequent runs of that logic with identical inputs will produce different outputs, and you can’t go wide.

OOD is not about state, it’s about data flow, having clear inputs and outputs, and having stateless transformations. And the metadata you have about your data flow is what allows you to programmatically determine when jobs can start, what jobs they can wait on, and how wide a single job can go.

Erik

stingoh, I probably wasn’t very clear about what I meant. How the functions are performed is not influenced by state. However you typically operate on arrays of data which you change in place. Data is mutable. If data is mutable you have state IMHO. In functional programming you don’t traverse over say an array of data and mutate it. You get as input immutable data and produce immutable data as output. Thus you can’t have loop constructs.

Wouldn’t it make sense to think of this as a pipeline. Each stage modifies a buffer of data which is read by the next stage which use that data to modify the next buffer which is read by the next stage.

Daniel

Of course one other very large benefit is reduced compilation times.

Even though a compiler compiling a module may not need to know about the private members of a class, it is still parsing and processing all that extra class definition. Add in other non-builtin non-reference aggregates, and you really start to suffer.

Workarounds in OO land? Use pimpl (a terrible hack), or pure base classes and incur the virtual overhead with possible problems of porting to non-homogenous cores like SPU, amongst others.

To me the discussion over using a processor with hetrogenous cores is a very compelling argument for data oriented design.

Jonathan

What I don’t understand is how to keep references to the implicit objects that exists within the arrays. One object may need to reference another. Indices become invalid if an entry is removed from the array and the data is moved.

The solutions I see is to have constant size of the arrays, never move objects and have a flag on the active objects. Or, let the objects reference each other by some unique identifier and search for it in the array.

Fixed arrays would require testing the flag when processing the data and there will be a fixed limit for how many objects can be created. Searching for objects seems inefficient as well.

Xananax Prozaxx

I find this extremely interesting but I am having trouble envisioning it. A short example in code would help tremendously. I dug on google and found a few lines here and there, but nothing above five lines. I am not asking for a full source code, but, say, an example of a bullet flying, an actor moving, and maybe, for example, a simple hit test tossed in there just to grok how the whole thing works?

wener

I try to start with leaning and using a Functional Programming Language.

Mr Hallows

I agree. Code snippets always help to concrete the concepts discussed.

The LoseThos operating system offers a different approach to parallelism you might be interested in. You’re right about SSE/MMX. If you’re curious, look at LoseThos’ flight simulator. It’s multicored master-slave, not SMP. All 8 cores dynamically update more or less of the image based on load, not using a GPU. (This flight sim is /LT/Demo/GameStarters/EagleDive.CPZ.)

Mike

I know this is an older article but it was exactly what I was looking for to push my current project forward.

I had always based my programming around, make the objects, define their members, methods, etc.

And I always hit walls. Every time. I would end up realizing I need to rewrite everything because of one forgotten feature or flaw somewhere, or scrap the design because it’s bad. Either way, rewriting the entire system three times is my limit. Time to reinvent myself.

After I read this article I immediately wrote down a DATA BREAKDOWN SHEET for my game. The main idea is that I designed the system around game data that will be saved and loaded. That encompasses critical systems such as Audio and Video settings, and including things such as Player data, Item data and World data. All the bits of persistent data a game requires.

It makes a lot more sense to me now.

So the new process is:

Step 1) Object-Oriented-Design to define objects that will be in the game. This is what you’re taught, and it works flawlessly.
Step 2) Data-Oriented-Design to lay down the foundation of the game and to begin implementation.

One last bit:

if you’ve ever read code from big game makers such as id, or Apogee, you would clearly see that their systems are always data-oriented designs. They’re never about objects, but about the centralized usage of moving data.

And I am so “d’oh” right now because I have been looking at code by those guys since I knew what code was and it just barely came to me because of this article.

you are really true – many years i think the same (i’m programming about 20 years), but these days i’m in functional programming and i’m fully delightful with them – especially simplicty of lisp. Clojure is nicely clean and pure. Only problem is that they are not still suitable to writing realtime games. Sematics is always more than syntax. i’m patient when syntax will be so simple as lisp these days and we will be able to concentrate to data manipulations to the prejudice of declaring many classes.

io

I know t

io

I know this is from a while ago now, but this sounds very much like aspect oriented programming. ( This wiki link isn’t the best overview, but it feels similar. http://en.wikipedia.org/wiki/Aspect-oriented_software_development#Concepts_and_terminology )

I tend to agree with you that the focus on objects as physical objects is bad. Some programs I’ve seen try to pack all… aspects.. of an object into one class, and it just becomes a mess. Separating out physics into a physics class, rendering into a rendering class, behavior into a “mvc”ish controller, and so on… tends to be a much more robust way to go, and it enables the kind of “locality of code and data” you’re talking about.

I also like to think of relationships between objects ( ex. physical attachment ) as a class — like a resource that gets created and destroyed along with the relationship. The relationship knows about the object(s) but not the other way around. Often this “class” can just be a c-like struct, but the concept is still OOP, just not based on concrete everyday concepts of a real world object….

Wm Leler

I’m really happy to find this article (even if it is a few years old). I used to teach OOP but eventually realized that it was just one tool in a big toolbox. Unfortunately, the CS community seems to look at everything as an object (like the saying “when all you have is a hammer, everything looks like a nail”). As a result, I’ve seen just as many projects hindered by obsessive use of OOP as helped. This is especially true where there is significant amounts of asynchronous behavior.

Anthony Green

Could you clarify which school of OOP design you’re making reference to?

Hi noel, I am new to data oriented design and I am currently developing a physics engine as a hobby projects. I am really interested in data orinted design but I am not sure how to apply data oriented design to my physics engine. I am using Box2D as a reference and when I look at the source code, erin catto use linked list as a container for rigid body. The reason why he use linked list over array is stated here http://www.box2d.org/forum/viewtopic.php?f=3&t=5389 . When the physics engine do a broadphase collision detection, It will return a list of collider pairs that will most likely located randomly at memory. Finally, as stated by Erin catto when islands are created, the set of bodies are likely not to be contiguous in memory. Using handle over pointer will add an extra level of indirection which results in more cache miss. So is it really beneficial to store rigid body or its component as array? Thank you

Kavukamari

I’m wondering if this is sort of like, you still have your references to “””objects””” but rather than containing their own data, they simply are a struct with a pointer list of their “owned” piece of the data arrays, I.E. enemy 400 has a list “Quaternion* pos, Vector3* vel, char* name, int* hp” and those pointers point to entry (ex.) 18250 of the Positions[], Velocities[], Names[], HPs[] arrays so that the data can be easily processed while still being “associated” with a unique “thing” ?