Introduction

The free lunch is over, and all software engineers need to understand how to write parallel code if they want to continue seeing performance improvements with future CPUs. Similarly, we have to think parallel if we want to take advantage of massively parallel computation resources like GPUs or FPGAs.

To fully exploit these hardware resources, people started designing so-called compute languages. For example, the CUDA language allows you to write massively parallel programs to run on NVIDIA GPUs, and the OpenCL language allows you to write massively parallel programs that can efficiently target CPUs, GPUs, and FPGAs. Graphics APIs like OpenGL and DirectX also allow you to write massively parallel geometry and pixel processing programs that run on GPUs, and on CPUs too with software renderers like WARP.

In recent years, OpenGL and DirectX have embraced so-called “compute shaders”, which allow general-purpose compute programs to run alongside the traditional geometry and pixel processing programs. Compute shaders brought significant innovation to renderer design, and are now an integral part of advanced rendering engines.

So Why Do We Care?

Let’s take a step back and think. Everything we’re talking about here is strictly for the purpose of optimization. Compute programs don’t let you implement logic that wasn’t possible to implement previously. At best, compute languages let you implement the same logic more efficiently, at the likely cost of more development time. Of course, performance is still very important. If we didn’t care about performance, we wouldn’t be using GPUs at all, but if you’re a gamer or a game developer, you probably know how important high-performance GPUs are for 3D gaming graphics to work at enjoyable frame rates.

The reason I’m emphasizing the importance of optimization is that writing efficient programs is highly dependent on your understanding of the relevant computer hardware. We’ve entered the domain of embedded systems, meaning that we are designing software systems with the intention of running them on a specific kind of hardware. As a software engineer, you are solving a problem (“I want to make such-and-such video game”), and you must solve this problem with a hybrid hardware-software solution. Hardware details are more than just a good conversation starter at software development conferences… Understanding the hardware is your job, and it’s the reason employers will pay top dollar to borrow your brain. It’s the argument of kings: This is embedded systems. We care.

Why Read This Article

The goal of this article is to gain an intuitive understanding of how compute works. Instead of memorizing arbitrary OpenGL/OpenCL/DirectX/CUDA terminology, we’ll put ourselves in the shoes of somebody who wants to write programs that fully exploit hardware resources, and we’ll see how this leads to the design of the compute languages we know and love today.

By understanding the principles behind the API, you can mostly spare yourself from having to memorize any API details at all. You can make your design on paper independently from whether you’re using OpenGL or DirectX or whatever, then you can translate it to whatever API-specific syntax when the time comes to implement your design.

Furthermore, if you do try to understand compute through only (eg) OpenGL’s documentation, you might find it very difficult. The specifications of compute languages are vague by nature, since they give generous flexibility to be implemented on widely different hardware. This vagueness makes it hard to understand these specifications… Unless you already have a lot of background knowledge about the underlying hardware, in which case you probably barely need to read the specification in the first place. To deal with this vagueness, this article tries to teach this required background knowledge that specifications otherwise omit.

Finally, it’s important to realize that contemporary compute APIs like OpenGL and DirectX are evolving at a fast pace. In 5 years, compute won’t look the same. In 10 years, compute capabilities will have evolved significantly in flexibility and performance. In 20 years, what we’re doing now will look primitive and uncivilized. If you’re reading this now, you’re probably going to see this evolution first-hand. Maybe you’ll be a user, and maybe you’ll be an implementer too. Either way, you have the potential to shape the future, so let’s think about how things should be rather than how things are.

Scaling with Hardware

Previously, developers could write non-parallel (scalar) code, and see their programs run significantly faster with every new generation of hardware. Unfortunately, as CPU designers run up against the physical limits of the universe, it has become more cost-efficient to, for example, create a CPU with two cores at the same (or lower) frequency, rather than trying to create a CPU with a single core at double the frequency.

There is an obvious trend of hardware designers deciding to replicate hardware resources rather than trying to speed up existing hardware resources. As before, we would like to write programs that improve in performance over time with new generations of hardware. To do this, we have to write programs that improve in performance as hardware resources become more and more replicated in the future.

Replicated hardware resources allow us to improve the performance our programs, but only in a specific kind of way. Seeing this requires a shift in mindset. Consider for example using a binary search tree or a hash table to optimize a linear search. Thinking about data structure improvements to improve search time is certainly important, but we also have to consider if and how these data structures would speed up the program if the user ran it with additional hardware resources.

In contrast to single-threaded optimizations, scaling with additional hardware requires distributing the data among available hardware resources. For example, with replicated comparison circuits, we can accelerate comparisons while searching by doing more than one comparison in parallel. With replicated arithmetic hardware, we can compute the hash code for more than one search query simultaneously. With replicated hardware for memory access, we can accelerate the access of the data structure we’re searching in. What these optimizations have in common is that they distribute the data of the input domain among hardware resources, an approach formally known as data-parallelism. Identifying ways to run your code in a data-parallel way is the key to writing code that scales with hardware.

The Red Herring: System-on-a-Thread

In the earlier days of game engine programming, developers were thinking about scaling with hardware by associating software systems to cores. For example, a game programmer might have decided to exploit a dual core CPU by running their graphics system on one core and their audio system on the second core, which is done by creating one operating system thread for each of these systems. This so-called “system-on-a-thread” solution has some serious scalability problems. If the audio thread finishes processing one frame of audio before the graphics thread finishes processing one frame of video, then the audio thread has to become idle while waiting for the graphics thread to catch up. In that period of time, only one out of two CPU cores is active, which means that computation power is left on the table.

Idling cores isn’t the only problem with system-on-a-thread. What happens if you want to add a physics system to the game engine, and the physics system also wants to run on its own thread? In this case, we have 3 system threads but only 2 CPU cores. This means we have over-subscription, which means that we have more threads than cores. This is undesirable, since the CPU cores have now become contended resources between the threads.

One core can only run one thread at a time, so the operating system will need to share each core’s running time among the threads assigned to it. When the operating system switches a core from running one thread to running another thread, an expensive context-switch happens. In the worst case, your program might spend more time context-switching between threads than doing useful work.

There is an important lesson to learn from this: Additional parallelism in software can only be useful if it can be efficiently allocated to hardware resources. When it comes down to it, software systems can’t use more circuits than what physically exists on the chip. Only hardware designers have the power to add additional circuits, and as a software designer you have to work with what you’re given.

The problems with system-on-a-thread underline the importance of data-parallelism. If we identify general ways of splitting up the data we’re working on among hardware resources, then we can efficiently split up work between available hardware, independently of how much hardware we really have.

Designing Parallel Programs

To design data-parallel programs, we need to understand what can and can’t be made parallel. As implied by the term “data-parallelism”, the goal is to divide the work into pieces that can be executed independently on the data that needs to be processed, which allows us to execute these pieces of work in parallel. By designing work that can run independently from other work, we guarantee that the execution of one piece of work won’t affect the results of another piece of work. This is important, since work that executes in parallel will likely not execute in a consistent order.

The Problem with Threads

Let’s take a moment to appreciate the problems that arise when parallel code doesn’t run in a consistent order.

This program is made up of two threads. The order in which the two threads execute their code is unpredictable, since it depends on the inner workings of the C compiler, the operating system and the hardware.

In one run of the program, the execution might happen in the following order:

a = 0;
a = 1;
printf("a == %d\n", a); // -> prints "a == 1"

In this case, Thread 2 ran its first line of code first, then Thread 1 ran to completion. As a result, the program finished with “a” set to 1.

In another run of the program, the execution might happen in this different order:

a = 1;
a = 0;
printf("a == %d\n", a); // -> prints "a == 0"

In this case, Thread 1 ran its first line of code, then Thread 2 ran its first line of code, then Thread 1 ran its second line of code. As a result, the program finished with “a” set to 0.

What this short example shows is that multi-threaded programs can produce different results based on the order in which the code executes on different threads. This unpredictability is bad news for programmers, since it makes the code much harder to understand. To quote Edward A. Lee’s “The Problem with Threads”:

a folk definition of insanity is to do the same thing over and over again and expect the results to be different. By this definition, we in fact require the programmers of multi-threaded systems be insane. Were they sane, they could not understand their programs.

When the order of execution of different threads causes the results of the program to change, we call that a “race condition”, to mean that the conditions in which the program’s results differ if one thread reaches a certain point of execution before another. If we want to resolve a race condition, we can identify synchronization points in the code, which can be implemented using an operating system synchronization object like a mutex, a semaphore, a condition variable, a critical section, an interlocked (atomic) operation, and more.

The synchronization objects I just described are very interesting to study and extremely useful, but they also all have relatively high execution costs. Instead, we should aim to avoid needing synchronization at all, and this can be achieved if we design algorithms that avoid race conditions entirely. To do this, we first need to know how to formally identify race conditions.

Identifying Race Conditions

As we saw in the previous section, running two functions in parallel may result in race conditions, which can lead to programs with indeterminate results. There are two main culprits to this race condition:

Shared memory

Side-effects

Shared memory was a problem in the previous example, since the race condition we observed was partly caused by the fact that the memory for the variable “a” was shared between the two threads. If the code was re-designed to make the variable “a” only exist in one thread (as opposed to being shared between the two threads), then it would be impossible for the variable to change due to the execution of a different thread, since those other threads simply don’t have access to it.

Along with shared memory, the second cause of race conditions in the example we saw was that each thread had side-effects. For example, consider the following program:

/* Thread 1 */ | /* Thread 2 */
b = a; | c = a;
b += 1 | c -= 1;

In this program, both threads had access to the variable “a” through shared memory. However, unlike the program with a race condition that we saw earlier, the two threads only read the value of “a”, and never write to it. Since the shared memory is now read-only, there is no race condition. Regardless of the order in which the lines of code of the two threads above are executed, the final values of “a” and “b” and “c” are always the same.

Based on these observations, we can make a powerful statement: Threads may share memory without race conditions and without synchronization if the memory they share is read-only.

The fact that memory can be safely shared between threads if it is read-only is very interesting from the perspective of a programming language designer. For example, the Haskell programming language is designed so every variable is read-only after being written for the first time. Since this prevents functions from having side-effects, memory can be shared arbitrarily and Haskell programs can thus be highly automatically parallelized. Haskell has other problems that make it arguably hard to use in real-life code, but it’s interesting and eye-opening to study for programmers of all walks of life.

Despite the elegance of languages like Haskell, our day-to-day life using programming languages like C requires us to both read and write shared memory. If we want to avoid race conditions in our programs, then we should ensure that these reads and writes happen in a consistent order. To do this, we must identify and handle the data hazards that happen when reads and writes are done to shared memory. For example, if one thread writes to shared memory then another thread reads that shared memory, that subsequent read should see the value that was just written. In other words, before being able to subsequently read from memory after a write, the reader must wait for the previous writer to finish writing. If the reader doesn’t wait for the writer to finish, the reader might read the previous value of that memory (as opposed to the new one it should have read). Even worse, the reader might read some data that is a Frankenstein-like half-half between the previous value and the new value, also known as a “torn read”.

Since data-hazards are concerned with handling the order of reads and writes, we can exhaustively analyze every possible such scenario:

Read-After-Read (RAR)

Read-After-Write (RAW)

Write-After-Read (WAR)

Write-After-Write (WAW)

The case of Read-After-Read is the one scenario where there is no data hazard. As described previously, there are no race conditions when shared memory is only read and never written.

In the case of Read-After-Write, it’s important for the reader to wait for the writer to finish, otherwise the reader might not see the updated value, or might get a torn read, as was explained earlier.

In the case of Write-After-Read, it’s important for the writer to wait for previous readers to finish, since otherwise the readers might see the new value instead of the old value they were supposed to see. If the writer doesn’t wait for readers to finish, it’s also possible for the writer to cause a torn read by overwriting a value while a reader is only halfway done reading it.

Finally, in the case of Write-After-Write, it’s important that the final value of the data reflects the second write instead of the first write. If this relationship isn’t enforced, successive reads will see the old value (from the first write) instead of the new value (from the second write). Therefore, a writer should wait for previous writers to finish before doing their own write.

Now that we’ve identified all possible data hazards that can cause race conditions, we can more formally identify race conditions in multi-threaded code. Awareness of these data hazards will allow us to design algorithms that avoid them when possible. Furthermore, when we can’t avoid a data hazard, we’ll know that we need to use some kind of synchronization to make sure reads and writes happen in the right order.

Designing Data-Parallel Algorithms

With the knowledge of data hazards, we can now think about designing algorithms that avoid them. This section reviews a variety of methods useful for designing data-parallel algorithms.

Independent Uniform Work

Running this code in parallel is fairly easy. Each iteration of the loop is independent from the other iterations, so we can simply distribute the “n” iterations equally among the hardware’s available resources, and that should give ideal hardware usage. This is possible because no memory is shared between the different iterations of the loop, which means that no synchronization is necessary and that the code can be trivially parallelized. We can parallelize the above code among the CPU’s cores trivially with an API like OpenMP, as follows:

By the way, OpenMP is a C/C++ compiler extension that allows you to automatically parallelize code. OpenMP is built into most C++ compilers these days, including GCC, clang, and Visual C++. You can enable OpenMP by simply passing a compiler flag on the command line, or by setting a project property in Visual Studio, so you should feel free to experiment with it.

Non-Independent Uniform Work

Let’s consider a slightly more challenging example: An algorithm that “blurs” an array. As a reminder, a basic box blur (ie. on an image) is implemented by computing the average of a pixel and its neighbors. In this case, we’ll take the average of a float and its 2 neighbors in the 1D array (the float before and the float after).

If we wanted to blur a single float in the data array, we might do something as follows (ignoring bounds checking):

data[i] = (data[i - 1] + data[i] + data[i + 1]) / 3.0f;

The code above cannot simply be used as the body of a loop over the whole array, because the iterations are not independent from each other. The order in which the iterations are executed will change the output of the algorithm, since the “data[i]” written in one iteration will be read by its neighbors, which will change the result of its neighbors’ computations. If we write to data[i] only after the the reads from data[] have been done for all iterations, then the answer will be correct, since the writes will not interfere with the reads. We could see this problem as a Write-After-Read data hazard, since we need to wait for all the reads from data[] to be complete before we do any writes to it.

The Write-After-Read data hazard happens in this example because we’re trying to read and write to the same memory location in a single step. To remove the data hazard, we can use an additional memory allocation. By creating a separate array to store the results, there are no more data hazards. The input array is read-only, and the output[i] is not shared between iterations, so the code can now be easily distributed among available hardware resources in parallel.

Note 1: It is assumed that the “input” and “output” arrays are not overlapping in memory. For example, if both pointers pointed to the same memory, we would have the exact same problem as before.

Note 2: For readability, the bounds-checking for the beginning and end of the array is omitted. If necessary, this could be fixed by handling the first and last iteration as a special case, or by making the data[] array bigger by adding one element before the beginning and one after the end.

Ping-ponging

To take the blur example one step further, let’s consider how we can make a “blurrier” result by blurring the results of the blur. In other words, we want to run the blur algorithm multiple times on the array in a loop. We can easily implement this using the blur function defined previously, by alternating between which buffer is used as input and which buffer is used as output. This could be done as follows:

This code alternates between reading and writing between two buffers. After each iteration, the buffer being read from and the buffer being written to are swapped. This is known as “ping-ponging”, to mean that the direction of the copying between the two buffers is alternating like how a ping-pong ball alternates between players on a ping-pong table. 🙂

Reduction

In the next example, we’ll be looking at how to implement a reduction. A reduction is an operation that turns a set of values into a single value, like for example computing the minimum element in an array. Computing the minimum of an array sequentially without any multi-threading is pretty darn easy. For reference, it might look something like this:

In the algorithm above, “curr_minimum” is potentially read and written by every iteration for the loop, meaning there exists a data hazard between each iteration of the loop. This data hazard prevents us from safely running this loop in parallel. In order to find the minimum in parallel, we can implement this algorithm in multiple passes. The big idea is to split the array into many smaller arrays, then find the minimum of each of these smaller arrays in parallel. Once the minimums of the smaller arrays are computed, we can find the minimum of these minimums. This subdivision of the work can be applied repeatedly in order to implement a highly parallel minimum algorithm, as shown below.

In the example above, the computation shrinks by half with each pass, since each pass computes the minimum for a pair of inputs from the previous iteration. The number of passes required can be lowered if each iteration of the inner loop computes the minimum of more than 2 values, so you can make a tradeoff between the number of passes and the amount of serial work per pass. This is something you need to tweak based on the hardware you’re working with and the size of your input.

Conclusion/To Be Continued

In this post, the motivation for designing data-parallel algorithms was established, based on the idea that we need to write software that can scale in performance as more hardware (cores, ALUs, etc) is added to processors. From there, we explored the concept of data hazards, showing how they are an obstacle to the implementation of data-parallel algorithms. Finally, a few common ways to implement data-parallel algorithms have been introduced.

In the next post, I’ll talk about how hardware (CPUs, GPUs, etc) can be designed to enable to rapid execution of data-parallel algorithms. Once we understand how data-parallel programs work from the conceptual side and from the hardware side, we’ll be able to talk about designing programming languages that allow efficient implementation of data-parallel algorithms based on principles that can be applied to a wide variety of hardware.

In order to effectively use a multi-core CPU, a decomposition of code into tasks must be made. By decomposing code into tasks, the tasks can be distributed among CPU cores by a scheduler. Tasks can have dependencies on other tasks, meaning that the tasks execute in a partial order. This partial order can be represented as a graph, where each task has predecessors and successors. A task can only execute when all its predecessors have completed.

To implement this, we can explicitly create a graph of tasks, with edges between tasks to identify predecessor/successor relationships. The problem with a general task graph system is that it’s not productive to work with. Maybe it’s okay for some purposes to have a GUI interface to build a task graph, but that’s too tedious for everyday code. Imagine if you had to explicitly create nodes and edges when implementing a single-threaded function call graph, it would be so tedious! So, how can a productive task graph interface be built?

Languages like C/C++ were originally designed for function decomposition. That means you split your code into functions, and call functions from other functions. eg:

This is the natural way to write C/C++ code, and this is how single-threaded game engines are written. In the task-parallel world, we want to enhance function decomposition by adding task decomposition. In other words, we structure the code as tasks that spawn other tasks.

There are benefits of matching task decomposition to function decomposition:

A task may not necessarily complete until its completion is explicitly waited on.

The relaxations of tasks allow them to run in parallel, but this relaxation must be constrained to enforce a partial order on execution. The requirement of partial order comes in 3 main forms, each more general than the last:

Task Path: 1 predecessor, 1 successor.

Task Tree: N predecessors, 1 successor.

Task Directed Acyclic Graph (DAG): N predecessors, N successors.

The Task Path can be implemented as an ordinary sequence of function calls. This may also be integrated in the task system as a sequence of continuation tasks, where each task defines its own continuation.

The Task Tree is implemented as a task that spawns a group of child tasks, then waits for them to complete. This is good for recursive divide-and-conquer algorithms, allowing tasks to be created dynamically based on the distribution of work.

The Task DAG is implemented by spawning tasks only when all their predecessors have completed. This can be implemented by storing a counter of incomplete predecessors in the successor task. When a predecessor task completes, it decrements the counter in each successor. The predecessor that decrements the counter to zero then spawns the successor.

These three forms of task relationships can be implemented using Intel TBB’s task system.

The Task DAG is implemented by reusing the reference count built into TBB tasks as the number of incomplete predecessors. The predecessor tasks must decrement and check the reference count and possibly spawn each successor, as explained previously.

It seems like we have all the building blocks to build task graphs with ordinary TBB tasks, but the API is a bit too low level. It may be possible to build a simpler interface on top of TBB, that would also simplify common use cases. I would like this API to look closely like an ordinary call graph, for the reasons stated at the start of this writeup. Designing this API is a topic for another post.

PS: TBB is designed for non-blocking tasks, meaning it is not suitable for performing I/O or waiting for asynchronous operating system events. With Intel TBB’s task scheduler, a blocking task results in an idle core in the best case, and a deadlock in the worst case. To handle asynchronous I/O and events, I would recommend using the Windows Threadpool task system, since its integration in the operating system helps it make intelligent scheduling decisions by knowing about the state of I/O operations and events. TBB tasks can be executed asynchronously in response to OS events using enqueued tasks.

In this article I’ll explain how to handle “High-DPI” in games, specifically referring to the relevant Windows APIs, some Apple APIs, and the SDL library. I’ll also explain what it means to write a “DPI aware” app in layperson’s terms, which might be helpful regardless of whether you’re using SDL or not. Finally, I’ll show how you can use DPI scale in typical graphics applications, and briefly look at some advanced DXGI features for compositing scaled layers of rendering.

Background

With the increase of resolution of computer monitors, we’re seeing more and more so-called “high dpi” or “retina” displays. What this means is that the monitor has a high number of “dots per inch”, where “dot” effectively means pixel. For example, the screen of a Microsoft Surface Pro 4 is probably around the same size as an older laptop you’ve owned (12.3 inches), yet it has a resolution of 2736 by 1824, which is probably much more than that older laptop.

In other words, the number of pixels has increased, while the dimensions of the monitor have stayed the same. What does this mean for graphics development? Well, it means that your visuals have shrunk considerably. You might be assuming 96 pixels per inch, which is Windows’ default, meaning that 640 pixels are about 6.7 inches wide. On the other hand, with 267 pixels per inch (on Surface Pro 4), your 640 pixels are about 2.4 inches wide, and the text in your UI has probably become small enough to cause eyestrain just from reading it.

So, how do we address this issue? The answer is so-called “DPI awareness”. However, to understand what that means, we have to realize that the word “DPI” has become unfortunately overloaded, leading to much confusion.

High DPI: The Truths and the Lies

Let me give it to you straight: When people say “DPI”, they might mean one of two things:

The number of pixels that physically exist per inch on your monitor.

The display zoom set in the Windows control panel.

When it comes to software development, you really only have to care about the second meaning, which is mostly unrelated to the density of pixels on your specific monitor. You can tweak this display zoom by opening the control panel’s “Display Settings”, and playing with the slider shown below:

The purpose of this slider is to zoom in all apps on your computer, to make it easier to read text and click on buttons by countering the shrinking effect described earlier in this article. Now here’s the confusing part: This display zoom is a percentage, not a number of dots per inch. Again, this zoom factor isn’t about the number of pixels on your screen, it’s simply a display zoom as a percentage, and nothing more.

Reading and Interpreting the Display Zoom

Despite “DPI” being just a zoom factor, if you ask Windows for the DPI using eg. GetDpiForWindow(), you’ll get a number in dots per inch in return. You can get the display zoom that corresponds to this number by dividing it by “96”, which is the default DPI in Windows. For example, if I set the display zoom in control panel to 200%, then GetDpiForWindow() will return 192. If I divide 192 by 96, that gives me “2.0”, which is the display zoom (corresponding to 200%). As a result, I know that I should scale my UI by 2x.

Your App’s “DPI Awareness”

Every app in Windows has a DPI awareness level. This was done for backwards compatibility for apps that were written before DPI was considered. If an app is not DPI aware, that means Windows will automatically scale up the app, which it does by effectively lying to your app about its size in pixels on the display. That means you might ask for a window that is 1000 pixels wide, but get one that is actually 2000 pixels wide. Your app will happily render at 1000 pixels wide, then Windows will automatically scale up your rendering to 2000 pixels.

On the other hand, if your app is DPI-aware, Windows will not lie to it about its size in pixels. If Windows says your app is 1000 pixels wide, then it’s 1000 pixels wide, period. It’s business as usual, with the only added detail that you should be trying to scale up your rendering to match the display zoom that the user requested in the control panel.

To be fully DPI aware, you should also act in response to DPI changes at runtime, which might happen if the user tweaks the slider in control panel while your program is running, or if the user drags your window from one monitor to another monitor that has a different display zoom setting.

Handling DPI in SDL

There are two main things you need to do to handle DPI in SDL. First, you should mark your app as “DPI aware”, and second, you should read the current display zoom to and interpret it in your renderer.

Setting DPI Awareness

There are 2 main ways to set DPI awareness: Using the manifest/Info.plist (for Windows and Apple, respectively), or by setting DPI awareness programmatically.

Setting the DPI awareness through a Visual Studio project is extremely easy. Open your project properties by right-clicking on your project in your solution and picking “properties”, then go to the menu shown below and set the “DPI Awareness” flag.

If you want to set the DPI awareness programmatically, I have some bad news for you. It’s currently possible to set DPI awareness programmatically on Windows, but not for Apple. If that’s not a deal breaker for you, you can set DPI awareness using “SetProcessDpiAwareness()”, which you can call by including the header <ShellScalingAPI.h> and linking “Shcore.lib” to your project. Note that this call must be the first Window management-related call in your program, so you should probably call this at the very top of your main. The following code snippet might be helpful. You can wrap it in a function in a separate file if you want to contain the symbol pollution caused by Windows.h.

Furthermore, if you’re using SDL, you should pass the SDL_WINDOW_ALLOW_HIGHDPI flag to SDL_CreateWindow, which lets you use SDL’s APIs for managing DPI. (See documentation for SDL_CreateWindow).

It’s kind of unfortunate that DPI can’t be set programmatically on Apple devices, since it makes it harder to write cross-platform apps with SDL, requiring instead that custom manifests are built per platform. You should probably create proper per-platform manifests if you’re shipping a production app, but this extra work hurts productivity for small cross-platform projects. If somebody from Apple is reading this, it would be sweet if you could do something about it. 😉

Reading the Display Zoom

SDL has a function to read the current display DPI (which again, is the display zoom and nothing more.) This function is SDL_GetDisplayDPI. It reports a horizontal, vertical, and diagonal DPI. Windows guarantees the horizontal and vertical DPI are always the same, so you can basically ignore everything but the horizontal DPI.

There’s still something missing though. Again, we don’t actually want the DPI itself, we want the display zoom as a percentage (or a ratio). To do this, we have to get the DPI using SDL, then divide it by the default DPI. Unfortunately, SDL has no function to get the default DPI, and the default DPI is actually different depending on your OS. Windows’ default DPI is 96, and Apple’s default DPI is 72. For this reason, I wrap SDL_GetDisplayDPI in my own function that also returns the default DPI. Also, SDL_GetDisplayDPI can fail if DPI information is not available, so I handle that by returning the default DPI. This is implemented as follows:

Display Zoom for Window Creation

If you ask DPI-aware SDL to create a window that is sized of 640 by 480, you’ll still get a tiny window as I explained at the start of the article. Maybe that’s what you want, since it’s actually doing exactly what you asked. On the other hand, I personally like having an interface where I can ask for 640 by 480 window, and it gives me a window that has the same size in inches that a 640 by 480 window would have on a screen with a default DPI.

This is mainly just convenient for me, since I’ve built up certain expectations about how big a 640 by 480 or a 1280 by 720 window should be on my screen. If you want to do this, you can get the display zoom of the monitor your window is on, then scale up your window’s size based on the monitor’s display zoom. If you assume the window will be created on the primary monitor, you might use something like the following to create your window:

By the way, allow me to repeat that the DPI might change at runtime if the user plays with the display zoom in the control panel, or if the user drags the window from one monitor to another monitor that has a different display zoom setting. For this reason, you might want to poll the DisplayDPI at every frame and resize your window in response. This is kind of difficult to do properly though, since switching the size of the window might make it suddenly pop onto another monitor and cause some kind of infinite loop, who knows.

The “proper” way to handle DPI changes at runtime is by responding to the Windows event WM_DPICHANGED, which gives you both the new DPI and a suggested new position and size for the window. Last I checked, SDL doesn’t handle WM_DPICHANGED, which is unfortunate, and would be a good thing to patch in.

Using Display Zoom in a Renderer

After marking your app as DPI aware, you need to scale up your rendering according to the display zoom. This section explains how to apply this display zoom to typical graphics scenarios.

Display Zoom for UI

If you have some kind of UI in your app, you probably want to scale that up based on the display zoom, since the user will have trouble reading any text otherwise. You might also want to create multiple versions of the artwork in your UI at different scales, to avoid the artifacts that come from automatic scaling algorithms. The goal here is to give the UI a sharp and crisp look at any display zoom, since blurry UIs tend to look pretty bad.

If your UI is designed to have a “pixely” look, like for example ImGui, then scaling will generally look bad unless you round to an integer scale, so keep that in mind as you decide how to interpret the display zoom.

On the topic of ImGui, it sorta has DPI awareness built-in through the “io.DisplayFramebufferScale” value. This value isn’t actually interpreted at all by ImGui, it’s basically just a way for you to pass your desired DPI scale from your window management code to your rendering code. ImGui itself (and its mouse handling code) assumes it lives in a world where scaling doesn’t exist, so even if you scale up the UI in your rendering code, you still need to “unscale” mouse coordinates before passing them to ImGui. Therefore, your code to pass mouse coordinates to ImGui might look as follows:

Display Zoom for Scene Rendering

For the best look possible, you can ignore that the monitor has a high DPI and just render your scene at the same pixel resolution as the window it’s being displayed in. This only works if your app is DPI aware, since otherwise the OS will be automatically scaling up your app and you’ll see scaling artifacts anyways. If rendering at a 1-to-1 resolution is too expensive, as is likely with any heavy rendering work on a 4K display, you might want to render your scene at a low resolution and do the scaling yourself. This might improve your frame rate, or save battery life. You might do this framebuffer scaling using glBlitFramebuffer, as shown in a side-tip in my article on Reversed-Z in OpenGL.

While it can be important to lower the number of pixels rendered on high DPI displays for performance reasons, recall that’s not what the “display zoom” slider is about. The display zoom slider is supposed to scale the visuals in the program to make them easier to see. If your program responds to display zoom changes by increasing the size of its window, that’s enough to cause your scene to occupy a greater number of square inches on the user’s screen, so mission accomplished. This is independent from whether or not you decide to do your rendering at a lower resolution than the number of pixels on the screen. If you want to scale your rendering without resizing the window, then you might want a camera zoom feature in your renderer that is independent of the display zoom slider. You might think of ways to incorporate the display zoom into your 3D camera’s logic, but that’s a topic of user interface design, and should probably be handled on a case-by-case basis.

Composing Scaled Layers in Hardware

Direct3D 11.2 allows you to compose layers at different resolutions in hardware, as documented here: Swap chain scaling and overlays (MSDN). For example, you can render a high resolution UI and a low resolution scene in separate layers, then rely on the hardware to compose these two layers together rather than doing it yourself. This might be a more efficient way to handle DPI scaling in a rendering engine, although it relies on hardware features.

I’m not sure if there’s a way to use hardware overlays in OpenGL, at least on desktop. wgl has functions for overlay layers, but they’re old and busted and have been deprecated for a long time. This might be another good reason to use WGL_NV_DX_interop2 to do your OpenGL rendering within a DXGI swap chain on Windows, since that allows you to use modern Windows presentation features without having to rewrite your OpenGL code. Although that extension is marked “NV”, it is now supported by all desktop GPU vendors.

Unfortunately, it’s not clear how to use WGL_NV_DX_interop2 properly from its documentation. I have some quick tests here if you’re interested: OpenGL-on-DXGI (GitHub). If you’re part of Microsoft or Khronos or one of the IHVs, your help might be needed here. 🙂

Conclusion

Hopefully now you know more about writing DPI aware apps, especially using Windows and/or SDL. Having a sharp resolution on high DPI screens tends to look really good, so I hope this article helps get you there. Thank you for reading!

It’s a pretty good way to get more precision out of your depth buffer. Very useful for games with long view distances, like for example Just Cause 2 (as shown in a link above).

Reversed-Z visualized: Near objects have a higher Z value than far objects.

So, how to use it in OpenGL? Here’s a no-nonsense step-by-step guide.

Step One: Set Clip Space Conventions

Reversed-Z is designed for clip-space Z values in the range [0,1], not [-1,+1]. OpenGL’s default convention is [-1,+1], but you can override that using glClipControl:

glClipControl(GL_LOWER_LEFT, GL_ZERO_TO_ONE);

I recommend sticking this line of code at the start of your program, and never changing the clip conventions after that. Give up entirely on the [-1,+1] convention, it’s hands-down objectively worse than the [0,1] convention when it comes to precision. It’s a good decision even if you’re not using Reversed-Z. Anyways, when you sample from a depth texture in OpenGL, you already get a value between 0 and 1… so switching to the [0,1] convention for clip coordinates will make everything more consistent.

glClipControl is an OpenGL 4.5 feature. If you don’t have OpenGL 4.5, it might still be available as an extension (see: hardware supporting GL_ARB_clip_control). Therefore, you could use something like the following code snippet:

Side tip: Copying a Framebuffer to a Window

If you didn’t know, you don’t need to create a depth buffer when you create your window. You can render your scene (and depth buffer) into an offscreen FBO, then copy to your window only at the end. For example:

With this approach, you can also be rendering your scene at a lower resolution than your window, since the call to glBlitFramebuffer will scale up your rendering. You might find this useful if you’re on a 4K display, but still want to render at a lower resolution for improved frame rate or battery power savings. This also saves you the trouble of creating a window in SRGB mode, which takes away further complexity from your window management code.

You can also use glBlitFramebuffer to convert a multi-sampled framebuffer into a single-sampled framebuffer, but you’re not allowed to do scaling simultaneously with resolving multi-samples in a call to glBlitFramebuffer, so you might have to call glBlitFramebuffer twice: Once to resolve multisamples, and once again to copy and scale up to your window. At this point, it might become interesting to do all this through a single fragment shader instead. You can simultaneously resolve and scale if you use the GL_EXT_framebuffer_multisample_blit_scaled extension (see: hardware supporting GL_EXT_framebuffer_multisample_blit_scaled).

Anyways, back to the whole Reversed-Z thing…

Step Three: Clear your Depth Buffer to Zero

With a normal depth test, you might have been clearing the depth buffer to 1, since that was the “far” value. When you call glClear(GL_DEPTH_BUFFER_BIT), it clears the depth buffer using the last set value of glClearDepth, which is 1 by default. On the other hand, with Reversed-Z, the “far” value is now 0, so you have to use the proper clear depth value:

Step Four: Flip your Depth Comparison to GREATER

As implied by its name, using Reversed-Z means that far depth values are represented by smaller numbers. That means you need to switch your glDepthFunc from GL_LESS to GL_GREATER, as shown in the code below. In this example code, I also reset the comparison and depth state back to OpenGL defaults, so the state doesn’t leak into other code that might not be doing Reversed-Z, or that might not be using depth testing.

Step Five: Update your Projection Matrix

Reversed-Z requires a slightly different projection matrix than what (for example) gluPerspective creates. What follows is some code you can use to create this new projection matrix. It uses the “right-handed” coordinate system convention, meaning that the Z axis points out of your computer screen. Alternatively, that means the objects in front of your camera have a negative Z value in view space. This is the OpenGL convention.

Note that this matrix doesn’t only reverse the Z, it also sets the far plane to infinity, which works well for extremely large view distances. You can find the derivation of this matrix in the following article: http://dev.theomader.com/depth-precision/

Note that I’m referring to the OpenGL Mathematics library “glm”, which by OpenGL convention uses column-major order for the arguments of its constructor. That means the first 4 inputs to the constructor are actually the first column of the matrix. Might be important if you’re translating to a different matrix library.

That’s it?

That’s it!

In Summary

Set your depth clip conventions to [0,1] using glClipControl.

Create and use a floating point depth buffer.

Clear your depth buffer to 0 instead of the default of 1.

Use GL_GREATER instead of GL_LESS for your depth test.

Use a projection matrix that flips the depth.

What about DirectX?

There’s almost no difference in implementing this in DirectX. First, you don’t need glClipControl because the [0,1] convention is already the default in DirectX. Second, if you’re following DirectX conventions and using a “left-handed” convention for the view, then just turn the “-1” into “+1” in the projection matrix. You also need to do this in OpenGL if you’re using left-handed conventions, meaning that the camera’s Z axis goes into your screen, or equivalently that you consider objects to be in front of the camera if their Z value in view space is positive.

Lately I’ve been writing lots of OpenGL programs for course projects and for-fun rendering side-projects. I’ve started to notice an overall design that works pretty well for these kinds of projects, and I’m going to document it here.

This design matches closely the idea of “Model-view-viewmodel” aka. “MVVM”. If you’re unfamiliar with this term, be aware that the words “model” and “view” here aren’t talking about 3D models or cameras or transformation matrices. MVVM is an object-oriented architectural pattern, similar to MVC (“Model view controller”). MVVM maps to the design of a renderer as follows:

Model: The scene which the renderer should render.

View: The renderer that traverses the scene and displays it.

View model: The data the renderer uses internally to render the scene.

Another important inspiration of the design in this article is the relational database. While I don’t use an actual relational database like SQL in my projects, I think the principles of relational databases are useful to reason formally about designing the model in the MVVM, so I’ll talk a bit about that in this article too.

In my projects, the three parts of MVVM are split into two C++ classes: “Scene”, and “Renderer”. In short, the Scene stores the data to render, and the Renderer reads the Scene to render it. In this article I’ll first talk about how I design the Scene, and then I’ll talk about how I design the Renderer.

Scene

The “Scene” class is very bare-bones, it contains no member functions whatsoever. Treat this class like a database: It has a bunch of tables in it, the tables contain data that possibly refer to other tables in the database, you can read/write the objects in the tables, and you can insert or erase elements from tables.

In a typical renderer, you might have a table of all the materials, a table of all the meshes, a table of all the instances of meshes, a table of all the cameras, and etc. The specifics of these tables depends on your project, so if you’re doing a skeletal animation project you would also have a table of the skeletons, instances of skeletons, and so on.

I found that designing the schema of the database is a big part of the requirements gathering phase for a project. Once you nail the design of the schema, the code you have to write to manage your data becomes more obvious (“I just need to hook up all this information”). Later modifications to your code can be done by figuring out how your schema needs to change, which is made easier since the relationships between your objects lay bare all together in one place.

In practice, I implement the tables of the database as C++ containers. This allows me to use containers with the access and update patterns I need for the project I’m working on. For example, I might want the objects in a table to be contiguous in memory, or do insertion/deletion of objects in arbitrary order, or be able to efficiently lookup an object by its name.

The design of the Scene class involves many concepts from relational database design. Let’s take some time to consider how primary keys, foreign keys, and indexing apply to the design of the Scene.

Primary Keys and Foreign Keys

Table usually have a “primary key”, which can be used to identify each object in the table uniquely, and can also be used to efficiently lookup the object. When an object in one table wants to refer to an object in another table, it can do so by keeping a copy of the primary key of the referred object, a concept also known as a “foreign key”.

An example of this situation is a mesh instance keeping a reference to the mesh associated to it, and keeping a reference to the transform used to place the instance in the world. Each mesh instance would then have a foreign key to the mesh table, and a foreign key to the transform table.

Primary keys and foreign keys can be as simple as integer IDs, though some care is required if you want to avoid foreign keys becoming invalid when modifications are done on the table they refer to. For example, if you decide to use a std::vector as the container for a table, erasing an element from the vector might cause objects to get shifted around, which will invalidate foreign keys referring to them.

In order to support insertion and deletion, I use a container I call “packed_freelist”, based on bitsquid’s ID lookup table. This container gives each object an ID that won’t change as elements are inserted and deleted, and allows efficient lookup of objects from their ID. The packed_freelist also automatically packs all objects in it to be contiguous in memory, which is desirable for efficient traversal.

Another way to implement foreign keys is to update the foreign keys when their pointed-to table is modified, which might be automated with a more sophisticated scene database system that is aware of foreign key relationships. On the simpler side of the spectrum, if you know that a table will never change after initialization, or if you only support insertion through push_back(), you can be sure that your foreign keys will never be invalidated with zero extra effort. You might also be able to use a C++ container that supports inserting/erasing elements without breaking pointers, like std::set, then using pointers to those elements to refer to other tables.

There exists a lot of formalism about primary keys and foreign keys in the world of relational databases. You might be interested to explore this topic further if you’re interested in exploring the limits of this concept.

Indexing

It’s nice to have primary keys and all, but sometimes you also want to lookup objects based on other attributes. For example, you might want to lookup all meshes with the name “crate”. In the worst case, you can iterate through all the meshes looking for it, which is not a big deal if you do this search infrequently. However, if you do this search often, you might want to accelerate it.

In database-speak, accelerating lookups is done by building an index on a table. You can store the table itself in a way that allows efficient lookups, like sorting all meshes by their name, which would allow binary searching. Storing the table itself in a sorted order is known as building a clustered index. Naturally, you can only have one clustered index per table, since you can’t keep a table sorted by two unrelated things at the same time. By default, your tables should support efficient lookup from their primary key.

If you want to have many indexes on a table, you can store an acceleration structure outside the table. For example, I could store a multimap alongside the table of meshes that lets me lookup mesh IDs by name in logarithmic time. The trouble is that every index needs to be kept in sync with the table, meaning that every additional non-clustered index adds further performance cost to insertions and deletions.

The Code

To give you an idea of what this looks like, I’ll show you some examples of Scene setups in my projects, with varying amount of adherence to the principles I explained so far in this article.

Exhibit A: “DoF”

I did a project recently where I wanted to implement real-time depth-of-field using summed area tables. The code for scene.h is at the bottom of this section of this blog post.

If you look at the code, you’ll see I was lazy and basically just hooked everything together using packed_freelists. I also defined two utility functions for adding stuff to the Scene, namely a function to add a mesh to the scene from an obj file, and a function to add a new instance of a mesh. The functions return the primary keys of the objects they added, which lets me make further modifications to the object after adding it, like setting the transform of a mesh instance after adding it.

You can identify the foreign keys by the uint32_t “ID” variables, such as MeshID and TransformID. Beyond maybe overusing packed_freelist, the relationships between the tables is mostly obvious to navigate. You can see I was lazy and didn’t even implement removing elements, and didn’t even write a destructor that cleans up the OpenGL objects.

Not deleting OpenGL objects is something I do often out of laziness, and it barely matters because everything gets garbage collected when you destroy the OpenGL context anyways, so I would actually recommend that approach for dealing with any OpenGL resources that stay around for the lifetime of your whole program. If you want to be a stickler for correctness, I’ve had success using shared_ptr, using shared_ptr as follows:

A small extension to this is to wrap the shared_ptr in a small class that makes get() return a GLuint, and adds a new member function “get_address_of()” that returns a GLuint*. You can then use get() for your normal OpenGL calls, and get_address_of() for functions like glGenBuffers and glDeleteBuffers. This is very similar to how ComPtr is used in D3D programs (with a very similar interface: Get() and GetAddressOf()).

This shared_ptr thing is not the most efficient approach, but it’s straightforward standard C++, and it saves you the trouble of defining a bunch of RAII wrapper classes for OpenGL objects. I find such RAII wrapper classes are usually leaky abstractions on top of OpenGL anyways, and they also force me to figure out how somebody else’s wrapper classes work instead of being able to reuse my existing OpenGL knowledge to write plain reusable standard OpenGL code.

I should point out that since my Scene class is implemented only in terms of plain OpenGL objects, that makes it easy for anybody to write an OpenGL renderer that reads the contents of the tables and renders them in whatever way they see fit. One thing you might notice is that I also supply a ready-built Vertex Array Object with each mesh, which means that rendering a mesh is as simple as binding the VAO and making a DrawElements. Note you don’t have to bind any vertex buffers or the index buffer, since OpenGL’s Vertex Array Object automatically keeps a reference to the vertex buffers and index buffer that were attached to it when I loaded the model (It’s true!). For a better explanation about how vertex array objects work, check out my Vertex Specification Catastrophe slides.

The pinch is that OpenGL’s Vertex Array Object defines a mapping between vertex buffers and vertex shader attributes. I don’t use glGetAttribLocation to get the attribute locations, since I want to separate the mesh from the shaders used to render it. Instead, I use a convention for the attribute locations, and I establish this convention using the “preamble.glsl” file, as described in my article on GLSL Shader Live-Reloading, which allows me to establish an automatic convention for attribute locations between my C++ code and all my GLSL shaders.

As an isolated example of how to load a mesh (using tinyobjloader), I’ll leave you with some example code. I hope this helps you understand how to use the Vertex Array Object API, which is tragically poorly designed.

Exhibit B: “fictional-doodle”

This was a course project I did with my friend Leon Senft. We wanted to load a 3D mesh and be able to animate it in two modes: skeletal animation, and ragdoll physics. This project is way bigger than the DoF project, and it was also a much earlier project before I had the proper separation of MVVM in mind. Still, from a coding perspective, the project went very smoothly. Leon and I designed the schema together, and once we figured out how we were going to interface between the animation and the ragdoll physics systems, we were able to easily work on the project in parallel.

To decide how the two animation systems should communicate, we sketched a feedback loop flowchart that explained the order in which the two systems read and wrote their data at every frame, which mapped in a straightforward way to our code since the flowchart’s reads and writes corresponded to reads and writes of our scene database’s data. Below is the flowchart copied verbatim from the OneNote document we scratched it up in.

As you can see in the code at the bottom of this section, this schema has a lot more tables. You can also see some non-clustered indexes used to lookup textures by filename, using unordered_map. We used std::vector for most of the tables, mainly because I hadn’t developed the packed_freelist class yet, and because random insertion/deletion wasn’t a requirement we had anyways.

Again, this project worked with OpenGL in terms of plain OpenGL GLuints, and we didn’t go through the trouble of arbitrarily abstracting all that stuff. This project also used shader live-reloading, but it was through a method less polished than my newer ShaderSet design. Leon was working on a Mac, so we had to deal with OpenGL 4.1, which meant we had to use glGetUniformLocation because GL 4.1 doesn’t allow shader-specified uniform locations. This made live-reloading and shader input location conventions a lot more tedious than with newer GL versions.

Speaking of Mac support, it’s really painful that Mac’s GL doesn’t support GL_ARB_debug_output (standard in GL 4.3), which allows you to get an automatic callback whenever a GL error happens. Instead, you have to call glGetError() after every API call, which is really painful. To fix this, we wrote our own OpenGL header that wraps each OpenGL call we used inside a class that defines operator() and automatically calls glGetError() after every call. You can see our implementation here It’s a bit disgusting, but it worked out well for us. I think it could be improved by wrapping the function pointers at the place where SDL_GL_GetProcAddress is called, which would make the interface more transparent.

Renderer

The whole point of the Scene was to store the data of the scene you want to render, without having to specify exactly how it should be rendered. This gives you the freedom of being able to implement different renderers for the same scene data, which is useful for when you’re making a bunch of different OpenGL projects. You can easily implement a new rendering technique without even having to touch the scene loading code, which is nice.

My implementation of Renderer usually has three member functions: Init(), Resize(width, height), and Render(Scene*). Init() does the one-time setup of renderer data, Resize() deletes and recreates all textures that need to be sized according to the screen, and Render() traverses the Scene and renders the data inside it.

I usually begin the program by calling Init() followed by an initial Resize(), which creates all the screen-sized textures for the first time. The code inside Resize() deletes and recreates all textures when the window gets resized, and since deleting is a no-op for null textures it’s safe to call it even after the first Init(). After that, Render() is just called at every frame to render the scene. Before calling Render(), I might also call an Update() method that does any necessary simulation work, like if I want to run some game logic. You can also be lazy and write that simulation update code in the Render(), whatever makes sense.

Your renderer probably has to maintain some data that is used to implement whatever specific rendering algorithm you have in mind. This data is what constitutes the viewmodel. I usually don’t separate the viewmodel into its own class, but I guess you could if you somehow find that useful. The “view” itself is basically the Render() function, which uses the model and viewmodel to produce a frame of video.

Below is an example of part of the Resize() function inside the “DoF” project’s renderer (which you can see here: https://github.com/nlguillemot/dof/blob/master/viewer/renderer.cpp) As you can see, I’m deleting and recreating a color and depth texture that need to be sized according to the window, and I also recreate the FBO used to render with them. As you can see, I’m still sticking to my guns about writing plain standard OpenGL code.

The code I write inside the Scene loading and inside Init() and Resize() tries to be careful about binding. I always set my bindings back to zero after I’m done working with an object, since I don’t want to leak any state. This is especially important for stuff like Vertex Array Objects, which interacts in highly non-obvious ways with the buffer bindings (see the mesh loading example code earlier in this article.)

When writing code in the Render() function, keeping your state clean becomes more difficult, since you probably are going to set a bunch of OpenGL pipeline state to get the effects you want. One approach to this problem is to glGet the old state, set your state, do your stuff, then set the old state back. This is maybe a best-effort approach for OpenGL middleware, but I really hate having to type all the glGet stuff. For this reason, I generally write code that assumes the pipeline is in its default state, and I make sure to re-set the pipeline to its default state at the end of my work. This leaves me with no surprises. Notice that this is yet another place where I try to express everything in terms of plain standard OpenGL code, since that makes it easier to see which pipeline state is set where, which makes it easier to simply read the code and see that all the state for a block of code is set and reset properly.

As an example of Render(), consider the following bit of code that renders all the models in the scene:

As you can see, I set the state at the start of this rendering pass, then reset it back to its default at the end. Since it’s just one block of code, I can easily double-check the code to make sure that there’s an associated reset for every set. There are also some shortcuts to use, for example you can use “glBindTextures(0, 32, NULL);” to unbind all textures.

There are still some parts of OpenGL state which I don’t bother to reset. First, I don’t reset the viewport, since the viewport’s default value is mostly arbitrary anyways (the size of the window). Instead, I adopt the convention that one should always set the viewport immediately after setting a framebuffer. I’ve seen too many bugs of people forgetting to set the viewport before rendering anyways, so I’d rather make it a coding convention not to assume what the viewport is. There is no default viewport in D3D12 anyways, so this convention makes sense moving forward. Another example is the clear color and clear depth, which I always set to the desired values before calling glClear (again, you have to do this in newer APIs too, including GL 4.5’s DSA).

Another example of state I don’t bother to reset is the uniforms I set on my programs. OpenGL does remember what uniforms have been set to a program, but I choose to have the convention that one should not rely on this caching. Instead, you should by convention set every uniform a program has between calling glUseProgram() and glDraw*(). Anyways, D3D12 also forces you to re-upload all your root parameters (uniforms) every time you switch root signature (the program’s interface for bindings), so this convention also makes sense moving forward.

There also exist more automatic solutions to preventing binding leakage. If you’re able to define your rendering code as a series of nested loops (see below), then you can be sure that you’re setting all necessary state before each further nested loop, and you can reset the state all together in one place. Another way to design these nested loops is using sort-based draw call bucketing, which you can use to automatically iterate over all the draw calls in your program in an order that respects which state changes are expensive and which ones aren’t. Since that design also loops over draw calls in order of state changes, you can again have some guarantees about the lifetime of your state changes, so you can reset any state at a place that produces the least waste. This was approximately the design we used in fictional-doodle’s renderer, which you can find here: https://github.com/nlguillemot/fictional-doodle/blob/master/renderer.cpp#L86

Conclusions and Future Work

The details of all this stuff is not super important, and the amount of engineering effort you should invest depends on the scale of your projects. Since I’m usually just implementing graphics techniques in isolation, I don’t have to worry too much about highly polished interfaces. I’ve enjoyed the benefits of being able to reuse Scene code, and I’ve been able to quickly write new Renderers by copy and pasting another one for some base code. My projects that separate the Scene and Renderer have so far been enjoyable, straightforward, and productive to work on, so I consider this design successful.

One part where this starts to crack is if you’re writing a project that implements a whole bunch of rendering techniques in one project, like if you’re trying to compare a whole bunch of different rendering techniques. You can probably get away with stuffing everything into the Scene and Renderer, but there’s a certain threshold where you can no longer keep the whole project in your mind at the same time, and it becomes important to limit the set of data that a given function can access to better understand the code. You might be able to get away with grouping up the model or viewmodel data into different structs, whatever helps keep you sane.

Another problem which I haven’t addressed much is multi-threading. OpenGL projects generally don’t worry much about multi-threading since the OpenGL API abstracts almost all that stuff away from you. This abstraction is partly what makes OpenGL a good API for prototyping rendering techniques, but in the future I’d like to apply this design to newer APIs like Vulkan and Direct3D 12. I suspect that I’ll have to identify which data varies “per-frame” (in order to have multiple frames in flight), while using some encapsulation to prevent one frame to read the data that belongs to another frame, or to avoid global state in general. I also suspect that the design of Render() will have to change to allow multi-threaded evaluation (to write commands faster using CPU parallelism), something which I’m currently looking at TBB flow graph to find a solution.

Appendix A: Transformation Hierarchies

It’s commonly desired to have a hierarchy of transforms to render a scene, sometimes called a “scene graph”. In practice, you shouldn’t think too hard about the whole scene graph thing, since that mindset has a lot of problems. If you do want to implement a tree of transforms, I would recommend keeping the links between transforms separate from the transforms themselves. For example, you might have a table for all the transforms, a table for mesh instances, and a table for meshes. Each mesh instance then keeps a foreign key to a transform and mesh that represent it. To support relative transformations, I suggest creating a new table that stores the edges of the transformation tree, meaning each object in the table stores the foreign keys of two transforms: The parent and the child.

The goal of this design is to compute the absolute transforms of each object at each frame, based on their relative transforms (in the transforms table) and the edges between transforms (in the transform edge table). To do this, make a copy of the local transforms array, then iterate over the transform edges in partial-order based on the height in the tree of the child of each edge. By iterating over the edges in this order to propagate transforms, you’ll always update the parent of a node before its children, meaning the transforms will propagate properly from top to bottom of the tree. You can then use these absolute transforms as a model matrix to render your objects.

By keeping the edges between the transforms in their own table, the cost of the propagation will be proportional to the number of edges, rather than the number of nodes. This is nice if you have fewer edges than nodes, which is likely if your tree is mostly flat. What I mean is that scene tree designs can easily end up with 1 root node with a bazillion children, which is just useless, and is avoided in this design. This also keeps the data structure for your scene nice and flat, without having to have a recursive scene node data structure (like many scene graph things end up).

Then again, I didn’t apply this in the examples earlier in this article, and everything still turned out ok. Pick the design that makes sense for your workload.

Appendix B: User Interface

You probably want some amount of GUI to work with the renderer. Hands down, the easiest solution is to use Dear ImGui. You can easily integrate it into your OpenGL project by copy-pasting the code from one of their example programs. ImGui is used in the dof program mentioned earlier in this article, so you can consult how I used it there too if you’re interested.

Working with cameras is also important for 3D programming of course. Programming a camera is kinda tedious, so I have some ready-made libraries for that too. You can use flythrough_camera to have a camera that lets you fly around the 3D world, and you can use arcball_camera to have a camera that lets you rotate around the scene. Both of these libraries are a single C file you can just add to your project, and you use them by calling an immediate-mode style update function once per frame. These libraries tend to integrate naturally.

Appendix C: Loading OpenGL Functions

One of my pet peeves with typical OpenGL programs is that people use horribly over-engineered libraries like GLEW in order to dynamically link OpenGL functions. You can do something way simpler if you only need to use Core GL and the few extra nonstandard functions that are actually useful. What I did was write a script that read contents of Khronos’ official glcorearb.h header, and produces my own opengl.h and opengl.cpp files. The header declares extern function pointers to all OpenGL entry points, and opengl.cpp uses SDL’s SDL_GL_GetProcAddress function to load all the functions.

To use it, you just have to add those two files to your project, call OpenGL_Init() once after creating your OpenGL context, and you’re good to go. No need to link an external library (like glew), and no need to deal with all of glew’s cruft. You can find the files here: opengl.h and opengl.cpp. You can easily modify them by hand if you want to adapt them to glfw. Since these files are so simple, they’re much more “hackable”, allowing you to do things like wrapping calls in glGetError (like was done in the “fictional-doodle” project, as I explained earlier).

As a bonus, this header also defines GLDrawElementsIndirectCommand and GLDrawArraysIndirectCommand, which are not in the standard GL headers but exist only in the GL documentation for indirect rendering. You can use these structs to pass around draw arguments (perhaps storing them once up front in your Scene) or use them in glDrawIndirect calls.

I wanted to design a multi-threaded DX12 rendering pipeline, and saw that Intel’s TBB Flow Graph might make things a lot easier for me. I felt that the documentation for the different flow graph nodes was poorly organized, which made it difficult to understand what all the different flow graph nodes do at a high level. For this reason, I wrote this blog post as a summary that puts it all in one place.

This blog post isn’t a full tutorial on TBB Flow Graph. You should read the Intel-provided documentation first. This blog post is mostly a summary in my own words of the official documentation.

If there are mistakes in my interpretation of the official documentation, or if there are important details missing, please tell me about it so I can correct it. I’ve left a few “unanswered question” bullet points in this article for information that seems lacking from the official documentation. If you have a good answer to these questions, please let me know.

Without further ado, let’s begin the summary.

Functional Nodes

The functional nodes are source_node, function_node, continue_node, and multifunction_node. These nodes have the purpose of doing computations on their inputs (if it has any) and passing outputs to their successors.

The source_node has no predecessors. It evaluates its body and outputs its result constantly. The source_node is always serial, so its body will never be called concurrently. If no successor accepts the message, the message will be put in a buffer, and it will be re-sent if a successor is added or a successor calls try_get().

The source_node keeps outputting items until its body returns false.

Unanswered question: can it be reactivated once the body returns false?

The function_node receives an input, applies the body’s function to it, and broadcasts its output to its successors.

The function_node’s maximum concurrency can be specified in its constructor. Setting the concurrency to “tbb::flow::serial” means the body will never be executed concurrently, while setting it to “tbb::flow::unlimited” puts no limit on the body’s concurrency. The concurrency can also be set to a specific number.

If the function_node’s concurrency is already maximized and it receives yet another input, the behavior depends on its graph_buffer_policy template argument. If the buffer policy is “queueing”, then extra inputs will be put on an internal unbounded input buffer until they can be processed. Alternatively, if the policy is “rejecting”, then the function_node will reject the extra inputs, and will instead pull from its predecessors when it becomes ready to handle more inputs. The internal buffer’s allocator can be customized.

The continue_node waits for its predecessors to finish before starting its successors. It counts how many times try_put() was called on it, and when the counter reaches a target value its body is executed and its result is broadcast to its successors. By default, the target value is the number of predecessors.

Despite having N predecessors, the continue_node receives no inputs from them. Emphasis: the continue_node is not a join_node. It receives no input from its predecessors. However, the continue_node still has the ability to broadcast its output to successors.

Note: The countdown is illustrated by the 3 horizontal bars on the left side of the diagram above.

continue_msg is an empty struct you can use as a placeholder type, like when you just want to kick-off a successor. Its use is not limited to continue_node, but one might be inclined to think that… which is why I put a note of it here.

The multifunction_node is similar to the function_node, with the main difference that it has multiple output ports. The multifunction_node’s body can try_put() to many of its output ports in one invocation. The outputs are represented as a tuple of output ports, which allows you to hook up output ports to successors, and allows you to decide which ports to send outputs to from within the body.

Note: “multifunction_node” used to be called “multioutput_function_node”

Buffering Nodes

Buffering nodes allow you to pass data from one node to another using a specific buffering scheme.

The buffer_node contains internal storage that stores the messages it receives. The allocator for the buffer can be customized.

Buffered messages are forwarded to the successors in an arbitrary order. Each message is sent to only a single successor, and this successor is chosen by trying to send the message to each successor in the order they were registered, until a successor accepts the message.

Successors can reserve a message from the buffer, which gives the successor ownership of that message without removing it from the buffer. The successor can then either release ownership (leaving the message in the buffer for another successor to grab), or consume the message (removing it from the buffer.) While a message is reserved, other messages can still be added or removed from the buffer.

The priority_queue_node is similar to the buffer_node. The difference is that the messages are forwarded in order of largest priority first, as defined by a Compare template argument (defaulting to std::less).

Unlike the buffer_node, if a successor reserves a message from the priority queue, no other reservations can be made (since this would violate the priority queue’s order.) However, messages can still be pushed to the priority queue while it is reserved.

Unanswered question: What happens if you reserve a priority queue, then push a message that makes its way to the front of the queue?

The sequencer_node is similar to the priority_queue_node, except that messages are sorted in sequence order. Sequence numbers are of type size_t, and the sequence number of a given message is determined by a user-specified function.

The overwrite_node contains a buffer big enough for only a single message. When the value is set, the new value is broadcast to all successors, but (unlike other buffer nodes) the value doesn’t get removed from the buffer when it gets sent to successors. Successors can still call try_get() and receive the same value. The node’s buffer is considered invalid until it gets set for the first time.

The write_once_node is similar to the overwrite_node. However, the single value in the buffer cannot be overwritten. If you try to put while a value is already in the buffer, the put is simply ignored. If you want to change the value in the buffer, you have to explicitly clear() it first, then you can put a new value.

Split/Join Nodes

Split/join nodes have the sole purpose of combining or separating streams. They either convert separate messages into tuples of messages, or the reverse.

The join_node receives multiple inputs, packs them into a tuple, then forwards that tuple to its successors. There are a few different policies for the join’s buffering, as shown above and explained below. The buffering policy is set with a template argument.

Queueing Join

With a “queueing” join policy, the join node is given an unbounded FIFO queue for each input port. When each input port’s FIFO has at least one item in it, the inputs are removed from the queue, packed into a tuple, and broadcasted to its successors.

Reserving Join

With a “reserving” join policy, the need for a FIFO is removed by using a “pull” approach rather than a “push” approach (as done in the queueing join.) When a predecessor puts to an input, the reserving join node marks that input as “potentially available”. When all inputs have been marked potentially available, the reserving join will try to reserve each input to see if they are still actually available. If not all inputs could be successfully reserved, all inputs are released.

If all inputs were successfully reserved, then they are packed into a tuple and broadcasted to successors. If at least one successor accepts the tuple, the input reservations are consumed. Otherwise, the inputs get released.

Since reservation is based on a pull approach, something weird happens: By design, the first push to a reserving join input port will always fail. Instead, this causes the port to switch to pull mode, which will then try_reserve the input from the predecessor. If all inputs can be reserved, they are try_consumed. For these reasons, you typically want the inputs to a reserving join to be buffer nodes (which support pulling).

Key Matching (and Tag Matching) Join

With a “key_matching” join policy, a user-provided hash function is applied to the messages to get their associated key, and the key/value pair is stored in a hash table. When a given key exists in a key/value pair in all input ports simultaneously, the key/value pairs are removed from the hash table, and their values are packed into a tuple which is broadcasted to the successors. If no successors accept the tuple, it is saved and will be forwarded on a subsequent try_get().

Unanswered question: What buffering scheme is used for the stored tuples? A buffer_node, or a queue_node, or what?

The “tag_matching” policy is a specialization of key_matching. The definition is simple:

typedef key_matching tag_matching;

The “tag_value” type is a simple typedef for uint64_t, so this just means you have to use a hash function that returns a uint64_t. That’s all.

The “indexer_node” receives inputs from multiple input ports, and broadcasts them to all successors. Unlike join nodes, the indexer node does not wait for all input slots to have a value. Instead, a message is broadcasted whenever any input port receives an input. The successors receive a tagged union (aka variant) that says which indexer’s input slot the value came from, and then allows you to retrieve the value of the message.

Note: “indexer_node” was previously called “or_node”.

Other Nodes

The TBB designers put some miscellaneous node types into this category.

The “limiter_node” is similar to a broadcast_node, but it can only output a limited number of messages. Outputted messages only count toward this threshold if the message is accepted by at least one successor. Once the maximum number of messages has been outputted, the limiter no longer accepts inputs.

The counter can be decremented by calling try_put() on the limiter_node’s “decrement” member object. The decrementer is an object which receives continue_msg as an input. The decrementer then makes its parent limiter reduce its output count by 1, and attempts to get a message from a predecessor then forward it to its successors. If the forwarding is successful, its output count will be increased by 1, bringing it back to its original value. If the forwarding is not successful, then the output count will have been reduced by 1.

The “async_node” is designed to communicate with systems outside TBB. The body of the async node receives a “gateway_type” as input, which you can pass to an external system. The gateway interface lets your external system try_put() to the async node’s successors. Also, the gateway interface lets you reserve and release. Reserving/releasing tells the flow graph that async work has started and ended (respectively). This lets the flow graph know when a wait_for_all() on this node should finish.

The maximum concurrency and the buffering policy of this node can be configured.

The “composite_node” allows you to wrap multiple nodes into a single node with defined inputs and outputs. This is done by calling set_external_ports(), which lets you specify the types of the inputs and outputs of your composite node.

This node is pretty special. It’s designed for streaming the submission of kernels and data to devices through queues, used to integrate heterogeneous systems with streaming hardware (like GPUs) into TBB flow graphs. Using it requires some configuration in order to properly pass data and kernels to the device. Intel provides such a prepackaged configuration in “opencl_node”, see this overview.

There’s no shortage of people saying object-orientation doesn’t work, but it’s not always easy to understand why. Complaints are often summarized to broad statements like “it produces ugly code” and “the performance is bad”. If these statements are true, why bother with object-orientation at all? In this post, I tried to answer this question by drawing a line between good uses and bad uses of object-orientation.

Things Objects are For

Applying object-oriented design has many advantages in terms of software development process improvements. Here’s a few that come to mind.

Objects are for Communication

Humans understand the world in terms of objects. Whether you’re an engineer or a marketing specialist or a QA tester, explaining the relationship between entities is easy to express in terms of objects. This is especially the case in simulations, like most video games. For example, saying that an “Actor” has a “Weapon” that is a “Bazooka” that shoots “Bullets” that are a “RPG missile” is something easy for anybody to understand, whether or not they have a technical background.

Objects are for Specification

Since it’s possible to easily convey concepts in an object-oriented way, it’s easy to formally specify the expected behavior of systems this way. Don’t worry about modeling every last detail of your design, you just need to create a high-level specification that engineers, managers, your customers, and your QA team can understand.

The actual implementation doesn’t need to match your specification in terms of code, it just has to have the same behavior at the end of the day. Somebody should be able to look at the specification and the product, then confirm if that feature exists or doesn’t. This allows your team and customer to be aligned on the requirements.

Engineers should know roughly what code corresponds to which part of the specification, even if the code itself isn’t programmed in an object-oriented way.

Objects are for Classification

By splitting object into classes and packages, we’re able to create meaningful subdivisions of the work involved in a software engineering project.

As a project manager, you can organize the engineering tasks required to finish the product in groups of related classes, and you can assign them to different people so they can work in parallel. Organizing tasks means you can ship faster.

Alternatively, you can measure which classifications of objects have the highest rates of defects, which allows you to make intelligent business decisions about where to allocate expensive QA efforts to maximally improve product quality. This means you ship a better product for cheaper.

Objects are for Relations

By understanding the relationships between objects, we can understand dependencies within the system. This makes it possible to organize the order in which tasks need to be done, and the order in which code can be tested.

This aspect can be taken to the logical conclusion of organizing your data using a relational database schema, using a high-level language like DDL. By having well-specified relationships, it becomes possible to automatically maintain them using a database management system. I see this as an industry-proven formalization of data-oriented programming.

Objects are for Isolation

Object-oriented design is fundamentally about software components that communicate with each other through messages. By isolating different components, it becomes possible to create high level systems without having to worry about the details. Humans can only focus on a limited amount of information at a time, so properly isolating components makes it easier to be productive. Of course, it’s possible to over-isolate things and lead to complications.

Isolation might be for political reasons, in the sense that interfaces separate the work of different teams and companies. Isolation might also be a result of the real world constraints, like when different distributed computer systems are physically separated and communicate through network messages.

Things Objects are Not For

Given all the good things about object-oriented design, it’s tempting to go overboard and apply it even to problems where it’s clearly not useful. Here are some examples of when applying object-orientation will lower the quality of your product.

Objects are Not For Implementation

After creating a design using object-orientation, you should rarely write code that corresponds one-to-one to your design using object-oriented language features. As a software engineer, you need to translate human requirements into computer requirements, since things that make sense for humans don’t necessarily make sense for computers.

You don’t have to hand-code everything in assembly or avoid the Java language entirely. Focus on writing code that solves the computing requirements of your customers. For example, what are the inputs, the outputs, the transformations that need to be done? What data needs to be moved where?

Objects are Not For Indirection

Object-oriented languages usually allow you to use inheritance in order to create specialized functionality for different objects. At the end of the day, this boils down to syntactical sugar for the creation and use of virtual function pointer tables, which is a way of implementing an indirection on functions.

Implementing virtual function pointer tables by hand (like you have to in C) is actually pretty tedious, so automation by the language is convenient. However, you have to consider if a virtual function pointer table is really the appropriate solution to implement the desired indirection. Many design patterns are workarounds for Java not having function pointers, so don’t confuse these workarounds as universal solutions.

You can often get away with implementing indirection in a much simpler way. For example, you could store the type of an object in an enum and switch behavior based on it. Although such alternatives are not always more flexible, consider the technical cost of increasing flexibility: The more flexible your classes become, the harder it becomes to understand what the code really does, which increases complexity. Overly flexible systems tend to become so abstract that nothing really “does” anything, and by focusing on writing code that solves your customers’ computing requirements, you can avoid these over-generalizations.

Even if you are using an object-oriented programming language, try to see through the abstraction its object system provides in order to write simpler and more efficient code.

Objects are Not For Allocation

By Java’s conventions, each object in your system is dynamically allocated and garbage collected. This likely doesn’t represent the allocation pattern your software system really has, since you will likely operate on collections of objects that benefit from having their allocations grouped together to facilitate memory management and improve cache coherency.

By better understanding the allocation patterns of your software system, you’ll probably find key points where allocations are made, like system initialization or user session startup. By understanding the allocation patterns of your system, you’ll be able to write code with simpler resource ownership relationships, you’ll be able to make better choices for data-structures and memory layouts, and consequently you’ll be able to write code with better performance.

Even in a garbage collected language, grouping objects in large allocations means less work for the garbage collector, so applying this principle is likely a win no matter what kind of memory management system your language uses.

Objects are Not For Parallelization

You rarely have “just one” of something, so working at the granularity of collections of objects rather than single objects can have many advantages in terms of parallel processing.

For example, it’s possible to leverage SIMD instruction sets to perform many fine-grained computations in parallel, but this requires that your objects are arranged contiguously memory and also usually requires a struct-of-arrays memory layout rather than array-of-structs. If you want to take advantage of SIMD, you need to stop thinking about objects and start thinking about organizing data. Clever compilers may automatically use SIMD optimizations in your loops, but they don’t have the power to dramatically change the layout of your data, so the compiler needs your help in this respect.

When doing large batches of parallel work, memory transfers become an important bottleneck in terms of the overall performance of your code. In this respect, it’s useful to consider the cost of streaming your data in and out of memory as you traverse the dataset and dispatch parallel computations on it. If your bottleneck lies in memory transfers but you have extra computation power to spare, you can further improve the performance of your algorithm by building acceleration structures or compressing your stream of data. Either way, the hard problems to solve here are related to data and algorithms, not related to organizing objects.

In Conclusion

Object-orientation is useful for design work, but that doesn’t mean your implementation also needs to be object-oriented. This could be summarized as the difference between object-oriented design and object-oriented programming.

Object-orientation is a natural method of communication between humans. However, when it comes to programming, focus on writing code that solves your requirements in terms of a computer system rather than an abstract object system.