What is the cost of changing state?

Programmers are supposed to have a fairly good idea of the cost of certain operations: for example the cost of an instruction on CPU, the cost of a L1, L2, or L3 cache miss, the cost of a LHS.

When it comes to graphics, I realize I have little to no idea what they are. I have in mind that if we order them by cost, state changes are something like:

Shader uniform change.

Active vertex buffer change.

Active texture unit change.

Active shader program change.

Active frame buffer change.

But that is a very rough rule of thumb, it might not even be correct, and I have no idea what are the orders of magnitude. If we try to put units, ns, clock cycles or number of instructions, how much are we talking about?

Answers

20

The most data I've seen is on the relative expense of various state changes is from Cass Everitt and John McDonald's talk on reducing OpenGL API overhead from January 2014. Their talk included this slide (at 31:55):

The talk doesn't give any more info on how they measured this (or even whether they're measuring CPU or GPU cost, or both!). But at least it dovetails with the conventional wisdom: render target and shader program changes are the most expensive, uniform updates the least, with vertex buffers and texture changes somewhere in the middle. The rest of their talk also has a lot of interesting wisdom about reducing state-change overhead.

Nathan Reed

Posted 2015-08-05T03:23:24.500

Reputation: 15 036

I am selecting this answer since it gives orders of magnitude, which is the closest so far to what I asked, even though the mentioned source doesn't give much explanation. – Julien Guertault – 2015-08-09T23:42:35.030

18

The actual cost of any particular state change varies with so many factors that a general answer is nigh impossible.

First, every state change can potentially have both a CPU-side cost and a GPU-side cost. The CPU cost may, depending on your driver and graphics API, be paid entirely on the main thread or partially on a background thread.

Second, the GPU cost may depend on the amount of work in flight. Modern GPUs are very pipelined and love to get lots of work in flight at once, and the biggest slowdown you can get is from stalling the pipeline so that everything that's currently in flight must retire before the state changes. What can cause a pipeline stall? Well, it depends on your GPU!

The thing you actually need to know to understand the performance here is: what does the driver and GPU need to do to process your state change? This of course depends on your GPU, and also on details that ISVs often don't share publicly. However, there are some general principles.

GPUs generally are split into a frontend and a backend. The frontend handles a stream of commands generated by the driver, while the backend does all the real work. Like I said before, the backend loves to have lots of work in flight, but it needs some information to store information about that work (perhaps filled in by the frontend). If you kick enough small batches and use up all the silicon keeping track of the work, then the frontend will have to stall even if there's lots of unused horsepower sitting around. So a principle here: the more state changes (and small draws), the more likely you are to starve the GPU backend.

While a draw is actually being processed, you're basically just running shader programs, which are doing memory accesses to fetch your uniforms, your vertex buffer data, your textures, but also the control structures that tell the shader units where your vertex buffers and your textures are. And the GPU has caches in front of those memory accesses as well. So whenever you throw new uniforms or new texture/buffer bindings at the GPU, it'll likely suffer a cache miss the first time it has to read them. Another principle: most state changes will cause a GPU cache miss. (This is most meaningful when you are managing constant buffers yourself: if you keep constant buffers the same between draws, then they are more likely to stay in cache on the GPU.)

A big part of the cost for state changes for shader resources is the CPU side. Whenever you set a new constant buffer, the driver is most likely copying the contents of that constant buffer into a command stream for the GPU. If you set a single uniform, the driver is very likely turning that into a big constant buffer behind your back, so it has to go look up the offset for that uniform in the constant buffer, copy the value in, then mark the constant buffer as dirty so it can get copied into the command stream before the next draw call. If you bind a new texture or vertex buffer, the driver is probably copying a control structure for that resource around. Also, if you're using a discrete GPU on a multitasking OS, the driver needs to track every resource you use and when you start using it so that the kernel's GPU memory manager can guarantee that the memory for that resource is resident in the GPU's VRAM when the draw happens. Principle: state changes make the driver shuffle memory around to generate a minimal command stream for the GPU.

When you change the current shader, you're probably causing a GPU cache miss (they have an instruction cache too!). In principle, the CPU work should be limited to putting a new command in the command stream saying "use the shader." In reality, though, there's a whole mess of shader compilation to deal with. GPU drivers very often lazily compile shaders, even if you've created the shader ahead of time. More relevant to this topic, though, some states are not supported natively by the GPU hardware and are instead compiled into the shader program. One popular example is vertex formats: these may be compiled into the vertex shader instead of being separate state on the chip. So if you use vertex formats that you haven't used with a particular vertex shader before, you may now be paying a bunch of CPU cost to patch the shader and copy the shader program up to the GPU. Additionally, the driver and shader compiler may conspire to do all sorts of things to optimize the execution of the shader program. This might mean optimizing the memory layout of your uniforms and resource control structures so that they are nicely packed into adjacent memory or shader registers. So when you change shaders, it may cause the driver to look at everything you have already bound to the pipeline and repack it into an entirely different format for the new shader, and then copy that into the command stream. Principle: changing shaders can cause a lot of CPU memory shuffling.

Frame buffer changes are probably the most implementation-dependent, but are generally pretty expensive on the GPU. Your GPU may not be able to handle multiple draw calls to different render targets at the same time, so it may need to stall the pipeline between those two draw calls. It may need to flush caches so that the render target can be read later. It may need to resolve work that it has postponed during the drawing. (It is very common to be accumulating a separate data structure along with depth buffers, MSAA render targets, and more. This may need to be finalized when you switch away from that render target. If you are on a GPU that is tile-based, like many mobile GPUs, a fairly large amount of actual shading work might need to be flushed when you switch away from a frame buffer.) Principle: changing render targets is expensive on the GPU.

I'm sure that's all very confusing, and unfortunately it's hard to get too specific because details are often not public, but I'm hoping it's a half-decent overview of some of the things that are actually going on when you call some state changing function in your favorite graphics API.