When you're playing your favorite game on Dolphin with a powerful computer, things should run fairly well. The game is running full speed, there are no graphical glitches, and you can use your favorite controller if you want. Yet, every time you go to a new area, or load a new effect, there's a very slight but noticeable "stutter." You turn off the framelimiter to check and your computer can run the game at well over full speed. What's going on?

The slowdown when loading new areas, effects, models, and more is commonly referred to as "Shader Compilation Stuttering," by users and developers alike. This problem has been a part of Dolphin since the very beginning, but has only recently become more of a focus.

When games barely ran at all, a little stutter here and there wasn't a big deal. Though emulation has improved to near perfection in many titles, the stuttering has remained the same over the years. Since the release of Dolphin 4.0, users have actually complained about shader compilation stuttering at an increasing rate even. While some of this may be partially due to increased GPU requirements from integer math, the bigger cause was actually that the stuttering stuck out more with there now being fewer serious issues otherwise.

There was some frustration and even antipathy from the developers toward shader compilation stuttering. It was something that was deemed unfixable and was garnering a lot of ill will and frustration within the community. Ironically, we hated the stuttering as much as anyone else, but the sheer insanity of the task was enough to keep most developers away. Despite this, some still privately held onto a glimmer of hope. It started out as a theory that had a chance of working. A theory that would take hundreds, if not thousands, of person-hours just to see if it was possible.

That hope is what fueled an arduous journey against seemingly impossible odds. A journey that would take multiple GPU engineers across two years. All in an effort to emulate the full range of the GameCube/Wii's proto-programmable pipeline without falling victim to this pesky stuttering.

Modern GPUs are incredibly flexible, but this flexibility comes at a cost - they are insanely complicated. To unlock this power, developers use shaders - programs that the GPU runs just like a CPU runs an application - to program the GPU to perform effects and complex rendering techniques. Devs write code in a shader language from an API (such as OpenGL) and a shader compiler in the video driver translates that code into binaries that your PC's GPU can run. This compiling takes processing power and time to complete, so modern PC games usually get around this by compiling shaders during periods in which framerate doesn't matter, such as loadtimes. Due to the number of different PC GPUs out there, it's impossible for PC games to pre-compile their shaders for a specific GPU, and the only way to get shaders to run on specific PC hardware is for the video drivers to compile at some point in the game.

Flipper, the GameCube GPU, is the largest chip on the motherboard.Image Credit: Anandtech

Consoles are very different. When you know the precise hardware you are going to run the game on, and you know that the hardware will never change, you can pre-compile GPU programs and just include them on the disc, giving your game faster load times and more consistent performance. This is especially important on older consoles, which may not have enough memory for or possibly even the capability to store shaders in memory. Flipper, the GameCube GPU, is the latter.

While it has some fixed-function parts, Flipper features a programmable TEV (Texture EnVironment) unit that can be configured to perform a huge variety of effects and rendering techniques - much the same way that pixel shaders do. In fact, the TEV unit has very similar capabilities to the DirectX 8 pixel shaders of the Xbox! It was so flexible and powerful that Flipper was reused as the Wii GPU (redubbed Hollywood) with few modifications. Unfortunately for us though, the TEV unit is designed for the game to configure and run TEV configurations immediately when an effect is needed. There is no preloading of the TEV configurations whatsoever, since the TEV unit doesn't have the memory for that.

That instantaneous loading is the source of all our problems. Dolphin has to translate each Flipper/Hollywood configuration that a game uses into a specialized shader that current GPUs can run, and shaders have to be compiled, which takes time. But the TEV unit doesn't have the ability to store configurations, so GC/Wii games must configure it to render an effect the instant it is needed, without any delay or notice. To deal with this disparity, Dolphin's only option is to delay the CPU thread while the GPU thread and the video driver perform the compilation - essentially pausing the emulated GC/Wii. Usually the compilation will take place in under a frame and users will be none the wiser, but when it takes longer than a frame, the game will visibly stop until the compilation is complete. This is shader compilation stuttering. Typically a stutter only lasts a couple of frames, but on really demanding scenes with multiple compiling shaders, stutters of over a second are possible.

Until a shader cache has built up, Metroid Prime 3 is quite painful.

As the first emulator to emulate a system with a highly programmable GPU at full speed, Dolphin has had to go it alone at tackling this problem. We implemented shader caching so if any configuration occurred a second time it would not stutter, but it would take hours of playing a game to build a reliable cache for it, and a GPU change, GPU driver update, or even going to a new Dolphin version would invalidate the cache and start the stuttering all over again. For years, it seemed like there was nothing more we could do about shader compilation stuttering, and many wondered if it would ever be solved...

Of all of Dolphin's remaining issues, shader compilation stuttering is the most complained about. Whether it be on the issue trackers, forums, social media, or IRC, this problem comes up all the time. Over the years, the reaction has shifted. At first, this stuttering was ignored as a non-issue. What did it matter if there was a slight stutter here and there if games barely ran at all in the first place? Things shifted in January of 2015, when this stuttering was formally accepted as a bug on Dolphin's issue tracker, and awareness spread.

Over the past few years, we've had users ask many questions about shader stuttering, demand action, declare the emulator useless, and some even cuss developers out over the lack of attention to shader compilation stuttering. The truth is that we hated the stuttering as much as anyone else, and we had thought about the problem for many years. Tons of solutions had been pondered, some even attempted. It just didn't seem possible to fix without serious side-effects.

Dolphin is pretty fast at generating the shaders it needs, but compiling them is a problem. But, if we could somehow generate and compile shaders for every single configuration, that would solve the problem, right? Unfortunately, this is simply not possible.

There are roughly 5.64 × 10^511 potential configurations of TEV unit alone, and we'd have to make a unique shader for each and every configuration. Vertex shaders are also used to emulate the semi-programmable Hardware Transform and Lighting unit, and this raises the number of combinations even higher.

Even if we were able to compile them, these shaders would only be usable on the version of Dolphin they were generated on. Upgrading to a new build would require a new set of shaders. Other necessary occasions like upgrading your graphics card or upgrading your graphics drivers would also necessitate a recompile. And all of this relies on the driver having a low-level cache, which not all drivers do.

If we could just generate and compile shaders during loading screens and whatnot, there wouldn't be any stuttering when it mattered. Trying to predict what the game wants to do simply isn't feasible to a degree that would solve this problem. The performance and implementation implications around having Dolphin try to "see ahead" either by fastforwarding and predicting inputs cost way too much for the situations that they could possibly help.

Blind prediction doesn't work either - a game can choose to run whatever configurations it wants without any warning, and past configurations don't tell us anything about future configurations. The only way to know what shaders a game would need would be go through a game and find out every configuration it could possibly want.

Dolphin uses a "Unique ID" object, or "UID" to represent a configuration of the emulated GPU, and these UIDs are then turned into shader code and handed to the video driver for compilation. Because UIDs are before compilation and have not been tailored to any specific PC GPU, they are compatible with any computer and could theoretically be shared. Users refer to this as "sharing shaders" and in theory if users shared UID files, they could compile shaders ahead of time and not encounter stuttering. Currently, the Vulkan video backend already has this feature as was necessitated to avoid shader caching issues on certain drivers.

So why hasn't extending this solution been pursued?

Dolphin is still improving. If a graphics fix is merged, all of those UIDs may have to be thrown out.

Not all games will be serviced. While popular games may get near complete UID collections, people playing hidden gems probably won't get any help.

In testing, there is very little UID overlap between games. The Legend of Zelda: The Wind Waker and The Legend of Zelda: Twilight Princess do share a small portion (15%) of configurations, but they are both running on the same base engine. Most games will have far less in common with each other, so sharing popular games will definitely not benefit lesser known games.

Users may miss various UIDs. There are a near limitless number of configurations. Even 100%ing a game isn't a guarantee that you've hit every configuration.

Developers pondered this idea for a while, but building the infrastructure for sharing UIDs and finding a good way to distribute them proved to create more disagreements than solutions. While this could possibly be used to improve an already working solution, it is not a working solution on its own.

Popularized by a fork, asynchronous shader compilation is a creative solution to the shader compilation dilemma. Tino looked at the problem more like how some modern games handle the same issue of having to dynamically compile new shaders - when you spawn into a new area, sometimes new objects will just "pop" in as they are loaded. He wondered if he could achieve something similar in an emulator and began rewriting how shaders were handled in his fork.

The asynchronous shader compiling concept changes how Dolphin behaves when there isn't a cached shader for an encountered Flipper/Hollywood configuration. Instead of pausing the game and waiting for a shader to compile, it simply skips rendering the object. This means that there is no pause or stutter, but some objects may be missing from view until the shader is ready.

This works well for some games. Depending on how the game's engine culls objects when drawing the world, objects that fall outside the field of view of the camera, or only cover a few pixels on-screen may still be rendered. In this case, skipping rendering of these objects is hardly noticable. However, depending on the game, it can result in the "pop in" described earlier.

Ignoring shader compilation can result in pop in and broken graphics. But, it does remain smooth!

One of the things users wondered was why Dolphin didn't at least implement Tino's asynchronous shaders as an option to fight shader compilation stuttering. In the end, it just came down to the fact that the people who could have implemented it along with other core developers were against it as a solution. They saw it as nothing more than a hack that would cause a lot of false positives on the issue tracker and cause bigger issues down the road. Those worries were proven somewhat valid when you realize that some games need objects to be rendered on the frame they expect it to be. In this case, the Mii heads are only rendered once to the Embedded Framebuffer. If the EFB copy is missing because of Async Shader Compilation, the Mii heads will not show up for the remainder of the game or until they're regenerated.

Wait a minute, those aren't gum drops...

Despite its flaws, users of Tino's fork swear by asynchronous shader compilation. For everything wrong with asynchronous shaders, they do solve the problem of shader compilation stuttering at all costs. The stark downsides were too steep for it to be merged into Dolphin master, but, this solution definitely brought the spotlight on how shader generation compilation was a big problem. Tino's work on asynchronous shader compilation really let us know how much users cared about this problem, and further motivated the team to come up with a more complete solution.

Write an Interpreter for the GameCube/Wii Rendering Pipeline within Shaders and Run it on the Host Graphics Card¶

Sometimes, one of the best ways to solve an impossible problem is to change your perspective. No matter what we tried, there was no way to compile specialized shaders as fast as games could change configurations.

But what if we don't have to rely on specialized shaders? The crazy idea was born to emulate the rendering pipeline itself with an interpreter that runs directly on the GPU as a set of monsterous flexible shaders. If we compile these massive shaders on game start, whenever a game configures Flipper/Hollywood to render something, these "uber shaders" would configure themselves and render it without needing any new shaders. Theoretically, this would solve shader compilation stuttering by avoiding compilation altogether.

This idea is all kinds of crazy, but it was also the first idea that had the potential to actually solve this impossible problem. The difficulty with this solution instead came from the absurd amount of work and expertise required to even get to the point of trying it. To put it into perspective, even among all the developers that work on Dolphin, only two or three people at most have the necessary knowledge on not only the GameCube/Wii hardware, but also modern GPUs, APIs, and the drivers to write, debug, and optimize the shaders. Not to mention running an interpreter as huge shaders is not exactly easy on the GPU, and many were afraid that all that work might not even run full speed on current video cards.

Hundreds, if not thousands of hours of mindnumbing, repetitive, yet difficult work were needed with no guarantee of any payoff.

It was only first attempted in 2015, when phire became so frustrated with the shader compilation stuttering on his brand new computer that he actually made a proposal and designed the framework for an ubershader. While he was well aware of the difficulties, he seemed intent on proving that Ubershaders were the solution to this age old problem. phire went in all alone in an attempt to teach Dolphin how to render all over again.

This is not a graphics filter.Wow, there are a few things wrong here...Because of its simplicity, SM64 was one of the first games to render something in Ubershaders.

After grinding at the feature for more than a month, he managed to get the pixel Ubershaders to the point where some games started to look like their fast-shader counterparts. The surprising part wasn't that it worked, but that the prototype Ubershaders actually ran full speed. phire himself recollected that his initial reaction consisted of, Holy shit, it's actually running at full speed and further admitted that GPUs shouldn't really be able to run these at playable speeds, but they do. Despite all the odds stacked against them, the prototypes only proved the Ubershaders could be our solution to shader compilation stuttering. And thus, the grind began to improve the accuracy of Ubershaders, fix the many bugs, and implement the missing features.

Early on, Ubershaders would make games look like distorted realities never seen before.It didn't take long for things to improve.

Before we knew it, Wind Waker was rendering with only minor errors.phire was able to get Wind Waker looking perfect fairly fast. Unfortunately, other games that used more features still would need much more work.

The effort to even get Ubershaders this far left phire completely exhausted with the project. On top of that, phire had to put in a ton of work cleaning up other projects for the Dolphin 5.0 release. The delays proved costly, as he lost his fire to continue working on ubershaders thanks to burnout and increasing worries about driver and API limitations regarding the solution. Despite being around 90% complete, the last 90% still remained to be done, including some key features.

Finishing the vertex Ubershaders

Infrastructure/linking pixel and vertex Ubershaders

Solving OpenGL and (after a rebase) Vulkan performance issues

Cleanups, bug fixes, and making rendering identical to the specialized shaders

GUI Options

Optional - Hybrid Mode for integrated/weaker GPUs

To see them on the cusp of working was painful. But, there weren't any developers capable of working on it with the will to take on such a massive project. Even those who would have considered working on it weren't ready to take on the cleanups, bug fixes, and infrastructure work. For well over a year, Ubershaders sat and bitrotted on the backburner within an evergrowing list of features that were never finished, and hope began to fade once again...

Shader compilation stuttering is one of the most complained about bugs in Dolphin, so after Ubershaders development stopped, people didn't forget about it. The pull request, though long abandoned, still saw comments, got linked around the forums and even posted on the bug tracker in various forms.

Ubershaders was the first real hope to eliminate shader compilation stuttering, and it was still brought up on a monthly basis. If anything, the progress on it only inflamed the community's desire for a solution. After much pleading, begging, and much, much blackmail honest coercion, Stenzek reluctantly took over the mantle of Ubershaders.

Even before Stenzek began working on Ubershaders, the team made some decisions toward maintainability of the graphics backends. One of those decisions that was met with a mixed, if not negative, reaction was the removal of the D3D12 backend. Unlike D3D9, we didn't go through a deprecation process; we removed it once it was obvious no one was going to maintain it.

This was a fortuitous decision however, as the removal of that backend aided with the rebase and revival of Ubershaders when Stenzek was ready to give it his best shot. As he was the architect of Dolphin's Vulkan backend, he was already more than willing to go through the extra work to setup ubershaders to work on Vulkan.

When the pixel and vertex Ubershaders were finally hooked up together and ready for a run, testers immediately took them to some of the worst case scenario titles. Considering that none of the previous solutions really worked for a game like Metroid Prime 3, it was first on the docket.

Metroid Prime 3 was one of the only games where shader stuttering actually caused its compatibility rating to be dropped from playable. Until now!

The initial Ubershaders test was a massive success, with stuttering completely eliminated in D3D, and only a few strange stutters early on in runs within OpenGL and Vulkan. Continued work on Ubershaders has made things better in each backend, with a few exceptions that we'll note later. But, just running games on Ubershaders wasn't the end-game; pure Ubershaders are a massive performance drain on the host graphics card. While each game's requirements will vary, your graphics card will greatly affect how high of a resolution you can run. At 1x Internal Resolution (480p) most dedicated GPUs should be able to get the job done, with higher end cards still able to push 1080p or higher even exclusively using Ubershaders. Unfortunately, many of our users don't have the necessary hardware to run Ubershaders at the resolution they'd prefer, which would put them in the unfortunate position of choosing between resolution and smoothness.

Intel Integrated GPUs can barely run Dolphin's specialized shaders at higher resolutions and stand no chance against Ubershaders. Click for details.

A very large portion of Dolphin's users are running onboard graphics. In our testing, onboard solutions at best could get roughly 50% speed with Ubershaders at 1x IR in a typical 3D game! Developers felt like ignoring a huge group of Dolphin's users would be a mistake and make Ubershaders a limited victory at best. Thus work continued on an even more robust solution that would cure these performance ailments once and for all.

Hybrid Mode Ubershaders is a marriage of Ubershaders and Asynchronous Shader Generation into a beautiful solution that takes the best parts of both with none of their flaws. Because Hybrid Mode greatly reduces the performance cost of Ubershaders, we expect it to be the most commonly used Ubershader mode.

Under Hybrid Mode, whenever a new pipeline configuration appears, Dolphin will use the already compiled Ubershaders to immediately render the effect without stuttering while still compiling the specialized shader in the background. Once the specialized shader is done, Dolphin will then hand the objects rendering through the Ubershader over to these newly generated specialized shaders.

Assuming that drivers and APIs behave the way we want, this is the perfect solution. Because Ubershaders are only running on a fraction of the objects on a scene and only for frames at a time, the performance hit is almost entirely negated and stuttering is completely eliminated. Unfortunately, drivers and APIs aren't perfect, limiting the effectiveness of Hybrid in some setups. That brings us to...

GPU driver teams have a tough job squeezing as much power as possible out of their products while also providing a stable experience for their users. We mean no disrespect to anyone who works on these drivers, but one of the biggest obstacles to this project has been a ridiculous number of driver and API quirks that forced workarounds and other changes in functionality.

By bringing these to light, we hope to get some attention on them. Maybe someone outside of the project can come up with a workaround for us or at least monitor it in case a future driver/API update fixes the issue listed.

Drivers can do things in ways we don't expect and cannot control. When we have to generate a new pipeline for a different blend or depth states, some drivers aren't smart enough to share the shaders between pipelines. This will cause a minor stutter the first time a new blending mode is used. Most of the time the variants a game will use are generated within the first few minutes of play, but it can still be frustrating when you're seeking perfection.

Thankfully, some drivers are smart enough to share shaders between pipelines, such as the Mesa driver, and there appears to be no additional stuttering. All of the other available drivers appear to suffer from some form of stuttering during variant generation. While we can't do anything about this currently, we're hopeful that as Vulkan drivers mature, they'll take on Mesa's more favorable behavior.

Some users have reported that on OpenGL and Vulkan (particularly in Hybrid Mode) there is some very slight stuttering when shaders are compiled. While we're not sure exactly what's wrong, because this does not happen in D3D, we're fairly certain that it's a quirk in NVIDIA's driver rather than a fault in how Dolphin handles things. Based on our testing, this appears to be separate from variant generation.

NVIDIA's Compiled Shaders on OpenGL and Vulkan are Much Slower than D3D¶

This one is particularly frustrating as there is no great way for us to debug this. We're feeding the same shaders to the host GPU on OpenGL, Vulkan and D3D, yet, D3D ends up with shaders that are much faster than the other two backends. This means that on a GTX 760, you may only get 1x internal resolution in a particular game on OpenGL or Vulkan, but on D3D, be able to comfortably get double or even triple before seeing slowdown.

Since NVIDIA does not allow us to disassemble shaders despite every other desktop GPU vendor having open shader disassembly, we have no way to debug this or figure out why the compiled code is so much more efficient on D3D. Think about how ridiculous this is: we want to make Dolphin run better on NVIDIA and they don't provide the tools to let us even attempt it. It's a baffling decision that we hope is rectified in the future. Without the shader disassembly tools provided by other vendors, fixing various bugs would have been much more difficult.

The sad thing is, the tools we need do exist - - if you're a big enough game studio.Edit: NVIDIA informed us that they only provide shader disassembly tools for Direct3D 12 (under NDA), and they are not available for other APIs regardless of NDA. Hopefully tools for other APIs will be available in the future.

During the writing of this article, our wishes were answered! AMD's Vulkan driver now supports a shader cache! This greatly improves what was a dire situation with Ubershaders, as it meant we'd have to recompile the Ubershaders every single run. It also improves variant stuttering as mentioned above.

As with any exciting feature, macOS users were probably waiting for the inevitable "but on macOS..." and here it is. The outdated, inefficient OpenGL 4.1 drivers on macOS simply aren't up to the task of handling Ubershaders to any useful degree. Hybrid Mode will reduce stuttering, but, Exclusive is too slow to be useful. As another downside macOS still doesn't support a shader cache under any drivers.

With all of the driver issues above, it isn't a surprise that some graphics cards work better on some backends and settings. We've outlined general recommended settings based on various video cards. Depending on your preference or particular graphics card, you may wish to deviate from the recommendations. Do note that changing certain settings while a game is running, such as per-pixel lighting and anti-aliasing level, will require different ubershaders to be compiled and may cause a sizable pause while this is done. Also remember that Ubershaders require more GPU power, so those same settings will also require a beefier graphics card.

Intel on Windows

Use D3D for Hybrid mode. Exclusive Mode does work, but Intel iGPUs are currently not fast enough to run it at fullspeed even at 1x native.

Driver generates variants with OpenGL meaning stuttering.

The Vulkan driver only supports Skylake+, and is too buggy to be worth using currently.

Intel on Linux

Use Vulkan for Hybrid mode. Exclusive Mode will work, but it will not be full speed.

The Anv driver is fantastic and should see the full benefits of Ubershaders.

The i965 Intel video driver doesn't share OpenGL shaders between threads, which means the render thread will always recompile the shader and stutter. Exclusive Mode, while slow, does work correctly but Hybrid Mode will stutter.

Click for details

AMD on Windows

Use D3D for Hybrid mode.

Use D3D or Vulkan for Exclusive Mode.

The AMD OpenGL driver is just slow in general.

AMD on Linux

Use Vulkan for Exclusive or Hybrid modes.

radv behaves similar to anv and works quite well.

Click for details

NVIDIA on Windows

Use D3D or OpenGL for Hybrid mode.

Use D3D or OpenGL or Vulkan for Exclusive mode. D3D's Ubershaders tend to be more efficient than OpenGL or Vulkan's resulting in higher performance on weaker GPUs.

NVIDIA on Linux

Use OpenGL for Hybrid mode.

Use OpenGL or Vulkan for Exclusive mode. Performance and may vary on which backend is faster per-game. Note that Vulkan will stutter while generating pipeline variants, which may cause one or two very minor stutters early on into a play session.

NVIDIA on Android

Use OpenGL for Hybrid mode.

Use OpenGL or Vulkan for Exclusive mode. Exclusive can actually get full speed in very basic games on the NVIDIA Shield T.V.

Click for details

PowerVR on Android

Not recommended. While running Ubershaders there is graphical corruption from shader compilation errors, but it will correct itself in Hybrid Mode as the specialized shaders compile. Too slow to be useful on current hardware.

Adreno on Android

Not recommended. Hybrid Mode will crash and Exclusive Mode show severe graphical corruption. Too slow to be useful on current hardware as well.

It feels strange to be talking about the ubershader project in past tense now. It is completed, it is merged, and you can use it right now in the latest development builds. While there may be some growing pains here and there, we finally have our solution to shader compilation stuttering. Ubershaders are going to get better over the years as graphics cards get stronger and Exclusive Mode can get more widespread use. Hybrid Mode should also get better as Vulkan drivers mature and other driver quirks are hopefully addressed. And, of course, we're going to continue working on our side to make sure emulation continues to improve.

While shader compilation stuttering is effectively solved, Dolphin still requires the host computer to be fast enough to emulate the game. Additionally there are some flaws in the JIT that can cause stuttering. Currently, Dolphin's JIT with branching support really struggles on games that use a JIT (such as N64 VC games), causing a stutter that feels like shader compilation stuttering but actually is not. While we were hoping to solve that issue, we'll likely be including an option to disable branching support if this cannot be rectified, so users have the power to turn it off for problematic titles.

Because of this giant article, we're not going to be doing a Progress Report this month. We will have July's many amazing changes combo'd into the August Progress Report. It's going to be a big one, so look forward to it! Until then, enjoy.