Brain dumps and other ramblings

Menu

On GC in Games (response to Jeff and Casey)

So it turns out youtube comments suck. I’ll write my response to Jeff and Casey’s latest podcast in blog form instead of continuing the discussion there. View it here: http://www.youtube.com/watch?v=tK50z_gUpZI

Now, first let me say that I agree with 99% of the sentiment of this podcast. I think writing high performance games in Java or C# is kinda crazy, and the current trend of writing apps in HTML5 and JavaScript and then running it on top of some browser-like environment is positively bonkers. The proliferation of abstraction layers and general “cruft” is a huge pet peeve of mine – I don’t understand why it takes 30 seconds to launch a glorified text editor (like most IDEs – Eclipse, Visual Studio, etc.), when it took a fraction of a second twenty years ago on hardware that was thousands of times slower.

That said, I do think their arguments against GC aren’t quite fair (and note that GC has nothing to do with JITs or VMs). The pile up roughly the right things in the “cons” column, but they completely ignore “pros” column, and as a result act baffled that anyone would ever think GC is appropriate for any reason.

Before I get into it, I should probably link to my previous post on GC where I spend a large chunk of time lamenting how poorly designed C# and Java are w.r.t. GC in particular. Read it here.

To summarize: no mainstream language does this “right”. What you want is a language that’s memory safe, but without relegating every single allocation to a garbage collected heap. 95% of your memory allocations should either be a pure stack allocation, or anchored to the stack (RAII helps), or be tied uniquely to some owning object and die immediately when the parent dies. Furthermore, the language should highly discourage allocations in general – it should be value-oriented like C so that there’s just plain less garbage to deal with in the first place. Rust is a good example of a language of this kind.

You’ll note that most of Jeff and Casey’s ranting is not actually about the GC itself, but about promiscuous allocation behavior, and I fully agree with that, but I think it’s a mistake to conflate the two. GC doesn’t imply that you should heap allocate at the drop of a hat, or that you shouldn’t think about who “owns” what memory.

Here’s the point: Garbage collection is about memory safety. It’s not about convenience, really. Nobody serious argues that GC means you don’t have to worry about resource usage. If you have type safety, array bounds checks, null safety, and garbage collection, you can eliminate memory corruption. That’s why people accept all the downsides of GC even in languages where it comes with much higher than necessary penalties (e.g. Java, C#, Ruby, Lua, Python, and so on… pretty much all mainstream languages).

A couple of weeks ago I spent several days tracking down a heap corruption in a very popular third party game engine. I haven’t tracked down who’s responsible for the bug (though I have access to their repository history), and exactly how long it’s been there, but from the kind of bug it was I wouldn’t be surprised if it’s been there for many years, and therefore in hundreds (or even thousands?) of shipped games. It just started happening after several years for no real reason (maybe the link order change just enough, or the order of heap allocations changed just enough, to make it actually show up as a crash).

The main thing to say about this bug (I won’t detail it here because it’s not my code) is that it was caused by three different pieces of code interacting badly, but neither piece was necessarily doing anything stupid. I can easily see very smart and professional programmers writing these three pieces of code at different times, and going through a few iterations perhaps, and all of a sudden there’s a perfect storm, and a latent memory corruption is born.

I mention this because it raises a few important points:

Memory corruption is not always caught before you ship. Any argument about manual memory corruption not being so bad because it’s at least transparent and debuggable, unlike the opaque GC, falls flat on its face for this reason. Yes, you have all the code, and it’s not very complicated, but how does that help you if you never even see the bug before you ship? Memory corruption bugs are frequently difficult to even repro. They might happen once every thousand hours due to some rare race condition, or some extremely rare sequence of heap events. You could in principle debug it (though it often takes considerable effort and time), if you knew it was there, but very sometimes you just don’t.

Memory corruption is often very hard to debug. Often this goes hand in hand with the previous point. Something scribbles to some memory, and fourty minutes later enough errors have cascaded from this to cause a visible crash. It’s extremely hard to trace back in time to figure out the root cause of these things. This is another ding against the “the GC is so opaque” argument. Opacity isn’t just about whether or not you have access to the code – it’s also about how easy it is to fix even if you do. The extreme difficulty of tracking down some of the more subtle memory corruption bugs means that the theoretical transparency you get from owning all the code really doesn’t mean much. With a GC at least most problems are simple to understand – yes you may have to “fix” it by tuning some parameters, or even pre-allocating/reusing memory to avoid the GC altogether (because you can’t break open the GC itself), but this is far less effort and complexity than a lot of heap corruption bugs.

Smart people fuck up too. In the comments there were a number of arguments that essentially took the form “real programmers can deal with manual memory management”*. Well, this is an engine developed by some of the best developers in the industry, and it’s used for many thousands of games, including many AAA games. Furthermore, there was absolutely nothing “stupid” going on here. It was all code that looked completely sane and sensible, but due to some very subtle interactions caused a scribble. Also, it’s not hard to go through the release notes for RAD-developed middleware and find fixes for memory corruption bugs – so clearly even RAD engineers (of whom I have a very high opinion) occasionally fuck up here.

With memory safety, most of these bugs simply disappear. The majority of them really don’t happen at all anymore – and the rest turn into a different kind of bug, which is much easier to track down: a space leak (a dangling pointer in a memory safe language just means you’ll end up using more memory than you expected, which can be tracked down in minutes using rudimentary heap analysis tools).

In other words: memory safety eliminates a whole host of bugs, and improves debuggability of some other bugs. Even when a GC causes additional issues (which they do – there’s a real cost to GC for sure) they at least do so before you ship, unlike the bugs caused by not having memory safety. This is a very important distinction!

Yes, you should be careful, and I’m certainly not advocating Java or C# here, but when you do consider the tradeoffs you should at least be honest about the downsides of not having memory safety. There is real value in eliminating these issues up front.

In current languages I would probably almost always come down on the side of not paying the cost of GC for high-performance applications. E.g. I’ll generally argue against any kind of interpretation or VM-based scripting altogether (DSLs that compile to native is a different issue), especially if they require a GC. However, I don’t think you need to overstate your case when making the tradeoff.

If I could pay a small fixed cost of, let’s say 0.5ms per frame, but be guaranteed that I’m not going to have to worry about any memory corruption ever again I’d totally take that tradeoff. We’re not there yet, but we really aren’t that far off either – the problem isn’t intrinsic to GC. Plenty of high performance games, even 60Hz ones, have shipped with GC’d scripting languages, and while they don’t usually collect the whole heap, a lot of them do manage to keep the GC overhead around that level of cost. So maybe in the future, instead of paying 0.5ms to GC a small heap that’s completely ruined by a shitty language that generates too much garbage, we could instead GC the whole heap and end up with similar levels of complexity by just creating less garbage in the first place (using a non-shitty language).

*Side note: I really hate arguments of the form “real programmers can deal with X” used to dismiss the problem by basically implying that anyone who has a problem just isn’t very good. It’s incredibly insulting and lazy, and no discussion was ever improved by saying it. In my opinion hubris, or extrapolating too far from your own experience, is a far more common sign of incompetence or inexperience than admitting that something is hard.

Like this:

LikeLoading...

Related

Post navigation

18 thoughts on “On GC in Games (response to Jeff and Casey)”

I feel as if you’re heavily resting your argument on extremely rare bugs that occcur once the game is shipped. As in, in an ideal scenario I would imagine that crashing (hopefully with some log/dump or into a debugger) would be the best case – which is not possible if you’re in a managed (or gc + mem safe) environment. And so if you’re testing and and you hit the code path that triggers this rare bug… in a gc+mem safe environment you would just silently continue execution when the program gets into a weird state, right? Sure .. memory resources will not be affected but wont I/O or other internal state data still be left in a weird state?

These rare bugs don’t only occur when the game has shipped, they consume a huge amount of resources to track down during development (usually towards the end). Nothing worse than a hard-to-find memory scribble with a few weeks to go on the clock. I do think it’s compelling that all the downsides of a GC happen *before* you ship, though, whereas these scribbles are essentially an unknown quantity that you can never rule out. Plenty of people see access violations on their PCs when running shipped games. This isn’t a hypothetical.

The only reason something would keep memory around is because that code thinks it still needs it. That bug would exist either way – in a memory unsafe language it could lead to scribbles (if you write through the pointer), and hard-to-debug secondary effects (such as heap corruption causing crashes elsewhere), in a memory safe language you’ll simply keep stuff around for longer, which is orders of magnitude easier to track down (E.g. “Hey, why is still code still running? Oh yeah, the shutdown code didn’t run” – much easier than heap corruption triggering 30 minutes later).

Basically most (memory) bugs disappear completely (e.g. race conditions, or other transient conditions, causing missed frees, or double frees, or accessing dangling pointers – the fact that a memory safe language will just keep the objects around until nothing needs them anymore simply fixes the bug for you by ruling out the need to carefully free stuff manually, it’ll happen when it’s safe automatically). The rare few bugs that don’t go away become much easier to track down (they don’t cause heap corruption, just code that keeps running, or data that’s kept around for too long – both are easy to track down when your program state hasn’t been trashed).

Indeed ( and Hello again! ) – these very rare scribbles are incredibly expensive and stressful ( speaking from recent experience) to track down when they only ,manifest in release candidate builds, weeks from shipping.

Can you eliminate garbage collection from that list, though? Isn’t it possible to have a language that has type safety, array bounds checks, null safety, and maybe cooperation with heap guard blocks to detect scribbles during debug mode?

I still agree with Jeff and Casey that a lot of the time, manual memory management isn’t “hard” enough to throw your hands up in the air and pay the GC price whole-hog.

Merely using guard pages in debug isn’t enough to eliminate the issue.. That can help catch many of them during testing, but to eliminate (in the strict sense of that word) memory corruption you have to rule it out by construction, not just increase the probability that you happen to catch the error before you ship. Part of the problem is that these things are kind of transient by nature (the deterministic cases are easy to find), so just finding the easy ones by adding some extra runtime checks in debug doesn’t really bring you all that much closer to a solution.

So the only way to truly eliminate the problem is to make sure that you can’t use memory after it’s been freed (plus get rid of pointer arithmetic, unsafe casts, etc.). This means you must enforce that somehow. For many simple cases you can statically track ownership and determine when something is no longer needed (stack allocation is the simplest case, unique ownership pointers is a slightly more flexible scenario – see Rust). These systems aren’t expressive enough to capture every conceivable kind of memory ownership patterns though (in particular, “shared” ownership), so for full generality you’re still going to need a dynamic storage reclamation system of some kind (tracing GC, or a region system or something else).

Ideally you’d have multiple approaches so you only reach for the fully generic system (e.g. tracing GC) in very few cases.

C++ doesn’t actually solve it fully, IMO, because it introduces memory leaks as a potential side effect of the “solution” (due to cyclic data). True, that doesn’t compromise memory safety, but still isn’t awesome. To really solve it with ref counting you’d still need a tracing backup collector to get rid of cycles.

I very much approve! However, it’s too slow to run all the time, so you can only really use it once you’ve had a crash and are trying to track it down. It’s a drastic improvement though because even if the repro rate is once every five days, you’re at least guaranteed to catch the offender if it does happen while you run it.

Unless of course the memory scribble is caused by a rare race condition, in which case the timing differences from running this can make it go away (but then that’s true for just about any tool you use to help you track these things down). Annoyingly, multithreading errors in C++ have a way of eventually showing up as a memory corruption…

Maybe all devs should have special PCs that can run at full speed and track every allocation. Like having special hardware that tracks things on a per-word basis. You could do something like it by allocating full pages, but some scribbles would go away if you do that, so ideally you could run it with the exact same allocator as the final shipping product, with just some hooks to set some use-bits using a hardware accelerated memory read/write trap.

I’m betting that GC works with games. I’ve written a modern 3D graphics engine in Lisp and the results are quite good so far. It may never make it to a customer but so far, it’s a wonderful environment to develop in and I haven’t had any real issues with speed.

Not to disrespect your work so far, but I doubt it has the features of some of the ‘big boys’. Look at the new U4 engine as an example, I doubt that could be coded in Lisp.
My reasoning is that while you could achieve results similar to some of todays AAA titles that would be the peak of the engines ability. The games that are out at the moment are mainly being built on top of engines designed to push ‘last generation’ hardware to 100% not current gear. Sure there are some cases where that isn’t true but for the most part it is. As such any engine would be ‘a generation behind’ in terms of what it can deliver.

I’m not sure what your point is. I assume by U4 you mean Unity 4? I.e. an engine largely based on (garbage collected) C#?

I agree that Lisp is probably not ideal for (all of) a game engine, but you could probably replace the C# parts of Unity with Lisp without too many issues. That said, I’m not sold on that kind of pervasive scripting anyway.

I watched the video and found the opinions of those highly experienced, well educated speakers to be painful on multiple occasions.

“I spend literally 0% of my effort on memory management” – really? That’s because you’ve spent the last 30 years programming C. I’ve written tens of thousands of lines of code in JASS (a fully interpreted JITless scripting language for the warcraft 3 engine). I’ve written and re-written simple and complex physics systems a hundred times. I can verbalize an entire multi-instancable script for gravitation physics line by line without a reference because I’ve just done it so many times. But that language is shitty, and avoiding boilerplate code that doesn’t self-document is one of the huge advantages you get by using Java. Just because I can make the code doesn’t make it maintainable or good.

Java doesn’t have to say malloc(q + r + s ) // q is for the iterator, r is for the interface, and s is for the facade

Java just says q = new iterator, r = new interface, s = new facade. It’s not about lazy programming, it’s about self-documentation. I’m not trying to say it’s impossible to write bad java code, or self-documenting C++ code, I’m just saying that Java provide tools and standards to encourage developers to do that easily, significantly more-so than C++.

Every time you use the word lazy to describe non-c-programmers, you just make yourself sound like more of a brogrammer. A stupid, narcissistic, yelling brogrammer. You program games – that’s hardly even computer science (I know, because I program games). You’re an engineer!

You want to see how shitty C++ is and how easily you can run face first into a brick wall? Consider the fraction of game programmers who use C++ compared to the fraction of machine learning analysts who use C++.

Lastly, of course video games should be implemented in high performance code. Games by nature are distinguished by their performance and aesthetics, and need to take full advantage of their environment. Of course ooyua is shitty for using Java. But don’t make an hour-long podcast spreading your malformed opinion about a branch of computer science that doesn’t concern you and never will. You know what’s great about C++ for game programmers? It compiles to better machine code than you can write yourself, if you’re on a highly optimized instruction set like x86.

But don’t kid yourself into thinking you write the highest performance code. You’re already 4-5 levels of abstraction above baseline, and the guys working at ARM and Selex building serious performance technologies in machine languages you’ve never heard of, or better yet, HDL, are laughing at you. If you really think you’re in a position to demand from developers a certain level of IO accessibility, write your game to be a standalone IC, you’ll get much better performance that way. (sarcasm)

One problem I’ve always had with GC is that it can give developers the *illusion* that resource management is not a largely manual process.

For example, removing geometry from your scene graph may require associated GPU data to be freed. This is scene removal logic, not destruction logic, and the resource is unlikely to ever be freed from memory until shutdown anyway when using GC because people have a tendency to ignore the need for weak references, e.g., and maybe reference the geometry in some other part of the scene. It’s very difficult to get a whole team of people to use GC properly.

A team using GC can have a tendency to start being lax about such issues, and we can end up exchanging what might have formerly been deterministic bugs for logical leaks which are very difficult to track down.

I suppose it depends on the type of work you’re doing. In the cases I’ve dealt with, I’d rather face the occasional segfault, e.g., than a system that thoroughly neglects to have a clear, logical designation of resource owners/managers and associated logic by having its developers incorrectly think that GC will solve this all for them. Of course, it’s possible that I may have simply faced worst-case scenarios where the entire team wasn’t really competent with GC, but it usually only takes a weak link or two to cause significant problems over the course of a year or two.

For example, let’s say that we’re working with a system which uses GC for all elements in a scene graph. The scene graph consists of fairly expensive geometric elements along with lighting elements which have store a list of geometry to exclude (elements in this list are not lit by that particular light).

In such cases, it is very easy to find, when working with a rather large team of “average” developers that they would store such a list using strong references to the geometry being excluded. Now geometry removal from the scene no longer frees its memory, they linger around as a result of the exclusion list stored by lights.

In fairness, without GC in this same scenario, we would probably end up with a segfault at some point when the user attempts to access the exclusion list stored in some light. The mistake is the same either way: the light system failed to account for geometry removal from the scene. Yet in this case, a segfault/access violation could be preferable to a silent GC issue that only manifests itself in terms of a logical leak that no one may even notice for years.

Forget cyclic references. I’ve found that most “average” developers do a good job of avoiding cyclical references but a poor job of understanding that every single reference they store is effectively turning their component into a shared resource owner.

This is the general problem I’ve found with GC. It can lead to a lot of silent resource leaks and bugs which may never be discovered in a timely fashion. And in systems where having a bunch of leaks isn’t such a big deal, that may be perfectly fine. But in systems that quickly accumulate expensive resources that need to be free even for adequate performance and stability, “average” programmers can become more neglectful when working with GC about paying attention to resource ownership and lifetimes. The biggest problem I see is that too many think of GC as some kind of silver bullet.