Thursday, May 21, 2015

After several man months of tool making, instrumenting and compiling our own custom Mono DLL, and crawling through 5k-30k node heap allocation graphs in gephi, our first Unity title (Dungeon Boss for iOS/Android) is now no longer leaking significant amounts of Mono heap memory. Last year, our uptime on 512MB iOS devices was 15-20 minutes, now it's hours.

It can be very easy to construct complex systems in C# which have degenerate (continually increasing) memory behavior over time, even though everything else seems fine. We label such systems as "leaking", although they don't actually leak memory in the C/C++ sense. All it takes is a single accidental strong reference somewhere to mistakenly "leak" huge amounts of objects over time. It can be a daunting task (even for the original authors) to discover how to fix a large system written in a garbage collected language so it doesn't leak.

Here's a brain dump of what we learned during the painful process of figuring this out:

- Monitor your Unity app's memory usage as early as possible during development. Mobile devices have some pretty harsh memory restrictions (see here to get an idea for iOS), so be sure to test on real devices early and often.

Be prepared for some serious pain if you only develop and test in the Unity editor for months on end.

- On iOS your app will receive low memory warnings when the system comes under memory pressure. (Note that iOS can be very chatty about issuing these warnings.) It can be helpful to log these warnings to your game's server (along with the amount of used client memory), to help do post-mortem analysis of why your app is randomly dying in the field.

- Our (unofficial) low end iOS devices are iPhone 4/4s and iPad Mini 1st gen (512MB devices). If our total allocated memory (according to XCode's Memory Monitor) exceeds approx. 200MB for sustained periods of time it'll eventually be ruthlessly terminated by the kernel. Ideally, don't use more than 150-170MB on these devices.

- In Unity, the Mono (C#) heap is managed by the Boehm garbage collector. This is basically a C/C++-style heap with a garbage collector bolted on top of it.

Allocating memory is not cheap in this system. The version of Mono that Unity uses is pretty dated, so if you've been using the Microsoft .NET runtime for C# development then consider yourself horribly spoiled.

Treat the C# heap like a very precious resource, and study what C/C++ programmers do to avoid constantly allocating/freeing blocks such as using custom object pools. Avoid using the heap by preferring C# struct's vs classes, avoid boxing, use StringBuilder when messing with strings, etc.

- In complex systems written in C# the careful use of weak references (or containers of weak references) can be extremely useful to avoid creating endless chains of strong object references. We had to switch several systems from strong to weak references in key places to make them stable, and discovering which systems to change can be very tricky.

- Determine up front the exact lifetime of your objects, and exactly when objects should no longer be referenced in your system. Don't just assume the garbage collector will automagically take care of things for you.

- The Boehm collector's OS memory footprint only stabilizes or increases over time, as far as we can tell. This means you should be very careful about allocating large temporary buffers or objects on the Mono heap. Doing so could unnecessarily bump up your Mono's memory footprint, which will decrease the amount of memory "headroom" your app will have iOS. Basically, once Mono grabs OS memory it greedily holds onto it until the end of time, and this memory can't be used for other things such as textures, the Unity C/C++ heap, etc.

- Be very careful about using Unity's WWW class to download large archives or bundles, because this class may store the downloaded data in the mono heap. This is actually a serious problem for us, because we download compressed Unity asset bundles during the user's first game session and this class was causing our app's mono memory footprint to be increased by 30-40MB. This seriously reduced our app's memory headroom during the user's first session (which in a free to play game is pretty critical to get right).

- The Boehm collector grows its OS memory allocation so it has enough internal heap headroom to avoid collecting too frequently. You must factor this headroom into account when budgeting your C# memory, i.e. if your budget calls for 25MB of C# memory then the actual amount of memory consumed at the OS level will be significantly larger (approximately 40-50MB in our experience).

- It's possible to force the Boehm collector used by Unity to never allocate more than a set amount of OS memory (see here) by calling GC_set_max_heap_size() very early during app initialization. Note that if you do this and your C# heap leaks your app will eventually just abort once the heap is full.

It may be possible to call this API over time to carefully bump up your app's Mono heap size as needed, but we haven't tried this yet.

- If your app leaks, and you can't figure out how to fix all the leaks, then an alternative solution that may be temporarily acceptable is to relentlessly free up as much memory as possible by optimizing assets, switching from PVRTC 4bpp to 2bbp, lowering sound and music bitrates, etc. This will give your app the memory headroom it needs to run for a reasonable period of time before the OS kills it.

If the user can play 20 levels per hour, and you leak 1MB per level, then you'll need to find 20MB of memory somewhere to run one hour, etc. It can be far simpler to optimize some textures then track down memory leaks in large C# codebases.

- Design your code to avoid creating tons of temporary objects that trigger frequent collections. One of our menu dialogs was accidently triggering a collection every 2-4 frames on iOS, which was crushing our performance.

- We used the Daikon Forge UI library. This library has several very serious memory leaks. We'll try to submit these fixes back to the author, but I think the product is now more or less dead (so email me if you would like the fixes).

- Add some debug statistics to your game, along with the usual FPS display, and make sure this stuff works on your target devices:

Total Mono heap used and reserved (You can retrieve this from Unity's Profiler class. Note this class returns all 0's in non-dev builds.)

Total number of GC's so far, number of frames since last GC, or average # of frames and seconds between GC's (you can infer when a GC occurs by monitoring the Mono heap's used size every Update() - when it decreases since the last Update() you can assume a GC has occured sometime recently)

While monitoring our app's memory consumption on iOS, we had to observe and make sense of statistics from XCode's Memory Monitor, from Instruments, from the Mach kernel API's, and from Unity. It can be very difficult to make sense of all this crap.

At the end of the day, we trusted Unity's statistics the most because we understood exactly how these statistics were being computed.

- Instrument your game's key classes to track the # of live objects present in the system at any one time, and display this information somewhere easily visible to developers when running on device. Increment a global static counter in your object's constructor, and decrement in your C# destructor (this method is automatically called when your object's memory is actually reclaimed by the GC).

- On iOS, don't be shy about using PVRTC 2bpp textures. They look surprisingly good vs. 4bpp, and this format can save you a large amount of memory. We wound up using 2bpp on all textures except for effects and UI sprites.

- The built-in Unity memory profiler works pretty well on iOS over USB. It's not that useful for tracking down narly C# scripting leaks, but it can be invaluable for tracking down asset problems.

- Remember that leaks in C# code can propagate downstream and cause apparent leaks on the Unity C/C++ heap or asset leaks.

- It can be helpful to mentally model the mono heap as a complex directed graph, where the nodes are individual allocations and the edges are strong references. Anything directly or indirectly referenced from a static root (either global, or on the stack, etc.) won't be collected. In a large system with many leaks, try not to waste time fixing references to leaf nodes in this graph. Attack the problem as high up near the roots as you possibly can.

On the other hand, if you are under a large amount of time pressure to get a fix in right now, it can be easier to just fix the worst leaks (in terms of # of bytes leaked per level or whatever) by selectively null'ing out key references to leafier parts of the graph you know shouldn't be growing between levels. We wrote custom tools to help us determine the worst offenders to spend time on, sorted by which function the allocation occurred in. Fixing these leaks can buy you enough time to properly fix the problem.

I haven't migrated LZHAM alpha (used by Planetside 2) yet. Still deciding if I should just archive it somewhere instead because it's a dead branch (LZHAM now lives here, and the much faster compressing development branch is here).

Friday, May 15, 2015

John Brooks (CEO of Blue Shift Inc.) and I were discussing LZHAM's current decompression performance vs. Google's Brotli codec and we came up with these ideas:

- Basic block optimization:
The current decompressor's inner loop is a general purpose coroutine based design, so a single implementation can handle both streaming and non-streaming scenarios. This is hurting perf. because the coroutine structure consists of a huge switch statement, which causes the compiler to dump locals to registers (and read them back in) a lot.

I'm going to add an alternative non-coroutine version of the inner loop that won't support streaming to optimize the very common memory to memory scenario. (LZHAM already has a few non-streaming optimizations in there, but it still uses a huge switch statement that breaks up the inner loop into tons of basic blocks.)

- LZHAM doesn't have any SIMD optimizations in its Huffman routines. I've been hesitant to use SIMD code anywhere in LZHAM because it complicates testing, but some of the Huffman routines should be easy to optimize with SIMD code.

- Finally, LZHAM's small block performance suffers vs. LZMA, Brotli, or zlib because it must compute Huffman tables on the fly at a fairly high frequency near the beginning of streams. There's a simple solution to this, which is to use precomputed Huffman table codelengths at the start of streams, then switch to dynamically updated Huffman tables once it makes sense to do so.

I already have the code to do this stuff (from the original LZHAM prototype), but it would require breaking v1.0 format compatibility. (And I'm not going to break format compatibility - if/when I do the new thing will have a different name.)

Wednesday, May 13, 2015

A snapshot of our current memory footprint on iOS, using Unity's built-in memory profiling tool (thanks to Sean Cooper):

Mono 59

Unity 53

GfxDriver 34.4

Textures 29.4

Animations 23.8

Meshes 16.8

FMOD 13

Profiler 12.8

Audio 11.1

Mono heap's allocator is a greedy SOB, and in practice only around 40-50% of its allocated memory contains persistent C# objects. We are going to try tweaking or modifying it soon to be less greedy, because we need the memory.

Also, the Unity heap is relatively huge now so we're going to poke around in there and see what's going in.

Sunday, May 10, 2015

We've been investing a bunch of time into creating a set of Mono (C#) memory leak and performance analysis tools, which we've been using to help us ship our first title (Dungeon Boss). Here's a high-level description of how the tools work and what we can currently do with them:

First, we have a custom instrumented version of mono.dll that captures and records a full transaction log of all mono and lower level libgc (Boehm collector) heap activity. It records to the log almost all internal malloc's/free's, mono allocs, and which mono allocs are collected during each GC. A full backtrace is stored with each log record.

We also record full RAM dumps at each GC, as well as the location of all static roots, thread stacks, etc. (The full RAM dumps may seem excessive, but they are extremely compressible with LZ4 and are key to understanding the relationships between allocations.)

We've also instrumented our game to record custom events to the log file: At menu, level start/end, encounter start, start of Update(), etc.

A typical workflow is to run the game in a Windows standalone build using our instrumented version of mono.dll, which generates a large binary transaction log. We then post process this log using a C# tool named "HeapBoss", which spews out a huge .JSON file and a number of binary heap RAM snapshots. We then explore and continue processing all this data using an interactive C++ command line tool named "HeapMan".

Here's a list of things we can currently do once we have a heap transaction log and GC RAM snapshots:

- Log exploration and statistical analysis:

Dump the log command index of all GC's, the ID's of all custom events, dump all allocs or GC frees of a specific type, etc.

- Blame list construction: We can replay the transaction log up to a specific GC, then recreate the state of the mono heap at that particular GC. We can then construct a "blame list" of those C# functions which are responsible for each allocation. We use a manually created leaf function name exclusion list (consisting of about 50 regex patterns) to exclude the deepest functions from each backtrace which are too low level (or internal) to be interesting or useful for blame list construction.

This method is useful for getting a picture of the top consumers of Mono heap memory at a specific spot in the game. We output this list to a .CSV file.

- Growth over time ("leak") analysis: We replay the transaction log up to spot A, create a heap state object, then continue playing up to spot B and create another heap state object. We then construct a blame list for each state object, diff them, then record the results to a .CSV file.

This allows us to see which functions have grown or shrunk their allocations over time. We've found many leaks this way. (Of course, they aren't leaks in the C/C++ sense. In C#, it's easy to construct systems which have degenerate memory behavior over time.)

- Various queries: Find all heap objects of type X. Find all objects on the heap which potentially have a pointer to a specific object (or address). Examine an object's memory at a specific GC and determine which objects it may potentially point to. Find the allocation that includes address X, if any. Find the vtable at address X, etc.

Note our tools don't know where the pointers are really located in a type, so when we examine the raw memory of an object instance it's possible to mistake some random bits for a valid object pointer. We do our best to exclude pointers which aren't aligned, or don't point to the beginning of another allocated object, etc. In practice this approach is surprisingly reliable.

- Root analysis to help find unintended leaks: Given an object, recursively find all objects which reference it, then find all the objects which refer to those objects, etc. until you discover all the roots which directly or indirectly refer to the specific objects. Output the results as a .DOT file and import into gephi for visualization and deeper analysis. (graphviz is another option, but it doesn't scale to large graphs as well as gephi.)

- Heap transaction videos: Create a PNG image visualizing the active heap allocations and synchronize this video with the game. Helps to better understand how the title is utilizing/abusing heap memory at different spots during gameplay.

What we'll be doing next:

- "Churn" analysis: Create a histogram of the blame function for each GC free, sort by the # of frees or # of bytes freed to identify those functions which are creating the most garbage over time.

- Automatically identify the parts of the heap graph that always grow over time. Right now, we do this manually by first doing a growth analysis, then finding which types are responsible for the growth, then finding the instance(s) which grow over time.

Friday, May 8, 2015

We need the ability to take full heap snapshots, visualize allocation graphs, do shortest path analysis from roots to individual objects, etc. If you don't provide tools like this then it can be extremely painful to ship reliable large scale software that uses garbage collection.

About Me

Back in the day I worked for several years at Digital Illusions on things like the first shipping deferred shaded game ("Shrek" - 2001), software renderers, and game AI. Then, after working for Microsoft at Ensemble Studios for 5 years as engine lead on Halo Wars, I took a year off to create "crunch", an advanced DXTc texture compression library. I then worked 5 years at Valve, where I contributed to Portal 2, Dota 2, CS:GO, and the Linux versions of Valve's Source1 games. I was one of the original developers on the Steam Linux team, where I worked with a (somewhat enigmatic) multi-billionare on proving that OpenGL could still hold its own vs. Direct3D. I also started the vogl (Valve's OpenGL debugger) project from scratch, which I worked on for over a year. In my spare time I work on various open source lossless and texture compression projects: crunch, LZHAM, miniz, jpeg-compressor, and picojpeg.