Sunday, May 10, 2015

We've been investing a bunch of time into creating a set of Mono (C#) memory leak and performance analysis tools, which we've been using to help us ship our first title (Dungeon Boss). Here's a high-level description of how the tools work and what we can currently do with them:

First, we have a custom instrumented version of mono.dll that captures and records a full transaction log of all mono and lower level libgc (Boehm collector) heap activity. It records to the log almost all internal malloc's/free's, mono allocs, and which mono allocs are collected during each GC. A full backtrace is stored with each log record.

We also record full RAM dumps at each GC, as well as the location of all static roots, thread stacks, etc. (The full RAM dumps may seem excessive, but they are extremely compressible with LZ4 and are key to understanding the relationships between allocations.)

We've also instrumented our game to record custom events to the log file: At menu, level start/end, encounter start, start of Update(), etc.

A typical workflow is to run the game in a Windows standalone build using our instrumented version of mono.dll, which generates a large binary transaction log. We then post process this log using a C# tool named "HeapBoss", which spews out a huge .JSON file and a number of binary heap RAM snapshots. We then explore and continue processing all this data using an interactive C++ command line tool named "HeapMan".

Here's a list of things we can currently do once we have a heap transaction log and GC RAM snapshots:

- Log exploration and statistical analysis:

Dump the log command index of all GC's, the ID's of all custom events, dump all allocs or GC frees of a specific type, etc.

- Blame list construction: We can replay the transaction log up to a specific GC, then recreate the state of the mono heap at that particular GC. We can then construct a "blame list" of those C# functions which are responsible for each allocation. We use a manually created leaf function name exclusion list (consisting of about 50 regex patterns) to exclude the deepest functions from each backtrace which are too low level (or internal) to be interesting or useful for blame list construction.

This method is useful for getting a picture of the top consumers of Mono heap memory at a specific spot in the game. We output this list to a .CSV file.

- Growth over time ("leak") analysis: We replay the transaction log up to spot A, create a heap state object, then continue playing up to spot B and create another heap state object. We then construct a blame list for each state object, diff them, then record the results to a .CSV file.

This allows us to see which functions have grown or shrunk their allocations over time. We've found many leaks this way. (Of course, they aren't leaks in the C/C++ sense. In C#, it's easy to construct systems which have degenerate memory behavior over time.)

- Various queries: Find all heap objects of type X. Find all objects on the heap which potentially have a pointer to a specific object (or address). Examine an object's memory at a specific GC and determine which objects it may potentially point to. Find the allocation that includes address X, if any. Find the vtable at address X, etc.

Note our tools don't know where the pointers are really located in a type, so when we examine the raw memory of an object instance it's possible to mistake some random bits for a valid object pointer. We do our best to exclude pointers which aren't aligned, or don't point to the beginning of another allocated object, etc. In practice this approach is surprisingly reliable.

- Root analysis to help find unintended leaks: Given an object, recursively find all objects which reference it, then find all the objects which refer to those objects, etc. until you discover all the roots which directly or indirectly refer to the specific objects. Output the results as a .DOT file and import into gephi for visualization and deeper analysis. (graphviz is another option, but it doesn't scale to large graphs as well as gephi.)

- Heap transaction videos: Create a PNG image visualizing the active heap allocations and synchronize this video with the game. Helps to better understand how the title is utilizing/abusing heap memory at different spots during gameplay.

What we'll be doing next:

- "Churn" analysis: Create a histogram of the blame function for each GC free, sort by the # of frees or # of bytes freed to identify those functions which are creating the most garbage over time.

- Automatically identify the parts of the heap graph that always grow over time. Right now, we do this manually by first doing a growth analysis, then finding which types are responsible for the growth, then finding the instance(s) which grow over time.

About Me

Back in the day I worked for several years at Digital Illusions on things like the first shipping deferred shaded game ("Shrek" - 2001), software renderers, and game AI. Then, after working for Microsoft at Ensemble Studios for 5 years as engine lead on Halo Wars, I took a year off to create "crunch", an advanced DXTc texture compression library. I then worked 5 years at Valve, where I contributed to Portal 2, Dota 2, CS:GO, and the Linux versions of Valve's Source1 games. I was one of the original developers on the Steam Linux team, where I worked with a (somewhat enigmatic) multi-billionare on proving that OpenGL could still hold its own vs. Direct3D. I also started the vogl (Valve's OpenGL debugger) project from scratch, which I worked on for over a year. In my spare time I work on various open source lossless and texture compression projects: crunch, LZHAM, miniz, jpeg-compressor, and picojpeg.