A Tale of Two Leaks: An Incremental Leak

Posted on February 17, 2015

In part 1, we saw a memory leak that was in some sense static: The memory use was unintentional/undesired, but it was only allocated once, and didn’t get worse as the process continued. On the other hand, the second leak was a true leak in the classic sense: the memory use appeared to be unbounded.

Adam, the lead sustaining engineer at edX, and Ed, our Devops lead, were able to identify that the leak happened during our bulk-grading operations. During grading, we loop through every student in a single and then loop through every gradable XBlock to identify whether we’ve already scored that XBlock, and if not, we score the student on that block. Then we aggregate all of those grades based on the course’s grading policy.

Adam was able to narrow down the problem by creating a test case that graded a single student repeatedly. That test showed the same unbounded memory growth we observed in the overall process. Using objgraph, he was able to identify that each time the student was graded, a constant number of CombinedSystems were created and not released. This was suspicious, as those objects were intended to be ephemeral objects intended only to combine the attributes of DescriptorSystem and ModuleSystem into a single object.

Adam was also to dump the processes memory with meliae, after it had leaked memory, so we were able to dig more into the particulars of the CombinedSystems that were still held in memory.

Investigating the Memory Dump

Once we had a dump, I was able to begin investigating with memsee. My first attempt was to use the path command that pointed to the errant pointer in part 1. However, all of those attempts timed out before they found any path from the root to an CombinedSystem.

The first dict is just __dict__ from the the LmsModuleSystem. Given that the CombinedSystems are just supposed to be pointing to a DescriptorSystem and ModuleSystem, it’s suspicious that a CombinedSystem is being held in memory in turn by an LmsModuleSystem.

It looks like we have a chain of LmsModuleSystem -> CombinedSystem -> LmsModuleSystem. (A side note about python memory management: CombinedSystem appears as a direct parent of LmsModuleSystem, with no intervening dict because CombinedSystem defines its attributes using __slots__. This is a good strategy for ephemeral objects, as it saves you from allocating additional dictionaries.)

Designing a Fix

Looking at the relevant edx-platform code, there’s only one place where CombinedSystems are constructed:

This seems like a good candidate for something that would be storing a CombinedSystem. In fact, looking at module_render.py, which is the primary entry point in the LMS for working with XBlocks, we see that we’re passing descriptor.runtime in to that argument:

In the end, we’re building up a chain of LmsModuleSystems and CombinedSystems, and never releasing them.

The initial fix to this was to extract actual DescriptorSystem from the CombinedSystem, and passing that to the LmsModuleSystem. That ensures that the references to the previous LmsModuleSystem is released when xmodule_runtime is re-bound. The code to make that change is in this pr.

A more robust fix would be for ModuleSystem to expect a pointer to a descriptor, rather than the descriptor runtime, so that it can extract the DescriptorSystem from the CombinedSystem itself.

My name is Calen Pennington. I'm a software developer and lead architect at edX,
father to a two-year old, part-time haskell hacker, board/card/video-gamer. This blog will primarily
focus on the first of those.
Find me on: