Plugging Memory Leaks in Confluence

Recently the Confluence team became aware that our public facing instance, http://confluence.atlassian.com was slowly leaking memory. This blog entry is the story of how we found and fixed the leaks.

What makes us think there’s a leak?

The initial reason we thought there might be a memory leak was that there were OutOfMemoryErrors occurring semi-regularly. That doesn’t necessarily imply the app was leaking memory. It could just mean that periodically it received so much load from users that it needed more memory than it had.

The second clue was this graph of Confluence’s memory usage. The red line shows the amount of memory used after a complete garbage collection is performed.

There were three steps to creating the above graph.

Turn on garbage collection logging using the following JVM options:
-XX:+PrintGCTimeStamps -verbose:gc -Xloggc:gc.log

Place in a Confluence chart macro: {chart:type=xyLine|dataOrientation=vertical|width=1200|height=400|colors=gray,red} … {chart}

The chart shows that on average, the memory increases steadily over time, suggesting a slow leak. There was still the possibility that Confluence was legitimately caching something, but it was definitely worth looking into.

The Easy Evidence: OutOfMemoryError

The first attempt at finding the culprit was to search through the stack traces that had caused OutOfMemoryErrors to see if any obvious candidates raised their heads. A few candidates came up, so we decided to test them in isolation.

Profiling locally with YourKit

We retrieved backups of the Confluence database dump and home directory and set up a local approximation of confluence.atlassian.com. We ran this Confluence instance with the YourKit agent and with the YourKit profiler running. This allowed us to see just how much memory these Memory Leak candidates used and whether they leaked or just used a lot.

They all turned out to just be things that used large amounts of memory, and pushed the usage over the edge. In some instances this memory usage was inappropriate. Eg when viewing a mail archive, all the mail is temporarily loaded into memory, CONF-9576.

Reducing the XML Space Export Footprint

The most common cause of confluence.atlassian.com running out of memory is the Space Export. Exporting a Confluence space, particularly to XML, is a highly resource intensive process. Most users aren’t even permitted to do it.

We discovered that much of the slowness and memory usage of exporting the Confluence documentation space was due to the quantity of mails, trackback links and referral links included in the export. None of these objects was really required in that space, so we moved or removed them and the export sped up dramatically. We also added an improvement request to make exporting these aspects of a space optional.

JMeter’s Access Log Sampler

Testing the likely candidates took quite some time, during which much clicking around the UI was done, including some by running JMeter with some standard usage scenarios. We expected the memory leak would show its head from that when we created a heap snapshot in YourKit and examined it. When it didn’t, the obvious solution was to throw in more load and to make that load more varied. Enter the Access Log Sampler.

We fed the JMeter Access Log Sampler with one day’s worth of Apache Access Logs from the confluence.atlassian.com server, and set it to spend a night attacking our test copy of confluence.atlassian.com. We set the YourKit profiler to take snapshots every hour and whenever the memory was low.

The Overpopulated Cache

JMeter worked like a charm and at last YourKit reported massive memory usage. Looking through the latest snapshot we found that the TinyMceServlet, which is part of our Rich Text Editor, was using around 50 MB of memory in a cache.

This cache was only meant to have a few different items in it, so something was very wrong. For some reason too many items were being added to the cache. The cache was a HashMap so immediately we went to the hashcode() and equals() method of the object used as the key in this cache.

The equals and hashCode methods were actually generated by IntelliJ IDEAs wonderful generate equals and hashcode ability. Usually it’s excellent, but in this case it has clearly screwed up.

The equals method compares the content of the plugins array, using (!Arrays.equals(plugins, that.plugins)). However the hashcode method uses the array object’s hashcode method inherited from Object. This means two TinyMceConfigurations can be equal, since their plugins arrays have equal content and yet have different hashcodes, because the arrays are not the same object.

The cache is meant to have very few different configurations passed to the servlet, so that nearly every result is found in the cache, and the cache stays very small. Unfortunately, because the hashcode method of the configuration key is broken, most requests to the servlet will generate a new (nearly identical) entry in the cache. Since this occurs every time a new user edits a page with the rich text editor, the leak is quite serious.

The fix is simple, but not as simple as using the Arrays.hashcode method, partially because it was introduce in a later JDK than our baseline of 1.4, and partially because it’s a patch over an underlying problem. Some readers will have looked at the above code and immediately griped that final arrays are misleading. The configuration Object is entirely immutable except for the content of that array, which in reality never changes.

The fix we used was to change the array of plugin names to be an unmodifiableHashSet. Theequals andhashcode methods of that set use the content of the set, and can be directly used safely in the equals and hashcodes of the TinyMceConfiguration.