For the past 9 months or so, there’s been an irritating, on-going memory leak with the public Krugle.org site.

Unfortunately catching the problem on the live site would mean letting the live site die a horrible death and then paw through the smoldering ruins. Which is something we weren’t willing to do.

So instead we set up a system that replayed the public site access logs, against an exact clone of the public site. And the problem didn’t show up. Nice.

After taking a look at the fidelity of the replay, we realized that some requests were being ignored, specifically HTTP POST requests – the access logs don’t contain the body of the POST, so it’s not possible to actually replay the exact request. These types of requests were primarily logins, and handling user activity logging.

So we took the next step of simulating these during the replay, by using fake data. But activity logging couldn’t handle the load of trying to replay a day’s worth of activity in an hour, so we disabled that. Which was a mistake.

Because eventually, after more comparison on live systems, the finger of blame kept pointing back at user activity logging.

Which meant we finally did a long overdue code review of this part of the system, and came across an interesting bit of code in the constructor for a class we use to send out HTTP requests:
configureLogger(Logger.getLogger("httpclient"));

We use Log4J to handle logging, and configureLogger is a method that configures logging. The configureLogger method had this interesting line of code:
logger.addAppender(new ConsoleAppender(new PatternLayout("%-5p %d{ISO8601} [%t] - %m%n")));

As soon as I saw this, I realized the problem. All logging systems have a singleton “root” logger and a logging hierarchy. The call to addAppender will effectively hand off this new appender object to the logging system, and it will never be deallocated for as long as the system runs.

Previously this wasn’t much of a problem, when we used a single HTTP sender object. But this wouldn’t handle the load, so I’d changed things to use a thread pool, which in turn dynamically allocated an HTTP sender for each request. And suddenly we wound up with a gnarly memory leak.