Fixing Memory Leak on Production Node.js Application

Jun 3, 2016

The last few days at work were rough.
My team was intensely focused on
preparing the production environment to be stable.
We’ve been having some
serious issues on the production environment.
One of the most serious production issues was a
nasty memory leak.

The graph above displays the memory usage of
12 node applications on our production
environment. The y-axis shows the memory usage
in percentage and the x-axis shows the time
span of 7 days. This narrow sawtooth
pattern on memory usage is extremely bad.
As you can see, all 12 servers
are accumulating usage in memory rapidly.
As a remedy for the memory leak, we had to
periodically restart our servers. This was less
than ideal, but because we had so many other
really high priority items last few weeks, we
just had to bite our tongue and suck it up.
However, the frequency at which we had to
restart our server started to increase.
We had to restart our servers every few days,
then every day, then it came to a point where we
were restarting our servers every 4 to 5 hours.
Thank god we have a globally distributed team
(Croatia, and Argentina), it could’ve been a
lot tougher without having team members in other
timezones. I can’t
emphasize this point enough, and I will take
another opportunity to praise having globally
distributed software development team in another
post in the future.

After some pitfalls, we fixed the
memory leak. As you can see below, the overall
memory usage stays flat after 13:30 PM after our
memory leak fix was applied.

We came up with several different strategies to
fix the memory issue, but the working solution
came down to comparing two memory dumps at
different times and comparing their content.
I used gcore on a production server to gather a
memory dump soon after the server restart when
the memory usage is around 30%. After around
3 hours when the memory usage for the server
starts to hover around 60%, I took another memory
dump. I used mdb on a local VM running
Solaris 11 to analyze the two memory dumps.

Here is the memory dump of the production server
at around 30% memory usage.

The most notable cause of the memory increase is
TraceSegment: name, transaction, ....
You can see that the number of TraceSegment
object count increased by 10 folds going
from 19917 to 209896. You can also see some
other object such as Timer and Array
increasing. This is because those are properties
of the TraceSegment object.
Contrast this number with the genre-tags object
that stayed exactly the same
2 Object: id, genreIds at 8775. That object
is a json file parsed into memory at startup
of the app and it does not increase. This
suggests to me that the memory dump is indeed
accurate, and Node retains memory steady on
correctly coded objects.

It turns out TraceSegment objects are created
by NewRelic. The fact that
I wasn’t able to recreate the severity of
memory leak convincingly on the development
environment from previous attempts also tells
me that the memory leak was caused by something
happening on the production environment.

We get some benefits of having NewRelic monitoring, but disabling NewRelic on
production was no-brainer at this point.
We had to do it. We all acknowledged that
NewRelic is not the initial cause of the memory leak.
There is still something in our code that is
behaving naughtily with NewRelic, but we
got to put our the damn fire on production
first. So we did it. We disabled NewRelic
on production and the memory usage stopped
climbing and it remains very steady for now.