At some point in the life of every Rails developer you are bound to hit a memory leak. It may be tiny amount of constant memory growth, or a spurt of growth that hits you on the job queue when certain jobs run.

Sadly, most Ruby devs out there simply employ monit , inspeqtor or unicorn worker killers. This allows you to move along and do more important work, tidily sweeping the problem snugly under the carpet.

Unfortunately, this approach leads to quite a few bad side effects. Besides performance pains, instability and larger memory requirements it also leaves the community with a general lack of confidence in Ruby. Monitoring and restarting processes is a very important tool for your arsenal, but at best it is a stopgap and safeguard. It is not a solution.

We have some amazing tools out there for attacking and resolving memory leaks, especially the easy ones - managed memory leaks.

Are we leaking ?

The first and most important step for dealing with memory problems is graphing memory over time. At Discourse we use a combo of Graphite , statsd and Grafana to graph application metrics.

A while back I packaged up a Docker image for this work which is pretty close to what we are running today. If rolling your own is not your thing you could look at New Relic, Datadog or any other cloud based metric provider. The key metric you first need to track is RSS for you key Ruby processes. At Discourse we look at max RSS for Unicorn our web server and Sidekiq our job queue.

Discourse is deployed on multiple machines in multiple Docker containers. We use a custom built Docker container to watch all the other Docker containers. This container is launched with access to the Docker socket so it can interrogate Docker about Docker. It uses docker exec to get all sorts of information about the processes running inside the container.Note: Discourse uses the unicorn master process to launch multiple workers and job queues, it is impossible to achieve the same setup (which shares memory among forks) in a one container per process world.

With this information at hand we can easily graph RSS for any container on any machine and look at trends:

Long term graphs are critical to all memory leak analysis. They allow us to see when an issue started. They allow us to see the rate of memory growth and shape of memory growth. Is it erratic? Is it correlated to a job running?

When dealing with memory leaks in c-extensions, having this information is critical. Isolating c-extension memory leaks often involves valgrind and custom compiled versions of Ruby that support debugging with valgrind. It is tremendously hard work we only want to deal with as a last resort. It is much simpler to isolate that a trend started after upgrading EventMachine to version 1.0.5.

Managed memory leaks

Unlike unmanaged memory leaks, tackling managed leaks is very straight forward. The new tooling in Ruby 2.1 and up makes debugging these leaks a breeze.

Prior to Ruby 2.1 the best we could do was crawl our object space, grab a snapshot, wait a bit, grab a second snapshot and compare. I have a basic implementation of this shipped with Discourse in MemoryDiagnostics, it is rather tricky to get this to work right. You have to fork your process when gathering the snapshots so you do not interfere with your process and the information you can glean is fairly basic. We can tell certain objects leaked, but can not tell where they were allocated:

If we were lucky we would have leaked a number or string that is revealing and that would be enough to nail it down.

Additionally we have GC.stat that could tell us how many live objects we have and so on.

The information was very limited. We can tell we have a leak quite clearly, but finding the reason can be very difficult.

Note: a very interesting metric to graph is GC.stat[:heap_live_slots] with that information at hand we can easily tell if we have a managed object leak.

Managed heap dumping

Ruby 2.1 introduced heap dumping, if you also enable allocation tracing you have some tremendously interesting information.

The process for collecting a useful heap dump is quite simple:

Turn on allocation tracing

require 'objspace'
ObjectSpace.trace_object_allocations_start

This will slow down your process significantly and cause your process to consume more memory. However, it is key to collecting good information and can be turned off later. For my analysis I will usually run it directly after boot.

When debugging my latest round memory issues with SideKiq at Discourse I deployed an extra docker image to a spare server. This allowed me extreme freedom without impacting SLA.

Next, play the waiting game

After memory has clearly leaked, (you can look at GC.stat or simply watch RSS to determine this happened) run:

We expect a large number of objects to be retained after boot and sporadically when requiring new dependencies. However we do not expect a consistent amount of objects to be allocated and never cleaned up. So let’s zoom into a particular generation:

Turns out, we were sending a hash in from the email message builder that includes ActiveRecord objects, this later was used as a key for the cache, the cache was allowing for 2000 entries. Considering that each entry could involve a large number of AR objects memory leakage was very high.

To mitigate, I changed the keying strategy, shrunk the cache and completely bypassed it for complex localizations:

At the top of our list we can see our JavaScript engine therubyracer is leaking lots of objects, in particular we can see the weak references it uses to maintain Ruby to JavaScript mappings are being kept around for way too long.

To maintain performance at Discourse we keep a JavaScript engine context around for turning Markdown into HTML. This engine is rather expensive to boot up, so we keep it in memory plugging in new variables as we bake posts.

Since our code is fairly isolated a repro is trivial, first let’s see how many objects we are leaking using the memory_profiler gem:

This WeakValueMap object keeps growing forever and nothing seems to be clearing values from it. The intention with the WeakRef usage was to ensure we allow these objects to be released when no-one holds references. Trouble is that the reference to the wrapper is now kept around for the entire lifetime of the JavaScript context.

Comments

Steve Gooberman Hill
about 3 years ago

very good and interesting article. Ruby has come a long way in some of these areas. There are also possibly some underlying issues we only show up with ruby compilation on alternate (and possiblynot brilliantly supported platforms). I personally have seen memory corruption issues appearing in floats on some ARM platforms with ruby 1.8 (Floats get filled with 0xFF’s).

Thxs for sharing this

Steve

about 3 years ago

Thanks for this, quite helpful! I tried your heap dump method, however, the result odd, I keep getting exactly the same lines over and over again using your analyze code. I can run my app, visit a few pages, dump, analyze an the generations and object counts are always exactly the same. What am I doing wrong?

That looks good, keep running it for a bit, how does it look after say 30 minutes / 1 hour?

Then have a look at the a particular generation to see what kind of objects were retained.

about 3 years ago

Okay, well, I did try that a few times and it keeps giving me the same object counts and generations, which is puzzling me. It doesn’t even matter at wich process I run rbtrace on and if the heap dump files have very different file sizes. But I will try for longer.

about 3 years ago

Okay, was running the app for an hour, opening a page that usually causes unlimited memory growth. Got the exact same output. Seems wrong to me.

about 3 years ago

LOL! Found the mistake, got the file path wrong.

Rails.root + "/tmp/ruby-heap.dump"
=> #<Pathname:/tmp/ruby-heap.dump>

Haha, it analyzed the file I generated much earlier at /tmp/ruby-heap.dump instead of the one in my rails apps folder.

So far the analysis points to the rails router (journey) for a potential leak…

Most retained objects in this line of code: Cat.new(val.first, val.last) in the journey parser. Wonder where Cat is defined and what it does.

Your article inspired me to create a list of Ruby gems that have memory leaks - https://github.com/ASoftCo/leaky-gems. There are more gems with C code extensions that do not release memory. Check this list out!

Ryan Mulligan
almost 2 years ago

Thanks for the post: it got me interested in statsd, which seems great. How do you populate statsd with the Unicorn worker memory information?

Sam… Great forensic work , blowing the lid off leaks… Except it leaves a shallow pit , why it takes 300,000 processes to launch an empty App…mem leakage plugged by stay-alive processes the coder has no idea exist… can these randomly spawn (as do updates and google features) from an overactive OS ?

…mem growth over time is a sure loss of code control… im looking to deploy a 1mil open session client request “discovery” database & certainly wont be using Ruby / Rails… Containers are great but rkt is superior to mainstram Docker… Do u recommend any database in an IB RDMA cluster?

The Elixir model of a 2nd process wachin each process (graceful flow control) is similar at a more atomic level… again is leakage managed by GC (cant stand ground) or by trimming parameters , as in yor case . the tools to pinpoint and kill such processes, or at least actively adjust (update) as per Hot Reloading… For websockets specifically, this allows deployments to a machine that might have millions of active connections without immediately bombarding the server with millions of reconnection attempts, losing any in progress messages. That’s why WhatsApp is built on Erlang, by the way. Hot reloading has been used to deploy updates to flight computers while planes were in the air.

WHAT ABOUT WATCHING THE OS Kernel ?
We use a custom built Docker container to watch all the other Docker containers. This container is launched with access to the Docker socket so it can interrogate Docker about Docker.

It uses docker exec to get all sorts of information about the processes running inside the container.

I am not sure what is a one container per process world ? that does not exist …

" it is impossible to achieve the same setup (which shares memory among forks) in a one container per process world."

A high client connect (one mil) server cluster experience, needs to be an available FreeBSD or Centos install using PF_ring… Out of the box framework If u come across one, pls let me know before my cluster kills us… mine is internally switched IB RDMA OFED standard , that pretty much knocks out httpLite offerings… OFED rewrites how data is handled (kernel bypass) Thus i cant see any point using outdated DB structures (and complexity causing bloat