PaaS Under the Hood, Episode 6: How to Optimize the Memory Usage of Your Apps

At dotCloud, the entire engineering team have daily rotations in customer support. When reporting for support duty, a question that I get often is “how can I optimize my memory usage?”. And since our pricing model is calculated based on memory (RAM) consumption, it comes to no surprise that many customers ask us “why does my app use so much RAM?”. Or that they are looking for answers as to how they can reduce the memory footprint of their app, which is quasi-synonymous to the question “how can I reduce my bill?”

There is no “TL,DR” on this matter. Memory usage is the product of several complex factors. Reducing the memory footprint requires a lot of technical knowledge and understanding. But after reading this article, you should have some understanding of the techniques used for memory optimization.

And, the good news is that this information is applicable for apps running on dotCloud, but also to virtually all apps running on any hosting platform.

Know Your Memory

Before talking about reducing memory usage, let’s make sure we have a common understanding of what memory is used for. For simplicity, we will talk about 3 types of memory: program memory, data memory, and kernel memory.

Application developers generally think of “program memory” which is the type of memory used for computation. If you have a Computer Science background, that’s what we call “heap” and “stack”. When you generate a thumbnail from a high-resolution photograph, you may need to load the original into memory before executing the resize algorithm. When you send a request to an external API which sends back a large result in e.g. JSON or XML, the result will have to be parsed. Many parsers require loading the whole result in memory before actually processing it. When your app is a web app or web service (a common use case for many of our readers), the web server running the app would need a bit of memory to store the state of each client connection, parse its HTTP headers, maybe buffer some output. (In the latter case, there will also be some per-connection kernel buffers involved in the process.)

The second type of memory is “data memory”, or “cache”. Data memory or “cache” corresponds to data which has been read from disk or will be written to disk shortly. There is a huge performance difference between memory and disk. RAM is typically 100x to 1000x faster than disk. Putting all your data into memory will get you blazing fast performance. However, there is also a significant cost difference between memory and disk. Your operating system and/or your database will make use of the amount of memory given to the application. However, you will want to make sure that the most-often requested data stays in memory for faster access. ”Data memory” includes not only the data that you read and write to disk but also your program’s dynamically-loaded libraries.

And finally, the third type is “kernel memory”. That’s a broad category for all the memory needed by your system to fulfill its duties such as special-purpose buffers of device drivers, low-level system data structures and others. It is not easy to dissect this memory without good knowledge of the inner working of your system kernel. It is a challenge to make adjustments and tuning for this memory. Fortunately, except in rare cases, this memory usage remains low in proportion to the other types of memory.

Measuring Memory Usage

Different kinds of memory need to be measured with different tools.

Resident Set Size

Measuring program memory is fairly straightforward. If you are using any kind of UNIX system, it will be shown by the classic tools top or ps, under the column labelled RSS, for Resident Set Size. That’s the amount of physical memory used by each process.

Tech Tip: To see the biggest memory hogs on your system, run top, then hit shift-M. It will sort the processes by memory usage instead of CPU usage.

Note: the Resident Set Size does not include the following:

Things which have been swapped out – If your program needs more memory than is physically available, it will eventually swap. However, its Resident Set Size will not increase, but it will still be putting a lot of pressure on the system. The swapping activity will probably kill your I/O performance.

Kernel network buffers – If your connected clients are sending data faster than you can read, they will eventually get throttled. However, the data will accumulate in kernel buffers. Conversely, if you send data faster than the clients can read (or faster than the network permits), you will also be throttled (because writing to the sockets will be blocked). And the data will also accumulate in kernel buffers.

Buffer cache – If you copy a 1 GB file by reading it in blocks of 1 KB and writing those blocks one after the other, there could be up to 2 GB sitting in buffer cache memory as a result! However, this memory usage will not show up under your process usage tab.

Kernel data structures – If your code walks through a complete file system to compute the total size of the files it contains, while it will not read the size themselves, it will exercise the VFS (“Virtual File System”) layer, and cache a lot of file metadata into memory.

System-Wide Buffer Cache

We just talked about the buffer cache. When a process reads a file from disk, the system will “pull” the required data blocks from the underlying device, and the process will actually read from memory. The data blocks will remain in memory – just in case they are needed again later. This will dramatically speed up the execution. Actually it will speed up the loading time of your programs. If the system had to actually read everything from disk each time it’s needed, even simple commands like ls, or running trivial shell scripts, would be 100x or 1000x slower. Running programs written in Python, Ruby, Java, etc. would take up extra seconds just to bring up the interpreter or virtual machine and load required dependencies. But thanks to the buffer cache, the code and data will most likely be in RAM, accessible in nanoseconds instead of dozens or hundreds of milliseconds if the data had to be loaded from disk.

However, the buffer cache is located at a very low level that is between the block devices and the virtual filesystem layer. This means that when a data block is loaded from disk to memory, the kernel has no notion of its ownership. The kernel is not aware about the file to which it belongs, or which process caused it to be loaded. It just knows that it is block number X, on the block device Y.

This means that the buffer cache as a whole can only be seen by the kernel as one big chunk of memory. It’s impossible to tell which processes are responsible for the loading of specific data.

Impossible? “Impossible” does not belong to the vocabulary of a Linux kernel developer. So read on!

Memory Control Group

“Control groups” are a very powerful feature of recent Linux kernels. We already discussed them in great length in this blog post. Control groups can measure the resources (CPU, memory, I/O…) used by processes or set of processes. They can also ensure fair sharing of those resources, and set limits on them. But what is most important to memory optimization is fine-grained accounting.

If you want to know the amount of memory used by a process, we already mentioned that you can look up its RSS usage. If you want to know the amount of memory used by a group of processes, simply adding the combined RSS usage by the group of processes would be erroneous. Why? Because in some cases, the memory usage may be counted twice or even multiple times, thanks to COW (Copy-On-Write). This will require a detailed explanation. So to keep things simple and relevant, we will examine what happens within servers using a common variant of the “pre-fork model”.

One way to handle concurrent HTTP requests is to use one process per request. Setting up a new process can take time, especially when using a big framework. There are many libraries and modules to load, and perhaps even some templates to compile in memory. So one common optimization technique is to “pre-fork” processes so that they are ready to accept requests. When a request comes in, it can be handled at once, instead of waiting for everything to get initialized. And this can be optimized even further. For example, instead of creating 10 “blank” processes, and having to repeat the initialization steps 10 times, it is possible to do the initialization in a master process once and then create 10 identical processes by forking the master process. Also, virtually all modern UNIX systems use copy-on-write, meaning that they will not do a copy of the memory of the process. The system will copy a given page only if the parent or child process tries to write it.

Going back to our beancounting problem, if a process has 100 MB of RSS memory, when you fork the process, the newly created process will also have 100 MB of RSS memory. So the amount of RSS memory used by the original process and the forked one will be a total of 200 MB, even if the actual memory use on the system increased by less than 1 MB. The conclusion is that just taking the sum of RSS numbers may be wrong, especially in cases where the systems make heavy use of pre-forking. This is the case for PHP, Python and Ruby servers.

The memory control group comes to the rescue! It lets you put processes in groups, and then you can observe the total (and accurate) memory usage for the whole group. It does not care about the size of the processes. It takes into account each page of physical memory that is actually used and for which purpose. This gets even better. It also tracks page ins (the fact of loading something from a disk) caused by processes inside a control group. It means that the buffer cache memory, which was initially one big indistinct pool, can now carefully accounted for. The kernel tracks who was responsible for loading every little bit of data. Of course, things can become complicated when some of the memory is used by processes located in different process groups. In a future blog post, we will provide more details about memory accounting done by control groups. The memory cgroup gives us information that helps us determine the optimal amount of memory to allocate to each app.

For now, control groups give us an accurate way to measure the memory usage of an app or technically, a group of processes. On the dotCloud platform, an advanced metrics system (based on collectd) reports on the memory usage (and other metrics) about your apps on regular intervals which we then display on the dotCloud dashboard. We will discuss the metrics system in detail in a future blog post. For now, if you want to know more about metering control groups with collectd, you can refer to our documentation on the collectd plugin for LXC. This special purpose code uses liblxcstats and is available to the open source community.

Page-In & Page-Out

Measuring memory usage is a good starting point, but it is not enough. Consider the following scenario. Your database holds 10 GB of data and has 1 GB of RAM allocated to it. What is the performance? That really depends on the size of the active set. The active set is is loosely defined by being the set of data which is frequently read or updated by your application.

For instance, if 99% of the requests always hit a given subset of the data, 100 MB in size. Then the active set is equivalent to 100 MB. The application will perform well because it is accessing the active set of data in RAM. The occasional request hitting data outside of the active set may trigger disk activity, but overall performance will be fine. Of course, there can be rare cases where the application performance is downgraded because the active set is evicted from memory. For instance, a nightly cron job that is updating the whole dataset will cause the application to run much slower than expected.

What happens when the active set does not fit into memory? The result is quite simple. Many requests would hit the disk. And it’s worth reminding that disk access is easily 100-1000x slower than memory access. The actual performance hit will depend on many factors. In most cases, if the active set exceeds available memory by just a few percentage points, you can expect 2X to 10X slower response times.

Determining the size of the active set is very hard because it can change depending on external factors (e.g. number of visitors or amount of traffic). Actually, a good way to measure the active set size is to vary the amount of RAM available for the application, and to observe corresponding change in disk activity. Start with an unlimited (or very large) amount of RAM, then gradually reduce the RAM until you witness significant disk activity – now that’s your active set size!

If your hosting platform supports it, you can vary the amount of RAM. The Xen Hypervisor, for instance, lets you change dynamically the amount of RAM without the need to reboot. It’s also the case with cgroups. Additionally, recent kernels have detailed per-cgroup page-in/page-out counters, making it easier to track the disk load incurred by each application.

Reducing Memory Usage

Now that we learned how to measure memory and the impact of our future optimizations, let’s talk about the ways we can reduce the memory footprint – and ultimately our hosting bill.

Optimize Code

This is not always easy or even possible, but it’s always worth trying. Here are some suggestions:

Do you process large amounts of data (say, more than 1 MB) in one big chunk, can it could be split into smaller chunks without losing efficiency?

Similarly, do you fetch large result sets from a database? If you are doing filtering and consolidation in your app, consider refining your queries to remove (or reduce) the extra processing (or at least, the amount of data involved). This (generally!) does not apply to SQL cursors, where you will iterate over a large number of rows, but one at a time (as long as you don’t retain all the rows in memory in your application).

Do you have multiple long-running tasks executed in parallel? Could they be executed sequentially instead? For instance, if you are transcoding videos from a format to another, make sure that you do not run more than one transcoding process per CPU core. Your processes would not finish quicker, but they would eat up more memory.

And of course, there is memory profiling. But then again, this is not so straight -forward. New Relic does an awesome job to monitor performance, but there is no easy-to-use memory usage profiling tool. There are different tools for different programming languages such as Guppy for Python or Memprof for Ruby.

Optimize Code To Be Faster

You may be asking the question, “What’s the connection with memory usage?” Well, if your service is based on workers, the connection between memory usage and memory optimization will become obvious as you read on! Your memory usage is proportional to the number of workers. In an HTTP server, the number of workers must be proportional to your traffic and to the amount of time that it takes to generate a page. For example, if you have 25 requests per second, and each page takes 200 ms to generate, then you would need at least 5 workers. Actually, you would need more than that, because if you only have just the right number of workers, requests will be queued and latency will increase.

(If you are interested into this topic, take a look at “Erlang’s Law” which was used initially to plan capacity for telephone networks. But the law is applicable for sizing workers to handle web requests!) Therefore, if you can optimize your code to run faster, you can afford to put fewer workers – and reduce the memory usage.

Virtually all PHP servers are based on worker processes. Python apps running on uWSGI, mod_python, or mod_wsgi are based on worker processes as well. The same goes for Passenger (if you are doing Ruby/Rails).

Node.js is not based on worker processes. A single process will handle many clients. Some servers like Gunicorn or Apache support multiple models and those different models have drastically different memory footprints.

Be Careful With External Calls

This one is also relevant if your service is based on worker processes. When a worker process makes external calls (e.g. requests to APIs like Twitter, Facebook, etc., but at the same time make requests into your database, redis, or other), it will sleep while waiting for the response. In other words, while you are waiting for external requests, the worker process is (kind of) wasting resources. It is still using some RAM, but it can’t process another request simultaneously. This is generally not an issue for rapid requests from Redis or the local database, but other external calls can make a big difference.

Many of our users have experienced this at one time or another. Their app would be performing perfectly, with a low memory footprint. And then all of a sudden, the latency increases, they see timeouts, error pages, and they need to scale up their app quickly to bring their site back on. And deployment logs can testify that there has been no change to the app itself! The truth is that the Facebook API call that used to take 50 ms is now taking 800ms because there is something wrong with Facebook. Their app now needs 16x the capacity to handle the same traffic.

This scenario can be particularly painful if you use an auto-scaling system. The auto-scaling system identifies the latency, and will automatically add more resources to reduce the latency. The customer ends up paying more (much more) to host their application in an attempt to reduce the time to access a slow resource outside their application.

Scary, isn’t it?

There are multiple solutions to address this problem. If you don’t need immediate results from the external service, you can defer the calls to an asynchronous worker process. Those processes will execute the calls, and store the results in e.g. a local Redis, which you can query at a later time.

Now, if you do need those results immediately, you can switch to an asynchronous system like Node.js, or Python+gevent, etc. You do not have to rewrite your entire application. You can refactor just the parts of the application or services that talk to outside services through an API and leave the parts that do not interface with outside services alone.

If you cannot rewrite the code, consider at least isolating it. For example, if your app shows on every page a little widget which requires a call to Facebook/Twitter/whatever, see if you can move the code generating that widget to a separate service, and load the widget through a separate request. If the third party API slows down (or is taken down), the widget will take longer to load or it may not display at all. However, the rest of the pages will load properly and not affect the performance of your entire application.

Shard To Reduce Active Set Size

Sometimes, you will optimize your code as much as possible, and still be stuck because your active set is simply too large to fit in memory. In most cases, you can always add more memory. However, adding more memory to servers can be too costly, often more costly for custom servers rather than for commodity servers. Or it can be inconvenient or disruptive if you decide to switch to a different infrastructure provider. Under those circumstances, you may have to modify your existing deployment, operational, and support processes to fit in with the new provider. Or it can be impossible, because you’ve reached the maximum capacity for the amount of RAM you can put on a machine.

The next step is to shard your data, which means to spread your data across multiple servers. There are many different ways to shard. NoSQL databases generally have support for sharding already. And with some luck, you will only need to make minimal changes in your code. If your database does not support sharding, implementing sharding requires a bit of work. Or you can perform some very crude, but efficient “sharding”. Using an online retailer as an example, their app may index product reviews and prices and their dataset could be large. They can use separate databases for different lines of products, or separate databases for different countries.

Realistically, if you do reach that point, switching to faster storage options such as striped SSDs or Fusion IO may be the best path than sharding the data. Those alternatives would still be slower than RAM, but considerably faster than your average disk array and faster than your average AWS EC2 storage, even with EBS PIOPS.

Last Words…

When investigating memory issues and optimizing the memory footprint for your apps, make sure that you have metrics. People with a strong sysadmin background will tend to look only at system metrics such as memory, CPU, I/O, HTTP latency, and others. However, people with strong software development backgrounds will look at application metrics such as number of “whatever-your-app-does” – impressions, carts, shares, only. Our advice would be to examine both application and system level metrics.

Infrastructure as a Service (IaaS) and Platform-as-a-Service (PaaS) providers should be able to provide system metrics as part of their offering. Our advice would be to ask your provider to make sure that they provide metrics that you are comfortable with. Do you understand what metrics are collected and how often it is collected?

Even if you run your own servers in-house, it is helpful to set up your own metrics system such as munin or collectd.

In a future blog post, we will explain dotCloud’s metrics system, and how it copes with millions of data points generated every minute on the platform.

About Jérôme PetazzoniJérôme is a senior engineer at dotCloud, where he rotates between Ops, Support and Evangelist duties and has earned the nickname of “master Yoda”. In a previous life he built and operated large scale Xen hosting back when EC2 was just the name of a plane, supervized the deployment of fiber interconnects through the French subway, built a specialized GIS to visualize fiber infrastructure, specialized in commando deployments of large-scale computer systems in bandwidth-constrained environments such as conference centers, and various other feats of technical wizardry. He cares for the servers powering dotCloud, helps our users feel at home on the platform, and documents the many ways to use dotCloud in articles, tutorials and sample applications. He’s also an avid dotCloud power user who has deployed just about anything on dotCloud – look for one of his many custom services on our Github repository.Connect with Jérôme on Twitter! @jpetazzo