Reddit is melting our server, here’s what we did (Nginx, Apache, Django and MySql)

I help out a popular clothing retailer, kuhl.com, sometimes when they need some after-hours help. Last night at dinner I got an urgent text from the lead developer who is based in Europe saying the website was melting due to being mentioned in a highly popular Reddit post. It seems that one of the Boston bombing suspects was spotted wearing a hoodie from Kuhl.

The first problem was that there weren’t enough database connections available, the second was that, of the M1.extra-large’s 15G of RAM, only 4G was being used, and the third problem was that the disk i/o and cpu usage was close to nil. We solved all of these problems, and they all helped, but ultimately could only handle the load by serving a static copy of the popular page. But here’s the process I took to get there and how each step impacted performance.

Here’s an overview of the application architecture:

First, I needed to increase the maximum amount of MySQL connections. We were at 50 and there were 150 Apache processes running. You can do this without restarting the server using this command:

set GLOBAL max_connections = 300;

Through some trial and error I bumped this up to 600 only to realize that the server refuses connections once it reaches 500, so it didn’t make any sense to go above 500 in the end. Remember that changing the setting this way won’t persist after restarting MySQL. You need to add it to your my.cnf file to stick after a reboot!

I then realized that while Nginx was doing fine, Apache and mod_wsgi were not meeting the needs of the website’s traffic. I upped the max_children from 150 to 200 and restarted Apache. I watched to see how many children there are with this command:

ps ax | grep apache | wc -l

This does three things in a pipeline: lists all processes, filters out processes that don’t have “apache” somewhere in their name and then returns the count of processes. Because the command you typed has ‘apache’ in it, and because that command is more than one command linked together it may increase your count by one or two. So for example, you may see 152 even though you only have 150 processes.

I watched the server as the number of processes quickly maxed out to 200. I then repeated this step to 250 and they quickly maxed out again. I went to 300 and received an error that the ServerLimit was set to 256. So the line before your MaxChildren in the config file needs to be set so that ServerLimit 300 in order to have MaxChildren 300.

Once again the server maxed out so I bumped it up to 500, and at the same time increased my MySQL connections to 500. This turned out to be a problem because when the maximum apache connections was reached I could no longer connect to MySQL as root, so I bumped the limit down to 490.

Throughout this process I was watching the RAM utilization. With 490 connections the usage was nearing 12G of RAM, which was a good sign. I like to use htop to monitor ram, cpu and server load.

But many people were still not able to visit the site. At this point the predominant problem was that they’d get a loading indication but never see the page. This indicated that Nginx was accepting connections that Apache couldn’t fulfill. I reviewed the server config and noticed that it was set to allow 4 workers with 200 connections each, or 800 connections. It was also waiting 90 seconds before timing out. I quickly changed the proxy_connect_timeout, proxy_send_timeout and proxy_read_timeout down to 20 seconds. I wanted the connections that would never get served to die more quickly. I still think 20 is too high, but I decided that wasn’t going to make a huge difference in the long run.

I also changed the max connections. You do this in nginx by taking the number of connections you want and divide it by the number of workers you have. In this case there were 4 workers and we wanted a little under 500 connections, so I set worker_connections to 122.

At this point, things were going good. I could tell because cpu load was up, memory utilization was almost maxed out and disk i/o was clipping along at 120MB/s for reads. That means we were legitimately maxing out the server. And yet, we still were not meeting the demand! In case you’re surprised that I’m glad all server resources were maxed out, here’s my thinking: If there is a spike I want to use as much of the server as I possibly can to meet the site’s needs. If the server is not maxed out, it’s misconfigured. However, the trick is to limit your load so that you don’t melt the server by allowing it to consume more ram than it has, or saturate the disk, etc.

It is my opinion that the Django app was not coded properly to handle this load, but then again, preparing for spikes like this is definitely the hardest challenge a developer has. One of my biggest complaints about Django is that the ORM defaults to doing a lot of small queries rather than joins. You can fix this by using the select_related property. There’s also opportunity to better utilize caching, either as part of the Django app, in Nginx or both. Nikos, the new head developer for the project is actually working on rewriting these and other key areas of the app. It just so happens that Tom, the full-time sysadmin for the project is also in the process of splitting the app from a single-server architecture to a more scalable multi-server architecture.

Unfortunately we had bad timing because none of those changes were ready and obviously you don’t make those changes during the panicked time around a server melt down.

So what did we do to ultimately get us past the bump? I shut down all the services for 10 seconds and let the connection attempts die, then started the server back up and quickly grabbed a static snapshot of the page that was being linked to from Reddit. I saved it to the disk on the server and had Nginx serve it static without hitting the application server or database.

The effect was dramatic – after the restart pages were being served well under 1 second, everyone was getting their requests met, not just on the now-static hoodie page but throughout the site. There were a maximum of about 50 apache children running, ram usage was under 1GB and server load was near zero.

It is my recommendation to better utilize caching. A technique I’ve used successfully with Nginx was adapted from this post about tweaking WordPress to handle extreme load. Essentially it tells Nginx to cache pages for a couple seconds. This means that when you change your site, visitors see it almost right away, but if you’re ever in the situation Kuhl was in, where you get an extreme rush of traffic, your site is essentially served from a static cache. This can be tricky on e-commerce sites though, so it may not be a plug ‘n play solution.

In my years when I worked as the Ubuntu.com webmaster I learned that one of the best things you can do to handle extreme loads is serve a site that is as static as possible. Since most sites these days are dynamically created with a CMS, that may mean putting a proxy server like Squid or Nginx in front of the site. Anything you can do to reduce the amount of code required to serve a page will help you when you get a spike.

All of these techniques were done in the heat of the moment. From the time I started working on it to the time it was resolved took about 90 minutes, an eternity in cases like this. In hindsight, I would have gone to the cached resource first. However I feel that each of the steps above has some merit and I wanted to share it with you. If you can suggest other ways to improve performance, I’d love to hear them. Also, if you like this article, hit the +1 or like button below.