Dude, Where's My Performance?

DZone's Guide to

Dude, Where's My Performance?

In the last few months I’ve been leading the charge to re-architect and implement this system into a series of isolated web applications. When I did, I noticed there were a lot of spikes that didn’t coincide with latency within the applications themselves.

When I first joined Liquid my task was clear: replace the rusting, crumbling, monolithic web application with modular, scalable services. In the last few months I’ve been leading the charge to re-architect and implement this system into a series of isolated web applications, which, a few weeks ago, went live. With the dust settling over this migration, the process decided, the tests written, the bugs few, I finally cleared my head and began to look at the infrastructure and application metrics. When I did, I noticed something quite off about the traffic patterns in those applications that see little traffic: there were a lot of spikes that didn’t coincide with latency within the applications themselves.

Let’s take a step back for a moment and talk about the architecture and implementation of these services. The smaller services (those with few events) are all set up similarly, with a couple of redundant nodes behind a load balancer, each connecting to a database with a single replica.

In each service all nodes run on Amazon Linux, using Apache for the web server, which in turn uses WSGI to orchestrate the creation of the Python interpreter per process (or per thread), which is responsible for loading the Django application for the service. We have NewRelic monitoring the application, SumoLogic collecting and aggregating log-level data from the application and Apache, and CloudWatch monitoring across the nodes and the load balancer. What was so strange about this situation was that the NewRelic logs showed a very stable, very predictable response time from the service of between 40ms and 80ms, while I would see regular spikes (every other minute) in the load balancer graphs in CloudWatch between 600ms and 700ms.

These spikes weren’t at all in line with the response time from the application itself, so where was the slowdown? There were a number of possibilities:

The nodes were set up on bad hardware in AWS.

I doubted this, as I saw this issue across multiple low-bandwidth services

The nodes were set up on hardware that was being utilized by neighbors (so-called noisy-neighbor syndrome).

Once again, this seemed unlikely.

Apache was seeing severe delays in communication with the Django application.

I was skeptical of this, given that I was using default WSGI settings and Apache’s default MPM prefork module.

Networking between the nodes and the load balancer was messed up.

These hosts sit in multiple availability zones, so if it were an issue with networking I’d have expected to see it effect one availability zone more than another.

The load balancer was faulty.

The load balancer was limited in resources due to low amounts of traffic.

This last idea was easily the most likely: I’ve seen this behavior before in services that are scaling up, where the load balancer will have a delay in available resources, causing latencies to rise where the service latency remains stable. I tested this by throwing junk traffic at the service and saw a drastic decrease in average latency that was disproportionate to the decrease in service time. The service time decreased due to caching on the DB end by about 30ms on average, whereas the average load balancer latency decreased by about 300ms on average. Unfortunately, what I did not see was any increased latency when I began sending junk traffic to the load balancer, which was my expectation if the load balancer required additional resources.

At this point Amazon support connected me to their Elastic Load Balancer (ELB) team. We tested the networking speed between the nodes and the load balancer and found them to be very reasonable (<5ms) and the ELB team noted that the load balancer for the low bandwidth services was on different hardware than that of the high bandwidth service. They offered to switch it over to the alternate hardware, but asked that I first investigate the time taken by Apache vs. the time taken by our application, which I had been remiss in doing given its low likelihood. Astonishingly, the Apache time agreed exactly with the high load balancer latency.

I was shocked: what the heck was going on that Apache would wait hundreds of milliseconds extra when communicating with the Django application? I began a series of experiments to identify the root cause of this delay. I played with turning off lazy initialization of the Python interpreter, to no avail. I tried cranking up and down the Apache clients with MPM prefork and MPM worker modules, which are both officially supported according to the WSGI documentation; this somewhat alleviated the problem. I began changing the WSGI daemon process and found an obvious correlation: the less MPM was responsible for, i.e., thread and process allocation, the less I saw this issue. By minimizing the number of MPM clients and pre-allocating processes and threads with WSGI daemon I saw a drastic improvement.

Working backwards from here I figured out what was really happening. Every time Apache would spin up a new process or thread it would initialize a Python interpreter (regardless of the lazy initialization specified by WSGI or the WSGI interpreter scope) and load the Django application, and by pre-allocating this in as many cases as possible it minimized the amount of new threads that would need to be spun up by Apache. The result in the low-bandwidth services was immediately evident.

Looking at one minute intervals shortly before and after the change.

Now, aggregating to 15 minute intervals after the change, the pattern becomes even easier to see.

At hour intervals after the change, we can see the change is very stable.

The CPU utilization became more regular (since Apache isn’t spinning down and up threads or processes all the time) and the average latency decreased significantly. Satisfied I now had an understanding of the problem I turned my eye towards the high-bandwidth service. Could the much higher amount of traffic be masking the same problem for this service? The architecture of this service is a bit different: instead of a relational database backend it has a NoSQL Redis database backing.

The amount of traffic this service sees is also two orders of magnitude larger than its low-bandwidth counterparts. I began tuning it in much the same way as the smaller services, doing some math to determine the resources I should allocate to the WSGI daemon process up front. Once I had this working on one node (judging from the Apache response times) I then applied the change across all of the nodes and saw a drastic decrease in CPU utilization. While the average latency certainly looked lower on average it’s difficult to draw immediate conclusions, given how fast the average service response was to begin with.

Looking at the one minute interval shortly before and after the change it’s not entirely clear if we’ve made a latency improvement.

At the hour interval the pattern is clear: we’ve decreased average latency significantly.

The conclusion that I’ve drawn from this is that WSGI and Apache MPM were not made to work together. Allowing Apache to manage processes and threads at all is extremely problematic for WSGI. Further, not only does using a global interpreter seem like a terrible idea for the performance of the application, but it also does not work (at least in this context) to alleviate the symptoms that I was seeing. I believe that WSGI daemon and Apache MPM (all modules) need to be tightly coupled, where the WSGI daemon process actually controls MPM, or MPM respects the limits set by WSGI as a minimum and uses these threads and processes first for incoming requests to the associated vhost. This kind of tuning isn’t covered anywhere in the WSGI documentation, although they warn that “If the Apache MPM settings are not set appropriately for the type of application being hosted, then worse performance can be seen rather than better performance.” Seriously, that’s as much as it says. In the meantime I will continue to allocate processes and threads to the extent the system can utilize under load using the WSGI daemon for all of my production Django applications.