Sunday, March 15, 2009

As well as powering a few cooltools, Linkscape is a data platform. Performance (and its measurement) isn't just important to reduce user latency, or cut costs. It's actually something we're hoping is part of our core competency, something that adds significant value to our startup. And the shortest path to performance is measurement.

This post is inspired by (and at times borrowed from) an email I sent to some friends for a consulting gig I did recently. But it rings so true, and I come back to it so often, that I thought I would share it. Alex and Nick, I hope you don't mind me sharing some of the work we've done on your very neat, very fun Facebook game.

Let me motivate the need for performance monitoring with a couple of case studies taken from our infrastructure:

This dashboard (above) illustrates 28 hours of load on our API cluster. I can immediately see service issues on the first server (the red segment of the first graph). This is correlated with a spike in CPU and some strange request patterns on the second server (the layered, multi-colored bar on the graph below). The degraded service lasted for a few hours, which was a configuration issue I solved in our monitoring framework. It should have guaranteed downtimes of no more than 4 minutes.

Even after solving our monitoring issue I still needed to investigate the underlying issue: I can see the CPU and request pattern are related. Ultimately I solved this issue within two weeks. Without this kind of measurement I would not even have known we had an issue, and would not have had the data to solve it.

From part of our back-end batch-mode processing, we had thought we'd tuned our system about as well as we could. At times we were pulling data through at a very respectable pace, roughly 10MB/sec per node. but we had also observed occasional unresponsiveness on nodes, with a corresponding slowness in processing. We left the system alone for a while, thinking, "don't fix it if it ain't broke". But recently we've been tuning performance for cost reasons; so we came back to this system.

Once we instrumented our machines with performance monitoring (illustrated above) we saw that the anecdotes were actually part of a worrying trend: the red circles show this. Our periods of 10MB/sec throughput are punctuated by periods of extremely high load. The graphs above show load averages of 10 or more on 4 core nodes, along with one process spiking up to hundreds of megabytes and nearly exhausting system memory. This high system load dramatically reduced our processing throughput.

It turned out that the load was caused by a single rogue program which consumed all available system memory due to buffered I/O. Usually we have a few I/O pipelines and give each many megabytes for buffering. However, this program has many dozens of pipelines, altogether consuming nearly a gigabyte of memory. This lead to significant paging and finally thrashing on disk.

Once we reduced the size of buffers (from roughly 40-100MB per pipeline to just 1-2MB per pipeline) we saw dramatic improvements in performance: a nearly 60% boost! And the nodes have become dramatically more responsive—no more load averages of 10+. The graphs above show load average maxing out at 4 and plenty of memory available. The data suggest that we might even be able to nearly double our performance with the same hardware by increasing parallelism and running another pipeline on each node.

All of this work is powered by simple monitoring and measurement techniques. Sometimes this has lead to significant, but necessary engineering work. But sometimes it's lead to a single afternoon's efforts yielding a 60% performance boost, with an opportunity to nearly double performance on top of that.

We're using a few tools:

collectd measures the system health dimensions (cpu, mem usage, disk usage, etc.) and sends those measurements to a central server for logging.

Monit watches processes and system resources, bringing things back up if they crash and sending emails if things go wrong.

These tools work together, in an open, plug-in powered way. I could swap out individual components and move to other tools, such as Nagios (which I've used for other projects) or Cacti (which I have not used).

Whether you're an on-the-ground operations engineer looking to watch system health and fix issues before they turn into downtime, or you're managing large-scale engineering, looking to cut costs and squeeze out more page or API hits, these tools and techniques point you in the right direction and give you hard data to justify your efforts after the fact. We've had many high ROI efforts initiated and justified by this kind of measurement.

Disclaimer

To quote a fellow Wisconsinite: All opinions expressed on this, our personal blog, are well-reasoned and insightful. Needless to say, they are not those of our employers. (...Whose opinions may also be well-reasoned and insightful; but, well, you know how it is.)