Pushing Performance with Tuning

As regular readers of the blog remember, a while ago Aaron wrote about Troubleshooting theory. This has inspired this follow up post, the theory behind performance tuning. We all hit bottlenecks in our systems, and the more we grow the more we find them.

Before we start I want to clarify the difference between troubleshooting and tuning. The most obvious point is that when you are troubleshooting something it's already broken. Tuning is when something works, just not well enough for your use case. Efficient troubleshooting is about using the quickest process of elimination to repair an obviously broken issue, whereas erformance tuning is following down the entire execution path of a process to find the slowest process and then making precision changes to expose the next slowest process. As many of us learned in school, a chemical reaction is at least as slow as the slowest process in it, and this is all what performance tuning is all about.

Step 0: Actually have a problem

It seems common sense you don't troubleshoot a problem you don't have but in performance it's very common for people to arbitrarily say "I bet I could make this faster", however there are many reasons this is usually a bad idea. In fact there are some people who even go so far as to say that premature optimization is the root of all (programming) evil. I think that might be going a bit far but the fact remains: when you optimize before you really know that you need to do you are putting work into fixing problems you only think you might have. As Adam Jacob famously said, "You don't have a scaling problem, unless you do, but even when you think you have one you probably don't."

You might be scaling for a billion queries a second, but I'm really sure you aren't Even if you were, you wouldn't be solving for "I need to have a billion queries", you would be solving for "At peak time load with standard customer usage this SELECT JOIN statement uses 60% of the execution time of the database", or maybe just "HAProxy nodes run out of resources and start swapping at 8 billion concurrent sessions".

Step 1: Identify a real issue that affects your users (or you)

Just like I said above: don't fix what doesn't exist. No matter how big the urge, I'm confident your company has real problems that you would get much larger gains from fixing. Are page loads slow? Do you have things that time out? Does a certain component of your environment take an unhealthy amount of time to come up? Do reports take an excessively long time to run?

You can't solve every problem to perfection but performance tuning is about making things better. You also need to keep total execution time/frequency in mind more than the individual execution time of something. It's usually better to find something that happens hundreds to millions of times a day that you can shave seconds off than trying to save five minutes on that quarterly report. Of course anything customer facing that runs slowly is a high priority. There are many people that say every page on your site should load in a second or less.

If this is your first time tuning for performance, don't be too afraid of grabbing low hanging fruit, but remember eventually you are going to have to tackle that one difficult query.

Step 2: Find the contributing factors

This could also be called "find the actual problem". The important thing to remember is the issue that you found is the symptom, not the illness. Don't guess at this stuff either. This is actually the most important part of the whole procedure! Unless you have some super low hanging fruit it's very likely that dev or staging are going to help you here unless you have a way of reliably replicating traffic to reproduce the issue.

This is where having metrics and monitoring in place can really help you. Wait, did I forget to mention this? Ok, well this is the new step zero then. Metrics systems like Datadog, VividCortex, NewRelic, Anemometer, Raingauge for MySQL, ect. can all help you get the performance data you need, from the top to the bottom of your application stack, but only if you have them logging metrics when the issue is happening. You can just jump into a system and start tracing from top to bottom, but it's a lot harder to spot all the contributing factors live.

There are a couple of "check this first" items that are always a good bet:

Check your disk I/O; unless you are using SSDs everywhere yet the spinning disk is the slowest part of any system.

In general, my rule of thumb is to always try to follow it down another level. If you can trace a problem down to what hardware resource is being taxed, then you can work your way up from there.

You may end up reaching the point where you are running dtrace or strace on a process to try and find out exactly what kernel call is all hung up. Maybe if you are exceptionally unlucky you'll have found a bug in some vendor software you use, but make sure to keep following the chain down.

Step 3: Tune (or just rub some cache on it)

Ideally if you've followed my rambling advice above every problem can be boiled down to:

A hardware resource usage issue,

Where you can adjust how said application utilizes the hardware,

Which on reload frees up said initial resource usage.

This is so much more complicated in real life though, and I can hardly begin to advise exactly where to start, but I can give you some advice on how to approach the problem.

First and foremost never change more than one thing at once. No seriously, you might be tempted to reconfigure half of your NGINX or MySQL memory settings but if you do that you will never know what setting fixed what and when performance flips and tanks you won't know which one was damaging either.

You may be inclined, or adverse, to rely on caching to fix your issue but you need to remember that caching at different layers is neither right nor wrong, good nor evil, it's just another tool in your tool belt. There are many places where you can't cut latency or execution time or are dealing with data that doesn't change nearly as rapidly as it is requested, just beware that when applied without knowledge of how often the data updates, it will cause far more problems than it fixes.

Sometimes certain resources like networking hardware or external resources (API calls) are much harder to fix unless you can increase bandwidth, decrease calls, or just reply on more high level caching.

Step 4: Monitor over time

Fixing a single run of anything means little in the scope of your application's lifespan. You need to watch for a large sample size possible when making changes. After reloading or reconfiguring an application the performance will flex simply due to reloading all the caches in the system. You will want to watch for a while before you decide to move to the next issue.

Step 5: Try again

So your change didn't fix anything, or maybe it even made performance worse. That happens. Before you decide to roll your change back, make sure to stop and look at the why behind why you made your change. Was it some random advice you found on the internet? An educated guess? Maybe those should get rolled back.

If you are sure of the change you made being based on actual resource usage then you may have just moved goalposts down to the next bottleneck. It might be time to take a look at what's changed in the whole process, starting the trace over. I've managed to fix an under performing query only to have memcached suddenly start choking up.

Conclusion

Sounds like a lot of work? Well yea it is. Performance tuning tends to be the 20% on the 80/20 scale of work. If you want to get the biggest bang for your buck then make sure to try and aim for as much low hanging fruit as possible. Really target the little things that have big impact, like things that are accessed or touched a lot.

In the end remember above all else that almost no one will have your problems exactly, and it's rare that when you read about someone else's problem it will apply directly to your use case. It's always good to stand on the shoulders of giants and try to apply lessons you find elsewhere, but always take the time to trace the issues down and make sure that problem really is the one you are having.