Linux server performance investigation toolkit

Following on from a talk I gave at PHPNW last year, here’s a summarised version of the tools and approaches we’ve used to track down performance issues. I’m not prescribing a specific order in which these tools are to be used and this is not a silver bullet. Instead, I’d like to think of this as a guide that will inspire some non-obvious approaches to figuring out mysterious performance issues.

Resource monitoring

Being able to see graphed data such as load average, io wait, mysql throughput, apache processes (to name a few) on a heavily used server is indispensable when it really matters. Although there are plenty out there, we use munin to draw pretty graphs and it gives us a very high level overview of our systems. As an example, we recently identified a tripling in volume of MySQL queries but the Apache traffic had remained comparatively consistent. Our first instinct was to increase the MySQL query cache but that can lead to more issues. Of course, hindsight is always perfect but we can now understand that the root cause was lack of appropriate application-level caching.

iostat

iostat (part of the systat package on RedHat descendants) allows you to see disk load per partition. Disk-triggered latency is a pernicious problem and needs to be identified and stamped out as soon as possible. Let’s imagine you’ve partitioned your server fairly loosely into:

Websites

DB Tables

Temporary

System

By running iostat -x 1 10 you get a broad idea for what’s causing the disk load on the server. Lots of reads per second on websites? Maybe you need to install APC or instigate some caching of file resources. Lots of reads on the system partition? Perhaps you’re spending a lot of time spinning up too many Apache processes because they’re being culled too aggressively.

atop

atop is like a supercharged version of top. It has at-a-glance, colour-coded indicators to tell you if you’re under heavy, CPU, IO or memory load. We primarily use it to identify which process is demanding most disk resources. Being able to identify the source of disk activity down to user and process id can help trap culprits in the act.

mysqladmin

I sometimes run mysqladmin -ri 10 extended-status | grep tmp and getting an idea of the weighting between temp tables and temp tables on disk. If you aren’t already aware, it’s very useful to know when a temporary table is likely to go to disk. As I’m sure you’ve gathered by now, writing stuff to disk is a bad idea and we should avoid it at all costs. Micro-optimisations on queries (to avoid “Using Tempoary” and in turn, going to disk) can yield serious benefits on a busy application where disk usage is frowned upon.

block_dump

On RedHat systems, echoing 1 to /proc/sys/vm/block_dump and the dmesg buffer will start filling as every single read or write operation will be documented. Needless to say, on busy systems that’s a lot of information. To make matters worse, everything is sort of documented twice – once as it goes into the page buffer and again as it gets flushed to disk. It also documents all of this at the block level. It’s very noisy, but it does contain some gems. For example, if block_dump leads you to believe that static web resources are being written to that’s probably the “atime” property of the file being updated. Mount the partition with noatime and hey presto – you’ve just reduced IO.

Percona Server

The guys at Percona can never be praised enough. They work tirelessly to make “the world’s most popular open source database” even better. If it’s not currently on your production servers, then you should consider it for your next deployment. The slow query log has had an overhaul and it can tell you all sorts of interesting information like if the query missed the query cache (because not all queries can be cached) or if MySQL pushed the table to disk before sorting. With some new slow query analysis tools, Percona Server can be used for local static analysis – a query which goes to disk will always go to disk, regardless of load.

Global query log

As with block_dump, this is a very noisy profiling solution but can help you understand an application’s behaviour in the wild. While MySQL is running, you can turn on the global query log simply by issuing “set global log=1” in the MySQL console. Every single statement is then going to be pushed to a log which you can then go through with a fine tooth comb, EXPLAINing them or pushing them through Percona’s extended slow query log analysis. Don’t forget to turn it off!

I’m constantly discovering tools and new ways of working, so I would hope to update and revisit this topic in the coming months. If you’d like to know more about our consultancy services and what we can offer, feel free to get in touch.