Monitoring MySQL IO Latency with performance_schema

They rightly point out that trying to correlate IO statistics from tools such as iostat to a databases workload, especially when that database is on a shared resource, is fraught with complications – and further to that, not the full picture anyway, even on a dedicated system. Brendan points out that to get the full picture, you really need to be instrumenting your profiling at the read/write layer of the application – before getting to per filesystem / disk statistics.

Brendan has followed up on his posts with a third on how to track these filesystem calls from a running MySQL process with DTrace, which again I recommend. In this post Brendan says the following, linking to my blog on graphing statement execution in 5.6 (thanks Brendan):

If you are on MySQL 5.6 and later, it looks like you may also be able to
get this information from the wait/io events from the new performance schema
additions (I havenâ??t tried).

This is true.

However I’d like to show how filesystem latency can be monitored from the current MySQL 5.5 GA version, with performance_schema, on all platforms.

When event timing is enabled within performance_schema, every single wait event is a measure of latency, whether that be to grab a synchronization point, or perform a file IO operation, or even later in 5.6 to take a table lock, perform table IO or network IO etc. – up to executing an entire statement or stage (including all locks, file IO, synch points etc. below them).

From the file IO perspective, all of these are instrumented around the layer that Brendan points to – interactions with the filesystem, not the disks, from within MySQL. You can see exactly which file IO events are instrumented within the performance_schema.setup_instruments table:

Each instrument of the wait/io/file type tracks a certain form of IO that is performed by MySQL, such as wait/io/file/innodb/innodb_data_file tracking datafile IO for InnoDB, and wait/io/file/sql/binlog tracking IO to binary logs at the SQL layer.

We do this in much the same way that Brendan does with DTrace, though instead we do this by wrapping the original functions, such as os_file_read(), os_file_write() etc., with functions that do the timing around them.

These in turn then call the original, now renamed functions, such as os_file_read_func() and os_file_write_func() (meaning that if you want to use Brendan’s solution with 5.5+, you should probably switch it to use these functions directly). Here’s InnoDB’s pfs_os_file_read_func() as an inline example:

Now we have the background out of the way, we can take a look at how the data is exposed.

I’m going to start with the raw stream of data that is exposed within the wait event tables, in the events_waits_history_long table specifically, as this gives the 10,000 most recent events within performance_schema (by default), giving a pretty up to date snapshot of what is happening “now”.

As with some of my previous examples, I’m going to use a helper view to do this. The following gets the latest IO events, also showing the originating thread, operation type, size and latency of each request:

In the version that I am using here (5.5.12) there is a bug in some cases of InnoDB IO (such as writing to the InnoDB log files) that double accounts for IO, so you have to do some manual work to exclude some rows (the second events of the same size but lower times) to find out just how much time was spent in this version. That annoys me. Luckily, we have since tracked down the problem there, and it’s been fixed and should be released in 5.5.13.

That results in 35845.0394 + 8649.3640 + 716.1718 + 11.2195 + 834.7232 = 46056.5179 microseconds, or roughly 0.046 seconds, out of the 0.78 seconds reported to the client. Most of the time at the filesystem layer was in doing an fsync() on the log files (which forced the data to hit the platters), rather than write().

It’s also worth noting what is happening within the msandbox@localhost:92 thread at the end of the result. That is actually the file IO that was required to return the result of the query that I was executing. There will be a few events after that statement completed to clean them up etc.

The first 4 events against latest_file_io.frm are to get the structure of the view I was selecting against, and the rest are a result of the JOIN against INFORMATION_SCHEMA.PROCESSLIST, which in this case is forcing a temporary table to be created for the processlist result. In 5.6, there is no longer any need for this JOIN, as all the columns are available within the threads table, so the IO could be reduced to just the view definition then.

This section also helps to point out one of the deficiencies with using Brendan’s script (at least without extending it) to really see when filesystem interactions are the causes of overall latency – a large portion of the time spent there was in creating the temporary table files, not just writing to or reading from them. Likewise with the view and opening/closing.

My work with the PERFORMANCE_SCHEMA has been limited to experimenting such as
described by Tracking IO with performance_schema however a recent discussion
with Mark Leith highlighted that with the performance_schema it will be possible
to determine the exact size of disk based temporary tables created by the MySQL
kernel for example.

Sorry for the delay Ronald, but I hope this shows a good example of exactly what kind of IO we track against temporary tables as well. If not, here’s a better one:

This may be a bit of a hit and miss approach, as this table only stores 10,000 events by default, and when you’re playing with temporary tables sometimes the number of file IO events could easily exceed this. Temporary tables are also very transient, MySQL doesn’t hold the file handlers open nor keep much metadata around once they are removed, so we don’t aggregate their data within the file summary tables at this time.

We do however summarize the events for regular table files, and the other files held open by the instance such as binary logs and log files etc. as shown above.

This is my test instance, so there is no great amount of IO going on. It shows in this example that MyISAM data file IO is the top consumer, followed by MyISAM index IO, InnoDB data file IO, view parsing, and frm file parsing respectively.

These are the events summarized by thread, so that you can trace which threads – either foreground or background – are likely to be consuming the most IO. This is just from a latency perspective however, and can not easily be correlated back to the file summary tables in a meaningful way. This means you can not easily correlate the amount of IO done with the time taken per thread (unless you are polling the current / history tables, if they are enabled).

However, it can provide insight in to who is suffering the most from filesystem latency:

Finally, we end with all of the events summarized by the instance of the event. With performance_schema, if there are multiple instances of a certain event type, such as two InnoDB log files, or multiple InnoDB data files, or multiple instances of a specific mutex in memory etc. – each of those are also aggregated individually within this table.

In the above example, there is a single InnoDB datafile (ibdata1 – 4680873728), and two InnoDB log files (ib_logfile0 and ib_logfile1).

We have a bit of a disconnect here however within performance_schema – it is not possible at the moment to join the events_waits_summary_by_instance table to the file_summary_by_instance table. file_summary_by_instance uses the file_name column as the key, whilst events_waits_summary_by_instance only exposes the object_instance_begin column. We hope to rectify this within the 5.6 release, by adding the object_instance_begin column to the file_summary_by_instance table (and possibly aggregating more right inline with the other file summary tables).

This would then allow us to drill in by file to both the total amount of IO done, and the total/min/max/avg etc. latency involved in doing so, per file that it accessed from the database. I’m putting together a worklog for this, and will post it to the comments once it is public.

I’ve also now created a ps_helper page dedicated to collecting various views and procedures that help with looking in to performance_schema data. I’ve still got a number of things that I’m playing with that I hope to blog about and publish soon as well. 🙂