Use something like mk-heartbeat, which operates in the same way as Seconds_Behind_Master (trying to show you the actual time difference between the master and slave), and is a little more robust in complex replication chains, and other situations where Seconds_Behind_Master falls down (such as when the IO thread is lagging).

You can also attempt to monitor replication lag according to the differences in where the two threads are reading / writing to, based on the binary log and relay log positions compared against the master. This allows you to dive a little more in to where the delays may be happening in the chain, rather than relying on a single time behind metric.

Here’s a (somewhat simplified and not entirely accurate – the IO thread reads via the master) representation of how an event moves through replication, and the points that can be monitored with current instrumentation (including time delayed replication within 5.6):

Now, I say “Delay” above, that’s not entirely accurate, as most of those metrics actually record the file, and position within that file, that each stage is at – they allow you to measure the number of bytes difference at each stage, not the latency of each of the stages.

Actually generating this data consistently is problematic however. As it requires two statements to be executed on two entirely different systems, you will need to ensure that you execute these as close as possible to each other in time, so ntp will need to be installed on each system, and they will need to be kicked off on a precise clock based schedule on each server. Even then, they really can only be used as an indicator, not a source of truth, simply because of this.

The expressions above are slightly simplified as well. Logs roll over, so you also need to know the size that logs can get to using either max_binlog_size or max_relay_log_size (if non-zero), as appropriate, and calculate the difference depending on the number of logs between each position – not entirely simple.

But even with these metrics in hand, they don’t really tell us much – the size of an event within the replication stream can differ depending on whether statement based or row based binary logging is enabled. A “small” event could be a statement that takes a very long time to run, or a single row based event that executes in no time at all. YMMV.

So we can use these to profile the system, and graphing over time probably wouldn’t hurt, but relying on them as an indicator that you should start looking at a slave because of performance issues gets tough, unless you rely on something that tells you that you are already behind, in a reactive rather than proactive manner.

What we really need is an indicator that tells us that we should be looking in to performance problems before the slave even starts to lag at all.

And so we come back to the title of this blog. The idea is simple, we need a “Replication Load Average”, that can tell us how busy the slave really is – if it’s not executing 100% of the time, then the chances are it’s not really falling behind. If it’s starting to be busy 90-95% of the time, then the chances are, it could start to lag pretty soon. It’s not a new idea either, I’ve seen a couple of prominent people wishing for such a thing:

Peter Zaitsev: Idea: MySQL Should add “Replication Load Average” metric which would tell what time replication thread was busy processing events compared to being idle. This would help a lot to see when you’re close to the limit.

Jeremy Cole: I agree however that getting saturation information is important and is something we will potentially implement in Twitter MySQL. It would be useful to get essentially a %busy/%idle stat for the SQL thread.

I’m happy to say gentlemen, this is already within 5.5, and has been since it was released.

When the SQL thread has executed all events from the relay log, it enters MYSQL_BIN_LOG::wait_for_update_relay_log(), which then waits for a condition called update_cond to be raised. This is exactly where the slave goes “idle”. And luckily within Performance Schema in 5.5, we can track conditions. Conditions get timed within Performance Schema for the time they were waiting for the condition to be raised by another thread – so in the above case, the total time that the SQL thread was idle.

The condition waited for within wait_for_update_relay_log() can be mapped to the wait/synch/cond/sql/MYSQL_RELAY_LOG::update_cond event within Performance Schema – here’s a snapshot from a 5.5 server that is in the middle of a replication tree (it’s both a slave and a master, so has both relay logs and binary logs enabled – the wait/synch/cond/sql/MYSQL_BIN_LOG::update_cond is a similar condition that is used by the IO thread logged in to a master, to signal that new events have been written to the binary log.):

OK, so we can get to the raw data.. Great.. But how do we turn that in to a load average? The key is to read the values on a set interval, compute the deltas of the SUM_TIMER_WAIT statistic, and compare the delta – which is the idle time within the interval – to the overall interval.

As a quick proof of concept, I’ll use an event to do this, to call a procedure to gather the data, compute the deltas, and generate some extra statistics. Before boring you with the details of the implementation, let me first give a snapshot of the end result that we are striving for:

This gives what both Jeremy and Peter were asking for – a single “busy” counter, with some load average like statistics over 3 different intervals, with a snapshot every 5 seconds in this example. This allows us to see that there are peaks and troughs in the load of the slave on the shorter interval, but flattens it’s overall load with the larger averaged intervals to give you a better baseline of how busy the slave really is over time.

At a high level, this gathers the wait data for the wait/synch/cond/sql/MYSQL_RELAY_LOG::update_cond event from the events_waits_summary_by_thread_by_event_name table for only the slave SQL thread, which is exposed as thread/sql/slave_sql, then grabs the last snapshot of data for comparison. It then looks at the current and previous wait info to determine how to to calculate the actual busy percentage based on how the event data has changed over the period, computes the deltas and busy percent for that last interval appropriately, and inserts that in to a table.

Finally it computes the moving averages for the 3 sets of defined intervals and updates the just inserted row with that data (so the moving average also takes in to account the current snapshot), and finally purges anything older than two hours, to store a rolling two hour window of 5 second snapshots.

It also disables binary logging (because these are local statistics that should not be propagated to downstream systems), and does some checking whether performance schema is enabled, and logs appropriately within a logging table if not.

The time delta recorded within 5.5 is only possible at a 1 second granularity, which makes us lose quite a bit of precision when compared to picosecond precision wait information, within 5.6 we can use microsecond precision with SYSDATE(6), and get even more accurate reporting – and note, SYSDATE() is used instead of NOW(), as within procedures NOW() returns the timestamp for when the procedure started, not when the function is called

The first 15 minutes worth of aggregate data should be taken with a pinch of salt, wait for the data set to “warm up”

This doesn’t necessarily negate the need for the other statistics described above – which can all be used to trace where latency might be, and in the case of Seconds_Behind_Master or a heartbeating implementation can still give information on how far behind you are getting, once your slave does really start to lag, as well as using something like Jeremy’s velocity metric (linked via is name above), which can give an idea of the general direction of of a slave once it’s started lagging.

With all that said though, I think this is the *key* metric that anybody should be tracking first and foremost. It is far better to fix problems or spread load before you start lagging.

The joys of digging through the different events in Performance Schema to see what they actually record! There’s gold in there still to be mined, but I think this one is a genuine 14 oz nugget.

Post navigation

14 comments

Nice. Didn’t take long to drum up a quick script to set this up. Thanks for adding the detail to make it that trivial. Now need to collect the data and put into my monitoring system, and then look at the graphs of different types of servers and see how close I am to the dreaded100%…

Great. Thank you for showing how to compute it with Performance Schema. I think this is what it needs. There is huge amount of information stored in it… however it is hard for many people to make a simple meaning of the data stored.

Indeed – I think the average Oracle DBA, that is familiar with the likes of the Oracle Wait Interface will get it straight away, but for the average MySQL DBA there’s a learning curve for this new Performance Schema paradigm that will take some getting used to.

This is really my aim with ps_helper, to show the simple use cases that people need when trying to diagnose various issues in clear views etc. so that they can slowly get used to using the data.

The one thing I think we’re lacking at the moment is a clear description of every single event that is tracked – I’ve started my own notes, but I think we need a concerted effort from more people than just me to make this happen.

We’ve also discussed some other things that P_S in 5.6 will bring, like some basic resource utilisation statistics which could be interesting on a per user or per host basis. That’s slightly off topic for this post, but I agree with Peter’s comment that the current documentation for P_S is rather limited and “overly technical”. I guess it was built for the developers, but of course there’s a lot of useful information in there which a DBA might like, once suitable parsed or filtered. So perhaps providing a more “DBA oriented” higher level overview of what’s in P_S and what you can do with it would be really good. Once people start to see the use cases and what you can get out of P_S I’m sure once 5.6 becomes GA this information will get a lot more attention than it’s had up til now.

I finally tried this on a server that seems to be pathologically falling behind. I notice that when the server is falling behind, with no other queries running, the “current_wait” value that is calculated and stored in slave_sql_load_average is almost always “wait/synch/mutex/mysys/KEY_CACHE::cache_lock”. What’s the best way for me to track down what that is? Sounds like a MyISAM key cache lock to me – is that right? If so, is there any way to figure out what to do about it?

One of the best ways to figure out why it may be waiting for this mutex is to check whether there really are other threads holding it for some reason – you can do that with the events_waits_current table (just poll that frequently as well).

Of course, my default answer would always be – do you really need to use MyISAM, and is InnoDB not an option?

This mutex is locked pretty excessively for most kinds of operations against the key cache, the only real course of action, if you must use MyISAM, is look in to multiple key caches.

So far, I haven’t seen any other threads holding KEY_CACHE::cache_lock while I’ve been polling events_waits_current by hand – there’s always just the one thread.

Most of the tables involved don’t *need* to use MyISAM, but we’ve already converted the ones that were straightforward over to InnoDB. I seem to have had some mild success by monitoring TABLE_STATISTICS and creating separate key caches for the tables that are being written to most.

While I’m sure that you may have realised this already it suddenly occurred to me that there are various metrics that DBAs care about which can be collected this way, not just the SQL thread “load”.

Things like:

“SQL thread busy %”, (mentioned by you in this post),
“IO thread busy %” (writing to relay_log?),
“% Busy writing to binlogs” (usually a contention point at some point,
“% Busy writing to disk (innodb)”,
“% Busy writing to iblog files”,
“% Busy reading from disk (innodb)” (would be good to split read/write “load”,
“% Busy checkpointing” (is it possible to catch this as a metric?),
…
and various others.

Most of the logic you’ve written could probably be reasonably easily modified to work on a group of metrics and so shouldn’t be much more costly to calculate, but the real advantage of metrics like this is it should be much easier for the DBA to figure out “where MySQL” is busy than looking inside P_S which is too technical and low level. The statistic mentioned by Chuck may well fall into this category.

Having a special mysqld_metrics (M_M) database which holds such tables and the stored procedures to collect the information would be very nice, and as you say this information can be collected already in MySQL 5.5 with MySQL 5.6 providing further data.

I do not have the time to investigate in detail the data that P_S can provide in this respect at the moment but those of you who’ve built P_S probably know what can be added. Then for many of us when the server is busy, doing something as simple as SELECT * FROM M_M will probably give us a very quick summary of what’s going on and pretty much pinpoint where/why the server may be busy without even having to look at PROCESSLIST output or P_S which is hard to use to give you a quick overview of server bottlenecks.

Just a thought. I’m not sure how realistic it would be to do this but it strikes me that this would be an interesting project.