Checking throughput with async MySQL replication

Replication throughput is the measure of just how fast the slaves can apply replication (at least by my definition). In MySQL async replication this is important to know because the single-threaded apply nature of async replication can be a write performance bottleneck. In a production system, we can tell how fast the slave is currently running (applying writes), and we might have historical data to check for the most throughput ever seen, but that doesn’t give us a solid way of determining where we stand right NOW().

An old consulting trick to answer this question is to simply stop replicating on your slave for a minute, (usually just the SQL_THREAD), restart it and watch how long it takes to catch up. We can also watch the slave thread apply rate during this interval to get a sense of just how many writes per second we can do and compare that with the normal rate (during peak hours, for example). This can be a handy way of quickly assessing how close you are to our maximum theoretical throughput.

But what about with PXC and Galera? This is easy on async because the master doesn’t care, but to be able to do this on PXC we need a way to intentionally lag a node without hanging or causing flow control on the rest of the cluster. And as it turns out, as of version 5.5.33, there’s a pretty easy way.

Measuring an average apply rate on PXC

First we need to pick a node that is not taking reads or writes (or shift some traffic away from one that is). We’re assuming reads and writes are happening on the rest of the cluster normally, and probably also that the node we chose has pretty similar hardware to every other node. Once we have this, we can use myq_status to see replication coming into the node and being applied:

So, it took this node 50 seconds to catch up again. Right at 20:20:05 when the queue zeroed out, I checked wsrep_last_committed again:

Shell

1

2

3

4

5

6

7

ip-10-142-147-72mysql>show global status like'wsrep_last_committed';

+----------------------+---------+

|Variable_name|Value|

+----------------------+---------+

|wsrep_last_committed|1332551|

+----------------------+---------+

1row inset(0.00sec)

Be sure to turn off wsrep_desync when we are done and caught up! Note you can turn off wsrep_desync right away, but that puts the node into the JOINED state which does limited flow control to help the node catch up. We want our sample to be unbiased by flow control (at least from this node).

Conclusion

So in 50 seconds, the node was able to apply 667183 transactions (difference between the two wsrep_last_seqno) which comes out to 13.3k tps apply capacity (at least sustained for 1 minute). This tells us we’re around 38% capacity for write throughput. Is that a perfect number? Maybe not, but it at least gives you a rough idea.

However, the point is that thanks to wsrep_desync we can measure this safely within a synchronous replication environment that may normally not allow this type of operation.

About Jay Janssen

Jay joined Percona in 2011 after 7 years at Yahoo working in a variety of fields including High Availability architectures, MySQL training, tool building, global server load balancing, multi-datacenter environments, operationalization, and monitoring. He holds a B.S. of Computer Science from Rochester Institute of Technology.