Tuesday, July 9, 2019

MySQL Master Replication Crash Safety Part #4: benchmarks of high and low durability

This is a follow-up post in the MySQL Master Replication Crash Safety series. In the three previous posts, we explored the consequence of reducing durability on masters (including setting sync_binlog to a value different from 1). But so far, I only quickly presented why a DBA would run MySQL with such configuration. In this post, I present actual benchmark results. I also present a fundamental difference between on-premise servers and cloud virtual machines as my tests are done in Google Cloud Platform (GCP). But before going further, let's summarise the previous posts.Update 2019-07-22: the 5th-a post of the series is out: MySQL Master Replication Crash Safety Part #5a: making things faster without reducing durability - using better hardware.

In Part #2, we extended our understanding of the impact of running a master with sync_binlog!=1 by considering lagging slaves using file+position replication. TL&DR: this introduces a new inconsistency scenario after a master OS crash and the resulting data drift is trickier to detect because replication might not break.

In Part #3, we even more extended our understanding of the impact of running a master with sync_binlog!=1 by considering GTID replication. TL&DR: this introduces even more inconsistency scenarios after a master OS crash and the resulting data drift is also tricky to detect because replication might not break.

This post has an annexe: Under the Hood. Benchmarking is a complex art and reporting results accurately is even harder. If all the details were put in a single article, it would make a very long post. The links to the annexe should satisfy readers eager for more details.

My first test is running sysbench with one thread on the same server as MySQL (co-located). I ran with high durability (sync_binlog=1 and trx_commit=1), and with low durability (sync_binlog=0 and trx_commit=2). The results are below in transactions per seconds (TPS).

Co-located, high durability: ~220 TPS

Co-located, low durability: ~6280 TPS

This (220 TPS vs 6280 TPS) is a huge difference: low durability provides more than 25 times the throughput of high durability ! If you need more than 200 TPS in those conditions, you might be tempted to run MySQL with reduced durability, but this comes with the consequences presented in the previous posts of the series. However, this test is somehow unfair: a normal application needing a high transaction throughput is usually not co-located with the database server. If I try the same tests from two other vms, one in the same zone (co-located with the master) and one a remote zone from the same region, I get below:

Remote vm, zone co-located, high durability: ~200 TPS

Remote vm, zone co-located, low durability: ~3770 TPS

Remote vm, remote zone, high durability: ~200 TPS

Remote vm, remote zone, low durability: ~2190 TPS

For a high durability configuration, we basically get the same throughput for the three setups (co-located, remote vm in the same zone, and remote vm in a remote zone of the same region). This is because the cost of the network round-trip to the database is negligible compared with the cost of the two syncs at transaction commit (real numbers are in the environment section of the annexe). For a low durability configuration, as those two syncs are removed, the network round-trip becomes the limiting factor of the benchmark so we see a big difference in the transaction throughput of the three setups. And still, in the architecture that is the most likely (the application talking with MySQL is in a remote zone but in the same region as the database server), we see a ten time difference in throughput between high and low durability configurations. This is very big: it is understandable that people are tempted to get this performance boost !

Low durability provides better transaction throughput !

So clearly, running with reduced durability provides a great performance improvement. This is especially visible in a cloud environment where a sync to disk is a network round-trip to the storage system (I am using GCP SSD persistent disks for my tests, but magnetic disks have similar behaviour latency-wise for binary logs — and probably also for InnoDB Redo logs — because sequential writes are cached at the storage level, and this is why on my production systems, I store binary logs on a dedicated magnetic disk which is cheaper than SSD). In a high durability configuration in GCP, the high latencies of disk syncs are the limiting factor. I would expect the performance of a high durability configuration with lower latencies, like a local SSD or a battery-backed-up RAID cache, to be much higher.

And for those of you who know about local SSD in cloud environments, more is coming about this subject in a next post. But in short and on local GCP SSD, I see higher write latency than on persistent SSD (yes, this is weird !). On local SSD in Amazon Web Services (AWS), I see the lower write latencies that I am expecting. However, it is too soon for me to state that AWS is better than GCP.

The high sync latencies of cloud environmentsare making the problem worse !

As cloud environments have higher sync latency than physical servers, it is not completely false to say that for getting decent transaction throughput in such environments, compromising on durability is the easiest solution (there is more to say about this statement, but it is the subject of a next post). With more and more people moving to the cloud, I am expecting more and more people to run with low durability configuration, and this is one of the reasons I think this series is important. I would also like database editors (Oracle and MariaDB) to understand the importance of supporting such configuration (cf the discussion about Bug#92109 below).

Database editors must support low durability configurationfor allowing running their products in the cloud !

It is also important that cloud providers allow running managed database with reduced durability settings (from what I have seen, Google Cloud SQL does not support such configuration, and I cannot say about Amazon RDS, if you know, please write a comment below).

It is relatively easy to get past the high durability transaction throughput limitation on a master without lowering durability (I am not telling how to do this yet, this is the subject of a next post, but the title of my last post — Getting past 2400 threads with sysbench and dbdeployer for benchmarking MySQL 5.7.26 — is already giving you a good hint), but we are missing an important part of the puzzle: replication.

Yes, replication ! So far, the results above were sysbench pointed to a master, what about transaction throughput on slaves ? Replication is by default single-threaded, so blocking on sync is problematic. Below are transaction throughputs from my tests, the methodology is in the replication section of the annexe:

Replication, high durability: ~230 TPS

Replication, low durability: ~7050 TPS

So replication is giving us a little more throughput than sysbench co-located on the master. This is probably because replication is using row-based binary logging format (the default in MySQL 5.7) which is skipping statement parsing. But there is one hidden conclusion in those numbers:

Reducing durability is a very efficient and common wayfor dealing with a replication lag problem !

And this is why I am so disappointed that replication with GTID is not yet crash safe with low durability configuration and I would like Bug#92109 to get the attention it deserves. My trick to avoid this problem right now is to disable GTID replication for slaved on which I am using low durability by running the below command, but this is only a patch and things should not be this way.

> CHANGE MASTER TO MASTER_AUTO_POSITION = OFF;

Hopefully, this post will remind Oracle of the importance of fixing this bug, which is not affecting MariaDB as its GTID table is updated after each transaction on slaves.

Before closing this post, I would like to mention a few things that I detail in the annexe:

I open Bug#96134 for adding START UNTIL and WAIT functions for the IO Thread.

For replication tests, I mentioned having to work around the very annoying limitations of relay log recovery, which I reported as Bug#74321 in MySQL 5.6. It is disappointing that this is still not fixed in 5.7 and 8.0, more than 4 years after being reported.

This is all I have for now. In the next posts, I will explain how to go beyond the numbers presented in this post (and yes, it will involve using many threads).