Yoshinori Matsunobu's blog

Thursday, November 9, 2017

In MySQL ecosystems, it has been a very common practice to shard MySQL databases by application keys, and to manage multiple small sized MySQL instances. Main reason of the sharding was that a single server could not handle so much data. 10 years ago, typical commodity servers had only 16GB RAM, and typical storage configuration was RAID10 with 6~12 SATA HDDs. There was no affordable flash storage available at that time. Because such machines could handle only a few hundred random IOPS, and buffer pool was so limited, we couldn't put much data on a single server — at most a few hundred GBs per server. Even small-mid sized applications easily exceeded single server capacity. They had to split databases into multiple servers.

Disadvantages of Sharding

Sharding by applications has been a common practice to scale MySQL beyond single machine. But there are a couple of disadvantages like the followings.

You have to write application logic to manage shards. Also, you need to manage many more MySQL instances

Atomic transaction is not supported across multiple shards

Can not take globally consistent backups across multiple shards

Cross instance join is not supported (or very slow)

Hard to use secondary keys efficiently. Unless secondary keys are part of sharding keys, you need to query all shards, which is very expensive, especially if you need sorting

Atomic transaction, join, and secondary keys are big deals. Not all applications can easily give up them. They might not be willing to spend time to implement sharding logic, either.

Shared Nothing SQL database is not general purpose

There are a couple of database products that offer transparent sharding by database internals. MySQL Cluster (NDB) is one of the MySQL engines that has existed for long years, offering shared nothing distributed databases. You can run atomic transactions, run cross-shard joins across multiple instances (data nodes). MySQL Cluster supports NoSQL interface (it is called NDBAPI), to query directly into data nodes, which boost performance significantly. I'm not going to talk about the NoSQL interface, since I'd like to discuss SQL database. Note that MySQL Cluster is a product name and it is not InnoDB. If you want to continue to use InnoDB, Vitess can be a good solution. It helps to build transparent, shared nothing database.
Transparent shared nothing databases like MySQL Cluster, Vitess solved some of the issues mentioned above, but there are still issues like below, and users may need to redesign tables and/or rewrite queries as needed.

Limited secondary key support. This was the same as I mentioned above. In MySQL Cluster, by default, rows are distributed by primary keys. If you do not explicitly specify primary keys in WHERE clause, queries need to scan all data nodes, which may significantly limit concurrency and increase latency, depending on how many data nodes you have and other query conditions (e.g. sorting).

Queries using JOINs may need lots of network tound-trips. Lots of improvements have been done in MySQL Cluster for handling joins, but it's still slower than InnoDB, especially if queries are complex.

As a result, to utilize MySQL Cluster, it was generally recommended not using secondary keys, and not using joins frequently. There are many good use cases, especially if people are 100% sure how their databases are used, but it is not recommended as a general purpose SQL database.

Bigger Small Data and its challenges

Several people at Facebook called MySQL services as "Small Data". Combined MySQL instance size was pretty large, but each instance size was small enough (normally up to 1TB). 10 years ago, people had to run small instances because of limited hardware support. Nowadays, commodity servers have more than 256GB RAM and more than 10TB Flash storage. There are many small-mid databases that fit in 10TB. Successful databases grow beyond 10TB, so they will have to shard anyway. But how about experimental projects, and/or many other applications that are expected to grow up to limited size? Instead of spending engineering efforts to manage shards and rewrite tables/queries, why not just put everything on a single server and take all advantages like atomic and consistent transactions, secondary indexes and joins — running "Bigger Small Data"?
There are a couple of public MySQL services supporting bigger storage size. Amazon Aurora (AWS) and Alibaba PolarDB (Alibaba Cloud) are both based on MySQL and claim to support more than 60TB instance size. It was not surprising to me that they chose bigger small data rather than shared nothing distributed database, because they had lots of customers who wanted to do whatever they wanted. They couldn’t control customers not to use joins.
But being supported does not necessarily mean working as expected. To make MySQL really work with bigger small data, it needs lots of improvements, beyond 8.0, especially improving concurrency and long running operations, including the followings.

Parallel logical dump (mysqldump)

Parallel query

Parallel binary copy

Parallel DDL

Resumable DDL

Better join algorithm

Much faster replication

Handling many more connections

Good resource control, so that some bad users don't eat all resources

These are needed at least, to answer questions like "how can MySQL handle general operations, if our instance grows 10x bigger"?

I'm not worried about short queries — row look-ups by primary keys or secondary keys. These are where MySQL has been great so far. I'm worried about long running queries. The most common case of the long running queries would be full table scan from a single table. In general, logical dump from 10TB table takes (much) more than 10 times, compared to scanning from 1TB table. InnoDB needs to keep history list for consistent reads. Maintaining history list and reading consistent snapshots from rollback segments get more expensive, if rows are heavily updated during running long running transactions. If you run daily backups by mysqldump, you might not tolerate 10x~20x longer backup time — which might not finish in 24 hours. To make logical dump shorter, parallel query support is needed, and this is not currently not supported by MySQL. Physical backup (binary copy) also needs parallelism, though it can be relatively easily implemented, since physical backup tools are written out side of MySQL and are easily extended.

Running ALTER TABLE on 10TB table is also challenging. Amazon Aurora supports instant DDL — adding a nullable column at the end of a table can be done without rebuilding the table. But there are still many DDL operations requiring copying tables. First, you will want to know when it ends. MySQL currently does not tell that. If it is expected to take 20 days, you might be worried what will happen if mysqld restarts before finishing the DDL. It would be great if the database remembers DDL state periodically, and can resume operations when restarting mysqld.

Replication is another technical challenge. MySQL replication is asynchronous. Slaves are often lagged if master instances are heavily updated. On bigger small data, update volume on master can be 10x or even more. How slaves can handle 10x more replication traffics?
Amazon Aurora does not have MySQL replication lag issue, if you run it in a single region. Aurora has 6x storage copies in the same region, across three availability zones. You can scale reads in the same region. But scaling reads across multiple regions requires MySQL Replication, and it is challenging unless making MySQL Replication a lot faster. Alibaba PolarDB offers InnoDB physical replication across different datacenters, which is significantly more concurrent and faster than binlog based replication. But you are constrained to one engine (InnoDB, though it's by far the most common engine) and debugging replication issues might be harder, since it's no longer binlog based.

Reliability improvement should not be ignored. 10x larger data means the instance serves many more connections and queries. If handful bad queries take the whole instance unavailable, the impact in bigger small data is much higher than in small instances. Good resource management is needed. High priority queries can be useful too, for making some important low latency queries finish earlier, without being affected by expensive queries.

There are lots of technical challenges to make Bigger Small Data really work. I expect Amazon Aurora will be ahead to implement these important features to make Aurora truly support bigger data. But I'm almost sure that AWS won't release them as open source software. For everybody to get such features, somebody else will have to implement. I hope Oracle will do, but I understand that they need to compete Amazon Aurora and Oracle will not be willing to give their features to AWS for free. More realistic scenario might be multiple companies, including us, implementing features, releasing as open source and contributing to Oracle and/or MariaDB. I think these are interesting technical projects for MySQL for a couple of years.

Monday, January 26, 2015

Long time ago I wrote a blog post that we started using Semi-Synchronous replication in production at Facebook. We are running our own MySQL facebook-5.6 branch, which added many replication (including Semisync) patches. Some of them were also merged with WebScaleSQL.

Recently I heard from people from community that they were suffering from Semisync performance issues in official MySQL 5.6. I took some time to review our previous MySQL bug reports, then realized that some important bugs were either still "verified" or inefficiently fixed. Two most affecting bug reports were https://bugs.mysql.com/bug.php?id=70342 and http://bugs.mysql.com/bug.php?id=70669. We fixed both at our branch so I haven't paid much attention after that, but people outside Facebook are certainly affected.

In this post, I'm going to describe some effective configurations to get better Semisync throughput on master and slaves, by showing simple benchmark numbers. I used three machines -- client and master and semisync slave --, all running on pure flash. They are located within very close distance. I created 100 databases and enabled Multi-Threaded-Slave, and ran 100 mysqlslap processes for 100 databases, with 30 concurrent connections each (3000 concurrent connections in total). All queries were auto-committed inserts and I used InnoDB storage engine on both master and slaves.

1. Set master_info_repository=TABLE
MySQL 5.6 and 5.7 have a performance bug that writing FILE based master info (and relay log info) files are very expensive. This is especially serious for Semisync replication, since this bug slows down IO thread. On Semisync, slow IO threads takes longer time to send ACK back to the master, so it slows down master throughput as well. Default master_info_repository is FILE, so without changing this parameter to TABLE, you are affected by this bug.
Here are benchmark results between FILE and TABLE.

These numbers were the number of commits per second on both master and slave instances. Slave didn't lag for most experiments, thanks to multi-threaded slave feature.
Please don't confuse between master_info_repository and relay_log_info_repository parameters. relay_log_info_repository has to be TABLE, otherwise crash safe slave doesn't work. master_info_repository works on both FILE and TABLE, but I suggest to use TABLE for performance reasons.

2. Reduce durability on master
Older 5.6 had a bug that slaves couldn't continue replication after crashed master's recovery, even if setting fully durable configurations. Here is a closed bug report.
This bug report was closed, but it caused some performance regression. Master extended LOCK_log mutex holding duration -- releasing LOCK_log mutex after calling fsync(). This certainly fixed the bug, but caused performance regression because LOCK_log was very hot mutex in 5.6 -- both Binlog Dump thread and application threads need to hold the lock. Hopefully reducing durability (I mean setting sync_binlog=0 and innodb_flush_log_at_trx_commit=0|2) mitigates the regression a lot. When using Semisync replication, you are most certainly to promote a slave on master failure, so durability on master does not matter much == You can reduce durability on master.

3. Loss Less Semisync
MySQL 5.7 improved replication performance a lot. Probably the most effective improvement was that Binlog Dump thread no longer held LOCK_log mutex. In addition to that, 5.7 introduced "Loss-Less Semisync" feature.
If you read my previous Semisync blog post carefully, you may have noticed that 1. we backported "Loss-Less Semisync" from 5.7, and 2. we got better throughput with Loss-Less Semisync than Normal Semisync. This was because Loss-Less Semisync actually reduced mutex contentions -- LOCK_commit and LOCK_binlog_. My 5.7 benchmark result was as follows.

Reducing LOCK_log contention on Binlog Dump thread and introducing Loss Less Semisync were major contributors in MySQL 5.7 performance improvements. At Facebook, we ported both in our 5.6 branch. It would be interesting for community if these can be available on other distributions, since using 5.7 in production will not happen anytime soon.

In addition to the above three configurations, there are still some considerations to make Semisync throughput not bad / better.

4. Semisync mysqlbinlog
At Facebook, we implemented mysqlbinlog to speak Semisync protocol, and used it as a Semisync replication reader. On replication slaves, IO thread and SQL thread conflict with internal mutexes. As I mentioned above, slow IO thread slows down Semisync master throughput. Semisync mysqlbinlog doesn't have such slowdown, because it doesn't have SQL thread. So using Semisync mysqlbinlog instead of Semisync slave can improve master throughput.

This shows Semisync mysqlbinlog improved master throughput. But this is actually not so good news -- because slave lags a lot. Fundamentally we need to fix mutex contentions between IO thread and SQL threads.

5. GTID
There are some open performance bugs in GTID. Especially 5.7 one is serious. If you really need high throughput Semisync, you need to carefully benchmark with GTID (and ask Oracle to fix!).

Here are whole table definitions and mysqlslap commands I used for benchmark.

Tuesday, April 1, 2014

After intensive testing and hack, we started using Semi-Synchronous MySQL Replication at Facebook production environments. Semi-Synchronous Replication itself was ready since MySQL 5.5 (GA was released 3.5 years ago!), but I'm pretty sure not many people have used in production so far. Here are summary of our objective, enhancements and usage patterns. If you want to hear more in depth, please feel free to ask me at Percona Live this week.

Objective / Why Semisync?

The objective of the Semi-Synchronous Replication is simple -- Master Failover without data loss, without full durability.

There are a couple of fast slave promotion (master failover) solutions. My own MHA covers both fully automated and semi-automated MySQL failover solution. Fully automated means both failure detection and slave promotion are done automatically. Semi automated means failure detection is not done but slave promotion is done by one command. Time to detect failure is approximately 10 seconds, and actual failover is taking around 5 to 20 seconds, depending on what you are doing during failover (i.e. forcing power off of the crashed master will take at least a few seconds). Total downtime can be less than 30 seconds, if failover works correctly. I'm using term "Fast Failover" in this post, which includes both automated and semi-automated master failover.
In MySQL 5.6, GTID based failover is also possible. Oracle's official tool mysqlfailover automates MySQL master failover using GTID. The latest version of MHA also supports GTID.

Both mysqlfailover and MHA rely on MySQL replication. MySQL replication is asynchronous. So there is a very serious disadvantage -- potential data loss risk on master failover. If you use normal MySQL replication and do automated master failover with MHA/mysqlfailover, you can do failover quickly (a few seconds with MHA), but you always have risks of losing recently committed data.

If you don't want to take any risk of losing data, you can't do fast master failover with normal MySQL replication. You have to do the following steps in case of master failure.

- Always set fully durable settings on master. By fully durable I mean setting innodb_flush_log_at_trx_commit=1 and sync_binlog=1.
- On master crash, wait for a while (10~30 minutes) until the crashed master recovers. Recovery takes long time because it involves OS reboot, storage and filesystem recovery, and InnoDB crash recovery.
- If the crashed master recovers, you can continue services without losing any data. Since all data exist on the master, slaves can continue replication. BTW official 5.6 had a bug causing all slaves broken in this scenario, but this bug was fixed in 5.6.17.
- If the crashed master doesn't recover (H/W failure etc), you need to promote one of slaves to a new master. There is a risk of losing some data but you don't have any other choice.

This "safer" approach has two issues.
- Longer downtime. This is because you have to wait for master's recovery.
- You can't eliminate risks of losing data. If master is dead and never recovers, your risk of losing data is the same as doing fast failover.

So, in bad cases, you have to suffer from both longer down time and losing data.

Semi-Synchronous Replication is helpful to prevent from losing data.

If you do not care about data loss
risk,
there is no reason to use Semi-Synchronous replication. You can use
normal MySQL replication and do fast failover with mysqlfailover or MHA.
Facebook is one of the companies to care about data loss risk with
MySQL, so that's why we were interested in Semi-Synchronous replication a
lot.

Semisync replication was originated from Google in 2007. Official MySQL supported from 5.5. Actual implementation algorithm was substantially different from Google's.

MySQL Cluster and Galera offers synchronous replication protocol in different ways. I do not cover them in this blog post.

Semi-Synchronous Replication currently has two types of different algorithms -- Normal Semisync and Loss-Less Semisync. Let me explain the differences.

Differences between Normal Semisync and Loss-Less Semisync

Loss-Less Semisync is a new Semisync feature supported in official MySQL 5.7. Original implementation was done by Zhou Zhenxing as "Enhanced Semisync" project, and also filed as a bug report. Oracle implemented based on his idea, and named Loss-Less semisync for it. So Enhanced Semisync and Loss-Less Semisync have same meanings. I say Loss-Less semisync in this post.

On
normal semisync(AFTER_COMMIT), committing to InnoDB is done before
waiting for ack from semisync slave, so the committed rows are visible
from applications, even though semisync slaves may not have received the
data. If master is crashed and none of the slaves received the data,
the data is lost but applications may have seen them. This is called
phantom reads, and in many cases it's problematic.

Loss-less
semisync (AFTER_SYNC) avoids the problem. Loss-less semisync commits
InnoDB after getting ack from one of semisync slaves. So when committed
data is visible from applications, one of the semisync slaves have
received that. Phantom read risk is much smaller: if both master and the
latest semisync slave are down at the same time, data is lost. But it's
much less likely to happen compared to normal semisync.

With Loss-Less Semi-Synchronous replication, committed data should be on one of the slaves, so you can recover from
the latest slave. You can always do fast failover here.

Reduced Durability

When you do fast failover, you can set reduced durable settings on master as well as slaves. Reduced durability means innodb_flush_log_at_trx_commit != 1 and sync_binlog != 1. With Semi-Synchronous replication, you can immediately start failover when master is down. When promoting a slave to the new master, identify the latest slave (highly likely one of the Semi-Synchronous slaves but not guaranteed) and apply differential logs to the new master. Master's durability does not matter here, because there is no way to access master's data during failover. So you can safely reduce durability. Reducing durability has a lot of benefits.

- Reducing latency on (group) commit because it doesn't wait for fsync().
- Reducing IOPS because the number of fsync() calls is significantly reduced: from every commit to every second. Overall disk workloads can be reduced. This is especially helpful if you can't rely on battery/flash backed write cache.
- Reducing write amplification. Write volume can be reduced a lot, even less than half in some cases. This is important especially when using flash devices, because less write volume increases flash life expectancy.

Requirements for Semisync Deployment

To make Semisync work, you need at least one semisync reader (slave with semisync enabled) within the same (or very close) datacenter as the master. This is for latency. When semisync is enabled, round-trip time(RTT) between master and one of the semisync slaves is added to transaction commit latency. If none of the semisync slave is located within close datacenter, RTT many take tens or hundreds of milliseconds, which means you can commit only 10~100 times from single client. For most environments, this will not work. You need a slave within close datacenter.

To make fast failover work without data loss, you need to make sure Semi-Synchronous Replication is always enabled. MySQL Semisync has a couple of points where optionally semisync is disabled:
- Exceeding timeout (exceeding rpl_semi_sync_master_timeout milliseconds to get ACK from all of the semisync slaves)
- No semisync slave (can be controlled via rpl_semi_sync_master_wait_no_slave)
- Executing SET GLOBAL rpl_semi_sync_master_enabled=0

If you want to enable semisync always, you make sure these scenario won't happen. Set infinite or very long timeout, and have at least two semisync readers.

Facebook Enhancements to Semi-Synchronous Replication

We spent a lot of time for testing Semi-Synchronous replication in 2013. We found some S1 bugs, serious performance problems, and some administration issues. Our MySQL Engineering team and Performance team worked for fixing issues and finally our Operations team deployed Semisync in production.

Here are our major enhancements.

Backporting Loss-Less Semisync from 5.7

As described above, Loss-Less Semisync is needed to prevent data loss and phantom reads, so we backported Loss-Less Semisync patch from official MySQL 5.7 to our Facebook MySQL 5.6 branch. It will be merged to WebScaleSQL branch soon.

Interestingly, when we tested semisync performance, Loss-less semisync gave better throughput than normal semisync, especially when the number of clients is large. Normal semisync caused more mutex contentions, which was alleviated with loss-less semisync. Since Loss-less semisync has better data protection mechanism, we concluded there is no reason to use normal semisync here.

Semisync mysqlbinlog

Starting from MySQL 5.6, mysqlbinlog supported remote binlog backups, by using --raw and --read-from-remote-server. On remote binlog backups, mysqlbinlog works like a MySQL slave. mysqlbinlog connects to a master, executing BINLOG DUMP command, then receiving binlog events via MySQL replication protocol. This is useful when you want to take backups of the master's binary logs. Slave's relay logs and binary logs are not identical to master's binary logs, so they can't directly be used as backups of the master's binary logs.

We extended mysqlbinlog to speak Semisync protocol. The reason of the enhancement is that we wanted to use "semisync mysqlbinlog" as a replacement of local semisync slaves. We usually run slaves on remote datacenters, and we don't always need local slaves to serve read requests / redundancy. On the other hand, as described at above "Requirements for Semisync Deployment" section, in practice at least two local semisync readers are needed to make semisync work. We didn't like to run additional two dedicated slaves per master just for semisync. So we invented semisync mysqlbinlog and use it instead of semisync slaves, as shown in the below figure.

Compared to semisync slave, semisync mysqlbinlog has a lot of efficiency wins.

- semisync slave has lots of CPU overheads such as query parsing, making optimizer plans. semisync mysqlbinlog does not have such overhead.
- semisync slave writes 2x (relay log and binary log). semisync mysqlbinlog writes binary log only.
- For semisync slave, the way to write to relay log is not efficient. IO thread writes to kernel buffer per each binlog event. For regular auto-committed transactions, it consists of three binlog events (query BEGIN, query body, and commit XID). When using InnoDB only, writing to kernel buffer for every XID event is enough (though it does not cover DDL). By writing to kernel buffer for every XID event, it makes the frequency of kernel buf flush by less than 1/3. semisync mysqlbinlog could easily do such optimizations. We have not done yet, but it is even possible to make mysqlbinlog send back ACK before writing, to a file, and the extension is very easy.
- Slave causes contention between SQL thread and I/O thread, so IO thread itself slows down, which slows down semisync master throughput too. Semisync binlog does not have such overhead because there is no SQL thread.

With mysqlbinlog reader, master failover step becomes a bit tricky. This is because mysqlbinlog is not mysqld process so it doesn't accept any MySQL command, such as CHANGE MASTER. When doing master failover, it is highly likely that one of local mysqlbinlog has the latest binary log events, and the events should be applied to a new master. New MHA version (0.56) supported the feature.

In this configuration, mysqlbinlog processes need to be highly available. If all semisync mysqlbinlog processes are down, semisync is stopped or suffering from long wait time..

Reducing plugin_lock mutex contention

Prior to MySQL 5.6.17, there was a performance bug that transaction commit throughput dropped significantly when there were non-semisync many slaves or binlog readers, even if there was only a few semisync readers. On typical deployments, there are two or three semisync readers and multiple non-semisync readers, so performance drop with many non-semisync readers was annoying.
The performance drop was caused by "plugin_lock" MySQL internal mutex on master. For those who don't know, semisync is a plugin in MySQL, and it's not installed by default. The plugin_lock mutex was needed by semisync binlog dump threads only, but actually the mutex was held by all binlog dump threads. We looked into the problem further.
First we tried replacing plugin_lock mutex with read/write mutex. It actually did not help much. But Linux profiling tools showed that plugin_lock still caused contentions. During profiling, we learned that most/all glibc rw-locks had an internal lock (mutex-like thing) on which threads could stall. The pattern was get lock, get exclusive access to cache line to modify data, release lock. This was relatively expensive for plugin_lock mutex, since it doesn't do any expensive I/O inside.

So switching plugin_lock to read/write lock was actually a bad idea. It was needed to remove below plugin related locks as long as possible. There are four major plugin related mutexes in MySQL.
- plugin_lock
- plugin_lock_list
- plugin_unlock
- plugin_unlock_list

We also noticed that Delegate classes had read/write locks and they caused very hot contentions (especially Binlog_transmit_delegate::lock). The read/write lock protects a list, so probably switching to lock-free list was possible. BTW we noticed that performance schema did not collect mutex statistics on the mutexes on Delegate classes (bug#70577).

The real problem was all of the above locks were held not only by semisync binlog readers, but also non-semisync binlog readers.

Based on the above factors, we concluded removing all plugin mutexes was not easy, then we decided to optimize to hold these locks by semisync binlog readers only, and not holding by non-semisync binlog readers. The below is a benchmark result.

x-axis was the number of non-semisync binlog readers, y-axis was concurrent INSERT throughput from 100 clients. The number of semisync binlog readers was always 1 to 3. Detailed benchmark conditions were described in a bug report.
Hopefully our patches were finally merged to 5.6.17 and 5.7 so everybody can get benefits easily.

With all of the enhancements, we could get pretty good benchmark results with semisync.

This is a mysqlslap insert benchmark on the master, with one semisync slave/mysqlbinlog running. x-axis is the number of clients, y-axis is the number of inserts on the master. Enhanced means loss-less semisync.
Normal slave is traditional (non-semisync) slave. Enhanced mysqlbinlog is our semisync usage pattern. As you can see, loss-less semisync beats normal semisync due to internal mutex contention reductions. semisync mysqlbinlog also beats semisync slave because of much less overheads. This shows that loss-less semisync scales pretty well.

Conclusion and Future Plans

After several performance improvements, Semi-Synchronous replication became good enough for us. From performance point of view, I expect that single-threaded application performance will be next low-hanging fruits. On our benchmarks, we got around ~2500 transaction commits per second with semisync (0.4ms per commit). Without semisync, it was easy to get ~10000 transaction commits per second (0.1ms per commit). Of course semisync adds RTT overhead, but on local datacenter network, RTT is much lower than 0.3ms. I think there is another semisync overhead here, so will revisit this issue and will work with Oracle Replication team and outside experts.

Monday, March 31, 2014

I released MHA version 0.56 today. Downloads are available here. MHA 0.56 includes below features.

Supporting MySQL 5.6 GTID. If GTID and auto position is enabled, MHA automatically does failover with GTID SQL syntax, not using traditional relay log based failover. You don't need any explicit configuration within MHA to use GTID based failover.

Supporting MySQL 5.6 Multi-Threaded slave

Supporting MySQL 5.6 binlog checksum

MHA supports new section [binlogN]. In binlog section, you can define mysqlbinlog streaming servers.
When MHA does GTID based failover, MHA checks binlog servers, and if
binlog servers are ahead of other slaves, MHA applies differential
binlog events from the binlog server to the new master before recovery. When MHA does non-GTID
based (traditional) failover, MHA ignores binlog servers. More details can be found on documentation.

Supporting custom mysql and mysqlbinlog location

Adding
ping_type=INSERT for checking connectivity for the master. This is
useful if master does not accept any writes (i.e. disk error)

Many people from Facebook speak at Percona Live this year. Please take a look at an interview from Percona to see what we are going to speak.

I assume many of my blog subscribers have already heard about WebScaleSQL that was announced this morning. MySQL Conference in April is the biggest MySQL conference in the world so it's a perfect timing to release something and collaborate with experts. I hope to meet with many people there.

Thursday, March 13, 2014

There is a relatively new and platform dependent flushing function called sync_file_range(). Some databases (not MySQL) use sync_file_range() internally.
Recently I investigated stall issues caused by buffered write and sync_file_range(). I learned a lot during investigation but I don't think these behaviors are well known to the public. Here I summarize my understandings.

Understanding differences between sync_file_range() and fsync()/fdatasync()

sync_file_range() has some important behavior differences from fsync().

sync_file_range() has a flag to flush to disk asynchronously. fsync() always flushes to disk synchronously.
sync_file_range(SYNC_FILE_RANGE_WRITE) does async writes (async sync_file_range()), sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER) does sync writes (sync sync_file_range()). With async sync_file_range(), you can *usually* call sync_file_range() very quickly and let Linux flush pages to disk later. As I describe later, async sync_file_range() is actually not always asynchronous, and is sometimes blocked for writeback. It is also important that I/O errors can't be notified when using async sync_file_range().

sync_file_range() allows to set file ranges (starting offset and size) to flush to disk.
fsync() always flushes all dirty pages of the file. Ranges are rounded
to page unit size. For example, sync_file_range(fd, 100, 300) will flush
from offset 0 to 4096 (flushing page#1), not limited from offset 100 to 300. This is because minimum I/O unit is page.

sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER) does not wait for metadata flushing. fsync() waits until flushing both data and metadata are done. fdatasync() skips flushing metadata if file size does not change (fsync() also skips flushing metadata in that case, depending on filesystem). sync_file_range() does not wait metadata flushing even though file size changes. If a file is not overwritten (=appended), sync_file_range() does not guarantee the file can be recovered after crash, while fsync()/fdatasync() guarantee that.

sync_file_range() behavior highly depends on kernel version and filesystem.

xfs does neighbor page flushing, in addition to specified ranges. For example, sync_file_range(fd, 8192, 16384) does not only trigger flushing page #3 to #4, but also flushing many more dirty pages (i.e. up to page#16). This works very well for HDD because I/O unit size becomes bigger. In general, synchronously writing 1MB * 1000 times is much faster than writing 4KB * 256,000 times. ext3 and ext4 don't do neighbor page flushing.

sync_file_range() is generally faster than fsync() because it can control dirty page ranges and skips waiting for metadata flushing. But sync_file_range() can't be used for guaranteeing durability, especially when file size changes.

Practical usage of the sync_file_range() is where you don't need full durability but you want to control(reduce) dirty pages. For example, Facebook's HBase uses sync_file_range() for compactions and HLog writes. HBase does not need full durability (fsync()) per write because HBase relies on HDFS and HDFS can recover from HDFS replicas. Compactions write huge volume of data so periodically calling sync_file_range() makes sense to avoid burst writes. Calling sync_file_range() 1MB * 1000 times periodically gives more stable workloads than flushing 1GB at one time. RocksDB also uses sync_file_range().

Async sync_file_range is not always asynchronous

Sometimes you might want to flush pages/files more earlier than relying on kernel threads (bdflush), in order to avoid burst writes. fsync() and sync sync_file_range() (sync_file_range(SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER)) can be used for that purpose, but both takes longer time (~10ms) on HDD if RAID write cache is disabled. You probably don't want to execute from user-facing thread.
How about using async sync_file_range() (sync_file_range(SYNC_FILE_RANGE_WRITE)) from user-facing thread? It's supposed not to wait for i/o, so latency should be minimal. But I don't recommend using sync_file_range() from user facing thread like that. This is actually not always asynchronous, and there are many cases it takes time for waiting for disk i/o.
I'm showing a couple of examples where async sync_file_range() takes longer time. In the following examples, I assume stable page writes are already disabled.

Stall Example 1: Small range sync_file_range()

In example 1, with stable page write fix, write() won't wait for dirty pages written to disk(writeback). But sync_file_range() actually waits for writeback.
When stable page write is disabled, there is a possibility that a page is both writeback in progress and marked dirty. Below is an example scenario.

In this case, the second sync_file_range(SYNC_FILE_RANGE_WRITE) is blocked until flushing to disk triggered by the first sync_file_range() is done, which may take tens of milliseconds.
Here is an example stack trace when sync_file_range() is blocked.

Stall example 2: Bulk sync_file_range()

What happens if calling write() multiple times then call sync_file_range(SYNC_FILE_RANGE_WRITE) for multiple pages at once? In below example, calling write() 21 times then triggering flush by sync_file_range().

Unfortunately, sync_file_range() also may take time in this case too.
It works as below in xfs. Since xfs does neighbor page flushing via sync_file_range(), there is a possibility that a page is both under writeback in progress and marked dirty.

Note that if write volume (and overall disk busy rate) is lower enough than disk speed, page 6 should be flushed to disk before starting second sync_file_range(). In that case it shouldn't wait anything.

Stall example 3: Aligned page writes

The main reason why async sync_file_range() was blocked is that write() was not aligned by page size. What if we are doing fully aligned page write (writing 4KB multiple)?
With aligned page write, async sync_file_range() does not wait shown at Example 1 and 2, and gives much better throughput. But, even with aligned page write, sometimes async sync_file_range() waits for disk i/o.
sync_file_range() submits page write i/o requests to disks. If there are many outstanding i/o read/write requests in a disk queue, new i/o requests are blocked until there is a free slot available in the queue. This blocks sync_file_range() too.
Queue size is managed under /sys/block/sdX/queue/nr_requests. You may increase to larger values.

echo 1024 > /sys/block/sda/queue/nr_requests

This mitigates stalls at sync_file_range() on busy disks. But this won't solve problems entirely. If you submit many more write i/o requests, read requests take more time to serve (write-starving-reads) which very negatively affects user-facing query latency.

Solution for the stalls

Make sure use Linux kernels supporting disabling stable page write. Otherwise write() would be blocked. My previous post covers this topic. sync_file_range(SYNC_FILE_RANGE_WRITE) is supposed to by asynchronous, but is actually blocked for writeback in many patterns, so it's not recommended calling sync_file_range() from user-facing thread, if you really care about latency. Calling sync_file_range() from a background (not user-facing) thread would be better solution here.
Buffered write and sync_file_range() are important for some databases like HBase and RocksDB. For HBase/Hadoop, using JBOD is one of the well known best practices. HLog writes are buffered, and not flushed to disk per write(put operation). There are some HBase/Hadoop distributions supporting sync_file_range() to reduce outstanding dirty pages. From Operating System point of view, HLog files are appended, and file size is not small (64MB by default). This means all HLog writes go to a single disk with JBOD configuration, which means the single disk tends to be overloaded. An overloaded disk takes longer time for flushing dirty pages (via sync_file_range or bdflush), which may block further sync_file_range(). To get better latency, using Linux Kernel supporting to disable stable page write, and calling sync_file_range() from background threads (not from user-facing thread) are important.

About Me

I am a database engineer at Facebook.
Before joining Facebook, I was a principal database and infrastructure architect at DeNA. My primary responsibility at DeNA was to make our database infrastructure more reliable, faster and more scalable. Before joining DeNA, I worked at MySQL/Sun/Oracle as a lead MySQL consultant in APAC for four years.
You can contact me on Yoshinori.Matsunobu_at_gmail.com (replace _at_ with @).