Monday, December 16, 2013

At Facebook, we have upgraded most of MySQL database tiers to 5.6, except very few tiers that have a special requirement -- very fast single threaded replication speed.

As Oli mentioned, single threaded performance is worse in 5.6. The regression is actually not visible in most cases. For remote clients, the performance regression is almost negligible because network latency is longer than 5.1->5.6 overhead. If clients are running locally but MySQL server is disk i/o bound, the overhead is negligible too because disk i/o latency is much longer than 5.1->5.6 overhead.

But the regression is obvious when clients run locally and queries are CPU bound. The most well known local client program for MySQL is SQL Thread (Replication Thread). Yes, 5.6 has a slower replication performance problem, if SQL thread is CPU bound.

If all of the following conditions are met, SQL thread in 5.6 is significantly slower than 5.1/5.5.

Almost all working sets fit in memory = Almost all INSERT/UPDATE/DELETE statements are done without reading disks

Very heavily inserted/updated and committed (7000~15000 commits/s)

5.6 Multi-threaded slave can not be used (due to application side restrictions etc)

You want to use crash safe slave feature and optionally GTID

Here is a micro-benchmark result. Please refer a bug report for details (how to repeat etc).

Fig 1. Single thread replication speed (commits per second)

On this graph, 5.1, 5.5, and 5.6 FILE are not crash safe. fb5.1 and 5.6 TABLE are crash safe. 5.6p has some experimental patches. Crash safe means even if slave mysqld/OS are crashed, you can recover the slave and continue replication without restoring MySQL databases. To make crash safe slave work, you have to use InnoDB storage engine only, and in 5.6 you need to set relay_log_info_repository=TABLE and relay_log_recovery=1. Durability (sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1) is NOT required.

There are a few observations from the benchmark.

17% drop from 5.1.69 to 5.5.33. I did not profile 5.5 much, but I suspect the regression was mostly caused by Metadata Locking feature introduced from 5.5.

From 5.5.33 to 5.6.15(FILE), 20% drop if log-slave-updates was enabled, 18% drop if log-slave-updates was disabled. In 5.6, FILE based replication has an inefficiency bug that it writes to OS filesystem cache too often (bug#70342). 5.6.15p(FILE) applied the fix for the bug, but was still 15% lower than 5.5.33 if log-slave-updates was enabled, 2.5% lower if log-slave-updates was disabled. This shows writing to binlog (from single threaded SQL thread) in 5.6 takes more time than 5.1/5.5.

Performance drop for 5.6.15 TABLE was serious. Compared to fb5.1 (crash safe), performance drop was 51.2% (14520/s->6103/s). slave_relay_log_info table is opened/fetched/updated/closed per every transaction commit. By keeping the table opened and skipping fetching row (updating only), performance improved to 7528/s (with log-slave-updates) and 9993/s (without log-slave-updates), but these were still much lower than fb5.1 (12505/s and 17056/s).

How can we fix and/or avoid the problem?
From MySQL development point of view, there are some room for performance optimizations.

Writing to slave_relay_log_info table should be optimized more. Currently the table is opened/fetched/updated/closed via Storage Engine API for each commit. It will be more efficient by updating the table via Embedded InnoDB API, which is currently used from InnoDB memcached plugin. slave_relay_log_info is a system table, and is updated by SQL thread (and MTS worker threads) only, so some tricky optimizations can be done here.

Writing to binlog should be faster in 5.6. 5.6 added many features such as group commit, so it is not surprising to get lower single threaded performance without any optimization.

5.1 to 5.5 performance drop was likely caused by Metadata Locking feature. I have not profiled in depth yet, but I think some functions (i.e. MDL_context::acquire_lock()) can be refactored to get better performance.

From DBA point of view, there are couple of workarounds. First of all, the regression won't happen in most cases. Most application data should be much larger than RAM, and SQL thread is doing many I/O, so CPU overhead is negligible. Even if working sets fit in memory, many applications don't need over 7000 commits per second on single instance. If you don't need high commit speed, you won't hit the problem either.
But some applications may really hit the problem -- a kind of queuing application is a typical example. At Facebook we are continuing to use 5.1 for such applications.
If you want to use 5.6 for such applications, you may disable crash safe slave, because usually database size is very small. If database size is small enough (i.e. < 200GB), you can easily restore from other running master/slaves without taking hours. If you don't need crash safe slave feature, you can use FILE based (default) replication, which is still slower than 5.1/5.5 (see "5.6.15 FILE" on the above graph) but is much better than TABLE based.
Also, Consider not using log-slave-updates. log-slave-udpates is needed when you use GTID. If you want master failover solution without GTID, MHA is a good enough solution and it does not require log-slave-updates.

Thursday, October 10, 2013

At MySQL Connect 2013, I talked about how we used MySQL 5.6 at Facebook, and explained some of new features we added to our Facebook MySQL 5.6 source tree. In this post, I'm going to talk about how we made full table scan faster in InnoDB.

Faster full table scan in InnoDB

In general, almost all queries from applications are using indexes, and reading very few rows (0..1 on primary key lookups and 0..hundreds on range scans). But sometimes
we run full table scans. Typical full table scan examples are logical backups (mysqldump) and online schema changes (SELECT ... INTO OUTFILE).

We take logical backups by mysqldump at Facebook. As you know
MySQL offers both physical and logical backup commands/utilities.
Logical backup has some advantages against physical backup. For example:

Logical backup size is much smaller. 3x-10x size difference is not uncommon.

Relatively easier to parse backups. On physical backup, if we can't restore a database by some serious reasons such as checksum failure, it is very difficult to dig into InnoDB internal data structures and fix corruptions. We trust logical backups rather than physical backups

Major drawbacks of the logical backup are that full backup and full
restore are much slower than physical backup/restore.

The slowness of full logical backup often causes problems. It may
take very long time if database size grows a lot and tables are fragmented.
At Facebook, we suffered from mysqldump performance problem that we couldn't finish full logical backup within reasonable amount of time on some HDD and
Flashcache based servers. We knew that InnoDB wasn't efficient at
full table scan because InnoDB did not actually do sequential reads,
but did random reads mostly. This was a known issue for long years. As database storage capacity has been growing, the slow full table scan issue has been getting serious
to us.
So we decided to enhance InnoDB to do faster sequential reads.
Finally our Database Engineering team implemented "Logical Readahead"
feature in InnoDB. With logical readahead, our full table scan speed
improved 9~10 times than before under usual production workloads. Under
heavy production workloads, full
table scan speed became 15~20 times faster.

Issues of full table scan on large, fragmented tables

When doing full table scan, InnoDB scans pages and rows by primary key order. This applies to all InnoDB tables, including fragmented tables. If primary key pages (pages where primary keys and rows are stored) are not fragmented, full table scan is pretty fast because reading order is close to physical storage order. This is similar to reading files by OS command (dd/cat/etc) like below.

You may find that even on commodity HDD servers, you can read more than 100MB/s multiplied by "number of drives" from disks. Over 1GB/s is not uncommon.

Unfortunately, in many cases primary key pages are fragmented. For example, if you need to manage user_id and object_id mappings, primary key will be (user_id, object_id). Insert ordering does not match with user_id ordering, so new inserts/updates very often cause page splits. New split pages will be allocated at far from current pages. This means pages get fragmented.

If primary key pages are fragmented, full table scan becomes very slow. Fig 1 illustrates the problem. After InnoDB reads leaf page #3, it has to read page #5230, and after that it has to read page #4. Page #5230 is far from page #3 and #4, so disk read ordering becomes almost random, not sequential. It is very well known that random reads on HDD is much slower than sequential reads. One very effective approach to improve random reads is using SSD, but per-GB cost on SSD is still more expensive than HDD so using SSD is not always possible.

Fig 1. Full table scan is not actually doing sequential reads

Does Linear Read Ahead really help?

InnoDB supports prefetching feature called "Linear Read Ahead". With linear read ahead, InnoDB reads an extent (64 consecutive pages: 1MB if not compressed) at one time if N pages are accessed sequentially (N can be configured by innodb_read_ahead_threshold parameter, default is 56). But actually this does not help so much. One extent (64 pages) is very small range. For most large fragmented tables, next page does not exist in the same extent. See above fig 1 for example. After reading page#3, InnoDB needs to read page#5230. Page#3 does not exist in the same extent as #5230, so linear read ahead won't help here. This is pretty common case for large fragmented tables. That's why Linear read ahead does not help much to improve full table scan performance.

Optimization approach 1: Physical Read Ahead

As described above, the problem of slow full table scan was because InnoDB did mostly random reads. To make it faster, making InnoDB do sequential reads was needed. The first approach I came up with was creating an UDF (Use Defined Function) to read ibd file (InnoDB data file) sequentially. After executing the UDF, pages in the ibd file should be within InnoDB buffer pool, so no random read happens when doing full table scan. Here is an example usage.

buf_warmup() is a udf that reads entire ibd file of database "db1", table "large_table". It takes time to read the entire ibd file from disk, but reads are sequential so much faster than random reads. When I tested, I could get ~5x overall faster time than normal linear read ahead.
This proved that sequentially reading ibd files helped to improve throughput, but there were a couple of disadvantages:

If table size is bigger than InnoDB buffer pool size, this approach does not work

Reading entire ibd file means that not only primary key pages, but also secondary index pages and unused pages have to be read and cached into InnoDB buffer pool, even though only primary key pages are needed for full table scan. This is very inefficient if you have many secondary indexes.

Applications have to be changed to call UDF.

This looked "good enough" solution, but our database engineering team came up with a better solution called "Logical Read Ahead", so we didn't choose UDF approach.

Reading rows by primary key order (same as usual full table scan, but buffer pool hit rate should be very high)

This is illustrated at fig 2 below.

Fig 2: Logical Read Ahead

Logical Read Ahead (LRA) solved issues of Physical Read Ahead. LRA enables InnoDB to read only primary key pages (not reading seconday index pages), and to prefetch configurable number of pages (not entire table) at one time. And LRA does not require any SQL syntax changes.
We added two new session variables to make LRA work. One is "innodb_lra_size" to control how many MBs to prefetch leaf pages. The other is "innodb_lra_sleep" session variable to control how many milliseconds to sleep per prefetch. We tested around 512MB ~ 4096MB prefetch size and 50 milliseconds sleep, and so far we haven't encountered any serious (such as crash/stall/inconsistency) problem. These session variables should be set only when doing full table scan. In our case, mysqldump and some utility scripts (i.e. online schema change tool) turn logical read ahead on.

Submitting multiple async I/O requests at once

Another performance issue we noticed was that i/o unit size in InnoDB was only one page, even if doing readahead. 16KB i/o unit size is too small for sequential reads,
and much less efficient than larger I/O unit size.
In 5.6, InnoDB uses Linux Native I/O (aio) by default. If submitting multiple consecutive 16KB read requests at once, Linux internally can merge requests and reads can be done efficiently. But unfortunately InnoDB submitted only one page i/o request at once. I filed a bug report #68659. As written in the bug report, on modern HDD RAID 1+0 environment, I could get more than 1000MB/s disk reads by submitting 64 consecutive pages requests at once, while I got only 160MB/s disk reads by submitting one page request.
To make Logical Read Ahead work better on our environments, we fixed this issue. On our MySQL, InnoDB submits many more page i/o requests before calling io_submit().

Source code

Conclusion

InnoDB was not efficient for full table scan, so we fixed it. We did two enhancements, one was implementing logical read ahead, the other was submitting multiple async read i/o requests at once. We have seen 8 to 18 times performance improvements on our production tables, and this has been very helpful to reduce our backup time, schema change time, etc. I hope this feature will be supported in InnoDB by Oracle officially, or at least by major MySQL fork products.

I understand almost all applications don't need 20000~30000 connections per second per instance, but latency increase should be considered. It would be better to make performance_schema parameter dynamic, and set 0 as default.

At HA session I'll speak about lessons learned at DeNA and Facebook.. I have used MHA at both companies, in both fully automated and not fully automated manner. We are also heavily testing GTID in 5.6 so probably I can share some practices as well. I'll speak about this at MySQL and Cloud Database Solutions Day on April 26, too.

At data reduction session, as session title suggests I'll talk about many data reduction techniques I have done so far and/or are planning to do at Facebook. Data reduction is very important for us. SSD / Pure Flash are still expensive, and "big data" costs a lot for such as backups, network transfers, etc.

About Me

I am a database engineer at Facebook.
Before joining Facebook, I was a principal database and infrastructure architect at DeNA. My primary responsibility at DeNA was to make our database infrastructure more reliable, faster and more scalable. Before joining DeNA, I worked at MySQL/Sun/Oracle as a lead MySQL consultant in APAC for four years.
You can contact me on Yoshinori.Matsunobu_at_gmail.com (replace _at_ with @).