Jeremy Cole on MySQL Replication

Jeremy Cole recently posted very nice post about MySQL Replication Performance. There are however few points I should comment on.

Jeremy Speaks about MyISAM and Innodb in the same terms, in fact they are quite different for Replication. As you probably know MyISAM uses buffered IO so if master crashes you’re out of sync anyway – whatever sync_binlog option was set to. Well it is still a bit better with sync_binlog enabled as slave could have consistent data after it catches up but master may have some of its tables out of sync with binary log and it is not easy to find which.

I’m surprised to read enabling binary log drops update performance. I guess Jeremy speaks about enabling binary log and setting sync_binlog option which is good for safe replication. But it is setting which is off by default so you should not see such effect just by enabling log_bin.

Innodb tables also have to flush logs on transaction commit unless you disable it so you should not be getting 1500 update transactions per second from the single thread without battery backed up cache. If you do it is quite possible Operation system is faking fsync() for you. In certain cases it only does so for sequential writes this is why sync_binlog changes things. Two logs are synced now which requires seeks which make certain disks to flush their cache.

Watch out for MySQL 5.0 – if you enabled Binary log with MySQL 5.0 you loose group commit which can dramatically increase commit rate for multiple user load. See this post by Vadim for benchmark results.

Solution with Battery Backed up RAID is great one and is not that expensive these days. Other two probably just create conditions for Operation System to fake fsync. You need to get data to the disk and you can’t physically get data to the disk surface at 1500 operarions per second. Jeremy is speaking about seeks but it is only part of the latency. Disk rotation is another big contibutor. Even if we stay on the same track we’ll be only able to do single disk write per rotation, which is about 250 for 15000 RPM drive. If you’re getting more without battery backed up cache something is likely faking it.

I also should mention this covers only Master overhead of Replication. Slave is where you can expect more problems, especially on high end systems. Slave serializes all operations to single stream so group commit can’t work. Furthermore even if you run with innodb_flush_logs_at_trx_commit=2 you still do not solve all problems. You still will practically have only One CPU to handle replication load. Furthermore if you have multiple disks they will not be used effectively as replication SQL thread will typically submit IO requests to the disk one by one so only one drive will do the work and other will be idle. You mostly really affected by this if you have a lot of writes to large database size.
This is one more reason why you might want to go with scale out and keep single server a box with few CPUs and few hard drives while running MySQL.

Related

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master’s Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

12 Comments

Sorry for being unclear on a few points. You’re right abuot sync_binlog. I am mainly discussing setting it to sync_binlog=1, but I neglected to mention that in the post. I’ll update it!

To give some context, my current research in this area is with regard to using replication for failover/disaster recovery, where sync_binlog=1 is basically a requirement.

Regarding transaction rates, in a single-threaded environment, you’re right, but when multiple threads are writing, the actual disk writes may be batched together, so you can get higher write throughput.

Additionally, those numbers are the averages, and depending on workload it’s possible to get much better than average throughput.

For a busy server writing small binary log events constantly or near constantly, to an ext3 filesystem set with data=journal, I think that’s nearly the optimal workload for the disk, without a write cache.

Yes with multiple threads there is group commit to allow to commit several transactions with single write… which however does not work with MySQL 5.0 If you’re using sync_binlog with MyISAM I guess there is no group commit ether.

I understnd the number are aproximations. It does not really matter if you’re saying 1000 or 5000 update transactions from single thread. My main point is – at such high rates unless you have battery backed up cache it is not 100% secure is someone is tricking you as drive can’t write that fast.

It however could be reasonably secure – ie it will survive MySQL crash and even OS cache as the data is in Disk drive cache. In certain cases it even may survive power failures as there is some energy spared in condensators. I would not rely on it however.

Speaking about data=journal, I’m quite curios – did you have a chance to compare it to data=writeback ?

Hello,
I was at the MySQL conference in May and attended Jeremy’s presentation on replication. My understanding was that all “writes” to the binary log were protected by a mutex–so you may have 100s of threads updating different tables/rows, but if sync_binlog = 1 (which is a requirement for true ACID integrity)–doesn’t this mean that in the end all writes (to the binary log) are “effectively” single-threaded?
Another question along this line is with regards to innodb_flush_log_at_trx_commit = 1? Is it also protected by a mutex and is it somehow “coordinated” with the write to the binary log?
Peter, it would be great if you could provide some high-level explanation of how InnoDB updates are propagated between the in-memory-buffer–>innodb-log-files–>innodb “tablespace” (.ibd) file–>and finally the binary-log.

Yes you’re right. Binary log is protected by mutex which makes it very expensive to do sync_binlog=1 unless you have battery backed up cache and if your OS does not lie you about fsyncs. It is simple code which does not even have “group commit” allowing to flush several events with single write to disk. Yes this means log writes are going to be serialized. It is not exactly single threaded as flushing the log is probably only portion of what thread does but it can be significant.

Innodb also has log mutex which protects log writes. In MySQL 4.1 and MySQL 5.0 without binary logging it however can do group commit, meaning it can commit several transactions which are waiting to be commited at the same time with single log write. This is currently broken in 5.0 if you enable binary log.

In general data migration in Innodb happens as following (which is traditional way for most of database management systems) – once transaction is commited the data is recorded in Innodb log files before COMMIT returns. The data pages are however modified only in the buffer pool. Dirty pages are flushed from the buffer pool in background. For Innodb flushes happen when log files are almost full and Innodb needs to overwrite some old log records (remember log files in Innodb are circular) or if there are too many dirty buffers in the buffer pool (see innodb_max_dirty_pages_pct variable).

Hello Peter,
Thanks so much for the prompt response.
If I understand your response correctly regarding the “flushing” of the InnoDB buffer pool, do you mean that data in the buffer pool (memory) is written directly to the tablespace files (.ibds)? So the sequence is something like the following:
START TRANSACTION (implicit or not)
UPDATE . . .. (modify some rows in the buffer pool (in memory)
COMMIT–>write (a delta?) to innodb-log-file AND
write statements to binary (replication) log AND THEN
. . . . do whatever is necessary because of multi-versioning in buffer pool (and mark these rows as dirty)

Then at some later time, as part of a background process, write the “dirty” pages in the buffer pool to the .ibd files
I’m sure I’m leaving some things out, but is that essentially the sequence? I had assumed that the innodb-log-files were “flushed” at some point to the tablespace files, just because in case of a crash before all buffered dirty pages were flushed, then InnoDB has to use its log files to synchronize the tablespace files because obviously anything in memory would have been lost, but it seems like you’re suggesting that the log files are not used to synchronize the tablespace files EXCEPT in the case of a crash?

I guess the problem was with you can’t embed html tags in the comment, sorry–i’ll try again

Also, as you mentioned the issue of group commit being broken (see http://www.mysqlperformanceblog.com/2006/05/19/group-commit-and-xa), I read that entry with interest, and what I took from it was that you should disable xa-support in my.cnf if you
1. don’t need it and
2. have binary logging (replication) enabled
your benchmarks suggested that by doing that (even if group commit is “broken”) you can double your write throughput

However, i stumbled upon this in the official mySQL 5.0 documentation:
“Even with sync_binlog set to 1, there is still the chance of an inconsistency between the table content and binary log content in case of a crash. For example, if you are using InnoDB tables and the MySQL server processes a COMMIT statement, it writes the whole transaction to the binary log and then commits this transaction into InnoDB. If the server crashes between those two operations, the transaction is rolled back by InnoDB at restart but still exists in the binary log. This problem can be solved with the –innodb-safe-binlog option, which adds consistency between the content of InnoDB tables and the binary log. (Note: –innodb-safe-binlog is unneeded as of MySQL 5.0; it was made obsolete by the introduction of XA transaction support.)”

I’m not sure what to make of that. XA support is traditionally for 2-phase commits between two different databases (or other transactional resources), i wouldn’t think that the binary log ITSELF is XA-compliant:)

I guess what I’m really asking is, if you have enabled binary logging (replication), sync_binlog = 1, and you’re using MySQL 5, and it’s the only transactional resource you’re using, is it safe to DISABLE innodb_xa_support? i.e. will the potential extra “dirty” statement remain in the replication log? I guess I don’t understand how enabling xa-support would resolve this, and as your benchmarks indicate, it potentially makes a HUGE difference (group commit or not).
Sorry for the length of this post, but I think you can see it’s a liitle bit confusing:)

1) Your understanding about flushing is aproximately correct. The data is modified directly in the buffer pool pages, which becomes “dirty” if there are some unflushed changes and later these pages will need to be flushed in full. This actually happens via one more “physical” log file which is called “double write buffer” to handle problem of partial page writes. The log file is structured differently it stores records which apply to individual page modification operation perfomed rather than full pages.

Unless database is staying idle database on the disk is pretty much always out of sync. Only in case of shutdown they are synchronozed (this is why it can take a while). They are also synchronized by crash recovery.

2) Right in MySQL binary log file has to be XA source as well… otherwise it can run out of sync. “Traditional” databases would not normally have many storage engines each with their own logs. In MySQL 5.0 you can disable innodb_xa_support and you can get binary log possibly out of sync with innodb transactional log. However it will not make group commit to work. Group commit only works in MySQL 5.0 if binary log is disabled at all.

Take a look at these benchmarks: http://www.mysqlperformanceblog.com/2006/05/19/group-commit-and-xa/ – disabling XA saves you from extra fsync which allows to double results but results still do not improve with many concurrent threads. This would be sight of good working group commit.

Peter,
I will be inheriting 25+ db servers (M->S replication) with over 800GB of data, 1000+ of writes per minute, all in MyISAM tables using version 4.1. Side note: A project will be underway to move to InnoDB. However in the mean time, I need to ensure we won’t loss data or at least keep it to minimum in the event of a crash. And if I understand you correctly, the best I can do is:
1. use battery backed up cache
2. set sync_binlog = 1

I’m planning to run with log-bin and sync_binlog=1. I’m also planning to do a full database backup (FLUSH TABLES WITH READ LOCK followed by a cp) once per day, along with a FLUSH LOGS. In the event of a crash, shouldn’t I be able to restore all data (both MyISAM and all committed InnoDB transactions) by just restoring the last full database backup and then applying the binary logs with mysqlbinlog? If so, why do I need the InnoDB logs, ib_logfile0 and ib_logfile1? Is there a way to disable them so they are not even written?

[…] around this are related to battery-backed disk cache, which you can read a bit more about in Jeremy Cole’s post on MySQL replication. You can also see some handy benchmarks that compare MySQL with and without […]