Tuesday, January 8, 2019

Care when changing MASTER_DELAY with CHANGE MASTER TO (delayed replication)

A few days ago, I stepped into a trap ! This made me lose time for fixing things (and even more for writing this post...). In the hope that you will avoid my mistake, I am sharing this war story. I also obviously opened a bug, more about this below.

TL&DR: be careful when using CHANGE MASTER TO MASTER_DELAY = N: it might wipe your relay logs !
As written in the TL&DR, running CHANGE MASTER TO MASTER_DELAY = N wipes the relay logs except if... This is a documented feature. This is what we can read in the manual:

Prior to MySQL 5.7.4, CHANGE MASTER TO deletes all relay log files and starts a new one, unless you specify RELAY_LOG_FILE or RELAY_LOG_POS.

In MySQL 5.7.4 and later, relay logs are preserved if at least one of the slave SQL thread and the slave I/O thread is running; if both threads are stopped, all relay log files are deleted unless at least one of RELAY_LOG_FILE or RELAY_LOG_POS is specified.

It is not because it is written in the manual that it makes this right. If the behaviour of a software, documented or not, is setting traps for users, this is a bug ! This is what happened to me a few days ago:

I had a crash of a MySQL intermediate master which corrupted the binary logs (I do not give details about the corruption, you can read one of my last post or come to my FOSDEM talk for details).

All the slaves of this intermediate master (let's call it X) had downloaded up to the same position in the binary logs, so I regrouped all the up-to-date slaves under a single slave (let's call it Y), and invoked dark magic and voodoo (mysqlbinlog) to re-slave Y to the master or X.

I also had delayed slaves replicating from X, but I could not easily re-slave them to Y because this replication chain is not using GTIDs. As these were my only delayed slaves, I did not want to process all their relay log yet, so I waited until they processed most of their relay logs (I should have been more patient...) and I set their MASTER_DELAY to 0. This was a mistake...

I ran STOP SLAVE;
CHANGE MASTER TO MASTER_DELAY = 0;
START SLAVE; this was a mistake...

This was a mistake because I did not think that this would wipe my relay logs. Also, to make things worse, I had purged the relay logs from X as they were corrupted (I kept a copy, but it was not easily usable). I also stopped the IO Thread on the delayed slaves because I did not want them to reconnect to their master where I purged binary logs. This created the perfect situation to make my delayed slave close to unrecoverable (I could have recovered them with dark magic and voodo, but re-cloning them was easier).

From what I gathered reading the documentation (cited above), I would have had the expected result if the IO Thread was still running (only if using MySQL 5.7.4+, so does not apply to 5.6), or if I would have specified RELAY_LOG_FILE and RELAY_LOG_POS. I reported Bug#93843: Setting MASTER_DELAY to 0 on a stopped delayed slave wipes the relay-log. Please subscribe to it and click the affect me link if you think this should be fixed.

Some could argue that my bug is actually a feature request, but as I said above and IMHO, a trap in a software, even if documented, should be considered a bug. Feel free to try changing my mind by commenting below and/or sharing a use case for changing MASTER_DELAY alone needs purging the relay logs.

Update 2019-01-10: I wrote "this trap was not removed" in paragraph above. This might be a little hard as this was not identified as a trap in Bug#87249. Even if Bug#93843 was closed as a duplicated of Bug#87249, the trap/bug has been acknowledge in Bug#93843.