How to lag a slave behind to avoid a disaster

MySQL Replication is useful and easy to setup. It is used for very different purposes. For example:

split read and writes

run data mining or reporting processes on them

disaster recovery

Is important to mention that a replication server is not a backup by itself. A mistake on the master, for example a DROP DATABASE or an UPDATE without a WHERE clause will be replicated nearly instantly to all slave servers and just having a slave is not going to be helpful here. How can we avoid that kind of mistakes? Having a slave server lagging behind.

pt-slave-delay

pt-slave-delay is another tool that comes with Percona Toolkit and it is very easy to use. It works stopping and starting the SQL Thread in order to archive the lag that you have specified in command line. Lets see an example:

With “–delay=10m” I’m asking the tool to lag the slave 10 minutes behind. The rest of options are just to specify how to connect to the slave. Piece of cake If you run the tool in order to lag a slave, for example 2 hours behind, you could have enough time to stop the replication and skip an offending query executed by mistake on the master.

Conclusion

We can improve the security and availability of our infrastructure using delayed replication slave. Is useful not only for disaster recovery but also to test how our application deal with lagged replication.

About Miguel Angel Nieto

Miguel joined Percona in October 2011. He has worked as a System Administrator for a Free Software consultant and in the supporting area of the biggest hosting company in Spain. His current focus is improving MySQL and helping the community of Free Software to grow. Miguel's roles inside Percona are Senior Support Engineer and Manager of EMEA Support Team.

Comments

Sadly, you have to know how long you want a delay for. An hour? 4 days? I’ve had developers who dropped tables and databases who came to me a week later (and in one case a month later!) and asked if there was any way to retrieve the missing data.

And there was, due to backups. I always make logical (export with mysqldump) backups and physical backups, so that if I just need to restore one database/table, I can. I tried using pt-slave-delay and did all the work to set it up, and all the work to ensure that monitoring checked to make sure replication wasn’t completely broken, just on a 4-hour delay…..it’s a difficult problem to solve. What if replication breaks entirely? how are you monitoring that?

In a previous job this tool was absolutely essential (under its previous mk- guise). We had a developer who was fond of dropping databases and tables in production without checking with sysadmin first, and unfortunately someone we couldn’t block access to for various reasons. We ran a 24 hour lagged slave as an essential way of staying sane.

This is one of my favorite pt/mk tools, probably not unreasonable to say data poisoning, (particularly admin error) is one of the more frequent causes of bad/lost data. I’ve seen it first hand in the form of a senior linux admin mistakenly working in the wrong environment, ‘poof’ half a table disappeared from the master and 3 real-time slaves in a production environment. Quite a headache to recover from backups and the binlog, a delayed slave makes life a lot easier. I’ve found a delay of 4 hours usually covers the most serious cases where either the admin realizes the mistake immediately, and/or it’s readily apparent from the behavior of the environment.

However, as noted above, a slave (even delayed) is not a silver bullet, if it’s past the delay time, you’re out of luck, conversely the longer you set the delay, the more you’ll need to catch up, along with the fact that the delay can be unpredictable (single-threaded execution and I/O lag). Goes without saying multiple logical/physical backups are a must.

Sheeri, are there any issues pt-heartbeat doesn’t solve regarding the monitoring issue?

Mike – in fact, that’s how we dealt with the monitoring issue. But pt-heartbeat doesn’t stay up, iirc whenever we restarted mysql we had to make sure pt-heartbeat was started. It’s another thing to worry about. But yeah, it worked, and when pt-heartbeat died, we got paged, so at least there are no SILENT failures.

Frankly, I’ve found that one time having a painful recovery leads to it happening a LOT less frequently. Making it easy to recover doesn’t train people nearly as effectively.

Is mysql ‘start slave until master-log-file master-log-pos’ reliable enough to synch up two slaves with their master?

I have a set up now with master and 2 slaves off of it, let’s call it slave1 and slave2. I want to swing slave 2 behind slave1 to make it a chain replication , master => slave 1 = > slave 2. The only option I see to stop slave on slave2 and wait a few seconds to make sure master_log_position on slave 1 is ahead of slave 2 and stop slave 2, take the master positions off of it and start slave until these positions on slave 2 to make them in sync and to further be able to make the chain replication work, without breaking replication.