Wednesday, August 13, 2014

SBR vs RBR when using On Duplicate Key Update for High Availability

The other day, I was discussing high availability (HA) with other people. The discussion was going in the direction of, as the application was only using INSERT ... ON DUPLICATE KEY UPDATE (IODKU) and traffic could be replayed, promoting a slave in case of the failure of the master was easy: replaying traffic would make& all the slaves converge back to a common state. It looks as this could work but it might not be as simple. Before going into the details, let's take a step back and look into HA in general.

tl;dr: be careful when using row-based replication when replicating IODKU as slaves will stop replicating if a row is present on the master and missing on slave (or vice-versa).

High Availability by Promoting a Slave as the new Master

Let's suppose we are working with this replication topology composed of 1 master and 3 slaves:

In the general case, when we want to promote one of B, C or D as the new master after the failure of A, the main problem is that those 3 slaves might be inconsistent:

B might be 1 transaction late from A,

C might be 2 transactions late from A,

and D might be 3 transactions late from A.

The solution to this problem is to level all slaves before promoting one of them as the new master. Once all slaves are levelled, promoting a new master is easy:

choose any slave as the new master,

note the binary log position of the chosen slave (SHOW MASTER STATUS),

place the other nodes as slaves of the new master at the position noted in the previous step.

Many solutions exist to level all the slaves, some of them are:

MHA: using the relay logs of the most up-to-date slave (B in our example) to bring the other nodes at its level,

GTIDs: using the binary logs of the most up-to-date slave to bring the other nodes at its level,

Binlog Tailer with Semi-Sync: using mysqlbinlog to keep a backup of the master's binary logs and using those to bring the slaves at the level of the failed master (copying the missing binlogs to the slaves and running them locally),

Binlog Server: keeping a copy of the master's binary logs on a proxy and replicating through this proxy (the slaves converge automatically in case of a failure of the master).

IODKUs (or REPLACEs) for High Availability

The other situation I am presenting in the introduction, and that is very application-dependent, is:

skip the leveling of the slaves,

and let the application make sure that all slaves converge back to a consistent state.

Using IODKUs (or REPLACEs) and replaying the last transactions (if this can be done by our application), all the slaves would become consistent:

I think you can still use IODKUs (or REPLACEs) as an HA strategy. If you use SBR, everything looks fine. But be careful not to later switch to RBR without using one of these solutions in case of the failure of the master:

Clean all slaves of transactions that will be replayed (level down) before promoting a master.

Replay the transactions on all slaves (level up) before promoting the a new master.

After promoting a slave as the new master, replay transactions in SBR before switching back to RBR.

Is any of you using IODKUs (or REPLACEs) as an HA strategy ? If yes, I would be interested to ear your use-case and learn from your experience. Feel free to leave a comment below.

Wishlist

We saw above that IODKUs (and REPLACEs) statements are lost in RBR, but should this be the case ? Maybe keeping the information that a Write_rows or an Update_rows event was generated by a IODKU (or a REPLACE) would be useful. With that, the slaves could execute the events as a true IODKUs (or REPLACEs). It is probably suitable that, by default, an IODKU (or a REPLACE) event breaks replication so a dba can investigate the source of the problem. Then, he could decide to skip or execute the event if suited (a new GLOBAL variable, similar to sql_slave_skip_counter, would be needed: sql_slave_row_event_force).

And in the more general case, maybe a RBR recovery mode would be useful. In this mode:

Write_rows would be run as UPDATE when the row is present,

Update_rows would be run as INSERT when the row is missing,

Delete_rows would not fail if the row is absent.

What do you think about those 2 proposed features (IODKU / REPLACE in RBR events and RBR recovery mode), would you like them to be integrated in MySQL ? Feel free to leave comments about those below.