Wednesday, August 16, 2017

The danger of no Primary Key when replicating in RBR (and a partial protection with MariaDB 10.1)

TL;DR: unless you know what you are doing, you should always have a primary key on your tables when replicating in RBR (and maybe even all the time).TL;DR2: MariaDB 10.1 has an interesting way to protect against missing a primary key (innodb_force_primary_key) but it could be improved.

A few weeks ago, I was called off hours because replication delay on all the slaves from a replication chain was high and growing. It was not the first time this happened on that chain, so I thought right away that this was probably an UPDATE or DELETE of many rows on a table without a primary key. Let's see what is the problem with this and to understand that, we have to talk about binary log formats.
There are two main binary log formats: Statement Based [Replication] (SBR) and Row-Based [Replication] (RBR). There is also the Mixed format but I will not talk about it here. There are also RBR subtypes (full, minimal and noblob) and proposed RBR subtypes (Lossless and another by Facebook which I do not find a reference to, but I think to remember that it was an interesting improvement on minimal RBR).

SBR binary logs contain the statement that was run on the master and a few other things to make sure the execution of that statement will produce the same result on slaves as on the master. RBR binary logs contain the actual fields of the rows that are inserted, updated or deleted. For a longer description, you will have to read the section Replication Formats of the manual.

When applying an Update_row or a Delete_row event on a slave (this is how events are called in RBR binary logs), the SQL thread must find (fetch) the updated or deleted row and it must execute the corresponding operation. For a single statement that updates or deletes Y rows on the master, the slave (SQL thread) will have to execute Y operations. If the table has a primary key (or another index, like a unique key, that makes the fetch efficient), the SQL thread is able to run those Y operations quickly. However, if there is no such key, each of those Y operations will need to scan the table. With a table of size N, this will be a O(Y * N) process. On a big table (when N is large) and even if Y is relatively small, this might take a very long time.

To illustrate that, I deployed a simple replication chain using MySQL 5.7 with one master and one slave and I used full RBR as the binary log format for the master. I created two tables, one without and one with a primary key, and I filled both tables with 50,000 rows.

When running the below DELETE statement on the master, both execute quickly. The one on the table with a primary key did not generate any noticeable replication delay on the slave. However, the same statement on the table without a primary key generated noticeable replication delay on the slave.

Below is the graph showing the replication delay on the slave by observing the Seconds_Behind_Master field of SHOW SLAVE STATUS output. For executing 4970 Delete_row events, even if the corresponding statement took less than a second to run on the master, it took about 90 seconds on the slave. Not having a primary key greatly penalized replication speed in this case.

To look in more details at what is happening on the slave, I used the excellent ps-top utility developed by my colleague Simon Mudd. This tool is a top-like visualization of performance schema. If we look at the view table_io_latency of ps-top while the slave is running the DELETE without a primary key, we have the following:

So, if you are planning to perform large UPDATE or DELETE operations on a master that is using RBR, not having a primary key on your tables will cause big problems in replication. I guess that from now on, you will try to always have a primary key on your tables. But you might forget, or the next person joining your team might not know that it is important to always have a primary key. This brings us to the protection implemented in MariaDB 10.1.

With MariaDB 10.1, you can enable innodb_force_primary_key which reduces the probability of having this kind of problem. I am writing that the problem is reduced and not completely solved because there are some cases that are not covered by this solution (I will write another post about this).

Before finishing this post, let's mention a few things:

This drawback of RBR is not new, it has already been discussed many times...