An algorithm to find and resolve data differences between MySQL tables

I’ve been designing an algorithm to resolve data differences between MySQL
tables, specifically so I can ‘patch’ a replication replica that has gotten
slightly out of sync without completely re-initializing it. I intend to create
a tool that can identify which rows are different and bring them into sync. I
would like your thoughts on this.

Background and requirements

I see this as the next step in my recent series of posts on MySQL tools and
techniques to keep replication running reliably and smoothly. Sometimes replicas
“drift” a little bit, even when there don’t seem to be any issues with
replication (this is one reason I submitted a bug report to add checksums on
binlog events).
Once a table differs on the replica, it gets more and more different from the
master, possibly causing other tables to differ too.

I need a tool that, given a table known to differ on master and replica(s), will
efficiently compare the tables and resolve the differences. Finding tables
that differ is easy with MySQL Table
Checksum,
but I am not sure the best way to find which rows differ.

Here are my requirements. The algorithm needs to be:

Designed for statement-based replication, which means no temp tables, no expensive queries that will propagate to the replica, and so forth.

Efficient in terms of network load and server load, both when finding and when resolving differences. No huge tables or un-indexed data, no high-impact INSERT.. SELECT locking, etc.

Efficient on the client-side where the tool is executed.

Must work well on “very large” tables.

Some things I assume:

Tables must have primary keys. Without primary keys, it’s hard or a waste of time at best, and a disaster at worst.

It is not a good idea to do this unless the fraction of rows that differ is very small. If much of the table is different, then mysqldump is a better idea.

Other tools I’ve found

I’ve found a number of tools that are either not complete or don’t quite
address the need, but reading the source code has been very productive.
There’s Giuseppe Maxia’s work in remote MySQL table
comparison. I
based the MySQL Table Checksum tool on some of this work. Read the comments on
that link, and you’ll see some follow-up from Fabien Coelho, who wrote
pg_comparator. The
documentation for this tool is an excellent read, as it goes into great detail
on the algorithm used.

There are also a few projects that don’t do what I’m looking for.
datadiff does a two-way
in-server comparison of two tables with OUTER JOIN, a fine technique but
inherently limited to two tables on one server, and not practical for
extremely large tables.
coldiff is
a more specialized variant of that tool. mysqldiff diffs the structure of two tables,
which I mention for completeness though it is not the problem I’m trying to
solve.

The Maxia/Coelho bottom-up algorithm

Without restating everything these smart people have written, here’s a high-
level overview of the algorithm as presented by Maxia and Coelho:

Compute a “folding factor” based on the number of rows in the table and/or user parameters.

Build successive levels of checksum tables bottom-up, starting at a row-level granularity and decreasing granularity by the “folding factor” with each level, until the final table has a single row.

Each row in the first table contains key column(s), a checksum of the key column(s), and a checksum of the whole row.

Each row in an intermediate-level summary table contains checksums for a group of rows in the next more granular level of summary data.

Groups are defined by taking checksums from the previous level modulo the folding factor.

Beginning at the most aggregated level, walk the “tree” looking for the differences, honing in eventually to the offending rows.

The “folding factor” is really a “branching factor” for the tree of summary
tables. If the factor is 128, each level in an intermediate summary table will
contain the groupwise checksum of about 128 rows in the next most granular
level summary table.

This algorithm has many strengths. For example, it uses a logarithmic search
to find rows that differ. It makes no assumptions about key distributions; the
modulo operation on the checksum should randomize the distribution of which
rows need to be fixed. It’s also very generic, which means it works pretty
much the same on all tables. There’s no need to think about the “best way to
do it” on a given table.

I am concerned about a few things, though. There’s a lot of data in all these
summary tables. The first summary table contains as many rows as the table to
analyze. If I were to calculate and store these rows for a table with lots of
relatively narrow rows, I might be better off just copying the whole table
from one server to the other. Also, creating these tables is not replication-
friendly; the queries that run on the master will run on the replica too. This
might not be a problem for everyone, but it would not be acceptable for my
purposes.

The second part of the algorithm, walking the “tree” of summary tables to find
rows that differ, doesn’t use any indexes in the implementations I’ve seen.
Suppose I have a table with 128 million rows I want to analyze on two servers,
using a branching factor of 128 (the default). The first checksum table has
128 million rows; the second has 1 million, and so on. Repeated scans on these
tables will be inefficient, and given the randomization caused by the summary
checksums, will cause lots of random I/O. Indexes could be added on the
checksum modulo branching factor, but that’s another column, plus an index
– this makes the table even bigger.

The checksum/modulo approach has another weakness. It defeats any
optimizations I might be able to make based on knowledge of where in the table
the rows differ. If the differences are grouped at the end of the table, for
example in an append-only table that just missed a few inserts on the replica,
the algorithm will distribute the “pointers” to these corrupt rows randomly
through the summary tables, even though the rows really live near each other.
Likewise, if my table contains client data and only one client is bad, the
same situation will happen. This is a major issue, especially in some large
tables I work with where we do things a client or account at a time. These and
other spatial and temporal locality scenarios are realistic, because lots of
real data is unevenly distributed. The checksum/modulo approach isn’t optimal
for this.

Finally, the bottom-up approach doesn’t allow for early optimization or
working in-memory. It builds the entire tree, then does the search. There’s no
chance to “prune” the tree or try to keep a small working set. The flip side
of this is actually a strength: assuming that the whole tree needs to be
built, bottom-up is optimal. But most of my data isn’t like that. If much of
the table is corrupt, I’m going to do a mysqldump instead, so I want to
optimize for cases where I’ll be able to prune the tree.

One solution: a top-down approach

Given that I won’t even be looking at a table unless the global checksum has
already found it differs, I am considering the following top-down approach, or
some variation thereof:

Generate groupwise checksums for the whole table in a top-level grouping (more on that later).

If more than a certain fraction of the groups differ, quit. Too much of the table is different.

Otherwise descend depth-first into each group that has differences.

I think this algorithm, with some tuning, will address most of my concerns
above. In particular, it will allow a smart DBA to specify how the grouping
and recursion should happen. The choice of grouping is actually the most
complicated part.

I’d do this client-side, not server-side. I’d generate the checksums server-
side, but then fetch them back to the client code and keep them in memory.
Given a good grouping, this shouldn’t require much network traffic or memory
client-side, and will avoid locks, eliminate scratch tables, and keep the
queries from replicating.

In the best case, all other things being equal, it will require the server to
read about as many rows as the bottom-up approach, but it will exploit
locality – a client at a time, a day at a time, and so on. This is a huge
help, in my opinion; reducing random I/O is a high priority for me.

Given all this, I think top-down is better if there are not many changes to
resolve, or if they’re grouped tightly together.

Some of the weaknesses I see are complexity, a proliferation of recursion and
grouping strategies, perhaps more network traffic, and susceptibility to edge
cases. Whereas the bottom-up approach has identical best and worst cases for
different distributions of corrupt rows (assuming the number of corrupt rows
is constant), the top-down approach suffers if there’s no locality to exploit.
I’m a bit worried about edge cases causing this to happen more than I think it
ought to.

Finally, and this could be either a strength or weakness, this approach lets
every level of the recursion have a different branching factor, which might be
appropriate or not – the DBA needs to decide.

Smart grouping and recursion

I think the hardest part is choosing appropriate ways to group and “drill
down” into the table. Here are some possible strategies:

Date groupings. We have a lot of data in InnoDB tables with day-first or week-first primary keys, which as you know creates a day-first or week-first clustered index. The first checksum I’d run on these tables would be grouped by day.

Numeric groupings. Tables whose primary key is an auto-incremented number would probably be best grouped by division, for example, floor(id/5000) to group about 5000 neighboring rows together at a time.

Character groupings. If the primary key is a character string, I might group on the first few letters of the string.

Drill-down. Take for example one of our tables that is primary-keyed on IDs, which are auto-incremented numbers, and client account numbers. The best way to do the table I’m thinking of is by account number, then numerically within that on ID. For the day-first table, I’d group by day, then account number, and then by ID.

Exploit append-only tables. If a table is append-only, then corruption is likely in the most recent data, and I might try to examine only that part of the table. If there are updates and deletes to existing rows, this approach might not work.

Use defaults if the DBA doesn’t specify anything. If there’s a multi-column primary key, recurse one column at a time. If a single-column key, look for another key whose cardinality is less, and recurse from that to the primary key instead.

I think the DBA will have to choose the best strategy on a table-by-table
basis, because I can’t think of a good automatic way to do it. Even analyzing
the index structures on the table, and then trying to decide which are good
choices, is too risky to do automatically. For example, SHOW INDEX will show
estimated index cardinalities, but they’re based on random dives into the
index tree and can be off by an order of magnitude or more.

How to resolve the differences

Again assuming that this reconciliation is taking place between a master and
replica server, it’s important to fix the rows without causing more trouble
while the fixing happens. For example, I don’t want to do something that’ll
propagate to another replica that’s okay, and thereby mess it up, too.

Fixing the rows on the master, and letting the fixes propagate to the replica
via the normal means, might actually be a good idea. If a row doesn’t exist or
is different on the replica, REPLACE or INSERT .. ON DUPLICATE KEY UPDATE
should fix the row on the replica without altering it on the master. If the row
exists on the replica but not the master, DELETE on the master should delete
it on the replica.

Peripheral benefits of this approach are that I don’t need to set up an
account with write privileges on the replica. Also, if more than one replica has
troubles with the same rows, this should fix them all at the same time.

Issues I need to research are whether the different number of rows affected on
the replica will cause trouble, and if this can be solved with a temporary
slave-skip-errors setting. The manual may document this, but I can’t
find it.

Next steps

I’m looking forward to your feedback, and then I plan to build a tool that’ll
implement whatever algorithm emerges from that discussion. At this point,
assuming the above algorithm is as good as we can come up with together, I’m
planning to actually implement both top-down and bottom-up approaches in the
tool, so the DBA can decide what to use. The tool will, like the rest of the
scripts in the MySQL
Toolkit, be
command-line friendly (there are lots of proprietary “visual tools” to compare
and sync tables, but they don’t interest me – plus, why would I ever trust
customer data to something I can’t see source code for?). I also understand
that not everyone has the same narrowly-defined use case of re-syncing a
slave, so of course I’ll make the tool more generic.

For my own use, ideally I’ll be making sure the tool is rock-solid, then
defining rules for tables that frequently drift, and running a cron job to
automatically find which tables are different and fix them. If the MySQL
Table
Checksum
tool finds a table is out of sync and I don’t have a rule for it, it’ll just
notify me and not try to fix it.

Summary

In this article I proposed some ideas for a top-down, in-client, replication-
centric way to compare a table known to differ on a master and replica, find the
rows that differ, and resolve them. I’m thinking about building a tool to
implement this algorithm, and would like your feedback on efficient ways to do
this.

I'm Baron Schwartz, the founder and CEO of VividCortex. I am the author of High
Performance MySQL and lots of open-source software for performance analysis, monitoring, and system administration.
I contribute to various database communities such as Oracle, PostgreSQL, Redis and MongoDB. More about me.