Wednesday, December 16, 2009

In the currentAgust 2004 issue of SysAdmin there is an article about one of my favorite subjects, i.e. remote table comparison applied to MySQL databases.

The source code is actually working code (not as documented as I would have liked it, though, but space was not unlimited) with which you can compare two tables in two remote databases and see wether they differ at all, and if they do, there are method to find out which rows are different, without copying the data from a host to the other.
How this works is minutely explain in the article, so I won't repeat it here. However, since the article deals mostly with database issues, I would like to spend some words on the supporting Perl code, which does the magic of creating the not-so-obvious SQL queries to carry out the task. I won't examine the most complex script, the one finding the detailed differences between tables, because I'd need to introduce too much background knowledge to explain them in full. Therefore, I would like to present a smaller function that will tell you if two tables are different at all, so you can take further action.
This is a slightly modified version of the source code for the same task published in the magazine. I am thinking about making a CPAN module with the whole thing, but not right now.

Let's start with the algorithm used.

In short, if you want to compare two large records, you can make a signature of the whole record, by joining together their fields and applying a CRC function, such as MD5 or SHA1.
Comparing a whole table is trickier, because in standard SQL, without stored procedures and cursors, obtaining the CRC of a range of records is far from trivial. One possibility, though, is to make a CRC for each record and then sum them up to get a result that can be consider the table signature.

The SQL part is quite complicated by the fact that MySQL can't handle arithmetic operations with the kind of number that can result from a MD5 signature. To get over this problem, I split the MD5 string into four chunks, using the SUBSTRING function, and convert them from base 16 to base 10 using CONV. The result is a simple number that is passed to SUM. The MD5 is calculated once per record and assigned to a global MySQL variable (@CRC). It is the signature of a string composed by all fields in the record, with some adjustments to avoid NULL values to come into the equation.
Perl's biggest involvement in all this, apart form making the query, which is a bitch to create manually, is the get_fields function that reads the table structure and creates the list of fields to be be passed to MD5. If one field is nullable, a call to the COALESCE function will be used instead of its bare name.

With these two functions ready, we can actually run a test on two tables in two different hosts.
To be sure that the function works as advertised, this script will actually create the tables in both hosts. The first time this script runs, the newly created tables have the same contents, and the result will be "no differences". If you run it a second time, one record will be altered, just barely, and the result will be different.

The first number is a simple record count. The second one is a concatenated string from the four sums calculated on the MD5 chunks.
When we run the script for the second time, column "j" in record nr. 50 is increased by 1. This displays the following result:

LOCAL : 100 208246712599204490057913196197913536230317654094
REMOTE: 100 211184068521202404576502196746869468230923726643
there are differences

For the third run, we modify the source code and change "set j = j+1" to "set j = j-1". Again, the script reports that there are no differences.

You should notice that what we actually get from the get_table_CRC function is a few dozen bytes, even if the tables contain a million records. The database server may have some number crunching to perform, but you don't need to send gigabytes of data through the network.
This technique can save you hours of searching if you need to know if two remote tables have differences. More granularity is provided by the other functions described in the article.

This existing literature mostly addressed how to find the differences between the tables, not how to resolve them once found. I needed a tool that would not only find them efficiently, but would then resolve them. I first began thinking about how to improve the technique further with my article http://tinyurl.com/mysql-data-diff-algorithm, where I discussed a number of problems with the Maxia/Coelho "bottom-up" algorithm. After writing that article, I began to write this tool. I wanted to actually implement their algorithm with some improvements so I was sure I understood it completely. I discovered it is not what I thought it was, and is considerably more complex than it appeared to me at first. Fabien Coelho was kind enough to address some questions over email.