Data seem sometimes to have their own life and will, and they refuse to behave as we wish.
Then, you need a firm hand to tame the wild data and turn them into quiet and obeying pets.

Monday, November 13, 2006

MySQL testing techniques: comparing tables

This is the first tutorial of a series dedicated to testing techniques.Soon all this material will find a home in a more appropriate place. In the meantime, enjoy the lecture!

While testing large tables, it is useful to test if two tables have the same contents. For example, if you want to compare performance between two tables using different storage engines, or tables created on different filesystems, you must be sure that both tables have the same content. Having loaded both from the same source is not a guarantee that the contents are the same: a mistake or different SQL modes used during the load may result in substantial differences.

General concepts

Then, you need to compare two, possibly very large tables. There are several methods available. One is to run a query with a LEFT OUTER JOIN. However, this method is likely to take very long or even exhaust your system resources if your tables are really large.One method that I have been advocating for long time is to run a global CRC on both tables and then compare the results.And, I hear you asking, how do you get a global table CRC?There is no predefined SQL feature for this task. Recent MyISAM tables have a built-in CRC, but you can't get it from a SELECT statement, and besides, if you need to compare the contents of such a table with one using a different engine, you are out of luck. Then, we need to use something more general, which can be applied to any table.The first step to get a global CRC is to get a list of the columns that we can then pass to a CRC function such as SHA1 or MD5.This list is a string made of the name of the columns, which we will pass to a CONCAT_WS function. However, if you know how SQL functions work, you will know that any NULL value in the list will nullify the whole expression. Therefore, we need to make sure that every nullable column is properly handled by a COALESCE function. The result of this operation, which we delegate to a stored function, is a safe list of column.The second step towards a global table CRC is to calculate a CRC for each record. We use the above list of columns to create a SELECT statement returning a SHA1 for each record in the table. But, what to do with it? There is no aggregate SQL function available for SHA or MD5. Thus, we need to process the result and calculate our CRC manually.As noted in a previous post, we can do that in two ways, using cursors or using a blackhole table. Some benchmarks show that the blackhole table is much faster than the cursor, and this is what we do.We start with an empty CRC. For each row, we compute a CRC of the whole record, plus the existing CRC. Since we are using a SELECT statement, we need to get rid of the output, because we are only interested in the calculation stored in the user variable. For this purpose, a black hole table is very well suited. At the end of the SELECT + INSERT operation, we have in hand two variables, one showing the count and one holding the global CRC for the table.Repeating this process for the second table we need to compare, we can then compare two simple values, and determine at a glance if we are dealing with comparable data sets.

First of all, we create another table, with the same structure of City, but using a different engine, and storing data in a bizarre order to see if our routine is robust enough. In fact our routine will calculate the CRC after sorting the data by primary key, so that there won't be any surprise.

5 comments:

So.. the obvious weakness of this is that it can't tell you what or where the differences are. I've seen other examples of queries that can, but they all compare each value in every row - which can obviously be very slow / resource intensive compared to this method.

If I needed to compare two very large tables, that are identical except for maybe 1 - 100 rows of data (some changed, some new, some deleted rows). I was wondering if it would be quicker to use the technique that you've shown here on the tables, then if differences are found, split the tables in half and run it again on just the first half - if changes are found there, split it again and repeat, otherwise move onto the second half. Keep splitting the data down into smaller sets to identify (maybe to within 1? 10? 100? rows?) where the changes actually are.

This way you can identify huge chunks of data that are exactly the same and don't need to be analyzed further, and small sections that contain changes, on which the other methods (compare each value) can be run far more efficiently.

I'm fairly new to SQL, so this sort of thing is a bit beyond my current capabilities, but I think this would be a really nice approach (in theory) to doing data comparisons.

The purpose of this technique is just to find if there are any differences. This is useful when you are comparing tables that are produced by replication, clustering, data copy, and so on.

However, if you need to find out exactly where is the difference, I wrote an article about this subject a few years ago, where I tried the approach that you describe, and found out that it's quite inefficient. There is a different algorithm you can use, i.e. taking the CRC of small chunks of the table, grouped by a convenient column value, and then apply a binary search to the differing chunks only, until you find the single rows that you are after.