Databases

Hashing Tables to Ensure Consistency in Postgres, Redshift and MySQL

Maintaining cache consistency

One of the main things Periscope does to speed up queries is maintain a cache of customer data. The cache is optimized to serve certain kinds of queries, and benefits from economies of scale.

Maintaining this cache leads us to a critical question: How do I know if the cache is still valid? Put another way: How do I know if the data in a table in database A matches the data in a table in database B?

Enter hashing, a general technique to detect if two datasets are the same. We can use a hash to validate that our cache is fresh without needing to understand any application-specific logic.

What we need now is a query that returns the exact same hash given the same table structure and data on all databases.

The algorithm

For each row:

Take the MD5 of each column. Use a space for NULL values.

Concatenate those results, and MD5 this result.

Split into 4 8-character hex strings.

Convert into 32-bit integers and sum.

We choose MD5 as our hash function because it’s fast and supported on all databases.

We break the column hashes into integer-sized chunks and sum them to get a single value (4 bigints) in order to save memory and transfer time. Ideally we’d hash all the individual column hashes, but this isn’t possible on all databases.

Finally, note that we must convert the columns into the same format before encoding in step (1) to ensure cross-database consistency.

In Postgres

Taking the MD5 of a column looks like this:

md5("column name"::text)

Some extra massaging may be required for more complex types. Examples of integers, text columns, and datetime columns are below.

Now we’ll layer on spaces for NULL values:

coalesce(md5("column name"::text), ' ')

Concatenating and hashing those results is a simple matter:

select md5(

coalesce(md5("column name"::text), ' ') ||

coalesce(md5("second column name"::text), ' ')

) as"hash"

from"my_table"."my_schema"

We then wrap this all in a subquery so we can split the result into 4 8-character hex strings, which are each converted into 32-bit integers and summed.

As we add that in, we get the final query:

select

sum(('x'||substring(hash, 1, 8))::bit(32)::bigint),

sum(('x'||substring(hash, 9, 8))::bit(32)::bigint),

sum(('x'||substring(hash, 17, 8))::bit(32)::bigint),

sum(('x'||substring(hash, 25, 8))::bit(32)::bigint)

from (

select md5 (

coalesce(md5("integer column"::text), ' ') ||

coalesce(md5(floor(

extract(epoch from"datetime column")

)::text), ' ') ||

coalesce(md5("string column"::text), ' ') ||

coalesce(md5("boolean column"::integer::text), ' ')

) as"hash"

from"my_schema"."my_table";

) as t;

Note the ‘x’ prepended to the hash strings, which tells Postgres to interpret them as hex strings when casting to a number.

In Redshift

Redshift supports the handy strtol function, making our hash-string-to-integer conversion a bit easier. Otherwise the full query is the same:

select

sum(trunc(strtol(substring(hash, 1, 8), 16))),

sum(trunc(strtol(substring(hash, 9, 8), 16))),

sum(trunc(strtol(substring(hash, 17, 8), 16))),

sum(trunc(strtol(substring(hash, 25, 8), 16)))

from (

select md5(

coalesce(md5("integer column"::text), ' ') ||

coalesce(md5(floor(

extract(epoch from"datetime column")

)::text), ' ') ||

coalesce(md5("string column"::text), ' ') ||

coalesce(md5("boolean column"::integer::text), ' ')

) as"hash"

from"my_schema"."my_table"

) as t;

In MySQL

MySQL sports a few changes from the Postgres and Redshift variants:

First, the syntax for casting many of the columns to helpful strings is different, e.g. for datetimes: