Hash distributing rows is a wonderful trick that I often apply. It forms one of the foundations for most scale-out architectures. It is therefore natural to ask which hash functions are most efficient, so we may chose intelligently between them.

In this blog post, I will benchmark the build in function in SQL Server. I will focus on answering two questions:

How fast is the hash function?

How well does the hash function spread data over a 32-bit integer space

I know there is also the question about how cryptographically safe the function is, but this is not a necessary property for scale-out purposes – hence, that aspect is out of scope for this blog.

For the picky people (my dear soul mates): I will be using 1K = 1024 in the following, so 64K = 65536.

Test data

Because we sometimes find ourselves hashing dimension keys, I will use a very simply table with a nice key range as my test input. Such a key range could for example have been generated by an IDENTITY(1,1) column or a SEQUENCER. Here is the test script that generates the input:

CREATE TABLE CustKey (SK INT NOT NULL)

INSERT INTO CustKey WITH (TABLOCK) (SK)SELECT n FROM dbo.fn_nums(10000000)

Speed of the Hash function

SQL Server exposes a series of hash functions that can be used to generate a hash based on one or more columns.

The most basic functions are CHECKSUMand BINARY_CHECKSUM. These two functions each take a column as input and outputs a 32-bit integer.

Inside SQL Server, you will also find the HASHBYTESfunction. This little gem can generate hashes using MD2, MD4, MD5, SHA and SHA1 algorithms. The problem for the purpose of our test is that these function spit out BINARY types, either 128 bits (for SHA) or 160 bits (for MD). However, we can quickly map that value into the integer space by doing a modulo (in SQL: %) with MaxInt (2**31 – 1). Interetingly, Modulo work directly on values of BINARY in SQL Server.

Before I move on the results, I would like share a little trick. When you need to get the raw CPU speed of the hash, you may be tempted to do something like this:

SELECT BINARY_CHECKSUM(SK) AS hFROM CustKey

But this is wrong. Why? Because you are both measuring the time it takes to hash as well as the time it takes to transfer the rows to the client. This is not very useful, on my laptop it takes over two minutes to run the above query.

Instead, you should try to measure the raw time the server needs to calculate the data. One way to achieve this (and this trick can be used on most queries you benchmark) is to wrap the SELECT statement in a MAX function. Like this:

Notice something else interesting, I am forcing parallelism down to 1. I want to get as close to the raw cost of the CPU as possible – so I don’t want to clutter the query with any overhead of parallelism. The query now runs in a little less than 3 seconds on my laptop.

I addition to the above trick, we also need to quantify the time SQL Server spends on just accessing the table. We can get a good approximation of this by measuring the SELECT statement without any hash function. On my laptop, I measured this to be around 3200 ms. In the results below, I have subtracted this runtime so we are only measuring the cost of the hashing.

With all this said, let us have a look at the results:

Some comments on the data:

Notice that MD2 is very inefficient. I have not studied this algorithm in great detail and would appreciate a specialist commenting here.

Apart from the MD2 anomaly, the HASHBYTES functions all performan pretty much the same. Using around 1 micro sec per hash value calculation

BINARY_CHECKSUM and CHECKSUM are MUCH faster, by about an order of magnitude

Doing a simple modulo that spreads out the data over the integer space is about the same speed as CHECKSUM and BINARY_CHECKSUM

It is not the cast to NVARCHAR that uses CPU time (HASHBYTES requires NVARCHAR input)

However, the checksum function do take longer to run when you first cast to NVARCHAR. Costing around 100 ns per INT value

Of course, the HASHBYTES functions also have other characteristics than speed which may be desired (cryptographic utility for example). but if you you want is speed, it looks like CHECKSUM and BINARY_CHECKSUM are faster.

But, let us see how good the functions are at spreading the data.

Spread of the Hash Function

Another desirable characteristic, apart from speed, of a hash function is how well it spreads values over the target bit space. This is useful not only for cryptographic purposes, but also to “bucket” data values into equal sized portions, for example in scale-out MPP architectures.

Often, you will not know in advance how many buckets you eventually want to subdivide the integer space into. You may start out with 32K buckets and divide up the integer interval in 128K (32-bit space divided by 128K = 32K) even sized buckets, each with 128K values in them:

Bucket0: [MinInt…MinInt + 128K[

Bucket1: [MinInt + 128K…MinInt + 128K*2[

Bucket2: [MinInt + 128K*2…MinInt + 128K*3[

… etc..

Bucket 32K: [MinInt + 128K*32K…MaxInt]

Perhaps you will later want a further subdivision into 64K buckets, each with 64K hash values.

Wise from the runtimes I measured before, I used only 1M rows with values [1…1M] for this test. However, this is plenty to show some good results, so don’t worry.

To avoid running chi-square testing (See later) of integer value spread all over the 32-bit space, I calculated the hash of each key and I then bucketed them into 64K bucket ranges (each with 64K values, neat isn’t it?).

Interestingly, this hypothesis makes a prediction (has to, if not we cannot falsify it): namely that in each bucket filled by hashing the 1M rows, we should expect to measure 1M / 64K = 15 members.

There is also an important twist here: We have to measure the content of each bucket, even when it is zero. This is why I use a left join in the query above, we have to measure all the results we know, not just count the hash buckets that actually get filled (you can see how not bucketing the data would have made the test result very large).

We can now test the hypothesis against the measured result. If we see correlation, then it gives us confidence in the correctness of the hypothesis, and lets us conclude that our hash function is a good one. It also gives us the ability to compare different hash functions with each other. We can test for correlation between the predicted result and the actual result with the chi-square test. Interestingly, the tools we need for this test is even exposed in Excel as the CHISQ series of functions.

Without further ado (and insults), here are the results:

(source data available on request)

Some observations about the data:

Most importantly: the CHECKSUM and BINARY_CHECKSUM functions are very poor hash functions. Even with 1M input and 64K large buckets, they don’t even fill the hash space. The skew is massive on non-skewed input data.

We also see that casting the INT to NVARCHAR improves the behavior of both BINARY_CHECKSUM and CHECKSUM, but not enough to make them good hash functions.

SHA and SHA1 seem to give us the best spread of the data, but at 1 microsecond per hash, it seems rather expensive.

Notice that the MD series do not give a particularly nice spread. I suspect this is related to the way I divide the data with MaxInt to get from the 120bit value to a 4-byte integer

We see that only the modulo does a distribution that gives us nearly 100% confidence (not fully though, because 1M does not divide 64K, which is reflected in the 824 value of Chi-squared). This result is expected, and my function here is engineered to “game the test”. The modulo cannot generally be used like this to spread the value all over the 32 bit integer space.

For fun, I put in a test where I use random values all over the integer space. Even with a confidence interval of 10%, we cannot say it distributes evenly. This means that SHA functions distribute better than random on this input dataset. Your mileage will of course vary for random values

On the note of the CHECKUM series, Books Online does (under?) state:

If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

Summary

In this blog I have explored properties of hash function built into SQL Server. The results indicate some general guidance that could be followed:

For best spread over the hash space, use SHA or SHA1 algorithms with HASHBYTES function

Be VERY careful about using CHECKSUM and BINARY_CHECKSUM as these functions do not spread a hashed integer value nicely over the full space. Depending on what you use these functions for, they may not be suitable

It takes about 1 CPU microsecond to calculate a single call to HASHBYTES, except when hashing with MD2

Using HASHBYTES to spread out values over the 32-bit integer space seems rather expensive, since even the best cases uses tens of thousands of CPU instructions.

This indicates that it may be possible to implement a better hash function if you do not care about cryptographic properties of the hash

I have one comment on CHECKSUM and that is CHECKSUM is different from the other hashing functions as the other preset the domain to the full extent of the number of bytes used for the result. MD5 and others pre-initialize with a set of data and the apply the user input on top of that. CHECKSUM doesn’t, as the function only relies on the input. See my blog post here. http://weblogs.sqlteam.com/peterl/archive/2010/08/19/checksum-weakness-explained.aspx

You would most probably get different bucket results for CHECKSUM of you use 1 million values occupying at least 31 bits (32 bits if using negative values).

I’m going to have to do some research on this and feed back to you ; my original thoughts have always defaulted to purely never combining hashes, but some of the reading I’ve been doing has indicated otherwise, and my own tests with just FNV-1 and FNV-1a has shown positive results.

I believe that combining cryptographic hashes is non-productive, but this is a non-cryptographic function. I will report from my tests

Hello Thomas, My name is Diego. I was reading your articke about HASHBYTES function but
i got the following error ‘HashBytes’ is not a recognized function name. i have copied the same example that MSDN page http://msdn.microsoft.com/en-us/library/ms174415.aspx, do you have any clue why is not working ? Thanks in advance for your advise.

I’d be very interested in seeing a similar analysis (or perhaps just an extension to your table) for the various hash functions in regards to hash clashes for distinct values. When it comes to hash distribution, you’ve nailed it in your conclusion. However, if using hashes to detect changes – as mentioned in your BOL quote – besides speed, the risk of clashes is equally important, whereas the distribution isn’t important.

I’d assume the page checksum is performed using a high-performance low-clash-risk hash function.

I have not actually looked at the source code for the page checksums, but I would assume that the goal is different than mine (detect changes) and that the hash is tailored for that. Clearly the CHECKSUM and BINARY_CHECKSUM do not have avalance properties, which would be desirable for page checksum. With regards to hash clashes – I need to think about how to set up a good test for that property. Any ideas?

I am thinking about implementing a CLR version of MurmurHash3 to see how many CPU cycles I can get it down to.