CHECKSUM options

These are both vulnerable to collisions but they’re nice and quick. What’s the difference between these two? CHECKSUM is collation aware but BINARY_CHECKSUM is not. So strings which evaluate as equal – for example, lower case ‘a’ and uppercase ‘A’ – produce equal hashes when using CHECKSUM but different hashes when using BINARY_CHECKSUM.

Either of these are good for sanity tests (in test frameworks) and for hash indexes. But I wouldn’t count on such an index to enforce uniqueness, the frequency of hash collisions is too high.

HASHBYTES options

HASHBYTES is another good option. You’ve got all kinds of hash algorithms with this one function. You can specify any of the following hash algorithms, MD2, MD4, MD5, SHA, SHA1, or SHA2.

Even with the limitation that you can only hash strings up to 8000 bytes, I like HASHBYTES a lot. I don’t use it for security, but I use it for unique hash indexes for database rows. I’ve described how to be careful when dealing with unicode strings in Careful Hashing

But what about hash collisions? Lordy, lordy, this again? I’m firmly in the camp that the chance of collisions is so small that the risk is negligible. I mean, the birthday paradox barely takes effect here. Just let me do some math, what if you inserted a unique row to this table once a second for forty years, you’d still be way more likely to win the lottery than to encounter a collision. MD5 and SHA1 isn’t exactly cutting edge cryptography, but it’s still good enough for what I need it to do (At least I’d choose it over the Hill Cipher).

Hashing Outside SQL Server

You’re not storing your files in the database right? That means that if you want to generate a checksum or hash for files, you’re probably going to want to do that processing outside SQL Server. I’ve bookmarked a C# solutions from stackoverflow How do I do a SHA1 File Checksum in C#? It’s a good start.

Something that’s probably relevant to how you hash is what you’re going to use the hash for.

If you’re using the hash for security purposes, then it’s important to you that collisions are very unlikely and the hashing function is hard to reverse. However if you’re just storing a precalculated hash for verifying the integrity of something, you might care less about those properties.

The Fowler–Noll–Vo hash function is my favorite, as it offers an acceptable rate of collisions (lower than CHECKSUM), but does not sacrifice computational performance of the hash result. There is also a handy SQL Server CLR for the FNV hash algorithm which I typically use throughout many databases that exhibit a hashing requirement.