Hash Functions in .NET – Right Tool for the Right Job

In this post I will talk about common mistakes developers make when when using hash functions. Any hash function is required to meet the following two requirements.

It must be easy to calculate for any possible message.

It must return a fixed-size bit string irrespective of the length of the message.

Broadly speaking there are two types of hash functions: non-cryptographic hash functions and cryptographic hash functions. A cryptographic hash function is expected to also meet the two additional requirements below:

It must be practically impossible to find the message given the hash (pre-image).

It must be practically impossible to find two different messages with the same hash (hash collisions).

Developers often confuse between these two kinds of hash functions which leads to following problems.

Using non-cryptographic hashes in place of cryptographic hashes has serious security consequences.

Using cryptographic hashes in place of non-cryptographic hashes has severe performance problems (either memory or speed).

The .Net BCL provides both cryptographic and non-cryptographic hash functions. Cryptographic hash functions such as MD5, SHA-1, SHA512 etc. that are available in the System.Security.Cryptography namespace are primary designed for security purposes (like digital signatures). On the other hand, non-cryptographic functions like Object.GetHashCode() is not meant for security considerations but can be used for performance reasons (like fast string comparisons and hash-tables).

The rest of the blog post will demonstrate the fact that Object.GetHashCode is not a cryptographic hash and how it (and other low bit count hashes) can be defeated.

Birthday attacks are used to defeat the “hash-collision” requirement for crypto-hashes with small bit lengths. It is based on the theory of the Birthday Paradox, commonly seen in Probability Theory textbooks. The “paradox’ comes from the fact that in a group of 41 randomly selected people, there is more than a 90% chance that at least two people share the same birthday. Also, In a group of 57 people, there is more than a 99% chance that at least two people share the same birthday.

The fact can be easily extended to hash functions too: For a hash algorithm with bit-size N with uniform distribution, a set of 2^(N/2) random hashes (hashes of random data) has a very good probability that at least two hashes in the set are the same.

In our case, the GetHashCode returns an Int32 which makes N = 32. So, effectively it takes just 2^(32/2) = 65536 hashes of random data to find a hash collision (which is BAD and NOT SECURE). However it still takes 2^32 = 4294967296 iterations or random trials for a successful pre-image attack.

Here’s a C# program which find collisions on the String.GetHashCode() function. Here’s how it works:

The program randomly generates 100,000 strings of fixed length (in the order of 65,536 but improves our chances)

Hash of each of these strings is calculated by calling String.GetHashCode() and is stored along with the string in a list.

The list is then searched for duplicate hashes (collisions) and strings with colliding hashes are reported via the command line (String –> Hash format).

If duplicates are not found, we try again (we are bound to find one or more collisions very soon).

Here’s a snapshot of a sample run (results would vary between runs as the program uses random numbers).

As a conclusion, I’d like to emphasize the fact that when dealing with real-world problems, developers need to carefully assess the requirements of security and performance and appropriately use crypto and non-crypto hashing functions. After all, security (or performance) of an application is only as strong (or as fast) as its weakest (or slowest) link.