String.hashCode() is plenty unique

The article (rightfully) points out that Java’s humble String.hashCode() method — which maps arbitrary-length String objects to 32-bit int values — has collisions. The article also (wrongfully) makes this sound surprising, and claims that the String.hashCode() algorithm is bad on that basis. In the author’s own words:

No matter what hashing strategy is used, collisions are enevitable (sic) however some hashes are worse than others. You can expect String to be fairly poor.

That’s pretty strongly worded!

The author has demonstrated that String.hashCode() has collisions. However, being a 32-bit String hash function, String.hashCode() will by its nature have many collisions, so the existence of collisions alone doesn’t make it bad.

The author also demonstrated that there are many odd, short String values that generate String.hashCode() collisions. (The article gives many examples, such as !~ and "_.) But because String.hashCode()‘s hash space is small and it runs so fast, it will even be easy to find examples of collisions, so being able to find the collisions doesn’t make String.hashCode() bad either. (And hopefully no one expected it to be a cryptographic hash function to begin with!)

So none of these arguments that String.hashCode() is a “poor” hash function are convincing. Additionally, I have always found (anecdotally) that String.hashCode() manages collisions quite well for Real World Data.

So what does a “poor” hash function look like? And what does a “good” hash function look like? And where does String.hashCode() fall on that spectrum?

A Bit of Theory

A 32-bit hash can only take 2^32 = 4,294,967,296 unique values. Because a String can have any number of characters in it, there are obviously more possible Strings than this. Therefore, collisions must exist because of the pigeonhole principle.

But what is the likelihood of a collision?

First, assume an “ideal” hash function is being analyzed. An ideal hash function distributes its outputs uniformly and independently across the hash space. In other words, for all possible inputs, an ideal hash function’s outputs should have no pattern, even if the inputs do.

The famous — and counter-intuitive! — birthday problem states that for 365 possible “hash values,” only 23 unique hashes must be computed before there is a 50% chance of a hash collision, even for ideal hash functions. If there are 2^32 possible hash values, roughly 77,164 unique hashes must be computed before there is a 50% chance of a hash collision, per this approximation:

On short inputs in English, 466,544 total hashes resulted in 356 collisions. An “ideal” hash function would generate an expected 25.33 collisions in the same circumstances. Therefore, String.hashCode() generates roughly 14.05x more collisions than an ideal hash function would for these inputs:

356.0 / 25.33 ≈ 14.05

An aggregate collision rate of 8 per 10,000 still isn’t bad in an absolute sense.

On longer inputs in English, 111,385 total hash resulted in 1 collision. An ideal hash function would generate an expected 1.44 collisions over this data. String.hashCode()‘s performance is on par with an ideal hash function in this case:

1 / 1.44 ≈ 0.694

Less than 1 collision per 100,000 hashes is excellent in an absolute sense as well.

A Bit of Interpretation

Obviously, String.hashCode() isn’t unique, but it can’t be. For short values it’s within an order of magnitude of theoretical ideal average efficiency. For long values, it performs in line with an ideal theoretical solution.

Several people on Reddit and Hacker News have pointed out that this approach is not a rigorous statistical analysis. That’s true! And I don’t mean to present it as such. (The word “significant” did creep into one draft; I’ve eliminated it.) For an example of what a much more thorough analysis of hash function performance might look like, I encourage readers to check out the links in the Further Reading section.

However, hopefully this at least demonstrates that String.hashCode() is plenty unique for its intended purpose, which is spreading String values out across a hash table, and that its collision performance is at least “OK.”

A Bit of Futher Reading

If you found this interesting, I highly recommend you check out this answer on Stack Overflow. It goes into wonderful depth on hash function collision and wall-clock performance.

EDIT: Upon reflection, I realized that what caused me to put this (simple) analysis together was the subjective claim that String.hashCode() performed “poorly.” In my experience, it works quite well for its intended purpose, and I wanted to show that. My first draft used hand-waving (at least) as much as the original article when trying to make that case. I hope introducing a more theoretical framework in this draft provides a more convincing argument that String.hashCode() at least isn’t “poor.”