Why do I think String.hashCode() is poor

In response to an article I wrote HERE, a recent article HERE pointed out a number of things which could have been clearer in my claim that String.hashCode() is poor. What does "poor" mean to me and does it even matter?

What is the purpose of a hashcode?

A hashcode attempts to produce a number for each input, in a somewhat random and unique way. If the input already contains a significant amount of randomness the hashing strategy largely doesn’t matter. Even Integer.hashCode() works well for random (and even sequential numbers)
Long strings, for example, are likely to contain a significant amount of randomness so the behaviour for a hash for long strings isn’t a good test of the hash. However, for that use case, it likely it doesn’t matter unless someone has planned a deliberate attack, in which case you have to do more than change the hashing strategy.

How might we test whether a hash code is good or not?

One approach is to score it compared to similar hashcode functions for an assumed use case.
In this case, I have used words from the dictionary, selecting words of lengths 1 to 16 and estimated their collision rate in a HashMap with default settings.

HashMap manipulates the hashcode in two ways.

it agitates the hashcode to ensure the higher bits are used

it masks the hashcode by the capacity - 1 where the capacity is the number of buckets and a power of 2. A mask is used as it’s faster than division.

String.hashCode() calculates the hashCode based iterating over the characters in a String and multiplying the hashcode by 31 each time.

Is 31 a good number to use?

However, lets test how it compares to other moderately sizes prime numbers. In the following test I look at the percentage of collisions a HashMap is likely to have for words of length 1 to 16. I take into account that HashMap agitates the hashcode and then masks it. The size to mask is based on the default size the HashMap would be for that many keys. I give extra weight for the worst outcome.

Not too surprisingly, 2 is the worst prime number to use for this test and has the highest score for collisions. Given that there is 26 letters in the alphabet lets assume we could have ignored prime numbers less than 26. Let’s also assume that numbers over 256 (possible values for a byte) could have been discounted. Primes over 256 don’t find scores outside the range of those we get with 26 to 256. How does each of our prime numbers rank?

with prime = 31, there are 5952 collisions
with prime = 109, there are 0 collisions

And for 3 character strings

with prime = 31, there are 755968 collisions
with prime = 109, there are 0 collisions

For this use case, 109 is so much better as the range of ASCII characters I am using is 32 to 126 inclusive which is a range of fewer than 109 values. Using 109 has no collisions for 4 characters either but takes to many resources to run this way.

But because String.hashCode()‘s hash space is small and it runs so fast,

There is plenty of int values to allow every 0 to 4 letter ASCII String to have a unique hashcode.

I have always found (anecdotally) that String.hashCode() manages collisions quite well for Real World Data.

I agree that in general, it does it’s job quite well and HashMap degrading to a tree instead of a list for collisions (in Java 8) mitigates this significantly.

However, let’s say that using 109 instead of 31 is 5% better, imagine how much processing power has been wasted on the millions of devices over the decades for the change of one number.

So what does a “poor” hash function look like? And what does a “good” hash function look like? And where does String.hashCode() fall on that spectrum?

I would say that in terms of prime factors that could have reasonably been chosen, it’s on the "poor" end of the scale.

Conclusion

In conclusion, I feel the String.hashCode() is poor as most prime numbers between 26 and 256 would have been a better choice than 31. The nearest prime 29 is likely to be better and 109 might be even better.
A more extensive study of a range of use cases would be needed to settle on one number better in most cases.

I would favour, the hashing strategy for HashMap should be configurable (like Comparator for TreeMap)