Problem

Clojure’s hashing strategy for numbers, sequences/vectors, sets, and maps mimics Java’s. In Clojure, however, it is far more common than in Java to use longs, vectors, sets, maps and compound objects comprised of those components (e.g., a map from vectors of longs to sets) as keys in other hash maps. It appears that Java’s hash strategy is not well-tuned for this kind of usage. Clojure’s hashing for longs, vectors, sets, and maps each suffer from some weaknesses that can multiply together to create a crippling number of collisions.

For example, Paul Butcher wrote a simple Clojure program that produces a set of solutions to a chess problem. Each solution in the set was itself a set of vectors of the form [piece-keyword [row-int col-int]]. Clojure 1.5.1's current hash function hashed about 20 million different solutions to about 20 thousand different hash values, for an average of about 1000 solutions per unique hash value. This causes PersistentHashSet and PersistentHashMap to use long linear searches for testing set/map membership or adding new elements/keys. There is nothing intentionally pathological about these values – they simply happened to expose this behavior in a dramatic way. Others have come across similarly bad performance without any obvious reason why, but some of those cases are likely due to this same root cause.

Change the hash of integers that fit within a long to the return value of longHashMunge (see Longs section of doc for more details)

Change the current multiplier of 31 used for vectors, sequences, and queues to a different constant such as -1640531527 or 524287 (see Vectors section)

For sets, instead of adding together the hash value of the elements, add together the return value of a function xorShift32 called on the hash value of each element (see Sets section)

For maps and records, instead of adding together hash(key) ^ hash(val) for each hash,val pair, instead add together hash(key)^xorShift32(hash(val)) (see Maps section)

Below is a link to a modified version of Paul Butcher's N-queens solver, with extra code for printing stats with several different hash functions. The README has instructions for retrieving and installing locally a version of Clojure modified with one of Mark's proposed alternate hash functions:

Here is a summary of results for some program elapsed times and how spread out the hash values are:

Problem

Using Clojure 1.5.1 hash

Using Mark Engelberg's 2013-10-30 proposed hash

Paul Butcher's N-queens problem with 6x6 board

Elapsed time: ~7 min

180,568 solutions hash to 3,013 distinct hash values

average of 59.3 solutions per hash value (max 2,492)

Elapsed time: ~12 sec

180,568 solution hash to 180,563 distinct hash values

average of 1.0 solutions per hash value (max 2)

with 6x9 board

Elapsed time: > 8 hours (did not wait for it to finish)

20,136,752 solutions hash to 17,936 distinct hash values

average of 1,122.7 solutions per hash value (max 81,610)

Elapsed time: 11.8 min

20,136,752 solutions hash to 20,089,488 distinct hash values

average of 1.0 solutions per hash value (max 3)

Compile Clojure 1.6.0-alpha1 source with "ant jar", so no tests run

Elapsed time: avg 51 sec (min 50, max 53)

Elapsed time: avg 50.6 sec (min 50, max 51)

Compile Clojure 1.6.0-alpha1 source with "ant", which includes running tests, but with generative test duration reduced to 1.0 sec

Elapsed time: avg 76.6 sec (min 75, max 80)

120,353 unique values hash to 113,405 distinct hash values

average of 1.06 values per hash value

Elapsed time: avg 76.0 sec (min 75, max 78)

119,811 unique values hash to 114,329 distinct hash values

average of 1.05 values per hash value

Tradeoffs and alternatives

These are discussed throughout Mark's document. A few of these are called out below

Nearly all of the proposals involve additional operations on ints or longs. This is expected to require little additional elapsed time in most hash computations, given that the most significant cost in hashing collections is usually traversing the parts of the data structure, especially if that involves cache misses. Measurements should be summarized or linked to here in the future.

Murmur3 is widely used, but does not lend itself well to incremental updates of hash calculations of collections.