Which hashing algorithm is best for uniqueness and speed? Example (good) uses include hash dictionaries.

I know there are things like SHA-256 and such, but these algorithms are designed to be secure, which usually means they are slower than algorithms that are less unique. I want a hash algorithm designed to be fast, yet remain fairly unique to avoid collisions.

Do collisions actually happen?

Yes. I started writing my test program to see if hash collisions actually happen - and are not just a theoretical construct. They do indeed happen:

FNV-1 collisions

creamwove collides with quists

FNV-1a collisions

costarring collides with liquid

declinate collides with macallums

altarage collides with zinke

altarages collides with zinkes

Murmur2 collisions

cataract collides with periti

roquette collides with skivie

shawl collides with stormbound

dowlases collides with tramontane

cricketings collides with twanger

longans collides with whigs

DJB2 collisions

hetairas collides with mentioner

heliotropes collides with neurospora

depravement collides with serafins

stylist collides with subgenera

joyful collides with synaphea

redescribed collides with urites

dram collides with vivency

DJB2a collisions

haggadot collides with loathsomenesses

adorablenesses collides with rentability

playwright collides with snush

playwrighting collides with snushing

treponematoses collides with waterbeds

CRC32 collisions

codding collides with gnu

exhibiters collides with schlager

SuperFastHash collisions

dahabiah collides with drapability

encharm collides with enclave

grahams collides with gramary

...snip 79 collisions...

night collides with vigil

nights collides with vigils

finks collides with vinic

Randomnessification

The other subjective measure is how randomly distributed the hashes are. Mapping the resulting HashTables shows how evenly the data is distributed. All the hash functions show good distribution when mapping the table linearly:

A timely post by Raymond Chen reiterates the fact that "random" GUIDs are not meant to be used for their randomness. They, or a subset of them, are unsuitable as a hash key:

Even the Version 4 GUID algorithm is not guaranteed to be unpredictable, because the algorithm does not specify the quality of the random number generator. The Wikipedia article for GUID contains primary research which suggests that future and previous GUIDs can be predicted based on knowledge of the random number generator state, since the generator is not cryptographically strong.

Randomess is not the same as collision avoidance; which is why it would be a mistake to try to invent your own "hashing" algorithm by taking some subset of a "random" guid:

Note: Again, I put "random GUID" in quotes, because it's the "random" variant of GUIDs. A more accurate description would be Type 4 UUID. But nobody knows what type 4, or types 1, 3 and 5 are. So it's just easier to call them "random" GUIDs.

@Earlz Development tool is Delphi. i assume you mean the images though. For "linear" map i created a square bitmap of size nxn, (where n = Ceil(sqrt(hashTable.Capacity))). Rather than simply black for list entry is occupied and white for list entry is empty, i used an HSLtoRGB function, where the hue ranged from 0 (red) to 300 (magenta). White is still an "empty list cell". For the Hilbert map i had to hunt wikipedia for the algorithm that turns an index into an (x,y) coordinate.
–
Ian BoydApr 23 '12 at 18:15

21

I've removed a number of comments that were along the lines of "+1 great answer!" - Please don't post comments that don't ask for clarifications or add information to the answer, if you feel it's a great answer, upvote it ;)
–
Yannis Rizos♦Apr 28 '12 at 20:34

7

It would be really interesting to see how SHA compares, not because it's a good candidate for a hashing algorithm here but it would be really interesting to see how any cryptographic hash compares with these made for speed algorithms.
–
MichaelMay 25 '12 at 15:09

If you just want to have a good hash function, and cannot wait, djb2 is one of the best string hash functions i know. It has excellent distribution and speed on many different sets of keys and table sizes

Actually djb2 is zero sensitive, as most such simple hash functions, so you can easily break such hashes. It has a bad bias too many collisions and a bad distribution, it breaks on most smhasher quality tests: See github.com/rurban/smhasher/blob/master/doc/bernstein His cdb database uses it, but I wouldn't use it with public access.
–
rurbanAug 20 '14 at 6:03

If you are wanting to create a hash map from an unchanging dictionary, you might want to consider perfect hashing https://en.wikipedia.org/wiki/Perfect_hash_function - during the construction of the hash function and hash table, you can guarantee, for a given dataset, that there will be no collisions.

It's pretty obvious, but worth pointing out that in order to guarantee no collisions, the keys would have to be the same size as the values, unless there are constraints on the values the algorithm can capitalize on.
–
deviosApr 4 '13 at 20:34

I improved gperf and provide a nice frontend to most perfect hash generators at github.com/rurban/Perfect-Hash It's not yet finished, but already better then the existing tools.
–
rurbanAug 20 '14 at 6:05

All the CityHash functions are tuned for 64-bit processors. That said, they will run (except for the new ones that use SSE4.2) in 32-bit code. They won't be very fast though. You may want to use Murmur or something else in 32-bit code.

In fact, their speed can be a problem sometimes. In particular, a common technique for storing a password-derived token is to run a standard fast hash algorithm 10,000 times (storing the hash of the hash of the hash of the hash of the ... password).

It's relatively fast, sure, for a cryptographic hashing algorithm. But the OP just wants to store values in a hashtable, and I don't think a cryptographic hash function is really appropriate for that.
–
Dean HardingFeb 19 '11 at 1:10

6

The question brought up (tangentially, it now appears) the subject of the cryptographic hash functions. That's the bit I am responding to.
–
yfeldblumFeb 22 '11 at 13:14

4

Just to put people off the idea of "In particular, a common technique for storing a password-derived token is to run a standard fast hash algorithm 10,000 times" -- while common, that's just plain stupid. There are algorithms designed for these scenarios, e.g., bcrypt. Use the right tools.
–
TC1Oct 14 '13 at 13:19

Cryptographic hashes are designed to have a high throughput, but that often means they have high setup, teardown, .rodata and/or state costs. When you want an algorithm for a hashtable, you usually have very short keys, and lots of them, but do not need the additional guarantees of a cryptographic has. I use a tweaked Jenkins’ one-at-a-time myself.
–
mirabilosDec 6 '13 at 13:57

Non-cryptographich hasfunctions like Murmur3, Cityhash and Spooky are pretty close together. One should note that Cityhash may be faster on CPUs with SSE 4.2s CRC instruction, which my CPU does not have. SpookyHash was in my case always a tiny bit before CityHash.

MD5 seems to be a good tradeoff when using cryptographic hashfunctions, although SHA256 may be more secure to the collision vulnerabilities of MD5 and SHA1.

The complexity of all algorithms is linear - which is really not surprising since they work blockwise. (I wanted to see if the reading method makes a difference, so you can just compare the rightmost values).

SHA256 was slower than SHA512.

I did not investigate the randomness of the hashfunctions. But here is a good comparasion of the hashfunctions that are missing in Ian Boyds answer.
This points out that CityHash has some problems in corner cases.

I wouldn't use the exact same one used here, as it's still relatively easy to produce collisions with this. It's definitely not terrible, but there are much better ones out there. And if there's no significant reason to be compatible with Java, it should not be chosen.
–
Joachim SauerApr 23 '12 at 12:51

If you still choose this way of hashing for some reason, you could at least use a better prime like 92821 as a multiplicator. That reduces collisions much. stackoverflow.com/a/2816747/21499
–
hstoerrJul 1 '14 at 6:30

First of all, why do you need to implement your own hashing? For most tasks you should get good results with data structures from a standard library, assuming there's an implementation available (unless you're just doing this for your own education).