Cuckoo hashing is not so new method of implementing hash tables. It has some very nice properties providing the load factor is not too high. Since objects are by reference, the load factor of the array is not a memory issue directly. As such its well suited to java.

In particular is has worst case O(1) performance as opposed to O(n) for chaining and other hash table implementations. Anyway i have been asked to make my updated versions available (I have already made previous versions available). This has been used in production code for some time now. So it should be pretty stable.

Its also always faster than a HashMap for all my test cases.

However comments are for wimps and communists WarningI do break the collections contract also. All methods that return a collection, are not backed by the original collection. That is they are copies. This is no longer true. They now conform completely to the Map contract.

Note that for these classes you can choose between BSD, Apache or LGPL with classpath exception for a license.

Comments and suggestions are always welcome. Note i lurk in #lwjgl on IRC a lot too.

UpdateOk so I found out that some of the hashing code was pretty brain dead. This is now fixed. Also added is a junit test for almost everything in cuckoo.*. Not that i found any bugs other than poor performance from bad hashing.

I use delt0r's cuckoo hashing code in my A* path-finding code and found that it is the fastest map implementation AND produces no garbage unlike the other map implementations which make an HashMap.Entry for every object they store.

No benchmarks are reasonable, but some are interesting. Seriously i had problems with my initial tests because of quirky benchmarks.

I am not really surprised with its rank in the benchmarks however. First most of my objects cache 3 fields for the objects stored in the map, this makes it quite a bit faster (the CuckooHashFields interface). And two: usage patterns really really matter. For example if you end up having to rehash a lot then i would expect it to degrade to the default hash map implementation. I would also expect a good performance gain by having a oversize initial capacity.

In my applications you get a lot of add/put and remove but mostly contains. The size stays approximately the same throught the maps life time in my app.

As awlays.. YMMV.

I should also point out that i use java.util.* implementations until i see a performance problem.

Edit: PS my hashing is not very optimized. There is perhaps a good performance gain there too.

I have no special talents. I am only passionately curious.--Albert Einstein

I've read about cuckoo before and it's indeed a nice algorithm. How well (or bad) it handles a degenerative case of all keys having the same hash code (as returned from the hashCode() method)? Does it have maximum of "number_of_hashes * bin_size" entries in such case? Expanding the array size and rehashing would end up with the same maximum amount of possible entries, right? Or have I overlooked something?

cuckoo hashing will fail with many identical hash codes as i have implemented it. However this is a pretty easy situation to avoid generally*. It can also be avoided internally. But is not a usage case i need to consider.

For the security aware this does have implications when you don't control the type of objects inserted into the map. Someone could mount a DNS attack by sending objects that have the same hash code.

I have no special talents. I am only passionately curious.--Albert Einstein

Re-did the charts with server VM and way more iterations. Wish we had a good way of generically comparing hashmaps in a few different ways, but I guess people will just have to experiment in their actual apps.

Bumping so that folks realize that there has been an important bug fix. In particular performance is now restored to what it should be. Also i have added a int keyed map as well. Its quite a bit faster than using Integers.

I have no special talents. I am only passionately curious.--Albert Einstein

Perhaps the last big fix? I have now changed the implementation to return proper backed versions of collections. This results in much faster iteration, and has conformance to the Map interface contract. Many of the Collections.* methods use these methods and this results in expected behavior and better performance.

I have no special talents. I am only passionately curious.--Albert Einstein

I'm messing around with cuckoo hashing. I think I might have found an issue with your implementation. Does the following correctly describe what happens with your implementation?

Quote

put key1 in a, b, or c key1 goes in aput key2 in a, x, or z key2 goes in xremove key1put key2 in a, x, or z key2 goes in akey2 is in the table twice!

Edit:

It seems you increment size in some places without checking >= threshold. Is this ok?

I think the capacity specified in the constructor should be used as "initialCapacity * (1 / loadFactor)". This avoids rehashing when the user knows how many items they want to insert (especially for small capacities).

I've had better luck using multiple bins instead of single bin implementations. The "sweet" point seeming to be either 2 or 4 bins per masked hash. My goodness metric was density and not performance. YMMV.

There are advantages and disadvatages of using identityHashCode. Advantage: It is (at least used to be) the raw address (maybe salted?) of the object. So you'll get tended to get objects that are allocated together nearby in the hash table. This may or may not match a given use-case. Disadvantage: It has to go across the native boundary and perform memory management fix-up, which can lead to requiring more memory. This will be the case for objects that just fit within the allocators bucket and the extra storage (orginal memory address) pushes the size into a larger chunk. Additonally, is doesn't have the notion of equivalent objects returning the same hash value. Finally, as noted, it tends to require more cycles.

Rehashing the inital hash returned by the object. This is another use-case dependent decision.

identityHashCode: rehashing the intial in effect negates any reason to use this call. So the first used hash should be something like: hash0 = (hash >>> 4)|(hash<< 28), if used.

hashCode: Notiably the integer wrapper types return their represented value as the hash. In the general case this is of little use. However it seems rather common than sequences are used. In this case not rehashing the intial is reasonable. The map implementations shipped with Java do bit mix the results of hashCode, this is more important there than for cuckoo since only a single hash value is used and must be 'good' for the general case. The additonal hash code(s) cover this case for cuckoo.

CommanderKeith you *must* use System.ideneitiyHashCode() because objects can and should override hashCode() to be consistent with .equals(). ie Integer. Also (new Integer(1)).hashCode()==1 and creates pathological cases very easily. So you really must hash the returned Object.hashCode().

Yes its a little slower. But it should be correct in all cases. Getting to the wrong place faster is not really faster.

Nate: I will double check. This should be caught in the unit tests and I did have it correct at one point. Perhaps doing the "lazy hashing" borked it. Shows the importance of code review

There are lots of different ways to implement this. I found that the cost of hashing is small compared to cache issues and Object.equals() and Object.hashCode() for the general case. So i went with 3 hash codes so load factors can go higher than 0.9 . The default is 0.75 and setting it to .9 will mean that puts that result in a rehash will be slower, but once re sized iterator performance will be faster. In my tests on 64bit and 32bit JVM on linux, having a single table was a little faster, I decided due to cache issues. But "fast" on JVM is a moving target with its own optimizations.

In practice I have custom CHashMap with custom Hash Fields. This is much faster than any of the above. But fitting the right Map to the job has always been part of optimization.

I have no special talents. I am only passionately curious.--Albert Einstein

Nate was right. Added something to the unit tests and fixed the code. This is getting silly. A good reason to use java.util.* unless you really need something faster.

This also illustrates why units tests are not the panacea of good code. They only test for what you think of.

I should note that for my IntHashMap the first hash is highly biased so that linear usage of ints results in fast access. This will hit the load factor only very slightly since the next 2 hash codes are close to random.

I have no special talents. I am only passionately curious.--Albert Einstein

It's true that I'm ignoring "black box" situations such as placing mixed types in the map. I'm making the assumption of "this is not a general purpose map that can used as a black box". For instance in:

1 2

Byteb1 = newByte((byte)1);Integeri1 = newInteger(1);

Both b1 & i1 return the same hash and performing equals will return false and therefore will cause the map to explode if you don't use identityHashCode. Likewise will be true for any pair of objects which return identical hash values and return false for equals.

However, this isn't much of a limitation for an embedded method for a wide class of problems.

This also illustrates why units tests are not the panacea of good code. They only test for what you think of.

I like to write test cases which cover anything I can think of and then have one or two which use a seeded random number generator to create a couple of thousand reproducible tests cases which might cover things I didn't think of. For example,

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org