Hash Tables With Open Addressing

August 24, 2012

One of the classic data structures of computer science is the hash table, which provides constant-time access to key/value data in the average case. The idea is to store a key/value pair in a location that is computed based on the value of the key, which works fine when all the hash values are unique; the problem arises when two keys hash to the same value. The hash tables of the Standard Prelude use a method called chaining to resolve collisions; today’s exercise uses a different method called open addressing.

All hash tables work by computing an address at which to store an item based on its key. Sometimes two keys hash to the same value, which causes a collision. Chaining, as used in the Standard Prelude, resolves collisions by storing multiple items at the same location, forming a list of items. Open addressing instead computes a second address, or if necessary a third, or fourth, or …, continuing until it finds an empty spot. It is necessary that the computation of secondary addresses eventually visits every possible storage location; a simple approach, which we will adopt, is called linear probing, in which storage locations are accessed in increasing order until an empty location is found. It is possible, of course, for the hash table to become completely filled, in which case an error is reported when a new item cannot be inserted.

The tricky part of hashing with open addressing is deletions, because it’s not possible to simply delete an item because some other item may rely on the the fact that its storage location was filled when the item was inserted. The solution is to have three types of items in a storage location: nil, which indicates that the storage location has never been used; deleted, which indicates that the storage location is currently empty but has been used in the past, and in use, for those storage locations that are currently occupied.

Your task is to write functions that maintain a hash table with open addressing. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

I’m embarrassed at the bug found by madscifi. Instead of fixing the bug directly, I offer an improved version of hash tables with open addressing, due to Knuth (AoCP3, Algo 6.4L and 6.4R). The primary difference between Knuth’s version and the version described above is that during deletion Knuth moves items in the table instead of marking them deleted, so that there is no longer a “deleted” indicator. The code below doesn’t quite follow Knuth; Knuth always leaves one element of the table empty, as a sentinel, at the cost of keeping track of the number of items in the table separately, but we use the entire table, counting items during searching to check for overflow. An empty hash table is a vector of null lists; when an item is inserted into the hash table, the appropriate null list is replaced by a key/value cons pair:

(define (make-hash len) (make-vector len (list)))

Each of the lookup, insert and delete functions tests four conditions. If the search wraps around the table back to the starting point, it overflows. If the search finds a null list in the hash table, the sought item does not exist in the table. If the current key equals the sought key, then we perform the appropriate action and terminate. Otherwise, the hash table has a key/value pair at the current location, but it isn’t the one we seek, so the search continues. Note that we worked front to back in the original code, but Knuth works the other way, from back to front.

The lookup function combines the first two conditions, returning #f indicating that the search key is not present in the table if the search wraps around the table or reaches a null item. If the key is found in the third condition, it is returned, otherwise the next item is checked:

Insertion combines the conditions differently. The first condition reports an overflow and quits. The second and third conditions are combined, and the new key/value pair is inserted in the table, either in a previously-empty location or associating a new value with an existing key. The function returns the modified hash table:

Deletion is tricky. As in lookup, the first two conditions are combined; if the key is not in the table, the table is returned unchanged. In the third condition, when the key is found, it is deleted, and some item farther down the chain replaces it. The replacement process is recursive, with moved items being replaced by some item farther down the chain; eventually, the chain is reformed so that it has no empty items. Loop1 loops over the items in the chain; loop2 finds an item to replace: