Further Programming in Java

Chapter 5: Hash Tables

Linear Probing Hash Tables

The linear probing hash table is a fairly simple structure where data items are stored directly inside the hash element array. It gets its name by the way it handles the unique problems that exist when storing the data items in the table directly.

Adding Data

Data can be added to the table by calculating the hash code from the key of the data that is being stored, and then placing the data in the corresponding hash element.

However, there are many more possible search keys than there are hash elements. It is inevitable that some keys will resolve to the same hash code. This is known as a hash collision, and it must be handled.

Why doesn't it make sense to have a hash element for each possible piece of data?

Hash collisions will always occur, but they can be avoided. This is the reason why a good hash function is so important: by spreading the data evenly across the entire table, on the whole there is a smaller chance of collisions occuring.

Resolving Collisions

If faced with a collision situation, the linear probing table will look onto to subsequent hash elements until the first free space is found.

This traversal is known as probing the table; and as it goes by one element at a time, it is linear probing.

There are other kinds of probing; for example quadratic probing is where the traversal skips one element, then two, then four, etc. until a free space is found.

Consider the situation mentioned above where data 'F' has the same hash code as data 'D'. In order to resolve the collision, the add algorithm will need to probe the table in order to find the first free space (after 'C').

Consider the situation mentioned above where data 'F' has the same hash code as data 'D'. In order to resolve the collision, the add algorithm will need to probe the table in order to find the first free space (after 'C').

If the probe loops back, and finally reaches the same element that it started at, it means that the hash table is full, and can no longer hold any more data. The addition operation will fail.

Algorithm

We can summarise the addition algorithm as follows:

Calculate a hash code from the key

Access that hash element

If the hash element is empty, add straight away

If not, probe through subsequent elements (looping back if necessary), trying to find a free place

Retrieving Data

In order to retrieve data from the hash table, we need a key to search for. From this key, we can calculate the hash code. This tells us where in the data array we need to start searching.

Because of the collision resolution of the add operation, the target data might reside at a location other than the element referred to by the hash code.

Therefore, it is necessary to probe the hash table until an empty hash element is found, and for an exact match between each data item and the given key. (The probing stops at an empty element, since it signals the end of where potential data might have been stored.)

Consider a situation where 'G' maps to the same hash code as 'B', and a search is undertaken. The retrieval algorithm will start looking at data items starting at that hash code, and continue comparing each hash item's contents for a match with 'G', until either the blank element is found, or (if the array is full) the probing loops back and ends up where the traversal started.

...otherwise, keep probing for the next item, looping back to zero if necessary

return null;}

if nothing has been returned by now, data is not present

Deleting Data

The basic deletion operation is merely a retrieval followed by data removal (clearing the hash element, once the target has been found.)

Unfortunately, this has a negative side-effect on the way the retrieval works. Since the data retrieval operation relies on blank hash elements as the signal to stop probing, there is the possibility that a deletion operation will render some data items unfindable. Consider where a search for 'R' (which has the same hash code as 'A') is attempted, after 'D' has been deleted:

The data 'R' will never be found, as the probing had terminated too early; this is due to the hash element that stored 'D' (and kept the probing going) being deleted.

The solution to this problem is to define two different kinds of blank hash elements:

a purely empty element, which has never stored data; and

an "empty but deleted" element, which stored data that has since been deleted.

These can be used to differentiate the situations in how a clear hash element came to exist; something that will be necessary to make the hash retrieval work again.

When a data item is deleted, it is not completely cleared, but instead has the "empty but deleted" mark. The retrieval function must then be modified so that it will terminate probing only on a purely empty element, and continue probing if an "empty but deleted" element is encountered.

An add operation can store data in the "empty but deleted" element.

As the deleted flag is only necessary to continue searching, adding data to one of these elements makes it work like just another normal element again (as far as the probing algorithm is concerned.)

Table Pollution

The 'deleted flag' approach is a relatively convenient way to resolve the problem of bridging together clusters of data for allowing data retrieval, however it has the side effect of causing those retrievals to take longer.

The reason for this is because over time, as the data structure is being manipulated, the number of deleted flags increase; as data is added and then deleted, what was once a purely empty element, becomes an empty but deleted element. Recall that for the 'deleted flag' technique to work, the retrieval can only terminate on a purely empty hash element.

These deleted flags may bridge together otherwise unrelated data (in terms of the search algorithm), and will gradually deteriorate the structure's efficiency. In the worst case, the hash table may prove no better than a basic linear search.

The only solution to the table pollution issue is to create a brand new hash table, add the existing data into the new table, and discard the old one.

Simply deleting the deleted flags won't work; they are an integral part of the probing process. It's just sheer coincidence that in the above diagram, none of the data items happened to move.

In a linear probing hash table, the place where data is actually stored depends a lot on what has happened in the table; past data collisions, earlier deletions and re-additions, etc.

Algorithm

A summary of the deletion algorithm is below. Note that it is very similar to that of the retrieve -- this is because the data must be found before it can be deleted.

Calculate the hash code for the given target key

Access the hash element

If the hash element is empty, the deletion has immediately failed.

Otherwise, check for a match between the target and data key

If there is a match, delete the hash element (and set the flag)

If there is no match, probe the table until either:

An match is found between the target key and the data's key; the data can be deleted, and the deleted flag set.