Hash Table Data Structure

In many applications we work with large data structures that we need to use to search, insert, modify or delete an element. These structures can be vectors, matrices, lists etc. In the best case, these can be sorted in O(log n). However, there are some data structures for which we do not need sorting in order to find an element, which would definitely save us some precious time. Hash tables are data structures that have this property. Imagine we have four string elements, as follows:

B = (“abc”, “painter”, “abacus”, “fuss”)

We build a vector Index to indicate the order in which we should place the words in a sorted vector. The alphabetical order for these words is “abacus”, “abc”, “fuss”, “painter”, so the Index vector would look like:

Index = (2, 4, 1, 3)

signifying that the first word in the array should be placed on the second position in the sorted array and so on. The way in which we can obtain the sorted vector is as follows:
B ‘= (B [Index (1)], B [Index (2)], B [Index (3)], B [Index (4)]).

This procedure is called indexing. The construction of the Index vector cannot be constructed in less than O(N log N), but this only needs to be done once. After this, searches can be made really quickly. If along the way are adding or deleting items, we will lose some time to maintain the index, but in practice this time is much less than the time that would be lost if searching took longer.

In some situations, unfortunately, you cannot do any indexing on a data structure. Consider the case of a program that plays chess. The number of possible positions for the pieces on the chessboard is too high. In such cases we use the data structures known as a hash table.

Suppose we want to build a hash table H with 1000 Boolean elements. Initially, all elements of H are set to False (or 0). If the number 400, for example, was found in the list, we would only have to set the value of H (400) to True (or 1). Next time we search for this element we would only need to examine the element H (400) and because it is True, it means that this number was found. If we delete an item in the hash table, all we need to do is set the corresponding index to False.

Now suppose that instead of 1000 numbers we have to represent up to 2 billion elements. We surely cannot use a regular array in this situation. The solution is to use a mod operation to establish a common property of some elements. If we can store up to 100,000 elements in our array, we choose M = 100,000 and insert each element at position f(value) % M, where f is called a hashing function. Our hash table will end up like an array of lists.

Hash Tables : The division method

One of the most common approaches when working with a hash table is to insert elements at a position given by x % M, where M is the number of table entries. The challenging aspect about this method is to choose the right value for M, such that the number of collisions for any input will be small. In addition, if M is maximized then the number of keys assigned to the same index will be smaller. Suppose, for example, that we are only allowed to store 5 variables in our array and we need to store the keys 0, 10, 7, 9, 3, 4.

At index 0 we would store 0, 10: 0 % 5 = 0, 10 % 5

At index 1 we would not store anything

At index 2 we would store 7: 7 % 5 = 2

At index 3 we would store 4: 3 % 5 = 3

At index 4 we would store 9, 4: 9 % 5 = 4, 4 % 5 = 4.

A good choice for M is represented by prime numbers that are not close to any power of 2. For example, instead of a table with M = 10,000 entries we can use one with 9973 entries. This would decrease the number of items stored at the same index and, as a consequence, searching in the hash table would be faster.

Hash Tables: The multiplication method

The multiplication stores the items at an index given by the hash function h(x) = M * A * x

For this method hash function is h (x) = [M * {A * x}]. A is a positive number, 0 < A < 1, and {x * A} means the fractional part of x * A, i.e. x * A – [x * A]. For example, if we choose M = 1234 and A = 0.3, for x = 1997, we would obtain h (x) = [1234 * {599.1}] = [1234 * 0.1] = 123. Note that the function h produces numbers between 0 and M-1, just as the mod function used in the division method.

In this case, the value of M has no importance, as opposed to the division method. We can therefore choose it according to our needs in terms of the number of elements that we need to store.

Hash Tables: Hashing – points to remember

The value of M (the size of the table) is important and greatly influences the efficiency of the hashing algorithm we choose to implement. It has been proven that the best performance is given by prime numbers that are not close to a power of 2. For example, if we choose M = 100, we would just take the last two digits of the elements and insert them at the same position (e.g. 123, 23 and 223 would all go to the same position), which would result in an increased number of collisions and hence and inefficient implementation of our hashing algorithm.

A good real-world example that would require the use of a hash table is the implementation of a router table for an ISP. Just imagine that there can be up to millions of routers handled by such a company. When a package of information needs to be routed to a certain IP address, searching for the optimal route can be done efficiently if a hash table is used to store all these addresses.

Hash Tables: Dealing with collisions – open addressing

In the examples we have seen so far our hash table was basically an array of lists. But what happens if we really do not want to store multiple elements at the same index? If this is the case we could use the technique known as open addressing. In case we want to add an element and a collision with another item already stored in the hash table occurs, this technique uses probing, which means scanning for alternate locations to place the current element at.

There are multiple types of probing:

Linear probing – simply search for the next position available in the array linearly, usually in steps of 1. If, for example, our hash function indicates an element should be stored at position 3, which already holds some values, we start looking for the next available location, examining 4, 5, 6 and so on, until the first free position has been found.

Quadratic probing: The difference from linear probing is in the interval between probes. Take the same example as above. Instead of examining every single position from 4 onwards, we could examine every x^2 + 2 positions. For example, we start with 4, then 4 + 1^2 + 2 = 7, then 4 + 2^2 + 2 = 10, then 4 + 3^2 + 2 = 15, and so on.

Double hashing: The interval between probes is computed by an additional hash function, usually different from the one we used initially.

Hash Tables: Universal hashing and bucket hashing

For a better performance for our hashing algorithm and for minimizing the number of collisions, we can use more than one hash function and pick a random function every time we are computing the index at which an element will be stored. This technique is known as universal hashing and such a collection of hashing functions is called a family of functions. For example, we can pick the functions (x + 1, 2x + 3, 3x + 4 and 4x + 2). Choosing a random function every time we want to compute the index of an element in the hash table will ensure a more uniform distribution across the table and hence a smaller number of collisions.

In order to improve the performance of the hash table even more, we could also make use of the concept known as bucket hashing. The main idea is to divide the M slots available in the hash table into B buckets. Thus every bucket would consist of M / B slots. The hash function would then assign values only to the first slot of every bucket. If this is unavailable, then we linearly search for the next free slot within the bucket. If, however, the bucket is full, we insert the element into an overflow bucket at the end of the hash table. This bucket has infinite capacity and is shared by all the other buckets. It is ideal to have as few elements as possible deposited in the overflow bucket.

Since last 15 years in different geographical locations, Sumit prepared hiring format for several hiring managers/teams to hire the balanced talents and interviewed talents on the various stages of their selection process. He also interviewed by hundreds of companies in different geographical locations.His best conclusion for hiring teams and candidate is to prepare in advance. Here ‘advance’ means keep your interview book ready and continue to update it even you are not going to interview candidates or applying for any job in next six months.