Lookup Tables

A table is used to store elements.
An element may be inserted in the table,
searched for (looked up) in the table,
and sometimes deleted from the table.
The idea of a table is very general and important in information
processing systems.
For example, a university data base might contain tables of
departments, staff, courses, lectures, students, exam results.
A compiler might use tables of
keywords, variables, types, constants, procedures and functions.
A table is a mutable data structure,
it has state,
it changes with time.
Its
properties
are therefore dynamic and are
expressed by pre and postconditions (assertions):

When the table has just been cleared,
lookup returns false for any element.
When an element is added to the table,
lookup returns true for that element
and the contents are otherwise unaffected.
If a new element is added to the table and then deleted
there is no net effect on the table.
Two further situations need to be considered.
If an element is in the table and an attempt is made
to insert it again, the result could either be an error (4a)
or no change (4b).
If an element is not in the table and an attempt is made to delete it,
the result could either be an error (5a) or
no change (5b):

Sometimes the element type
consists of a key
plus some other attributes or data.
The key might be a number or a name (string);
it is something unique to identify an element.
In this case the lookup operation usually returns the associated
attributes rather than just true or false.
If there is no extra data we call the element itself the key.

It is important for a program to be efficient:
to use little space and
to run quickly.
How these aims can be achieved depends on the properties of
the data and particularly the key,
on the operations on the table
and on the frequency and pattern of use of the different operations.

We say that the key type is sparse if the key values in use
form a very small fraction of the number of possible values in the type.
We call it a dense type if most of the possible values come into play.
These are only loose definitions and each case must be judged on its merits.
For example,
if keys are alpha-numeric strings they are necessarily sparse because
there are infinitely many possible strings.
Even if the string length is limited but reasonably large there
are a very great many strings.
On the other hand,
the days of the week form a dense enumerated type.
The sparseness of (some) key-types
has a large influence on the method of choice for implementing a table.
Indeed many of the data structures described in this book
have been devised precisely to deal with sparse types efficiently.

A table where the element is all key
can also be thought of as a set.
If the element type is a dense index type then
the table can be implemented as a bit mask.
Lookup is done by the membership test, in,
addition and deletion by set union and difference.
There is a limit of the range of element values in
some programming languages the provide a built-in `set' data-type.
A large, dense set can be implemented by an array of boolean
so this can also be used to implement the corresponding table.
Conversely, a sparse set can be implemented by the methods
for tables described below.

A table to hold a limited number of elements with a sparse key
can be implemented using an array and a counter.
Initially the table is empty and the counter is set to zero.
When a new element is inserted
the counter is incremented and the new element stored at that index.
The elements are in no particular order other than that of time
of insertion.
Lookup is by the linear-search algorithm
which steps sequentially through the array.

If there are n elements in the table, the search takes
O(1) time in the best case and
O(n) time in the average and worst cases.
Deletion first requires a search to find the element to delete;
after that it takes O(1) time to actually remove it:
it is over-written with the last element in the array and
the counter is decremented.
On the other hand, insertion is fast and takes a constant amount of time,
provided that the element is known to be absent in which case
a search is unnecessary.

Notes.

A person does not
use linear search to find a name in a telephone directory,
except perhaps for Dustin Hoffman's character in `The Rain Man'.
Rather the directory is opened at a likely place,
and the key compared with names on that page.
If the key is alphabetically before the names on the page,
the search is continued in the front part of the directory
otherwise the back part.
The fact that the directory is alphabetically ordered
or sorted is essential to
this process.
The binary-search algorithm makes use of an ordering of
the key type in a similar way.
The table is implemented by a sorted array.
Two "pointers" are used to delimit the search area,
initially the occupied part of array.
The middle element is compared with the key and either the
lower pointer moved up or the upper pointer moved down as appropriate.

Change the data in the HTML FORM below, click `go', and experiment.
The region of the array from `lo' to `hi-1' is
[bracketted] and a[mid] is starred `*' in the trace:

array:
search key:
output:
trace:

Correctness

The algorithm is short but there are some subtle points to observe.
Firstly, if the key is present it lies between `lo' inclusive and `hi'
exclusive - this is the loop invariant.
The invariant is true at the start of the first iteration because
lo=1 and hi=n+1.
The loop body ensures that the invariant remains true
at the start of each subsequent iteration.
This is so because of the choice of test in the central if statement.
If the mid element is less than or equal to the key and lo is moved up to mid
the invariant is still true.
The converse is that the mid element is greater than the key
and when hi is moved down the invariant is still true.
Note the connection between the hi-1 in the invariant and the `>=' test.
When hi=lo+1 the search has narrowed to just one element,
A[lo..hi-1] = A[lo..lo] = A[lo],
which can be compared to the search key for equality.

Termination

The algorithm terminates because the two integer-valued pointers are
moving closer together and the exit test must eventually succeed.
If hi>=lo+1 then hi>lo+2 and so lo<mid<hi.
Therefore regardless of whether `lo' is moved up to mid or hi is moved
down to mid, lo and hi move closer together at each iteration
and closer to satisfying the termination test.

Testing

It is very easy to get the details of binary search wrong
when coding it so it is a good idea to test it on at
least the following cases:

an empty array!

an array of one element, (a) equal to the key or (b) less than it
or (c) greater than it

several elements with the key less than the 1st or

the key greater than the last element

the key equal to the 1st, a middle, or the last element

the key not in the array and falling between various positions

Complexity

Finally, the algorithm takes O(log(n)) time
as each iteration halves the search area.
Note that it is not worth testing for equality within the loop
and exiting.
To do so would add an extra test and time to every iteration.
The average number of iterations saved would be limited by

1 × 1/2 + 2 × 1/4 + 3 × 1/8 + ...

which is a convergent series;
it converges to two.
So the average number of iterations saved would be bounded by two
but the (averge and worst-case) loss in time
due to the extra test in each iteration
executed would be unbounded [Dijkstra].

Binary-search takes O(log(n)) time in all cases which is much
faster than linear search.
If the table does not change once created,
ie. there is no insertion or deletion,
it can be sorted once in O(n*log(n)) time;
see the chapter on sorting.
Insertion and deletion must keep the array sorted
and they take O(n) time at worst and on average which is relatively slow.
This is because insertion and deletion
to the left of position n require elements to be moved up or down
to make space or to fill the hole.
Using a sorted array may have an extra benefit if the elements are
also processed sequentially, for example to print alphabetical lists.

For an apparently crazy idea hashing works remarkably well.
A hash function maps key values onto the index
type of an array which implements the hash table.
Elements are accessed directly using the hash index
as an array index.
The surprising part is that the best hash functions are
essentially pseudo-random functions.
The ideal hash function maps the actual key values uniformly onto
the index type;
unfortunately the ideal is not always achieved.

For many purposes, a "reasonable" hash function on strings
is to take the binary value of the 1st character, multiply by three,
add the binary value of the 2nd character, multiply by three, and so on,
finally adding the binary value of the last character.
The hash value is this total modulo the size of the hash table.

The multiplication by three in the hash function ensures that
every character contributes to the least significant bit of the
hash index.
Changing any one character in the string by one
will give a different hash index
if the size of the hash table is even,
and it is often a power of two.
Multiplication by an even number would cause the effect of the leading
characters of a long string to be lost (exercise: why?).

It is impossible to design a good hash function for all circumstances.
In general it is necessary to analyse the statistical properties of the key
type and design a hash function to suit.
For example,
if a hash table is used to manage a table of identifiers in a compiler,
(some) programmers are fond of names like
x1, x2, x3, y1, ..., z1, ...
some of which are liable to hash to the same index.
Such an occurrence is called a collision.
Collisions are in fact inevitable because the key type is infinite
and cannot otherwise be squeezed into a finite index range.
The solution is to hash the key and examine
the table at that index.
If a different key is already at that position
then a collision has occurred.
A probing strategy then looks at a sequence of positions
until a blank is found.

Linear probing is the simplest probing strategy.
It looks at successive elements, modulo the table size,
until a blank is found.
The new element is inserted there.

The same probing strategy is used in the search.

Entries may cluster together in blocks as a result of collisions
as the table fills up.
Linear probing amounts to using a linear search on the collision blocks.
Its average time complexity is linear in the size of the average block.
A good hash function will minimise the size of blocks.

Quadratic probing is a more sophisticated probing strategy.

The step-size increases linearly so the
indices generated,
i, i+1, i+3, i+6, ...,
are described by a quadratic in the step number
modulo the table size.
Quadratic probing is not guaranteed to probe every location in the table -
an insert could fail while there is still an empty location
however hash tables are rarely allowed to get full - see later.
The same probing strategy is used in the associated search routine.

The virtue of quadratic probing is that the probe locations for
two near-miss keys, as opposed to two colliding keys,
are not closely related and this reduces the likelihood
of large collision blocks forming.
With linear probing on the other hand,
collision blocks following from similar indices tend to join together
slowing the search process.
For example,
if one set of keys all hash to `6' and another set hash to `7'
then the collision blocks of these sets will over-run and merge
with each other under linear-probing.
With quadratic probing on the other hand,
the probes after 6 are at positions 7, 9, 12, 16, ...
and the probes after 7 are at positions 8, 10, 13, 17, ....
This tends to give small disjoint blocks.

Simulations can give an estimate of the performance of a hash table.
The time taken by insertion and lookup
is dependent on the number of collisions.
The results given here are based on randomly generated keys
with a uniform distribution.
This is the best possible case;
typical results may therefore be somewhat worse.

It can be seen that with either probing strategy,
a hash table less than 60% full is very efficient indeed.
Average insertion and search time is essentially constant, O(1).
Quadratic probing performs better for a nearly full table.

If the number of entries in the table is liable to grow, a common strategy
is to increase the table size when the number
of entries reaches a certain percentage of the table's current capacity
or when the collision rate reaches some critical level.
Elements may have to be re-hashed when moved into the enlarged table.
The table may also be shrunk if the number of entries falls
below some limit.

Another way of dealing with collisions is known as chaining.
Here, each hash table entry contains a pointer to a list of colliding
elements.
The lists are created in a collection.
Deletion is more easily implemented under chaining.
A linear search is used within a list.
Provided the hash function is good, each list
is kept short and the operations are fast.
See the chapter on lists for the necessary techniques.

The hash function randomises the order in which elements are stored
in the hash table.
This makes it difficult to process the elements
sequentially and may rule against the use of a hash table
in some applications.

Exercises

A table is implemented by an array of 1000 elements;
initially there are 600 members.
What are the approximate relative efficiencies of using

linear search and an unsorted array

binary search and a sorted array

hashing

under the following circumstances

1000 insertions, 1000 deletions and 1000 lookups (interleaved)

100 insertions, 100 deletions and 2800 lookups (interleaved)

Implement a delete operator for a hash table
under linear-probing.
Think carefully about what happens if x and y collide,
and the sequence

insert(x,T); insert(y,T); delete(x,T); lookup(y,T)

is performed.
One possible technique is to mark entries such as x as `deleted'.
Search terminates at an empty array slot but continues over a deleted slot.
Performance may deteriorate if there are many deletions.
Think of other techniques.

Long:
Implement a small data base of telephone numbers.
The key type is a name, assumed unique.
The associated data is a telephone number of up to 10 digits.

The data base commands are:

I name number -- insert a new entry

D name -- delete an entry

P name -- print the number associated with name

(i) Implement a suitable table as a
class or module
(depending on your programming language)
which uses a hash-table internally and
exports the table operations.
(ii) Implement an alternative table class/module which uses
a sorted array and binary-search internally
and exports the table operations.
(iii) Implement a command processor which makes use of a
table class/module.
Ideally each part should be carried out by a different programmer.
Consider the issue of errors, such as trying to print a missing entry,
and fully specify the module interface
before starting to write code.
Test the components in isolation before putting them together
into a complete system.