1.2. Representing Tables

We have studied a couple of data structures that could be used to
implement tables relatively efficiently. What are they?

Answer:A Heap would be good for insertion and deletion,
but terrible for retrieval. In most applications, retrieval is the
principal operation: you build up a table initially with a
sequence of insertions, and then do a large number of retrievals;
deletion is usually rare. The importance of retrieval makes heaps a
poor way to implement tables.

The best choices are Binary Search Trees (especially if balanced),
or B-trees, giving O(logN) insertion, deletion and retrieval.

Can we do any better? We will look at a technique called hashing
that aims to make these operations constant time. That may
seem impossible, but hashing does indeed come very close to achieving
this goal.

To see how this might be done, let's start with a familiar data
structure and try to extend it to handle tables in general. Can anyone
think of a simple data structure that allows any of its elements to be
accessed in constant time? You are all familiar with one...

Answer:The array. We can access any position of an array
in constant time. We think of the subscript as the key, and
the value stored in the array as the data. Given the key, we
can access the data in constant time.

For example, suppose I wanted to store in a table information about
the students in this class. I could use an array of size 100, say, and
assign to each student a particular position in the array. I'd tell
this number to the student, calling it his/her student number. When
you came to ask me about your marks, or when I needed to update your
record, I'd use your student number as a subscript in the array.

This is the basic idea behind a hash table. In fact, the only flaw
in the strategy that needs to be addressed is the step in which I tell
you what your ``student number'' is. In practice, we usually do not
control the key values: the set of possible keys is given to
us as part of the problem, and we must accommodate it.

To carry on with our example, suppose that circumstances forced me
to use some part of your personal data as the key - say your social
insurance number. Because this is a number, I can still use my
original strategy. I'd use your social insurance number as an array
subscript, and store your information in the position that it indexed.

Can anyone see the problem with this?

Answer:Social insurance numbers are 9 digits long, so I
would need an array of size 10**9, even though I only have information
about 100 students.

The constraints, then, that we are working with are these:

The set of possible key values is very large. This set
might even be unbounded - imagine that your name was to be used
as the key: there are an infinite number of different names.

The set of actual key values is quite small

To get constant-time operations, we must use an array to store
the information.

The array cannot possibly be large enough to have a different position
for every possible key. And, in any case, we must be able to
accommodate keys of types (such as real numbers or strings) that are
not legitimate (in C) as array subscripts.