11.9 Finding the Index for Partially Matched Strings

The problem
considered here concerns a large number of string keys that need to
be accessed by full or partial match. Each string is unique, so the
full-match access can easily be handled by a standard hash-table
structure (e.g., java.util.HashMap). The
partial-match access needs to collect all objects that have string
keys starting with a particular substring.

Consider this hash table consisting of keys and values:

"hello" 1
"bye" 2
"hi" 3

The full match for key "hi" retrieves
3, and the partial match against strings starting
with "h" retrieves the collection
{1,3}. Using a hash-table structure for the
partial-match access is expensive because it requires that all keys
be iterated over, and then each key matching the corresponding object
needs to be collated.

Of course, I am considering here a large collection of strings.
Alternatives are not usually necessary for a few (or even a few
thousand) strings. But for large collections, performance-tuning
techniques become necessary.

To tune, look for data structures that quickly match any partial
string. The task is somewhat simpler than the most generic version of
this type of problem because you need to match only the first few
consecutive characters. This means that some sort of tree structure
is probably ideal. Of the structures available from the JDK,
TreeMap looks like it can provide exactly the
required functionality; it gives a minimal baseline and, if the
performance is adequate, there is no more tuning to do. But
TreeMap is 5 to 10 times slower than
HashMap for access and update. The target is to
obtain HashMap access speed for single-key access.

Don't get carried away searching for the perfect
data structure. Thinking laterally, you can consider other
possibilities. If you have the strings in a sorted collection, you
can apply a binary search to find the index of the string that is
greater than or less than the partial string, and then obtain all the
strings (and hence corresponding objects) in between.

More specifically, you can construct a sorted array of keys from the
hash table. Then, if you want to find all strings starting with
"h", you can run a binary search for the strings
"h" and "h\uFFFF". This gives
all the indexes of the band for all the keys that start with
"h". Note that a binary search can return the
index where the string would be even if it is not actually in the
array. (The correct solution actually goes from
"h" inclusive to "i" exclusive,
but this solution will do for strings that don't
include character \uFFFF.)

Having parallel collections can lead to all sorts of problems in
making sure both collections contain the same elements. Solutions
that involve parallel collections should hide all accesses and
updates to the parallel collections through a separate object to
ensure that all
accesses and updates are consistent.
The solution here is suitable mainly when the collections are updated
infrequently, e.g., when they are built once or periodically and read
from often. Here is a class implementing this solution:

Note that
this solution has a wider application than only string keys. Any type
of object can be used as a key as long as you can create a
methodology to compare the order of the keys. This is a reasonable
solution for several types of indexing.