C/C++

An STL-Compatible Hybrid of Linked List & Hash Map

Source Code Accompanies This Article. Download It Now.

"Linked_hash" is an STL-compatible data structure based on the best of the linked-list hash-map classes.

August, 2005: An STL-Compatible Hybrid of Linked List & Hash Map

William is the Chief Software Engineer for Stage Logic LLC, a small software development company. He specializes in developing CORBA-based real-time systems. He can be contacted at bill@stagelogic.com.

On a recent project, I found myself facing a perplexing design problem. I needed a data structure that provided very high, random-access performance, while simultaneously providing high performance when iterating through the entire structure. Because the project was being developed in C++, I first turned to the Standard Template Library (STL). However, I couldn't find a single data structure that satisfied both requirements. Undeterred, I decided to create my own STL-compatible structurea hybrid linked-list/hash-map class I call "linked_hash."

The Not-Quite-Sufficient STL Structures

My requirements for a data structure were clear. I needed to support a large number of random accesses to individual items in the structure, followed by an iteration through the whole structureand I needed to support it hundreds of times per second. Performance was, therefore, a top priority, but (being lazy) I first examined the STL to see if there were any existing data structures that met my needs. In this case, the specific STL implementation I used was the GNU libstdc++ Library (http://gcc.gnu.org/libstdc++/).

The data structure that I looked at was a hash map. The hash map would provide me with constant-time random access, meaning that in the average case, the amount of time required to access a single element would be constant, regardless of how many items were stored in the data structure. However, the problem with a hash map is the nonaverage case. If the keys for two or more items end up having the same hash value, then they all end up in the same hash bucket. The random access time for the hash map then starts to go down, as the bucket has to be searched linearly for the proper key. The solution is to use a good hashing function, and ensure that you have plenty of distinct buckets. In the case of the GNU hash_map implementation, though, iteration through the structure is proportional to the number of buckets, as well as the number of items. Consequently, a sparsely populated hash map with lots of buckets can't be efficiently iterated through. So the hash_map failed the iteration requirement and was unusable for my structure.

The next structure I considered was just a plain old map. Since it uses a tree for its underlying storage (at least in the GNU implementation), it can be iterated through linearly. It also has decent performance for random access, since the tree never gets very tall (the height is the log of the tree's size). However, in my particular situation, decent performance wasn't good enough. Even if the height of the tree was only four or five levels, that was still four or five comparisons for each of the thousands of random accesses per second. It just didn't quite meet my requirements for scalable performance.

Finally, I briefly considered a linked list or an array, but the random access for a list provides a totally unacceptable linear search performance, and an array would be too costly to expand or shrink as new items were added or deleted. Instead, I was faced with the certainty that I'd have to push aside my laziness and create a new data structure. Since I didn't want my laziness to go too far, I decided to create my data structure as a reusable, STL-compatible class instead of some tightly coupled specialty structure. That way, I know I can be lazy the next time I need this type of structure.

Implementing the Structure

To meet both a constant-time random access and linear iteration, I decided to make a data structure that is essentially a hybrid linked-list/hash-map. Acting as a hash map, it lets individual elements be accessed in constant time, using a key. In addition to storing the individual elements in a hash map, the structure supports linear-time iteration through the elements by linking them together in a linked list.

The interface to the linked_hash structure follows the interface for the GNU STL's hash_map (the hash_map isn't a part of the STL Standard, but the GNU implementation follows the SGI STL, which has become somewhat of a de facto standard). Although iteration through the structure behaves like a linked list, the class does not support list-specific methods such as push_front() or push_back(), or insertion at a specific position in the list.

To store data in the structure, I decided to make use of the existing hash_map and list classes, and provide the linked_hash class as a wrapper around these two structures. Instead of directly storing my key/item pairs in either the hash map or list, I created a _linked_hash_node structure to contain the data (see Listing One). The internal list and hash_map then stores pointers to nodes; each node will be simultaneously stored in both the hash_map and the list. Each node then stores iterators that point to the location where the node is stored in the list and hash_map. This way, the linked hash can fully manipulate the node, regardless of whether it was located through the list or the hash map. For example, if the erase() method is called with a key, the correct node will be located using that key to look it up in the internal hash map. Once found, the iterator to the list that is stored inside the node will be used to also remove the node from the list. Without the iterators, erase() would have to search linearly through the list to remove the reference there, which would destroy the constant-time access performance.

Once I set up the data structures to hold the linked hash's data, I needed to create an iterator for the class. Listing Two is the iterator I developed, which is fairly straightforward. All I do is store a pointer to the node that is associated with the given iterator. Implementing the dereferencing operations for the iterator then becomes a simple task of going through the node to access the underlying data pairs that are stored therein. For the increment and decrement methods, I then simply make use of the list iterator stored in the node, and call its increment and decrement methods.

The linked_hash class itself has template parameters that match those of the hash_map class, namely a key, data type, hash function, class for testing the equality of two keys, and an allocator class. Inside the linked_hash class, I put several private typedefs that I'll use later in the class definition, such as the _linked_hash_node, the hash_map, and the list (see Example 1). You'll also see that I have two typedefs that use the Alloc::rebind<>::other class. The allocator rebind lets you take an allocator of arbitrary type (in this case, the one supplied as a template parameter to the linked_hash class) and create an allocator for a different data type, but of the same allocator type. For instance, if you provide linked_hash with a shared memory allocator for integers, the rebind gives you a shared memory allocator that (for example) allocates _linked_hash_nodes. In this case, the two rebound allocators that I typedef are for creating _linked_hash_nodes, and pointers to _linked_hash_nodes. I'll use the former for allocating nodes to store the data, and the latter as template parameters to the internal hash_map and list that stores the nodes.

The constructors for the linked_hash class also match the hash_map class, and in most cases simply pass the constructor parameters on to the constructor for the underlying hash_map. In the case of the set of constructors that also take a range of elements, the constructors also use the linked_hash class's insert() operation to load the elements into the linked_hash as the object is being constructed. For full details on the linked_hash constructors, see the full source for the linked_hash class (available electronically; see "Resource Center," page 5).

For many of the simpler methods in the linked_hash class, I was able to directly make use of the corresponding method in the underlying hash_map class. The bucket_count() or hash_funct() methods, for instance, are concepts wholly used by the hash_map, and so it is trivial to just call the hash_map method. Similarly, for methods like size() and empty(), I can use the underlying hash_map function because each item in the linked_hash has exactly one corresponding node in the hash_map. Of course, that also holds true for the underlying list, so I could have just as easily used list::size() and list::empty(). As with the linked_hash constructors, you can see each of these method implementations in the source code for the linked_hash class.

With other methods, there was not a direct one-to-one correlation for the method implementations, but the linked_hash method implementation was still nearly trivial because of methods available in list or hash_map. For example, the swap() method was able to work by swapping both the hash_map and the list. Iterators, too, were simple. All I needed to do for the beginning iterator was to create a new linked_hash iterator from the beginning iterator of the underlying list object, and the ending iterator is just created with a NULL pointer instead of a list iterator pointer.

Now look at the insert() method in Example 2. When inserting a new data item, the first thing to do is to create a new node for holding the data. I do that using the node allocator that I created when I instantiated the class. This ensures that the memory will be allocated from the proper location (for example, using new, from shared memory, from a pool, and so on). Once I've allocated an empty node, I try to insert it into my internal hash_map. If that succeeds, I add the pointer to my list and construct a full node. If the insertion into the hash_map fails, then I know the key already existed; instead of constructing a new node, I deallocate the node pointer because I won't need it. Finally, the insert() method returns a pair with either the newly inserted data or the data that was already present for the supplied key, and a Boolean indicating success or failure.

Another interesting method to implement was erase() (Example 3). When a program goes to erase an element from the linked_hash, the object needs to remove the node references from both the internal hash_map and the internal list, then destroy the node itself. To examine exactly how this is done, look at the erase() method that takes a key and erases the element stored under that key (if it exists). To start, the method needs to see if such an element does indeed exist, so it calls find() on the internal hash_map. If the element does not exist, the method simply returns zero (the number of elements erased). If the element does exist, I first store a pointer to the node. Once I have the node stored, I then remove it from both the hash_map and the list, using the iterators that were stored in the node itself (that way, both erasures are constant time operations). With the node removed from both internal data structures, I then use my allocator to destroy the object and deallocate its memory. Finally, the method returns a 1, since one item was erased.

Finally, look at the implementation for the operator[] method (Example 4). It's only one line, but worth a brief discussion. Since the insert() method returns the existing element for a key, if there is one, you can implement the subscripting operator by attempting to insert a new element with a default value. If no element with that key exists, the new element will be inserted. If one does exist, it will be returned. Regardless, you can extract the data portion from the return value and return it.

Using the Linked Hash

Because the linked_hash class conforms to an STL-compatible interface, you can use it just about anywhere you might use an STL map or a hash_map. For instance, say you have a bunch of devices that need to be able to receive and process events. Events are received asynchronously, but you want to process the events for all of the devices at a fixed interval. Example 5 shows how you could use a linked_hash to store all of your devices, keyed by a name string. When an event is received, it can be inserted into the queue for the appropriate device, using insert_event(). A processing thread would then be able to periodically call process() to iterate through all of the queues and process the received events.

How Does its Performance Compare?

Since my primary motivation for creating the linked_hash class was to maximize performance for the set of use cases I needed to support, it makes sense to take a look at how the performance for the linked_hash class really compares to other data structures. So, to test the performance, I wrote three test programs that test insertions, modifications, and iterations, using linked_hashes, hash_maps, and maps. For the linked_hash and hash_map, I also tested against instances with very large bucket counts set to ensure no memory reallocations would be needed.

The first test I ran was an insertion test, which inserts new items into an empty data structure. Figure 1 presents the results, showing the time necessary for insertions as the number of items inserted increases. With a large bucket count, you can see that the insertion time for the linked_hash increases linearly. Although the insertion performance is worse than a hash_map that also has a large bucket count, you can see that the linked_hash performance is usually better and much more predictable than the insertion times for a hash_map that doesn't have a large bucket count.

The next test that I ran was a test of modification times. Since the linked_hash and the hash_map should both have constant access times, I only did a comparison between the linked_hash and the map. For the test, I measured how long it took to perform a direct access of each item in the data structure, using the subscript operator. As you can see in Figure 2, the time necessary to access each item with the map increases much more quickly than the linked_hash.

Finally, I ran a test of iterations. For this test, I tested how long it took to iterate through the entire data structure, as the number of items in the data structure increased. Figure 3 has the results, which clearly show that iteration through the linked_hash is vastly more efficient than iteration through the hash_map, especially if the hash_map has a large bucket count.

For each of the individual tests, the linked_hash was never the best performer. Overall, though, there was no other data structure that was able to consistently outperform the linked_hash. If you need to optimize for a single use case among those tested, the linked_hash is probably not your best choice. However, if you need a data structure that has predictable, decent performance over all three, the linked_hash may be a good choice. If you have a good hash function that lets you keep your keys from colliding, it is a very good choice for soft real-time systems that have a stack of things that need to support asynchronous random access and iteration (it would probably only be a good choice for a hard real-time system if you can guarantee no hashing collisions).

Conclusion

Although the linked_hash met my needs perfectly, there are a few areas where improvements could be made. One big area is ordering. For now, items inserted into the linked_hash are unordered, so iteration is not guaranteed to be in any particular order (in reality, the order is based on the order of insertion). If you needed the iteration order to be sorted, you could use a sorted data structure to replace the list. Or, you might find that you need to have multiple items with the same key (for example, a set instead of a map). Currently, only one item is supported, but multiple items could be supported with a hash_set instead of a hash_map.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!