Shared Memory Hash Table

19 Jul 2012

For the last month I’ve been working on a hash table which is stored in
shared memory and can thus easily be used across applications. This is
ideal for simple caches of data that reside in multiple applications. My
specific use case was the Nepomuk Resource class, which is a
glorified cache of key value pairs and uses a hash table. A considerable
amount of effort has gone into making sure that each application’s
Resource classes are consistent with the other applications.

I always thought something this basic would have been implemented, but I
just couldn’t find a shared memory hash table which actually supported
resizing.

Basic Hash Table

Hashing is arguably one of the most important concepts of computer
science. If you aren’t aware of how it works here are some nice links -

Shared Memory Hash Table

When implementing a hash table in shared memory, one encounters a couple
of problems which normal hash tables do not have to deal with. Namely
‘named memory locations’. In the Unix world each shared memory location
has to be given a unique identifier so that it can be accessed by other
applications. Because of that we cannot allocate each Node/Bucket
independently.

Allocating a new named shared memory region for each node seems like
quite an overkill.

Structure

Since everything has to fit inside a contiguous memory location, we need
to structure the hash table a little differently.

We use two shared memory locations - HashName and HashData. This is
done cause a hash table is not fixed in size, and will need to be
reallocated. With the reallocations, a new named shared memory will have
to be created, and all existing clients would need to be informed to use
this newly allocated location which would have a different name.

Instead we use HashName as the unique identifier the client knows
about, and HashData is internally used and can be changed when the
hash table needs to grow in size.

Hash Name data

The additional integer is a micro optimization. Whenever a client needs
to use the hash table they need to make sure that they are connected to
the appropriate shared memory, and not the old version. The code does
that by checking if HashData name is the same as the same as the one
provided by HashName.

This results in a string comparison, which would take a certain number
of cycles depending on the length of the string. We use an additional
integer to indicate the if the string has changed. Integer comparisons
are a lot faster than string comparisons.

Internal Data

Most of the initial members are quite obvious, but I’m still listing
them.

Size: The number of elements in the hash table

Capacity: The total number of elements the hash table can hold

Invalid: The number of buckets that invalid (have been deleted)

Empty-Bucket: The offset of the next empty bucket which may be used
for insertion

After this comes the array of offsets referred internally as m_buckets.
This array instead of holding the pointers to the Buckets, like in a
traditional hash table, it holds offsets from the beginning of the next
array.

The next array is an array of Bucket s, which is internally referred to
as m_data. This array holds the key value pairs. A typical Bucket is
defined as -

struct Bucket {
KeyType key;
ValueType value;
int hash;
int link;
}

The link member is again not a pointer, but it contains the integer
offset to the next Bucket from the start of m_data.

Insertions

Insert operations are quite simple. The key is hashed into an integer,
which after a modulo operation is used as the index.

The corresponding index is checked in m_buckets. If m_buckets does
not already have some value over there, there is no collision and we
just allocate a new Bucket and plug in its offset.

Allocations of new buckets are done by consuming the location given by
emptyBucket, and then incrementing its value.

If m_buckets does not have an empty value, we go to the corresponding
Bucket, and follow its link until the link is empty. At that point we
allocate a new node and make the link point to this new Bucket. This
approach is almost identical to that of conventional chaining, except
that we are using offsets instead of pointers.

Deletions

Delete operations are fairly similar to insertions. The index is
procured by hashing they key, and performing a modulo operation with the
capacity. After which m_buckets is used to get the offset of the actual
Bucket. Instead of deleting the bucket, which would not be possible
cause it is stored in an array, we simply mark it as invalid.

The number of invalid buckets is then updated, and m_buckets[index] is
marked as empty. Later during the resize operation, the wasted memory
will be cleaned up.

Note: Deletions are actually a little more complex because of
collisions, but you just need to traverse the entire link chain, and
mark the corresponding Bucket as invalid

Resizing

The resize operation occurs when the loadRatio of the hash table goes
above a certain threshold. For now, I’ve set it to 0.8.

loadRatio = (invalid-buckets + size) / capacity

When the loadRatio goes above 0.8, a new shared memory location is
allocated. The metadata (size, capacity, invalid and empty-Bucket) are
initially copied to the new shared memory. After which the offset for
each bucket is recalculated ( hash % newCapacity ), and it is inserted
into the new shared memory.

The invalid buckets are ignored and they are not copied.

Once the copying is completed, the hash data key in HashName is
updated, the id is incremented. This is done so that all other
applications using the shared memory can update their internal pointers.

Iterating

Iterating over the hash table is probably the only operation that is a
lot simpler than traditional hash tables. We already have all the data
listed as an array - m_data. All we need to do is iterate over it
while discarding invalid buckets which were created by delete
operations.

Well, in theory.

In practice, safely iterating over a shared hash involves copying the
element being accessed. Mainly because another application could have
shrunk the entire memory and your index could no longer be valid.

Another possible way is to copy its contents to a QHash. I’ve
implemented a simple function for that.

Problems

Dynamically allocated Types

You cannot use any types which dynamically allocate memory as the key
or value. This rules out QString, QUrl, QVariant (kinda) and
most of the commonly used Qt Data Structures. If you need to use a
string as the key, you’ll need to explicitly set an upper limit by using
something like QVarLengthArray.

This is a huge problem, and makes using the shared hash very difficult
in practice.