GCC hash_map vs. unordered_map

Long story cut short: Do not use hash<const char*>. Such keys will be treated as integers. Use C++ strings instead (or create a custom hashing function).

It started innocently but, before noticing anything, I was writing C++ code again. I had a problem to solve that asked for something like Python's dictionaries. There is this thing called hash_map in C++, which apparently wasn't included in the standard library because the standards committee ran out of time, but it has made its way into most compilers anyway. So hash_map it is, I thought to myself.

Before discussing the real problem, let's start with a smaller issue. I started writing on my Cygwin installation with GCC 3.4.4 using <ext/hash_map>. Fine, but after moving to a current Linux distro with GCC 4.4.5 I got a nasty deprecation warning. It seems that starting with GCC 4.3 one should rather use unordered_map (which isn't part of the standard library just yet either). Since I wanted the program to compile with both compilers without warnings, I've changed the includes into this:

This way I could leave all the rest of the code unchanged, you'd think. Well, guess again. If you're a C die hard, who likes using const char* (which is passionately used even in the most popular C++ hash_map example), you're in for a surprise. unordered_map (and probably everything else from the TR1/C++0x standards) doesn't know the concept of const char* any more. By design! To them it's just a pointer like any other, without any magical features. So, while old SGI's hash_map creates a hash from the string itself, this new class just takes the memory address as hash. You most certainly didn't want that.

I was surprised to learn this and I did that by running the benchmark from Attractive Chaos. According to the diagrams posted there, GCC's unordered_map is a terrible memory hog for some unknown reason when using cstrings as keys. It was said that it might have been a compiler bug (but it turned out to be a feature instead). I really couldn't believe that, therefore I downloaded the benchmark to check it myself. I have only tested map, hash_map and unordered_map. Both tests compiled just fine after adding #include <cstring> to their source for the sake of strcmp. This benchmark tries to fill the storage with 5 million int-int or cstring-int pairs (but most of them are duplicates), and then reports the extra memory and time used for that plus the number of elements actually stored. Here are the results on a 32-bit relatively slow machine:

test

keys

elem­ents

mem­ory

CPU time

tr1_unordered_map

int

625792

10 MB

1.06

tr1_unordered_map

cstring

5000000

78 MB

1.04

sgi_hash_map

int

625792

10 MB

1.14

sgi_hash_map

cstring

625792

10 MB

2.03

sgi_map

int

625792

220 kB

4.44

sgi_map

cstring

625792

228 kB

9.85

unordered_map has utilised 8 times as much memory when using string keys but, wait a second, it has also stored 8 times as much data! Here is the culprit: duplicate keys have been stored. It became immediately clear that what was really stored were pointers and not string hashes. Unfortunately there's some extra work, that needs to be done, in order to support C-type strings. Instead of the original code of test.cc:

I've taken the hashing function from the libstdc++ mailinglist.
I have tried something more sophisticated, like FLV, but then the program was running a few times slower. There's really no need for that. We shouldn't worry about hash collisions that much and rather pick a fast function. A few collisions won't harm performance that much, it will just take a few calls to strcmp to find the correct key when required. Look at the performance comparison now. Not bad:

Comments

Paul M. Dubuc on October 27, 2011, 15:02:

Thanks for this very useful analysis. I wouldn't recommend putting the "using namespace" statements in header files even though because it promotes everything in those namespaces into the global namespace. If you are already using the hash_* containers without the namespace, I would suggest something like this:

I'm not saying that I would prefer picking Cstring over string, but nevertheless I would expect it to work with unordered map, given all the effort to provide compatibility with C. The actual use of CString would depend e.g. on the libraries I'd have to operate with.

As for the remark about writing bad code, I had a look at it and noticed that there were no checks for null pointers, which unnecessarily limits the example and leads to a crash if someone tries to use a nullptr as key. So here is an updated example: