Pinned topictr1::hash&lt;unsigned long long&gt; considered harmful!

‏2011-06-04T09:28:07Z
|Tags:

Answered question
This question has been answered.

Unanswered question
This question has not been answered yet.

I've discovered that unordered_hash/unordered_set with STCK values as a key results in pathological performance which is almost O(n).
I noticed this when testing a real-world test case. STCK values are used as unique keys on z/OS for UOW tokens for CICS/IMS/DB2 etc.
Our products use STCK as keys to queues. In the case of unordered_set/unordered_map this showed up like a sore thumb! I can
circumvnet the problem by using a good 64bit hash function (thanks Thomas Wang) but would like IBM to address this issue.

The implementation (below) is is not a hash function, it's cast which casts a 64bit integer to a 32bit integer which
just takes the low 32 bits. In the case of a STCK value, this results in a sub-optimal value (it's a shocker)! FWIW, I also tested
the boost::hash<unsigned long long> which is just as pathetic. But the boost hash tables use a prime number for the bucket size
which fixes bad hashes. The IBM version uses a linear hash table, which is great - much faster for expands/shrinks - but does not
protect against bad hash functions.

I've discovered that unordered_hash/unordered_set with STCK values as a key results in pathological performance which is almost O(n).
I noticed this when testing a real-world test case. STCK values are used as unique keys on z/OS for UOW tokens for CICS/IMS/DB2 etc.
Our products use STCK as keys to queues. In the case of unordered_set/unordered_map this showed up like a sore thumb! I can
circumvnet the problem by using a good 64bit hash function (thanks Thomas Wang) but would like IBM to address this issue.

The implementation (below) is is not a hash function, it's cast which casts a 64bit integer to a 32bit integer which
just takes the low 32 bits. In the case of a STCK value, this results in a sub-optimal value (it's a shocker)! FWIW, I also tested
the boost::hash<unsigned long long> which is just as pathetic. But the boost hash tables use a prime number for the bucket size
which fixes bad hashes. The IBM version uses a linear hash table, which is great - much faster for expands/shrinks - but does not
protect against bad hash functions.

Re: tr1::hash&lt;unsigned long long&gt; considered harmful!

‏2011-08-09T15:25:28Z

This is the accepted answer.
This is the accepted answer.

Follow up: Here's a real world use case. I wrote a program to analyze a GETMAIN/FREEMAIN GTF trace and used an unordered_map<int,GTFRec>, where the key is the storage address. First problem was a compiler error due to some code that references the _VACPP_HASH_FUNCTION_CHECK. The code is a debug routine and should be enclosed in #ifdef/#endif!"//'CEE.SCEEH(XHASHTBL)'", line 516.50: CCN5274 (S) The name lookup for "_VACPP_HASH_FUNCTION_CHECK" did not find adeclaration."//'CEE.SCEEH(XHASHTBL)'", line 516.50: CCN6226 (I) Declarations for non-dependent names are resolved in the templatedefinition."//'CEE.SCEEH(XHASHTBL)'", line 516.50: CCN6227 (I) "_VACPP_HASH_FUNCTION_CHECK" does not depend on a template argumentCCN0793(I) Compilation failed for file //'DOC.CPP(GFSDUMP)'. Object file not created.

So after I compiled my program with the _VACPP_HASH_FUNCTION_CHECK switch on I realised that the same problem exists for hash<int>, it's not a hash function it's a cast! It looks to me like these hash functions have been copied from boost, which uses primes so gets away with it! As we all know, with linear hash tables (powers of 2) we need a good hash function. Here's a good one...

Come on guys, we can do better than this! I can't be bothered to open a PMR but I'm guessing that all the hash functions are bad.

David Crayford

Follow up: Here's a real world use case. I wrote a program to analyze a GETMAIN/FREEMAIN GTF trace and used an unordered_map<int,GTFRec>, where the key is the storage address. First problem was a compiler error due to some code that references the _VACPP_HASH_FUNCTION_CHECK. The code is a debug routine and should be enclosed in #ifdef/#endif!"//'CEE.SCEEH(XHASHTBL)'", line 516.50: CCN5274 (S) The name lookup for "_VACPP_HASH_FUNCTION_CHECK" did not find adeclaration."//'CEE.SCEEH(XHASHTBL)'", line 516.50: CCN6226 (I) Declarations for non-dependent names are resolved in the templatedefinition."//'CEE.SCEEH(XHASHTBL)'", line 516.50: CCN6227 (I) "_VACPP_HASH_FUNCTION_CHECK" does not depend on a template argumentCCN0793(I) Compilation failed for file //'DOC.CPP(GFSDUMP)'. Object file not created.

So after I compiled my program with the _VACPP_HASH_FUNCTION_CHECK switch on I realised that the same problem exists for hash<int>, it's not a hash function it's a cast! It looks to me like these hash functions have been copied from boost, which uses primes so gets away with it! As we all know, with linear hash tables (powers of 2) we need a good hash function. Here's a good one...

Re: tr1::hash&lt;unsigned long long&gt; considered harmful!

Follow up: Here's a real world use case. I wrote a program to analyze a GETMAIN/FREEMAIN GTF trace and used an unordered_map<int,GTFRec>, where the key is the storage address. First problem was a compiler error due to some code that references the _VACPP_HASH_FUNCTION_CHECK. The code is a debug routine and should be enclosed in #ifdef/#endif!"//'CEE.SCEEH(XHASHTBL)'", line 516.50: CCN5274 (S) The name lookup for "_VACPP_HASH_FUNCTION_CHECK" did not find adeclaration."//'CEE.SCEEH(XHASHTBL)'", line 516.50: CCN6226 (I) Declarations for non-dependent names are resolved in the templatedefinition."//'CEE.SCEEH(XHASHTBL)'", line 516.50: CCN6227 (I) "_VACPP_HASH_FUNCTION_CHECK" does not depend on a template argumentCCN0793(I) Compilation failed for file //'DOC.CPP(GFSDUMP)'. Object file not created.

So after I compiled my program with the _VACPP_HASH_FUNCTION_CHECK switch on I realised that the same problem exists for hash<int>, it's not a hash function it's a cast! It looks to me like these hash functions have been copied from boost, which uses primes so gets away with it! As we all know, with linear hash tables (powers of 2) we need a good hash function. Here's a good one...

The unordered associative containers use specializations of the class template hash as the default hash function. The class template is specialized on all built-in types in the <functional> header and all of the specializations currently do the same thing. The implementations are basically the identity function. The standard mandates that these functions must be provided but it does not specify what exactly they should do.

At the C++ Language committee level the Library Working Group did not want to get involved in the specification of generic hash functions that are valid for each builtin types. A generic hash function is very difficult to provide since each users requirements and usage of hash tables is different. It was always the intent that the user would be required to write their own hash function and pick an appropriate hash function based on their own specific data set. The question about which generic hash function to provide for each users data set is best answered by the user since they are familiar with their own requirements.

As you have stated, several implemenations use the identity function as the default hash function. It was likely chosen as the default because it is fast from an execution point of view not from a bucket distribution or usefulness view point. You have provided an excellent example of how one would write their own hash function. In addition, the unordered associative container also provides utility functions for monitoring as well as tuning the performance of the hash table in the functions: bucket_count(), bucket_size(n) and load_factor(). If you find that the performance of the hash table is not quite what you expect the hash function is the most likely candidate for replacement and the utility functions can be used to help determine if that is the case.

The _VACPP_HASH_FUNCTION_CHECK problem appears to have been fixed already. If you could provide the specifics of your compiler version and OS I can check if a fix has been made available for your levels.

Re: tr1::hash&lt;unsigned long long&gt; considered harmful!

The unordered associative containers use specializations of the class template hash as the default hash function. The class template is specialized on all built-in types in the <functional> header and all of the specializations currently do the same thing. The implementations are basically the identity function. The standard mandates that these functions must be provided but it does not specify what exactly they should do.

At the C++ Language committee level the Library Working Group did not want to get involved in the specification of generic hash functions that are valid for each builtin types. A generic hash function is very difficult to provide since each users requirements and usage of hash tables is different. It was always the intent that the user would be required to write their own hash function and pick an appropriate hash function based on their own specific data set. The question about which generic hash function to provide for each users data set is best answered by the user since they are familiar with their own requirements.

As you have stated, several implemenations use the identity function as the default hash function. It was likely chosen as the default because it is fast from an execution point of view not from a bucket distribution or usefulness view point. You have provided an excellent example of how one would write their own hash function. In addition, the unordered associative container also provides utility functions for monitoring as well as tuning the performance of the hash table in the functions: bucket_count(), bucket_size(n) and load_factor(). If you find that the performance of the hash table is not quite what you expect the hash function is the most likely candidate for replacement and the utility functions can be used to help determine if that is the case.

The _VACPP_HASH_FUNCTION_CHECK problem appears to have been fixed already. If you could provide the specifics of your compiler version and OS I can check if a fix has been made available for your levels.

Thank you so much for your reply. I can understand that it's difficult to provide decent generic hash functions. What I want are hash functions that perform optionally on the platform I use. I've checked all the unordered associative containers and the hash functions in the popular STL implementations (z/OS, MSVC, GCC, Dinkumware, SGI) and the only one that performs poorly is z/OS. I also noticed that the only libraries that used the identity function also used a prime based scheme, and as you well know that circumvents poor hash functions.

As a customer I don't expect to have to RYO my own hash function. The standard says that average case should be O(1) and worse case O(n). I would say the current implementation is crippled and is average close to O(n) for the majority of cases.

This is a enterprise compiler and library. It's not cheap! I would expect out of the box performance to be world class! Maybe I have lofty expectations?

Re: tr1::hash&lt;unsigned long long&gt; considered harmful!

Thank you so much for your reply. I can understand that it's difficult to provide decent generic hash functions. What I want are hash functions that perform optionally on the platform I use. I've checked all the unordered associative containers and the hash functions in the popular STL implementations (z/OS, MSVC, GCC, Dinkumware, SGI) and the only one that performs poorly is z/OS. I also noticed that the only libraries that used the identity function also used a prime based scheme, and as you well know that circumvents poor hash functions.

As a customer I don't expect to have to RYO my own hash function. The standard says that average case should be O(1) and worse case O(n). I would say the current implementation is crippled and is average close to O(n) for the majority of cases.

This is a enterprise compiler and library. It's not cheap! I would expect out of the box performance to be world class! Maybe I have lofty expectations?