I am curious to know what is the reasoning that could overweighs towards using a self-balancing tree technique to store items than using a hash table.

I see that hash tables cannot maintain the insertion-order, but I could always use a linked list on top to store the insertion-order sequence.

I see that for small number of values, there is an added cost of of the hash-function, but I could always save the hash-function together with the key for faster lookups.

I understand that hash tables are difficult to implement than the straight-forward implementation of a red-black tree, but in a practical implementation wouldn't one be willing to go an extra mile for the trouble?

I see that with hash tables it is normal for collisions to occur, but with open-addressing techniques like double hashing that allow to save the keys in the hash table itself, hasn't the problem been reduced to the effect of not tipping the favor towards red black trees for such implementations?

1. If you can't produce a hash key, how do you produce a key to determine where the node is placed in the tree? If you can produce a key for the node placement, why can't you use that for the hash key? 2. Why can't you effectively do this with a hash table + linked list? Can you provide more explanation? Remember that a linked list is essentially just a tree optimized for ordering. 3. Trees have a best case of log(N). Hashing is always constant. Collisions have the same effect on both structures. How can trees have better worst-case performance?
– JakeSep 12 '10 at 21:00

3

@Jake 1- by comparing elements. Having order on something doesn't mean you can hash something. 2- Because. 3- There is nothing to collide in trees.
– unbeliSep 12 '10 at 21:05

1. The item you're comparing is the key. All binary data can be hashed by some method, and all data in a computer is binary data. I suppose it could get expensive if the keys are huge, but that would likely have the same effect on the comparison function of the tree. 3. Search trees have ordered nodes. You can only order a list if the list has a set of keys in some form on which to order the elements. If two nodes have the same key, that is a collision.
– JakeSep 12 '10 at 21:21

1

1- not true. 3- not true about the keys: you don't need any keys for the search tree. If you insert an entry, which is equal to an existing entry, there is no performance impact, contrary to the hash collision. It is your homework to understand why.
– unbeliSep 13 '10 at 19:46

Storage allocation is another consideration. Every time you fill all of the buckets in a hash-table, you need to allocate new storage and re-hash everything. This can be avoided if you know the size of the data ahead of time. On the other hand, balanced trees don't suffer from this issue at all.

Balanced binary trees have a predictable time of fetching a data [log n] independent of the type of data. Many times that may be important for your application to estimate the response times for your application. [hash tables may have unpredictable response times]. Remember for smaller n's as in most common use cases the difference in performance in an in-memory look up is hardly going to matter and the bottle neck of the system is going to be elsewhere and sometimes you just want to make the system much simpler to debug and analyze.

Trees are generally more memory efficient compared to hash tables and much simpler to implement without any analysis on the distribution of input keys and possible collisions etc.

In my humble opinion, self-balancing trees work pretty well as Academic topics. And I
do not know anything that can be qualified as a "straight-forward implementation of a
red-black tree".

In the real world, the memory wall makes them far less efficient than they are on paper.

With this in mind, hash tables are decent alternatives, especially if you don't practice
them the Academic style (forget about the table size constraint and you magically resolve
the table resize issue and almost all collision issues).

In a word: keep it simple. If that's simple for you then that's simple for your computer.

Could you explain how you can "forget about the table size contraint?" Are you just suggesting we ignore it, or do you mean something else?
– OdradeJul 16 '10 at 14:24

Do you count the C++ standard library implementation as an academic topic? or that of Java and .NET containers? Just so you know, std::map is mostly implemented as a balanced binary search tree; and gnu libstdc++ implements std::map as a Red & Black tree, IIRC.
– FingolfinDec 2 '13 at 23:08

"keep it simple. If that's simple for you then that's simple for your computer." That is not very good advice, think Fibonacci Heap - highly complex and very efficient. Same goes for, for example, SSS* game tree search or Thorup path finding...
– PawelPNov 30 '14 at 21:35

Trees are dynamic (the space complexity is N), whereas hash tables are often implemented as arrays which are fixed size, which means they will often be initialized with K size, where K > N, so even if you only have 1 element in a hashmap, you might still have 100 empty slots that take up memory. Another effect of this is:

Increasing the size of an array-based hash table is costly (O(N) average time, O(N log N) worst case), whereas trees can grow in constant time (O(1)) + (time to locate insertion point (O(log N))

Elements in a tree can be gathered in sorted order (using ex: in-order-traversal). Thereby you often get a sorted list as a free perk with trees.

Trees can have a better worst-case performance vs a hashmap depending on how the hashmap is implemented (ex: hashmap with chaining will have O(N) worst case, whereas self-balanced trees can guarantee O(log N) worst case for all operations).

Both self-balanced trees and hashmaps have a worst-case efficiency of O(log N) in the best worst-case (assuming that the hashmap does handle colissions), but Hashmaps can have a better average-case performance (often close to O(1)), whereas Trees will have a constant O(log N). This is because even thou a hashmap can locate the insertion index in O(1), it has to account for hash colissions (more than one element hashing to the same array index), and thus in the best case degrades to a self-balanced tree (such as the Java implementation of hashmap), that is, each element in the hashmap can be implemented as a self-balanced tree, storing all elements which has hashed to the given array cell.

This sounds a lot like an opinion. Do you have any data / reference to support your statement?
– FlorisMar 14 '14 at 19:53

Querying a range of keys in a hash table may imply knowing exactly where there are keys and where there are not. You would need to repeat the hashing operation for every possible key that you believe it is in the query interval. With trees, you can just locate the beginning of the interval and then traverse the tree in-order until the end of your query interval.
– CesarSep 8 '14 at 15:09