Thoughts on Perl and Emacs, technology and writing

Hashes And Trees Are Not Fungible

"I think most languages don’t include trees because hash tables are close enough for most purposes. You don’t usually need to traverse key/value pairs in the map in sorted order, which is about the only thing a tree buys you over a hash. And if you know that you care about sortedness, you probably know enough to go implement your own."

With my Perl background, I’ve got a lot of sympathy for that position. It’s almost, but not quite, completely wrong though.

Extensible Containers are all the same and yet different

Vectors or trees or hashes or whatever, do have something in common. They are all capable of storing a bunch of things. You can have operations such as store and retrieve. That’s true of course. The difference lies in how long it takes to do an operation.

You want to find a particular item in a singly linked list? That’ll take N/2. Insertion is constant.

Iterating over a sorted range of keys in a tree? log2 N + length of range. In a hash? N log2 N.

When I want to implement a database, or whatever, this sort of thing matters. I could always implement it on top of singly linked list, or a hash, or a vector but it would be terribly inefficient.

The English Effect

Paul Graham once mentioned The Blub Paradox where a user of a particular programming language doesn’t understand the value in unfamiliar (more powerful) constructs in another language1.

Actually the effect is not just restricted to languages, it affects all kinds of useful things. And they don’t have to be more or less powerful than other things, just different. So someone who is constrained into a scalar/array/hash view of the world, won’t necessarily see the value in binary trees2.

I think of this as The English Effect. English isn’t more or less powerful than other languages. But a native english speaker will tend to look at other languages and not see any value in learning them. They will be unaware that there are ideas that can be expressed much more elegantly in other languages.

The same ideas can be described, albeit in a clumsy way, just the same as binary trees can be implemented in assembler.

Take for example, the German doch. The number of times I have wished there was a simple disagreement with a negative assertion in English and the working efficiency it would impart is uncountable.

Oh, that idea you have, it is never going to work, the boss will never okay it and it will take to long and there are instances it won’t cover and, and, and.

You simply respond with Doch! And your negative colleague has no possible comeback. You are all able to proceed with your work and do so gladly 😉

1. I tend to find that the way it is phrased is quite condescending. And how is it a paradox anyway?

2. Looking at Nathan’s blog though, he does Lisp. And in Lisp, the basic datatype is a tree. So he’s not unfamiliar with that stuff which makes his comment even more strange to me.

I am with Nathan on this one – it is not that he does not understand the differences between the various algorithms – what he says is that for most purposes they are close enough. This is about Huffman coding the language – the cases where you really need trees (or bags, sets, etc) – are so rare that they can be relegated to the libraries.

I tried to point out that Nathan probably understands trees in footnote 2.

And I kind of get that rationalization. But if I was only going to get one of hashes and trees, I’d rather have trees. Log2 N isn’t much slower for normal usage, and you get much saner range traversal.

I do understand trees. So much so that I’ve written a library for them in Common Lisp. 😉 The inspiration for them *was* to implement an in-memory database where I needed to iterate through in sorted order…but I haven’t really needed them besides that, in CL or other places. (I’m not totally happy with my implementation, but that’s another story…) It’s funny that you think of the basic datatype of Lisp as a tree, though; I’ve never heard it expressed that way before.

If trees are your fundamental map, using them requires you to implement an ordering over whatever type your keys are. Which can be an unnecessary irritation, especially if you don’t naturally think of your keys as having an ordering. Or maybe it’s not straightforward to give them an ordering. And it’s another piece of code to write and debug… Hashes (in dynamic languages) make it very easy to use whatever you like as a key; the implementation hides all those details.

Ordering constraints also make using multiple datatypes as keys tricky; doing so is very easy with a hashtable. (Granted, I think it would be unusual to have such a table.)

Depending on your area of concern, these may be trivial or they may be significant speed bumps.

@Nathan – on the fundamental type in lisp being trees: cons cells have a car, and a cdr. And you can store anything you like in either one, including other cons cells. That seems like a binary tree to me.

on trees requiring a less than function: Hashes require a hashing function. If a built-in hash can provide a hashing function for anything under the covers, so a built-in tree could provide an ordering function under the covers (if you’re really stuck for ideas you could use less(hash(key)). In some cases an ordered traversal would return seemingly random elements, just like a traversal of a hash. In others, an ordered traversal would do what you expect.

Aside from a slightly slower store/retrieve operation, you don’t lose anything by having your fundamental type be a tree instead of a hash. But in many cases you gain a decent key range operation.

I don’t doubt that you understand what a tree is. But I think the fact that you don’t see opportunities to use them all over the place just means your native language consists of scalars, extensible arrays and hashes which has constrained your thinking. I think I have the same reaction to Graphs. I know what they are, and I have some ideas about the problem domains in which they are useful. But doggone it, I’m sure I could just replace them with a tree and not lose too much!

Although yet, followed by a reason would work (it would still be a little clumsy) I’m afraid YET! doesn’t work standalone in the same way as Doch (if I understand Doch usage correctly – admittedly I don’t intuit German well). AND!? might work better, but then I think UND!? would be similar in German and still different from Doch oder?

@Jared: I think the connotation of “binary tree” is that there are some value at the nodes, which is three pieces of information, rather than two. (Yes, you can implement this with cons cells.) Talking about the fundamental datatype of Lisp being trees is liable to receive a double-take, even if the listener has read TAoCP volume 1.

I think we are in violent (mostly) agreement. Both of us have said that if you want sorted range iteration of some kind, then you ought to use a tree instead of a hash. No argument there. I think where we disagree is that you think sorted range iteration is a common thing to do; that has not been my experience. Nor can I think of too many instances where it would have been helpful.

As a quick-and-dirty calculation, looking for HashMap in Java code via codesearch.google.com results in more than 10x the number of hits than looking for TreeMap. This is an imperfect way of looking at it: the first page of results for both are from open-source Java implementations, for instance. Context is also lacking; the keys from those HashMaps might be sorted for further processing, in which case a TreeMap might have been the better choice. But 10x suggests that a fair number of other programmers think sorted range iteration is not a very common operation either.

@Nathan – yep, good point about binary trees. I should have said trees of course. Cons cells are useful for making any type of tree. If I include unsorted trees, they’re even more useful vs plain hashes with (for example) hierarchy information included for free.

Personally, I often find that I need to iterate over a range shortly after I’ve implemented something complex in a hash, e.g. an LRU cache or whatever. Yes, you could say this shows a lack of foresight 🙂

For the search of hash usage vs tree usage, I’m going to attribute this to the English Effect. I’m sure if you look at a typical native lisper’s code where trees are 1st class structures and hashes are definitely 2nd class, they will find more utility for trees. Heck, even C++ got map before hashmap.