tags:

views:

answers:

There a some data structures around that are really cool but are unknown to most programmers. Which are they?

Everybody knows linked lists, binary trees, and hashes, but what about Skip lists, Bloom filters for example. I would like to know more data structures that are not so common, but are worth knowing because they rely on great ideas and enrich a programmer's tool box.

PS: I am also interested on techniques like Dancing links which make interesting use of the properties of a common data structure.

EDIT:
Please try to include links to pages describing the data structures in more detail. Also, try to add a couple of words on why a data structures is cool (as Jonas Kölker already pointed out). Also, try to provide one data-structure per answer. This will allow the better data structures to float to the top based on their votes alone.

@Joe: Those are problems with naive trie implementations. In practice, you usually have a much higher branching factor that a single char and you usually compress lines in the tree in order to store common sequences of chars (like syllables) in a single node.

Heap-ordered search trees: you store a bunch of (key, prio) pairs in a tree, such that it's a search tree with respect to the keys, and heap-ordered with respect to the priorities. One can show that such a tree has a unique shape (and it's not always fully packed up-and-to-the-left). With random priorities, it gives you expected O(log n) search time, IIRC.

A niche one is adjacency lists for undirected planar graphs with O(1) neighbour queries. This is not so much a data structure as a particular way to organize an existing data structure. Here's how you do it: every planar graph has a node with degree at most 6. Pick such a node, put its neighbors in its neighbor list, remove it from the graph, and recurse until the graph is empty. When given a pair (u, v), look for u in v's neighbor list and for v in u's neighbor list. Both have size at most 6, so this is O(1).

By the above algorithm, if u and v are neighbors, you won't have both u in v's list and v in u's list. If you need this, just add each node's missing neighbors to that node's neighbor list, but store how much of the neighbor list you need to look through for fast lookup.

"The Heap ordered search tree is called a treap." -- In the definition I've heard, IIRC, a treap is a heap-ordered search tree with *random* priorities. You could choose other priorities, depending on the application...

A suffix *trie* is almost but not quite the same as the much cooler suffix *tree*, which has strings and not individual letters on its edges and can be built in linear time(!). Also despite being asymptotically slower, in practice suffix arrays are often much faster than suffix trees for many tasks because of their smaller size and fewer pointer indirections. Love the O(1) planar graph lookup BTW!

@Edward Kmett: I have in fact read that paper, it was quite a breakthrough in suffix array *construction*. (Although it was already known that linear time construction was possible by going "via" a suffix tree, this was the 1st undeniably practical "direct" algorithm.) But some operations outside of construction are still asymptotically slower on a suffix array unless a LCA table is also built. That can also be done in O(n), but you lose the size and locality benefits of the pure suffix array by doing so.

Spatial Indices, in particular R-trees and KD-trees, store spatial data efficiently. They are good for geographical map coordinate data and VLSI place and route algorithms, and sometimes for nearest-neighbor search.

I think it'd be useful to know why they're cool. In general, the question "why" is the most important to ask ;)

My answer is that they give you O(log log n) dictionaries with {1..n} keys, independent of how many of the keys are in use. Just like repeated halving gives you O(log n), repeated sqrting gives you O(log log n), which is what happens in the vEB tree.

They are nice from a theoretical point of view. In practice, however, it's quite tough to get competetive performance out of them. The paper I know got them to work well up to 32 bit keys (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.7403) but the approach will not scale to more than maybe 34-35 bits or so and there is no implementation of that.

Kd-Trees, spatial data structure used (amongst others) in Real-Time Raytracing, has the downside that triangles that cross intersect the different spaces need to be clipped. Generally BVH's are faster because they are more lightweight.

MX-CIF Quadtrees, store bounding boxes instead of arbitrary point sets by combining a regular quadtree with a binary tree on the edges of the quads.

HAMT, hierarchical hash map with access times that generally exceed O(1) hash-maps due to the constants involved.

Inverted Index, quite well known in the search-engine circles, because it's used for fast retrieval of documents associated with different search-terms.

The source code of the book "Data Structures and Algorithm Analysis in Java/C++" seems to include implementations of Pairing-Heapshttp://users.cs.fiu.edu/~weiss/dsaa_c++3/code/http://users.cs.fiu.edu/~weiss/dsaajava2/code/

@FreshCode As @Tom Savage said, it's more useful when checking for negatives. For example, you can use it as a fast and small (in terms of memory usage) spell checker. Add all of the words to it and then try to look up words the user enters. If you get a negative it means it's misspelled. Then you can run some more expensive check to find closest matches and offer corrections.

@abhin4v: Bloom filters are often used when most requests are likely to return an answer of "no" (such as the case here), meaning that the small number of "yes" answers can be checked with a slower exact test. This still results in a big reduction in the *average* query response time. Don't know if Chrome's Safe Browsing does that, but that would be my guess.

@Zuu, Ok, I'll give some constructive criticism. You provided many data structures, of which only a small fraction could be considered "lesser known". There are no links in your post and it generally misses the entire point of the question.

It's hard to tell what people would understand by 'lesser known'. Some people barely know what a balanced tree is. And while people might know the term 'heap' they don't know it's a general data structure that can actually be used with sense in a given application.

What goes for the links, sure i could look it all up, but i was nice enough to categorize them as well as linking to an index of data structures on Wikipedia. Also note that the last part of the question was added after i posted this answer :-)

Wikipedia
A skip list is a probabilistic data structure, based on multiple parallel, sorted linked lists, with efficiency comparable to a binary search tree (order log n average time for most operations).

They can be used as an alternative to balanced trees (using probalistic balancing rather than strict enforcement of balancing). They are easy to implement and faster than say, a red-black tree. I think they should be in every good programmers toolchest.

If you want to get an in-depth introduction to skip-lists here is a link to a video of MIT's Introduction to Algorithms lecture on them.

Rope: It's a string that allows for cheap prepends, substrings, middle insertions and appends. I've really only had use for it once, but no other structure would have sufficed. Regular strings and arrays prepends were just far too expensive for what we needed to do, and reversing everthing was out of the question.

Without knowing what is was called I recently wrote something very similar to this for Java - performance has been excellent:http://code.google.com/p/mikeralib/source/browse/trunk/Mikera/src/mikera/persistent/Text.java

I think Disjoint Set is pretty nifty for cases when you need to divide a bunch of items into distinct sets and query membership. Good implementation of the Union and Find operations result in amortized costs that are effectively constant (inverse of Ackermnan's Function, if I recall my data structures class correctly).

Anyone with experience in 3D rendering should be familiar with BSP trees. Generally, it's the method by structuring a 3D scene to be manageable for rendering knowing the camera coordinates and bearing.

Binary space partitioning (BSP) is a
method for recursively subdividing a
space into convex sets by hyperplanes.
This subdivision gives rise to a
representation of the scene by means
of a tree data structure known as a
BSP tree.

In other words, it is a method of
breaking up intricately shaped
polygons into convex sets, or smaller
polygons consisting entirely of
non-reflex angles (angles smaller than
180°). For a more general description
of space partitioning, see space
partitioning.

Originally, this approach was proposed
in 3D computer graphics to increase
the rendering efficiency. Some other
applications include performing
geometrical operations with shapes
(constructive solid geometry) in CAD,
collision detection in robotics and 3D
computer games, and other computer
applications that involve handling of
complex spatial scenes.

Enhanced hashing algorithms are quite interesting. Linear hashing is neat, because it allows splitting one "bucket" in your hash table at a time, rather than rehashing the entire table. This is especially useful for distributed caches. However, with most simple splitting policies, you end up splitting all buckets in quick succession, and the load factor of the table oscillates pretty badly.

I think that spiral hashing is really neat too. Like linear hashing, one bucket at a time is split, and a little less than half of the records in the bucket are put into the same new bucket. It's very clean and fast. However, it can be inefficient if each "bucket" is hosted by a machine with similar specs. To utilize the hardware fully, you want a mix of less- and more-powerful machines.

Representing sets of items and performing very fast logical operations on those sets.

Any boolean expression, with the intention of finding all solutions for the expression

Note that many problems can be represented as a boolean expression. For instance the solution to a suduku can be expressed as a boolean expression. And building a BDD for that boolean expression will immediately yield the solution(s).

An interesting variant of the hash table is called Cuckoo Hashing. It uses multiple hash functions instead of just 1 in order to deal with hash collisions. Collisions are resolved by removing the old object from the location specified by the primary hash, and moving it to a location specified by an alternate hash function. Cuckoo Hashing allows for more efficient use of memory space because you can increase your load factor up to 91% with only 3 hash functions and still have good access time.

They're used in some of the fastest known algorithms (asymptotically) for a lot of graph-related problems, such as the Shortest Path problem. Dijkstra's algorithm runs in O(E log V) time with standard binary heaps; using Fibonacci heaps improves that to O(E + V log V), which is a huge speedup for sparse graphs. Unfortunately, though, they have a high constant factor, often making them impractical in practice.

These guys here made them run competetive in comparison to other heap kinds: http://www.cphstl.dk/Presentation/SEA2010/SEA-10.pdfThere is a related data structure called Pairing Heaps that's easier to implement and that offers pretty good practical performance. However, the theoretical analysis is partially open.

I think lock-free alternatives to standard data structures i.e lock-free queue, stack and list are much overlooked.
They are increasingly relevant as concurrency becomes a higher priority and are much more admirable goal than using Mutexes or locks to handle concurrent read/writes.

You can use a min-heap to find the minimum element in constant time, or a max-heap to find the the maximum element. But what if you wanted to do both operations? You can use a Min-Max to do both operations in constant time. It works by using min max ordering: alternating between min and max heap comparison between consecutive tree levels.

One lesser known, but pretty nifty data structure is the Fenwick Tree (also sometimes called a Binary Indexed Tree or BIT). It stores cumulative sums and supports O(log(n)) operations. Although cumulative sums might not sound very exciting, it can be adapted to solve many problems requiring a sorted/log(n) data structure.

IMO, its main selling point is the ease with which can be implemented. Very useful in solving algorithmic problems that would involve coding a red-black/avl tree otherwise.

Have a look at Finger Trees, especially if you're a fan of the previously mentioned purely functional data structures. They're a functional representation of persistent sequences supporting access to the ends in amortized constant time, and concatenation and splitting in time logarithmic in the size of the smaller piece.

Our functional 2-3 finger trees are an instance of a general design technique in- troduced by Okasaki (1998), called implicit recursive slowdown. We have already noted that these trees are an extension of his implicit deque structure, replacing pairs with 2-3 nodes to provide the flexibility required for efficient concatenation and splitting.

A Finger Tree can be parameterized with a monoid, and using different monoids will result in different behaviors for the tree. This lets Finger Trees simulate other data structures.

A proper string data structure. Almost every programmer settles for whatever native support that a language has for the structure and that's usually inefficient (especially for building strings, you need a separate class or something else).

The worst is treating a string as a character array in C and relying on the NULL byte for safety.

The other extreme is C++, where every self-respecting library comes with its own string data type, which is of course incompatible with all the other string types except `const char*`. Personally, I prefer the environments where I don't have to spend so much time converting strings from one type to another.

Zippers - derivatives of data structures that modify the structure to have a natural notion of 'cursor' -- current location. These are really useful as they guarantee indicies cannot be out of bound -- used, e.g. in the xmonad window manager to track which window has focused.

Nested sets are nice for representing trees in the relational databases and running queries on them. For instance, ActiveRecord (Ruby on Rails' default ORM) comes with a very simple nested set plugin, which makes working with trees trivial.

Zipper is yet another simple data structure which is very useful in looking at a specific node of a tree or a list (like for example text editing, xml trees etc) in a functional programming perspective.

It's pretty domain-specific, but half-edge data structure is pretty neat. It provides a way to iterate over polygon meshes (faces and edges) which is very useful in computer graphics and computational geometry.

Left Leaning Red-Black Trees. A significantly simplified implementation of red-black trees by Robert Sedgewick published in 2008 (~half the lines of code to implement). If you've ever had trouble wrapping your head around the implementation of a Red-Black tree, read about this variant.

Scapegoat trees. A classic problem with plain binary trees is that they become unbalanced (e.g. when keys are inserted in ascending order.)

Balanced binary trees (AKA AVL trees) waste a lot of time balancing after each insertion.

Red-Black trees stay balanced, but require a extra bit of storage for each node.

Scapegoat trees stay balanced like red-black trees, but don't require ANY additional storage. They do this by analyzing the tree after each insertion, and making minor adjustments. See http://en.wikipedia.org/wiki/Scapegoat_tree.

Compilers use a structure that is recursive but not like a tree. Inner scopes have a pointer to an enclosing scope so the nesting is inside-out. Verifying whether a variable is in scope is a recursive call from the inside scope to the enclosing scope.

There is a clever Data-structure out there that uses Arrays to save the Data of the Elements, but the Arrays are linked together in an Linked-List/Array.

This does have the advantage that the iteration over the elements is very fast (faster than a pure linked-list approach) and the costs for moving the Arrays with the Elements around in Memory and/or (de-)allocation are at a minimum.
(Because of this this data-structure is usefull for Simulation stuff).

"...and that an additional array is allocated and linked in to the cell list of arrays of particles. This is similar in some respects to how TBB implemented its concurrent container."(it is about ther Performance of Linked Lists vs. Arrays)

There are faster implementations around (see ActiveState's Python recipes or implementations in other languages), but I think/hope my code helps to understand these data structures.

By the way, both BK and VP trees can be used for much more than searching for similar strings. You can do similarity searches for arbitrary objects as long as you have a distance function that satisfies a few conditions (positivity, symmetry, triangle inequality).

Splash Tables are great. They're like a normal hash table, except they guarantee constant-time lookup and can handle 90% utilization without losing performance. They're a generalization of the Cuckoo Hash (also a great data structure). They do appear to be patented, but as with most pure software patents I wouldn't worry too much.

They are used extensively in Apache. Basically they are a linked list that loops around on itself in a ring. I am not sure if they are used outside of Apache and Apache modules but they fit the bill as a cool yet lesser known data structure. A bucket is a container for some arbitrary data and a bucket brigade is a collection of buckets. The idea is that you want to be able to modify and insert data at any point in the structure.

Lets say that you have a bucket brigade that contains an html document with one character per bucket. You want to convert all the < and > symbols into &lt; and &gt; entities. The bucket brigade allows you to insert some extra buckets in the brigade when you come across a < or > symbol in order to fit the extra characters required for the entity. Because the bucket brigade is in a ring you can insert backwards or forwards. This is much easier to do (in C) than using a simple buffer.

I had good luck with WPL Trees before. A tree variant that minimizes the weighted path length of the branches. Weight is determined by node access, so that frequently-accessed nodes migrate closer to the root. Not sure how they compare to splay trees, as I've never used those.

Fenwick Tree. It's a data structure to keep count of the sum of all elements in a vector, between two given subindexes i and j. The trivial solution, precalculating the sum since the begining doesn't allow to update a item (you have to do O(n) work to keep up).

Fenwick Trees allow you to update and query in O(log n), and how it works is really cool and simple. It's really well explained in Fenwick's original paper, freely available here:

Its father, the RQM tree is also very cool: It allows you to keep info about the minimum element between two indexes of the vector, and it also works in O(log n) update and query. I like to teach first the RQM and then the Fenwick Tree.

They work as a catenable deque with O(1) access to either end, O(log min(n,m)) append, and provide O(log min(n,length - n)) indexing with direct access to a monoidal prefix sum over any portion of the sequence.

Despite their long name, they provide asymptotically optimal heap operations, even in a function setting.

O(1) size, union, insert, minimum

O(log n) deleteMin

Note that union takes O(1) rather than O(log n) time unlike the more well-known heaps that are commonly covered in data structure textbooks, such as leftist heaps. And unlike Fibonacci heaps, those asymptotics are worst-case, rather than amortized, even if used persistently!

Another nice use case is for weighted random decisions. Suppose you have a list of symbols and associated probabilites, and you want to pick them at random according to these probabilities

a => 0.1
b => 0.5
c => 0.4

Then you do a running sum of all the probabilities:

(0.1, 0.6, 1.0)

This is your inversion list. You generate a random number between 0 and 1, and find the index of the next higher entry in the list. You can do that with a binary search, because it's sorted. Once you've got the index, you can look up the symbol in the original list.

If you have n symbols, you have O(n) preparation time, and then O(log(n)) acess time for each randomly chosen symbol - independently of the distribution of weights.