Introduction

About a year ago, I needed a dynamic data collection type where I needed to do quick lookup for keys based on a sort order. After evaluating what the .NET SortedDictionary provided, as well as looking around for something I could use, I decided to write my own version of a sorted data collection, based on an in-memory BTree collection. I recently needed this again, and decided to release the code to others.

Update

When I posted the original code and article, I had thought I was able to handle duplicate keys. After I started trying to use the code in my own project, I discovered duplicates processing was not available.

While adding the ability to manage duplicates, I found the original interfaces were a bit hard to wrap my head around, mostly because of the names I had used. I decided to update the interface with names that describe better what they do.

I also cleaned up the code a little bit: added more commenting to explain logic, and tried to make the implementations between the collection and dictionary closer (there is a fair amount of copy paste code; I did evaluate re-factoring them, but decided not to tackle that right now).

Background

The 'go to' lookup data structure in .NET that I typically see used is the Dictionary<K,V> generic collection. While this is very fast (O(1) for lookup, insertion, and removal), it lacks any kind of ordering constraints. If you need to, for example, find the next 10 keys with value 52941 or greater, in an ever changing large collection of numbers, you have no real option but to perform expensive scans of the collection (O(N*M) performance on average, where N is the number of items and M is the number of times you need to perform the scan).

Normally, to solve those kinds of needs, we would normally use a balanced tree structure: Red-Black (2-3-4) trees, 2-3 Trees. The solution I decided to try implementing is a BTree, a structure normally used for file I/O, but as I found, performs competitively with the balanced binary tree structures. Implementing my own solution also allowed me to add some features that I've found very useful to have.

Had the .NET SortedDictionary<K,V> generic collection implemented the original feature I needed, the "upper" and "lower" bound iteration, I probably would have never tried implementing my own solution (or, perhaps it does support this; I for one could not figure out how to tap into that feature if it does exist). I was somewhat surprised when my implementation benchmarked 20-30% faster than the .NET sorted dictionary, for comparable operations.

Using the code

There are two interfaces, and two classes implementing those interfaces, and a test harness/NUnit test assembly. Note: I use Visual Studio Express for development at home, and therefore must sometimes get a little creative in order to debug problems. Much of the harness simply adds that ability. Normally, if we can attach to the unit tests themselves, much of the harness would be unnecessary. In any case, it makes a useful benchmarking tool.

The first data structure is the simple 'set': BTree<TKey>. This class has the following features available:

It implements the ICollection<T> interface.

It efficiently maintains the collection in sorted order as items are added or removed, based on the key comparer provided at construction - both insert and remove are O(log N).

It is efficient at providing any bounds based enumeration, in either ascending or descending order - O(log N) + O(K) for K elements beyond the first.

It is (relatively) efficient at providing a value at a specific index in the collection - O(log N).

Supports either enforcing unique constraints, or allows duplicates.

There is a code sample below for a simple usage example, but I'll go into some of the interface methods and talk a little more about implementation details and some interesting uses.

One point to be aware of relating to the implementation: It uses a counted B+Tree structure. What this means is that the leaf nodes contain all information needed to reconstruct the list, stored as a doubly linked list. Intermediate and root nodes store the first key and a total count for all items in their child nodes list.

bool AllowDuplicates

I added this property to the original code I posted previously. The interface for both the collection and the dictionary versions of sorted collection is read-only, however, there was no reason it had to be read only in the BTree<T> implementations. I did add (perhaps too arbitrary of a) constraint that it cannot be changed from true to false if the collection has any elements in it. It should be possible and easy to modify the code and relax that constraint, if there is a specific need to do so; probably a "HasDuplicates()" computation might be useful to ensure that if the collection indicates it does not allow duplicates, that it actually does not have duplicates. If the condition is relaxed, and the collection has duplicates when AllowDuplicates is off, it will fall back to ambiguous behavior for all operations related to duplicate values.

I wrote the BTree<T> class behavior first, and kept it simple. I did not really need this to have any ambiguity resolution, so there is none. If you add duplicates, they will enumerate in an arbitrary order.

The BTreeDictionary<T> implementation includes three biasing properties to resolve the ambiguity: InsertBias, LookupBias, and RemoveBias. The logic uses the signs of these property values to determine whether to bias towards the first duplicate item, or last duplicate item when adding, looking up, or removing a value using a key based operation. This allows the collection to act as a series of queues (FIFO) or stacks (LIFO) for values associated with each key.

IEnumerable<> WhereGreaterOrEqual, WhereLessOrEqualBackwards

These two methods form the cornerstone for range based searching. They efficiently allow you to take a look into the current structure in the collection, and begin enumerating forwards or backwards through the items in the collection starting with a specific pivot key value. For example, to get all tags in a sorted dictionary where the associated keys are between 0.4 and 0.6, you might use something like the following:

Because this implementation stores counted information, it is (relatively) efficient at performing operations taking or returning the index of the item of interest, rather than using key values. The BTree<> class could fairly easily be adapted to provide an unsorted list with O(log N) random access list, including insertion and removal at an index (compared to List, which provides O(1) insert at the end, or O(N) insert at a random point). In any case, the index based operations can be useful for sorted collections, too; for example, to get the total number of items (inclusively) between 0.4 and 0.8, one could use data.FirstIndexGreaterThan(0.8) - data.LastIndexLessThan(0.4) to get an O(log N) range size value.

IsReadOnly

One more feature I added to the 1.1 version of the DLL is the ability to dynamically mark the collections read only. This will most likely be useful during enumeration. One could set IsReadOnly to true, enumerate, then set it back to false, to ensure if anything attempts to modify the collection while enumerating, it will throw an exception. Continuing an enumeration through the collection after modifying the collection can lead to some "fun". It will probably work 90% of the time, but will occasionally crash, skip existing elements, or return elements it had previously already returned. I did not add any version checking to the enumerators themselves, but that should be easily tweaked if you need that level of safety.

Some interface methods and class implementation in the Dictionary may appear missing, until you realize that the Keys property on the dictionary gives that functionality. I did not add the (usually) redundant interface and implementation members when those values were as easily obtained via the Keys property on the dictionary.

The code sample below illustrates a solution for a simple problem: suppose you needed code that could add a number, request the median, or request the Kth value at any moment. Given the BTree<K> collection, such code is easy to implement:

While you could probably get away with using a List or SortedList, such an implementation will have severe performance problems if you continuously add new random values at the same time that you are continuously querying for Kth or Median values.

The second data structure, BTreeDictionary<TKey,TValue>, provides the following features:

It implements the IDictionary<K,V> generic interface.

It efficiently maintains all keys in sorted order, having the same performance characteristics as the simpler BTree<K> type, as described above, including the ability to get the Kth key/value element.

Some operations may yield ambiguous results. I added some properties to define precisely how to resolve the ambiguity.

Extending the code sample above, if you wanted to record some tag with each number, and retrieve those tags along with the Kth element, or medians, that would be possible too, with simple modifications.

Points of interest

As I noted above, I was surprised to find my code had better benchmarks when compared to the SortedDictionary<K,V> collection type. For 1,000,000 adds, followed by 1,000,000 removes, on random, sorted, or unsorted data, the timings I got on my machine are as follows:

Collection Type

Random Numbers

In Order Numbers

Reverse Order Numbers

Dictionary<K,V>

0.5s

0.5s

0.5s

SortedDictionary<K,V>

5.7s

2.3s

2.9s

BTreeDictionary<K,V>

4.2s

2.4s

2.2s

BTree<K>

2.3s

1.8s

2.0s

I believe much of the improvement over the SortedDictionary can be explained by understanding that the the BTree strategy will have greater cache locality (items are grouped on average 50-100 in each node, as a simple array), and better GC performance (it only stores pointers to child nodes in root and intermediate nodes, so there is less memory pressure, compared to storing a left and right child with every item key value). It was interesting that for the 32-bit integer keys I tested with, that as I increased the node capacity from 10 through to 200, performance peaked at about the 120 node capacity mark. I was expecting it to peak at a much lower number, considering that on average, it will need to move half the items in a node to make room for new keys. Apparently, it is faster to move an extra 20-30 items, than the overhead in smaller nodes for the extra branching and memory splattering.

A commenter suggested a SortedList. It did not return after several minutes for the random test, so I did not include it. Also, realize these metrics can be beat amortized by a simple sorted list, if you do not need to deal with ever changing data. These data structures are tuned for being able to incrementally perform operations while new items are being added, or items are being removed from the collection.

Comments and Discussions

Got my 5. The idea to improve cache locality by chunking the nodes is a good one.
I could learn from your code a lot especially how to use Array.BinarySearch correctly.
During my journey with Reflector I did find BinarySearch did wonder why they were returning negated indicies. Now I know.