Introduction

This article describes an IDictionary<TKey,TValue> implementation similar to SortedList<TKey,TValue>. The implementation uses TKey[] and TValue[] arrays as the underlying data structure. Performance increase is achieved by using cyclic array indexing and by pre-fetching the 32 most significant bits of TKey.

Asymptotic Time Complexity of SplitArrayIntDictionary<TValue>.Insert()

For comparison, the average time of a balanced binary tree (SortedDictionary<Tkey,TValue>.Insert(TKey,TValue)) operation is O(log(n)), which is more efficient for large dictionaries. For small dictionaries (up to 50,000 - 10,000 elements), the Array.Copy operation is so efficient that inserting a new key to SplitArrayIntDictionary is faster than SortedDictionary (see the Excel graphs at the top of this page).

Cyclic Array.Copy Background

The .NET SortedList implements IDictionary<TKey,TValue> by synchronizing two arrays - TKey[] and TValue[]. The algorithm basically shifts the arrays each time a key needs to be inserted in the middle of the array. The shift operation uses Array.Copy (which is implemented internally using memmove). memmove is much faster than an equivalent C# for {} implementation, especially if we do not want to use auxiliary memory (in place copy operation):

Three additional issues incur overhead on the Copy operation: first, we need to keep the TKey and TValue arrays synchronized. So each Array.Copy command is executed twice. (We could have used one array containing a KeyValuePair<TKey,TValue>, but it would involve boxing of TKey and also hurts cache locality of TKey[].)

Third, the bigger the array, the more time the Array.Copy command would require - especially if the source and the destination cause too many CPU cache misses (see the Points of Interest section at the bottom of this page).

TKey Comparisons

One of the most time consuming operations in the algorithm is the Search loop. Most (if not all) .NET collections require either the TKey to implement the IComparer interface, or that the user would provide an IComparable. The approach in this example is to optimize speed at the cost of "Best Practices". Mind you, it is still possible to use IComparer and IComparable, but it would involve a penalty of a second virtual function call.

The C# JIT compiles the Search method once, and distinguishes between different TKey types with a vtable (similar to virtual method calls). So there is a table that either points to Compare(int,int) when TKey==int, or Compare(long,long) when TKey==long. The JIT does not "inline" the Compare method inside the loop since that would require separate Search compilations for int and long. This JIT decision is by-design, and until it is relaxed - well - we need to look for something faster.

Abstract methods are not considered "best practice" for implementing a C# "contract", and the JIT doesn't inline abstract or virtual function calls either. However, abstract method calls are faster since the runtime JIT does not insert an if (Comparer!=null) {} clause before the virtual method call ("this" is never null).

The second caveat is that Compare does excess computations. It distinguishes between three different conditions: "<", or "==", or ">". In some cases, we might only want to distinguish between two conditions: "<=" or ">".

In order to speed the comparisons even further, we pre-fetch the 32 most significant bits of the key and store them in a separate nodePrefetch[] int array. So we define yet another abstract method PrefetchKeyMSB(TKey) which returns the 32 Most Significant Bits of the key. For example, the MSB of long is the upper 32 bits, and the MSB for a person's name is the first two letters of the last name.

Using the Code

The code is released under the MIT Open Source license. The IDictionary methods implemented are: Insert, TryGetValue, indexer[], and Clear. You are invited to implement the rest of the methods. Do not forget to derive your own class from SplitArrayDictionary<TKey,TValue> and implement the following methods:

All benchmarks performed better on the laptop, except the SortedList that performed better on the Workstation in the following cases:

100% random data when number of elements > 500,000 (maximum that was tested).

10% random data when number of elements > 100,000.

What do you think is the reason for that?

The memory consumption of this example was slightly lower than SortedList and about half of the memory that SortedDictionary consumed. Memory and time benchmarking were done following Tony's (Tobias Hertkorn) advice.

FAQ

Q: Wouldn't it be more cost-beneficial to buy a better computer (rather than tweaking the implementation)?

A: Clients that are power hungry would spend money on new hardware, whether you tweak the code or not. Any improvement over that would always be welcome.

A: The .NET Framework, is well, a framework. I believe that Microsoft should leave some room for contributors (CodeProject anyone?). You are all invited to peer review this work and contribute so the code will be more complete and robust.

Q: Is unmanaged code faster?

A: Yes... for now. As Microsoft/Sun/IBM continue to tweak compilers, higher level languages will close the gap. Part of this work is to highlight places that need attention. Banging my head on the wall did not leave any long lasting impression, hopefully this work would do better.

Q: What about the (poor) asymptotic performance?

A: It should be possible to use this implementation as a node of a balanced binary tree. The hybrid data structure would still have the "CPU cache locality" boost together with O(log N) asymptotic performance. Unfortunately, I don't have more time to work on it. If you are into it, start reading about AVL trees and Scapegoat trees. Since the array is quite big, one can use "helper" objects without affecting the memory consumption much. So, if for example, we track for each array its total weight (including children) and add a back pointer to the parent node, together with some Garbage Collection awareness ...

Q: Who cares about a slightly faster Dictionary?

A: Dictionaries are related to fast search performance, so even a little gain might probably prove useful for someone out there. Some of the tweaks described in this work can prove useful for other data structures as well.

1. The SplitArrayDictionary has a public Count property which calls the protected SortedCyclicSplitArray.

Good OO practice would require contain relationship (and not inheritance) between these two classes and in that case Count would have been public in both. However, in this article inheritance is preffered as it is slightly faster.

2. The dictionary above a certain capacity is slower that SortedDictionary. What you could do, is to initialize the SplitArrayDictionary to the sweet spot capacity (arround 1E+4) and switch the implementation to a SortedDictionary once it gets full (simmiliar to System.Collections.Specialized.HybridDictionary).

3. Just updated a new release that supports enumerators and removing items.

So, it sounds like you did quite a bit of research and spent a heck of a lot of time on this. Whats the payout?? I mean if its for 3% increase thats good, but wouldnt it be more cost-beneficial to just use a better computer? a bit more ram? A farm if need be