Understanding Generic Dictionary in-depth

Introduction

Generic Dictionary is a great instrument to have in your toolset.

The Generic part is keeping us type-safe and helps avoid boxing/unboxing while the Dictionary part allows us to manage Key/Value pairs and access them easily.It also allows us to add, remove and seek items in a constant timecomplexity- O(1) - that is, if you know how to use it properly.

Let's start with a simple example of adding items to a dictionary and see what the time complexity for each operation is:

We can see that while adding the three first items we've got O(1) time complexity – that is what we have expected. Adding the fourth and eighth item has O(N) time complexity (where N represents the amount of items that are already exists in the dictionary).

In order to understand what happened here we need to understand how Generic Dictionary manages its items.

Background

In order to provide O(1) time complexities in common add/remove/seek operations, Generic Dictionary is built using a hash table.

In short, Hash table is a data structure which consists on an array of "buckets" for storing the elements.The way Hash tables handle insertion of new items is by extracting a Hash code for each object, and using that hash code to determine in which bucket to place the item – usually by performing additional simple calculation to adjust the object's hash code to the size of the buckets array.

Object.GetHashCode method

One of System.Object's small group of methods is a virtual method called GetHashCode which has the following method signature:

publicvirtualint GetHashCode()

As System.Object is the ultimate base class in .Net, all types (even Value Types that also inherits from System.Object) has built-in hash code generation functionality, which could also be overridden by sub-classes when needed.<o:p>

This important method is being used by all hash-based collection while objects are being added, removed or accessed.

The default GetHashCode implementation provided by System.Object should be sufficient in most cases, but later on this article we will discuss where overriding GetHashCode method and creating your own hashing algorithm is a must.

The reason that the above figure describes a "simple" hash table, is that it does not support collisions.

Hash code in .Net (that being returned from the object's GetHashCode virtual method) are of type Int32 – which means that there are 2^32 possible values which can be returned by the method.

Does it is means that we cannot have more than 2^32 objects in our application? Of course not. In fact, even the default object's GetHashCode is not guaranteed to return unique values for different objects.

In addition, our buckets array is probably much smaller than that. In the figure above the buckets array is set to 30 items.

The obvious question here is - could it be that two items will be getting the same index in the buckets array? Sure. The pigeonhole principle states exactly that.

When two objects are getting the same index in the buckets array it means that we have a collision – we will get to see how .Net Dictionary handles collisions next.

Adding items

Considering the example above, we've used the Dictionary's default constructor.

Dictionary<int,string> myDictionary = new Dictionary<int,string>();

When instantiating a Dictionary with default constructor an array of 3 items is being allocated.

After these instructions have been completed, our Dictionary inner-array of buckets is full<o:p>.

Note - it means that the total number of items added to the dictionary is equal to the size of the array. It does not mean that all of the buckets array slots are occupied – as we seen above, there could be a collision that two items will be having the same index (after the hash code have been adjusted to the buckets array size – in this case using MOD operator).

<o:p>

Let's add another item.

myDictionary.Add(4, "Item no'4"); // O(N)

While trying to add the 4th item, the dictionary checks if the number of items is equal to the size of the array, in this case it does, so it and allocates new larger array. In this case, the new size of the array will be 7, and after adding couple of more items the dictionary will be resized again to 17 items.

Three? Seven? Seventeen? Exactly.Dictionary implementation has an inner pre-calculated list of prime numbers that will be used for the array size.

When there is a need to resize the array, the new size will be the next item in the prime numbers array (above) that is larger than the result of multiplying the old capacity by 2.

For example:

On the first three insertions of items the array size is 3 – next item will cause the Dictionary to resize to 7 by the following calculation:3 * 2 = 6 –> next larger prime number in the list – 7.<o:p>

After couple of items, when we will try to insert the 8th item, the Dictionary will be resized to:7 * 2 = 14 –> next larger prime number in the list – 17.

<o:p>

<o:p>Note: If will add more items than the maximum number in the table (i.e 7199369), the resize method will manually search the next prime number that is larger than twice the old size.

<o:p>

Note: The reason that the sizes are being doubled while resizing the array is to make the inner-hash table operations to have asymptotic complexity. The prime numbers are being used to support double-hashing.

<o:p>Going back to the first example – passing 30 as the capacity argument to the Dictionary's constructor will make sure we will have O(1) time complexity while adding our items to the dictionary (up to 30 items in this case).

Always set initial size while instantiating a dictionary – even if you only have rough estimation on how many items will be added to the dictionary. It will help avoid re-allocating and copying large arrays.

<o:p>Note: When cloning a dictionary, prefer using the relevant constructor that accepts source Dictionary instead of naively going through the old collection and adding it to the new one. That way there won't be any redundant allocations and array copying.

Handling Collisions

There are several ways to handle collisions. Dictionary does that by using a method called chaining.

Dictionary actually has two arrays with the same number of items:

Buckets array – stores the index of which the object is stored in the entries array

Entries array – stores the actual items within a special data structure (if it's a reference type – the reference to the item will be stored)<o:p>

Each item in the entries array is a struct with the following fields:

privatestruct Entry
{
publicint hashCode; // Entry (Key) hash code after being adjusted to the array size using MOD operatorpublicint next; // Index of which the next item in the collision chain resides in the entries array - if there is no collision in that entry the value is -1public TKey key; // Entry generic Keypublic TValue value; // Entry generic Value
}

Note that the size of the struct in memory will be determine by the types of provided Key and Value. For example, for Dictionary<int, string>, the size of each entry in the array will be 16 bytes on an x86 operating system (every int takes 4 bytes, string is actually a reference to the real string which also takes 4 bytes).

The above figure represents that state of the hash table after two objects have been added.Both objects are getting the same index (4) when calculating the reminder of dividing the hash code with the array size.Obj1 is stored in the first slot within the entries array.Obj2 is stored in the second slot within the entries array.Obj2 is holding Obj1 index. When trying to get Obj1 (by using Obj1 key), the hash code is getting index 4 in the buckets array (after MOD calculation). In that bucket, there is the index to the last object that got the same index – in that case the value 1 is retrieved – which is Obj2 – the root of a linked list.In order to find Obj1 (or to determine if it is already exists) a loop thought the linked list is being performed, while checking each node for equality.

Note: Interestingly, Generic Dictionary does not seems to support custom load factors as Hashtable does, and the load ratio will always remain 1:1 – which means that the Dictionary size cannot be less than that amount of items in it (even if not all buckets are being used, all entries array slots are used). For more information on HashTable load factor, please refer to the Wikipedia page.

Customizing hashing and equality algorithms

Some of Dictionary's constructors containsIEqualityComparer<T>that is being used for both equality and hash code purposes.IEqualityComparer<T> interface has two methods, Equals and GetHashCode.In which cases we should use a custom IEqualityComparer? Consider the following scenarios:

You are using a 3rd party type as a Dictionary key and you want to replace its GetHashCode implementation (for performance, distribution of data, etc…)

When EqualityComparer is not being passed to the Dictionary's constructor, EqualityComparer<TKey>.Default is being used.

Automatic distribution adaptation for string keys

Having more than one Dictionary key with the same bucket-array index can be troublesome for performance.Accessingthe Dictionary using this key can cause a lot of array traversals, which calls our expected O(1) time complexity into question.For string keys only, there is a special optimization-In order to make sure that each 'get' and 'add' operations will not go over more than 100 items for each bucket, a collision counter is being used.

If while traversing the array to find or add an item the collision counter goes over 100 (limit is hard-coded) and the EqualityCompareris of type EqualityComparer<string>.Default, a new EqualityComparer<string>instance is being generated for alternative string hashing algorithm.

If such provider is found, the Dictionary will allocate new arrays and copy the content to the new arrays using the new hash code and equality provider.

This optimization might be useful for a scenario where somehow your string keys are not being distributed evenly, but could also lead to massive allocations and waste of CPU time for generating new hash codes of what could be a lot of items in the dictionary.

<o:p>

Note: there is another case where such provider can be generated even if the equality comparer is not EqualityComparer<string>.Default, by implementing the internal interface IWellKnownStringEqualityComparer. As this interface is used internally and not exposed externally it wasn't mentioned above.

Using custom ValueTypes as Dictionary keys

<o:p>All C# primitives inherits from System.ValueType (except string which is not a ValueType), and all of them overrides GetHashCode to implement their own hashing algorithm.Hashing algorithms should consist on couple of important principles (actually there are more, you can find them on MSDN):

* GetHashCode should return the same hash code consistently on the same object while it is not modified

For example, GetHashCode for int returns the number itself.GetHashCode for DateTime returns the inner tick count, etc…The reason all primitive types overrides GetHashCode is that the default implementation of ValueType.GetHashCode is relatively slow and based on going through all fields and XORing them, with special treatments to memory gaps between the fields and another special treatment for reference type fields… all of this for getting the object's hash code? Remember the first principle for good hashing algorithm? That's right, it should be fast.That is the reason that when you are implementing your own custom ValueType that is intended to be used as a dictionary key, make sure to override GetHashCode with a good hashing algorithm that will suit your type's fields – we saw in the previous examples how important it is to have a well distributed hash code to minimize collisions and by that having better add, access and remove time complexities while working with dictionaries.

As the new type you have created is a custom ValueType, IEqualityComparer<T>.Default will get ObjectEqualityComparer<T> in return. When the Dictionary will try to compare the keys it will eventually invoke the Equals(object obj) method that will box the object. for more information about boxing/unboxing, review MSDN documentation.

For that reason, implementing IEqualityComparer<T> for your custom value type should be a must. (Thanks Jonathan C Dickinson for highlighting this important issue)

Note: When implementing custom Hash code generation algorithms, make sure that the hash code is dependent upon fields that are immutable by their meaning – as changing one of the fields that was used to generate the hash code while the object is being used as dictionary key will make it inaccessible and could lead to duplicated items and unexpected results.

Synchronization

It should go without saying, but it is important enough to mention it again – use proper synchronization while working with a Dictionary concurrently - Dictionary is not thread-safe for read and write operations simultaneously (only multiple readers are supported).

The most common phenomenon while working concurrently (reading and writing at the same time) is infinite loops (both for the readers and the writer). You can see the following post that described the symptoms.

If you need thread-safe dictionary, consider using ConcurrentDictionary<K,V>(Framework 4.0 and above).

Conclusion

Generic Dictionaries in .Net are probably the best option to use for key/values scenarios and they are widely used almost in all application, but as we seen above, understanding the inner-working can help us get an improved and consistent performance.

It looks more like separate chaining (like we have the next field in the Entry struct to look for the entry that ended up on the same index).

But I think you are right cause otherwise I don't see why they would use prime number to double the size of the underlying array.

So I guess I am a little bit (lot) confused...

Could you explain a bit in details why:
- "The reason that the sizes are being doubled while resizing the array is to make the inner-hash table operations to have asymptotic complexity": I don't really see how doubling the size of the array make the operations asymptotic.
- "The prime numbers are being used to support double-hashing.": I am not sure if I got it right, how?

Hi Ofir.
Thank you very much for the nice article.
I have few questions to you:
1)Since all entries according to you are stored in the Entries array and in the buckets only indexes are stored, so how this array is implemented? LinkedList or List? If it is LinkedList, access by index should be slow. If it is List, what about entry deletion? Should be slow...
2)What about Enumerations? There are 3 of them. KeyValuePairs, Keys and Values. How this is implemented? I am asking you because you are saying that there is array of Entries, which of them is private struct Entry
{
public int hashCode;
public int next;
public TKey key;
public TValue value;
}
Seems Inumerable for all this 3 data types is not generated every time when you call it. Otherwise why were are getting exception when have foreach and trying to add/remove entry from the dictionary?

The Dictionary class has a field like _indexOfFirstDeleted, defaulting to -1. When you delete an item, the index of that item in the Entries array becomes the indexOfFirstDeleted. Considering you may have more items deleted, the contents of such an entry are "cleared" and the next points to what was previously the _indexOfFirstDeleted. Of course the index array is also adjusted if the item happened to be the first one in the indexes arrays.

So, what happens is this: Imagine you add 3 items. It is not important where they are in the index array (that depends on their hashes) they will become the first 3 items in the entries array. Then, you delete the second item (index 1, as it is zero based).
The _indexOfFirstDeleted will now be 1. The content of that item 1 will be the "default" for key and value, in the .NET implementation I think it puts -1 to the hashCode and the "next" will hold -1.
When adding a new item, the first thing that's done is to verify if there's a deleted space. If there is, then the new item will use the space of that first item (that is, the index 1 in this case) and the _indexOfFirst will point to the "next" deleted item (or -1).

It is even funny, because if you add 3 items, then delete the 3 items in order, then add them again, this new time their indexes in the entries array will be 2, 1 and 0, because first the deleted spaces are used, and the "indexOfFirst" is actually the last deleted space, not really the first, but that's a "confusion" caused by reusing the fields in the entries array.

About the enumeration, I believe there's a "version". At list the list has a version that's used to know if the current enumerator is holding the latest version of the list. The reason for this is that adding or removing items changes the size of the list or the order of the nodes and, in particular if there's a resize, what the enumerator holds can be completely wrong.

Hi d3vi1h3aRt,
When you remove an item from the Dictionary, the relevant entry that holds that item is in the entries array is being reset (-1 for the key, the rest of the fields value, next, etc... being set with default values).

In addition, as the entries array is holding a linked list (each item holds a "next" item), the previous item is being modified to direct to the next item. (if we remove the second item, now the first item should direct to the third).

Dictionary is only being resized when items are being added and there is not room for them anymore. the Dictionary is not resizing when items removed.

Although, this does not make to much difference, as the inner resize method is creating a new dictionary and copies the items in to the destination Dictionary.
If you really want to resize the Dictionary, you can create a new one and pass the old one as a constructor parameter.

Thank you for your excellent article
i just want to Konw
if Obj1 and Obj2 have the same index
How could we get the value of Obj2

Obj1 is stored in the first slot within the entries array.<br />
Obj2 is stored in the second slot within the entries array.<br />
Obj2 is holding Obj1 index. <br />
When trying to get Obj1 (by using Obj1 key), the hash code is getting index 4 in the buckets array (after MOD calculation). In that bucket, there is the index to the last object that got the same index – in that case the value 1 is retrieved – which is Obj2 – the root of a linked list.<br />

Regarding your question - When adding items to Dictionary, the key must be unique (Dictionary won't let you add two items with the same key).
So in our case, Obj1 and Obj2 (which are keys of course), must be different from each other.

While trying to access a value for a specific key, a hashcode is being calculated, and adjusted to the array size using MOD in order to find the index in which the key located in the bucket array.

In the above example, as Obj1 and Obj2 having the same index, the values for those keys are stored in a linked list (Entry struct holds a Next field), where the latest key inserted act as the linked list root node. As keys are unique, going through this linked list and check for equality (Equals()) will find the relevant key that you searched for.