Introduction

The new System.Collections.Generic namespace contains generic versions of the most often used data structures such as lists and dictionaries. But unfortunately, it does not contain a generic set type. The project included with this article provides a generic set type to fill this gap.

Why another set type?

Yes, I know what you are thinking. Why another set type? Several set collections have been posted to this site, for example this and this. Well, first of all, this is a generic set implementation, so it will be more convenient to use and also much faster than the aforementioned non-generic set types. Implementing a generic collection class is also a good way to see how .NET generics work.

The Interface

Since this set type tries to fill a gap in the System.Collections.Generic namespace, it is coded in a similar style so that it fits in easily. It also strives to provide good interoperability with other collection types.

I tried to provide methods that are similar in name and function to the ones provided by the other collections in the System.Collections.Generic namespace. For example, I implemented the very useful ConvertAll, TrueForAll, FindAll and ForEach methods exactly as they are in the List<T> class. Of course, I also implemented IEnumerable and ICollection in both generic and non-generic versions. So, hopefully, there will be no surprises when using the Set<T> class.

The implementation

The requirements for a set class are quite simple: the Add, Remove and Contains methods should be as fast as possible. Fortunately, the System.Collections.Generic namespace already contains a collection that has exactly these properties, namely the Dictionary<K,V> type. The Dictionary<K,V> type is a strongly typed version of the System.Collections.HashTable class.

The keys in a Dictionary<K,V> are unique. And adding, removing, or finding keys in a Dictionary<K,V> will be fast as long as the storage type K provides a good GetHashCode() implementation. So, in the interest of avoiding code duplication, it makes sense to use a Dictionary<K,V> internally to store the set elements.

Inheritance or composition

Since Dictionary<K,V> is not sealed, it would theoretically be possible to let Set<T> inherit from Dictionary<T,V>. This would in fact be the most efficient way to implement a set that is backed by a HashTable, but I decided against it for several reasons: First of all, Dictionary<K,V> contains several methods and properties which have nothing to do with sets, and which would needlessly clutter the Set<T> class and confuse the user. And second, the fact that the Set<T> class is implemented using a Dictionary<K,V> is an implementation detail that should be invisible for the user of the class. If I should decide to write my own Set<T> implementation instead of delegating all the work to Dictionary<K,V>, I could do this without breaking API compatibility.

So, I decided to use a private field to store the Dictionary<K,V> instead of inheriting from Dictionary<K,V> even though this means that one more object is created.

Efficiency?

As we have seen, the keys of a HashTable can be used to emulate a set. But what about the values? We don't need them, so we could just use arbitrary values for them. This is an often-used idiom in non-generic .NET programming.

This small code snippet demonstrates how to use a non-generic HashTable to make the values in an int[] unique. Since the value we store in the HashTable is arbitrary and we want to avoid unnecessary boxing, we use null even though it looks a bit funny.

The above code works and is reasonably efficient, but there are several places where it is less than optimal. Since the non-generic HashTable class uses object references internally for keys and values, every time we add an int to the HashTable, it is boxed and placed on the heap. And, we still need the space for storing the reference to the value even though we are not interested in it.

Generics to the rescue

The generics implementation in .NET really shines when it comes to value types. When a generic type is instantiated using a value type as type parameter, the CLR generates special IL code for this specialization. So, using value types with generics totally eliminates the boxing overhead. With this in mind, we can rewrite the above code using the generic version of a HashTable:

This code will be quite a bit faster than the non-generic version, since adding a new key to the temporary dictionary no longer has any boxing overhead. But there is still the unnecessary baggage of a reference to a value we never need.

Zero-Size structs?

I thought that I could eliminate this overhead by using a zero-size struct, but as it turns out, the current .NET runtime reserves some memory even for zero-size structs.

This will hopefully change in the future, since there are quite a few nice tricks that use zero-size structs.

Here is what I tried: since we do not care about the value, we can give it any type we want. Obviously, we want the smallest type we can get. The smallest primitive type is byte, which is 1 byte long (duh) but will usually consume between 4 and 8 bytes depending on pack size. But we can do even better by defining our own dummy type with zero size:

Alternative implementation

Using a Dictionary<T,X> to store the Set<T> data is reasonably efficient. But it has some memory overhead due to the unused dictionary values. I am currently working on a Set<T> that is backed by a List<T> sorted by the hash code. This would be much faster and more space efficient in many situations. When I am happy with the code, I will post it to the site mentioned below. It will have exactly the same public methods as this implementation, so you will be able to use it as a drop-in replacement.

Set operations

In addition to the usual collection operations, the Set<T> type supports all kinds of set specific operations. Here is an overview:

Method

Operator

Description

a.Union(b)

a|b

All elements that are in a or b

a.Difference(b)

a-b

All elements in a that are not in b

a.SymmetricDifference(b)

a^b

(a^b)=(a-b)|(b-a)

a.Intersection(b)

a&b

All elements that are in a and in b

a.Equals(b)

a==b

a and b contain exactly the same elements

-

a<b

b contains all elements in a and some additional elements

-

a<=b

b contains all elements in a and maybe some additional elements

There is a slight difference in the operator methods and the non-operator methods. While the operator methods all require sets as parameters, the non-operator methods often use an IEnumerable<T> as parameter. I did this to avoid accidental conversions to Set<T> when using the operators.

Conclusion

I found the new System.Collections.Generic namespace a real pleasure to work with. It is much faster for value types because of the elimination of the boxing overhead. The many useful methods such as the above mentioned ConvertAll, TrueForAll, FindAll and ForEach make it possible to eliminate most loops. The only thing it was missing is a Set type. But no longer.

If you can afford using .NET 2.0 features, you should by all means use the System.Collections.Generic namespace even if you don't need the type safety. For example, you should use List<object> instead of ArrayList.

About the code

The included project contains the described Set<T> type, and a small demo program that performs a few performance and correctness tests on sets. The code is under a BSD license, so feel free to use it for your own projects.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Share

About the Author

Rüdiger Klaehn works as freelance developer in the space industry. He is very interested in functional programming languages and hopes that .NET will lead to a more widespread adoption in the industry.

Comments and Discussions

I could swear I read it on either Brad Abrams' blog or on the Wintellect website, but now that I've searched for it, I can't seem to find it. I've posted a topic on the Wintellect forums to ask about it, I'll post here when I get a reply.

Any remotely useful information on my blog will be removed immediately.Judah Himango

So, I decided to use a private field to store the Dictionary instead of inheriting from Dictionary even though this means that one more object is created

I didn't check this out, but in C++ the correct way of doing this would be private inheritance. I assume you can do that in C# (if not, then I guess you are correct to do it with a data member).
private inheritance gives you control of the public interface you show, while preserving the capabilities of the father class.

C# does not have private inheritance. So I had to use a data member. It is not that bad though. The dictionary creates a bunch of arrays internally, so one more object on the heap won't kill the performance.