These are daunting signatures to start with – four type parameters, and up to six normal method parameters (including the first "this" parameter indicating an extension method). Don’t panic. It’s actually quite simple to understand each individual bit:

We have two sequences (outer and inner). The two sequences can have different element types (TOuter and TInner).

For each sequence, there’s also a key selector – a projection from a single item to the key for that item. Note that although the key type can be different from the two sequence types, there’s only one key type – both key selectors have to return the same kind of key (TKey).

An optional equality comparer is used to compare keys.

A delegate (resultSelector) is used to project a pair of items whose keys are equal (one from each sequence) to the result type (TResult)

The idea is that we look through the two input sequences for pairs which correspond to equal keys, and yield one output element for each pair. This is an equijoin operation: we can only deal with equal keys, not pairs of keys which meet some arbitrary condition. It’s also an inner join in database terms – we will only see an item from one sequence if there’s a "matching" item from the other sequence. I’ll talk about mimicking left joins when I implement GroupJoin.

The documentation gives us details of the order in which we need to return items:

Join preserves the order of the elements of outer, and for each of these elements, the order of the matching elements of inner.

For the sake of clarity, it’s probably worth including a naive implementation which at least gives the right results in the right order:

Aside from the missing argument validation, there are two important problems with this:

It iterates over the inner sequence multiple times. I always advise anyone implementing a LINQ-like operator to only iterate over any input sequence once. There are sequences which are impossible to iterate over multiple times, or which may give different results each time. That’s bad news.

It always has a complexity of O(N * M) for N items in the inner sequence and M items in the outer sequence. Eek. Admittedly that’s always a possible complexity – two sequences which have the same key for all elements will always have that complexity – but in a typical situation we can do considerably better.

The real Join operator uses the same behaviour as Except and Intersect when it comes to how the input sequences are consumed:

Argument validation occurs eagerly – both sequences and all the "selector" delegates have to be non-null; the comparer argument can be null, leading to the default equality comparer for TKey being used.

The overall operation uses deferred execution: it doesn’t iterate over either input sequence until something starts iterating over the result sequence.

When MoveNext is called on the result sequence for the first time, it immediately consumes the whole of the inner sequence, buffering it.

The outer sequence is streamed – it’s only read one element at a time. By the time the result sequence has started yielding results from the second element of outer, it’s forgotten about the first element.

We’ve started veering towards an implementation already, so let’s think about tests.

What are we going to test?

I haven’t bothered with argument validation tests this time – even with cut and paste, the 10 tests required to completely check everything feels like overkill.

However, I have tested:

Joining two invalid sequences, but not using the results (i.e. testing deferred execution)

The way that the two sequences are consumed

Using a custom comparer

Not specifying a comparer

Using sequences of different types

See the source code for more details, but here’s a flavour – the final test:

To be honest, the tests aren’t very exciting. The implementation is remarkably simple though.

Let’s implement it!

Trivial decision 1: make the comparer-less overload call the one with the comparer.

Trivial decision 2: use the split-method technique to validate arguments eagerly but defer the rest of the operation

That leaves us with the actual implementation in an iterator block, which is all I’m going to provide the code for here. So, what are we going to do?

Well, we know that we’re going to have to read in the whole of the "inner" sequence – but let’s wait a minute before deciding how to store it. We’re then going to iterate over the "outer" sequence. For each item in the outer sequence, we need to find the key and then work out all the "inner" sequence items which match that key. Now, the idea of finding all the items in a sequence which match a particular key should sound familiar to you – that’s exactly what a lookup is for. If we build a lookup from the inner sequence as our very first step, the rest becomes easy: we can fetch the sequence of matches, then iterate over them and yield the return value of calling result selector on the pair of elements.

All of this is easier to see in code than in words, so here’s the method:

Personally, I think this is rather beautiful… in particular, I like the way that it uses every parameter exactly once. Everything is just set up to work nicely.

But wait, there’s more… If you look at the nested foreach loops, that should remind you of something: for each outer sequence element, we’re computing a nested sequence, then applying a delegate to each pair, and yielding the result. That’s almost exactly the definition of SelectMany! If only we had a "yield foreach" or "yield!" I’d be tempted to use an implementation like this:

Unfortunately there’s no such thing as a "yield foreach" statement. We can’t just call SelectMany and return the result directly, because then we wouldn’t be deferring execution. The best we can sensibly do is loop:

The compiler will effectively generate the same code (although admittedly I’ve used shorter range variable names here – x and y instead of outerElement and innerElement respectively). In this case the resultSelector delegate it supplies is simply the final projection from the "select" clause – but if we had anything else (such as a where clause) between the join and the select, the compiler would introduce a transparent identifier to propagate the values of both x and y through the query. It looks like I haven’t blogged explicitly about transparent identifiers, although I cover them in C# in Depth. Maybe once I’ve finished actually implementing the operators, I’ll have a few more general posts on this sort of thing.

Anyway, the point is that for simple join clauses (as opposed to join…into) we’ve implemented everything we need to. Hoorah.

Conclusion

I spoiled some of the surprise around how easy Join would be to implement by mentioning it in the ToLookup post, but I’m still impressed by how neat it is. Again I should emphasize that this is due to the design of LINQ – it’s got nothing to do with my own prowess.

This won’t be the end of our use of lookups though… the other grouping constructs can all use them too. I’ll try to get them all out of the way before moving on to operators which feel a bit different…

Addendum

It turns out this wasn’t quite as simple as I’d expected. Although ToLookup and GroupBy handle null keys without blinking, Join and GroupJoin ignore them. I had to write an alternative version of ToLookup which ignores null keys while populating the lookup, and then replace the calls of "ToLookup" in the code above with calls to "ToLookupNoNullKeys". This isn’t documented anywhere, and is inconsistent with ToLookup/GroupBy. I’ve opened up a Connect issue about it, in the hope that it at least gets documented properly. (It’s too late to change the behaviour now, I suspect.)

I’ve had a request to implement Join, which seems a perfectly reasonable operator to aim towards. However, it’s going to be an awful lot easier to implement if we write ToLookup first. That will also help with GroupBy and GroupJoin, later on.

In the course of this post we’ll create a certain amount of infrastructure – some of which we may want to tweak later on.

Essentially these boil down to two required parameters and two optional ones:

The source is required, and must not be null

The keySelector is required, and must not be null

The elementSelector is optional, and defaults to an identity projection of a source element to itself. If it is specified, it must not be null.

The comparer is optional, and defaults to the default equality comparer for the key type. It may be null, which is equivalent to specifying the default equality comparer for the key type.

Now we can just consider the most general case – the final overload. However, in order to understand what ToLookup does, we have to know what ILookup<TKey, TElement> means – which in turn means knowing about IGrouping<TKey, TElement>:

The generics make these interfaces seem somewhat scary at first sight, but they’re really not so bad. A grouping is simply a sequence with an associated key. This is a pretty simple concept to transfer to the real world – something like "Plays by Oscar Wilde" could be an IGrouping<string, Play> with a key of "Oscar Wilde". The key doesn’t have to be "embedded" within the element type though – it would also be reasonable to have an IGrouping<string, string> representing just the names of plays by Oscar Wilde.

A lookup is essentially a map or dictionary where each key is associated with a sequence of values instead of a single one. Note that the interface is read-only, unlike IDictionary<TKey, TValue>. As well as looking up a single sequence of values associated with a key, you can also iterate over the whole lookup in terms of groupings (instead of the key/value pair from a dictionary). There’s one other important difference between a lookup and a dictionary: if you ask a lookup for the sequence corresponding to a key which it doesn’t know about, it will return an empty sequence, rather than throwing an exception. (A key which the lookup does know about will never yield an empty sequence.)

One slightly odd point to note is that while IGrouping is covariant in TKey and TElement, ILookup is invariant in both of its type parameters. While TKey has to be invariant, it would be reasonable for TElement to be covariant – to go back to our "plays" example, an IGrouping<string, Play> could be sensibly regarded as an IGrouping<string, IWrittenWork> (with the obvious type hierarchy). However, the interface declarations above are the ones in .NET 4, so that’s what I’ve used in Edulinq.

Now that we understand the signature of ToLookup, let’s talk about what it actually does. Firstly, it uses immediate execution – the lookup returned by the method is effectively divorced from the original sequence; changes to the sequence after the method has returned won’t change the lookup. (Obviously changes to the objects within the original sequence may still be seen, depending on what the element selector does, etc.) The rest is actually fairly straightforward, when you consider what parameters we have to play with:

The keySelector is applied to each item in the input sequence. Keys are always compared for equality using the "comparer" parameter.

The elementSelector is applied to each item in the input sequence, to project the item to the value which will be returned within the lookup.

To demonstrate this a little further, here are two applications of ToLookup – one using the first overload, and one using the last. In each case we’re going to group plays by author and then display some information about them. Hopefully this will make some of the concepts I’ve described a little more concrete:

The IGrouping<TKey, TElement> objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping<TKey, TElement>. Elements in a grouping are yielded in the order they appear in source.

Admittedly "in an order based on…" isn’t as clear as it might be, but I think it’s reasonable to make sure that our implementation yields groups such that the first group returned has the same key as the first element in the source, and so on.

What are we going to test?

I’ve actually got relatively few tests this time. I test each of the overloads, but not terribly exhaustively. There are three tests worth looking at though. The first two show the "eager" nature of the operator, and the final one demonstrates the most complex overload by grouping people into families.

I’m sure there are plenty of other tests I could have come up with. lf you want to see any ones in particular (either as an edit to this post if they already exist in source control, or as entirely new tests), please leave a comment.

EDIT: It turned out there were a few important tests missing… see the addendum.

Let’s implement it!

There’s quite a lot of code involved in implementing this – but it should be reusable later on, potentially with a few tweaks. Let’s get the first three overloads out of the way to start with:

How do we feel about this? Well, groupings should be immutable. We have no guarantee that the "values" sequence won’t be changed after creation – but it’s an internal class, so we’ve just got to be careful about how we use it. We also have to be careful that we don’t use a mutable sequence type which allows a caller to get from the iterator (the IEnumerator<TElement> returned by GetEnumerator) back to the sequence and then mutate it. In our case we’re actually going to use List<T> to provide the sequences, and while List<T>.Enumerator is a public type, it doesn’t expose the underlying list. Of course a caller could use reflection to mess with things, but we’re not going to try to protect against that.

Okay, so now we can pair a sequence with a key… but we still need to implement ILookup. This is where there are multiple options. We want our lookup to be immutable, but there are various degrees of immutability we could use:

Mutable internally, immutable in public API

Mutable privately, immutable to the internal API

Totally immutable, even within the class itself

The first option is the simplest to implement, and it’s what I’ve gone for at the moment. I’ve created a Lookup class which is allows a key/element pair to be added to it from within the Edulinq assembly. It uses a Dictionary<TKey, List<TElement>> to map the keys to sequences efficiently, and a List<TKey> to remember the order in which we first saw the keys. Here’s the complete implementation:

All of this is really quite straightforward. Note that we provide an equality comparer to the constructor, which is then passed onto the dictionary – that’s the only thing that needs to know how to compare keys.

There are only two points of interest, really:

In the indexer, we don’t return the list itself – that would allow the caller to mutate it, after casting it to List<TElement>. Instead, we just call Select with an identity projection, as a simple way of inserting a sort of "buffering" layer between the list and the caller. There are other ways of doing this, of course – including implementing the indexer with an iterator block.

In GetEnumerator, we’re retaining the key order by using our list of keys and performing a lookup on each key.

We’re currently creating the new Grouping objects lazily – which will lead to fewer of them being created if the caller doesn’t actually iterate over the lookup, but more of them if the caller iterates over it several times. Again, there are alternatives here – but without any good information about where to make the trade-off, I’ve just gone for the simplest code which works for the moment.

One last thing to note about Lookup – I’ve left it internal. In .NET, it’s actually public – but the only way of getting at an instance of it is to call ToLookup and then cast the result. I see no particular reason to make it public, so I haven’t.

Now we’re finally ready to implement the last ToLookup overload – and it becomes pretty much trivial:

When you look past the argument validation, we’re just creating a Lookup, populating it, and then returning it. Simple.

Thread safety

Something I haven’t addressed anywhere so far is thead safety. In particular, although all of this is nicely immutable when viewed from a single thread, I have a nasty feeling that theoretically, if the return value of our implementation of ToLookup was exposed to another thread, it could potentially observe the internal mutations we’re making here, as we’re not doing anything special in terms of the memory model here.

I’m basically scared by lock-free programming these days, unless I’m using building blocks provided by someone else. While investigating the exact guarantees offered here would be interesting, I don’t think it would really help with our understanding of LINQ as a whole. I’m therefore declaring thread safety to be out of scope for Edulinq in general :)

Conclusion

So that’s ToLookup. Two new interfaces, two new classes, all for one new operator… so far. We can reuse almost all of this in Join though, which will make it very simple to implement. Stay tuned…

Addendum

It turns out that I missed something quite important: ToLookup has to handle null keys, as do various other LINQ operators (GroupBy etc). We’re currently using a Dictionary<TKey, TValue> to organize the groups… and that doesn’t support null keys. Oops.

So, first steps: write some tests proving that it fails as we expect it to. Fetch by a null key of a lookup. Include a null key in the source of a lookup. Use GroupBy, GroupJoin and Join with a null key. Watch it go bang. That’s the easy part…

Now, we can do all the special-casing in Lookup itself – but it gets ugly. Our Lookup code was pretty simple before; it seems a shame to spoil it with checks everywhere. What we really need is a dictionary which does support null keys. Step forward NullKeyFriendlyDictionary, a new internal class in Edulinq. Now you might expect this to implement IDictionary<TKey, TValue>, but it turns out that’s a pain in the neck. We hardly use any of the members of Dictionary – TryGetValue, the indexer, ContainsKey, and the Count property. That’s it! So those are the only members I’ve implemented.

The class contains a Dictionary<TKey, TValue> to delegate most requests to, and it just handles null keys itself. Here’s a quick sample:

internalbool TryGetValue(TKey key, out TValue value) { if (key == null) { // This will be default(TValue) if haveNullKey is false,// which is what we want. value = valueForNullKey; return haveNullKey; } return map.TryGetValue(key, out value); }

// etc}

There’s one potential flaw here, that I can think of: if you provide an IEqualityComparer<TKey> which treats some non-null key as equal to a null key, we won’t spot that. If your source then contains both those keys, we’ll end up splitting them into two groups instead of keeping them together. I’m not too worried about that – and I suspect there are all kinds of ways that could cause problems elsewhere anyway.

With this in place, and Lookup adjusted to use NullKeyFriendlyDictionary instead of just Dictionary, all the tests pass. Hooray!

At the same time as implementing this, I’ve tidied up Grouping itself – it now implements IList<T> itself, in a way which is immutable to the outside world. The Lookup now contains groups directly, and can return them very easily. The code is generally tidier, and anything using a group can take advantage of the optimizations applied to IList<T> – particularly the Count() operator, which is often applied to groups.

I’m really pleased with this one. Six minutes ago (at the time of writing this), I tweeted about the Intersect blog post. I then started writing the tests and implementation for Except… and I’m now done.

The tests are cut/paste/replace with a few tweaks – but it’s the implementation that I’m most pleased with. You’ll see what I mean later – it’s beautiful.

The result of calling Except is the elements of the first sequence which are not in the second sequence.

Just for completeness, here’s a quick summary of the behaviour:

The first overload uses the default equality comparer for TSource, and the second overload does if you pass in "null" as the comparer, too.

Neither "first" nor "second" can be null

The method does use deferred execution

The result sequence only contains distinct elements; even if the first sequence contains duplicates, the result sequence won’t

This time, the documentation doesn’t say how "first" and "second" will be used, other than to describe the result as a set difference. In practice though, it’s exactly the same as Intersect: when we ask for the first result, the "second" input sequence is fully evaluated, then the "first" input sequence is streamed.

What are we going to test?

I literally cut and paste the tests for Intersect, did a find/replace on Intersect/Except, and then looked at the data in each test. In particular, I made sure that there were potential duplicates in the "first" input sequence which would have to be removed in the result sequence. I also tweaked the details of the data in the final tests shown before (which proved the way in which the two sequences were read) but the main thrust of the tests are the same.

Nothing to see here, move on…

Let’s implement it!

I’m not even going to bother showing the comparer-free overload this time. It just calls the other overload, as you’d expect. Likewise the argument validation part of the implementation is tedious. Let’s focus on the part which does the work. First, we’ll think back to Distinct and Intersect:

In Distinct, we started with an empty set and populated it as we went along, making sure we never returned anything already in the set

In Intersect, we started with a populated set (from the second input sequence) and removed elements from it as we went along, only returning elements where an equal element was previously in the set

Except is simply a cross between the two: from Distinct we keep the idea of a "banned" set of elements that we add to; from Intersect we take the idea of starting off with a set populated from the second input element. Here’s the implementation:

The name of the local variable holding the set (bannedElements instead of potentialElements)

The method called in the loop (Add instead of Remove)

Isn’t that just wonderful? Perhaps it shouldn’t make me quite as happy as it does, but hey…

Conclusion

That concludes the set operators – and indeed my blogging for the night. It’s unsurprising that all of the set operators have used a set implementation internally… but I’ve been quite surprised at just how simple the implementations all were. Again, the joy of LINQ resides in the ability for such simple operators to be combined in useful ways.

Okay, this is more like it – after the dullness of Union, Intersect has a new pattern to offer… and one which we’ll come across repeatedly.

First, however, I should explain some more changes I’ve made to the solution structure…

Building the test assembly

I’ve just had an irritating time sorting out something I thought I’d fixed this afternoon. Fed up of accidentally testing against the wrong implementation, my two project configurations ("Normal LINQ" and "Edulinq implementation") now target different libraries from the test project: only the "Normal LINQ" configuration refers to System.Core, and only the "Edulinq implementation" configuration refers to the Edulinq project itself. Or so I thought. Unfortunately, msbuild automatically adds System.Core in unless you’re careful. I had to add this into the "Edulinq implementation" property group part of my project file to avoid accidentally pulling in System.Core:

Unfortunately, at that point all the extension methods I’d written within the tests project – and the references to HashSet<T> – failed. I should have noticed them not failing before, and been suspicious. Hey ho.

Now I’m aware that you can add your own version of ExtensionAttribute, but I believe that can become a problem if at execution time you do actually load System.Core… which I will end up doing, as the Edulinq assembly itself references it (for HashSet<T> aside from anything else).

There may well be multiple solutions to this problem, but I’ve come up with one which appears to work:

Add an Edulinq.TestSupport project, and refer to that from both configurations of Edulinq.Tests

Make Edulinq.TestSupport refer to System.Core so it can use whatever collections it likes, as well as extension methods

Put all the non-test classes (those containing extension methods, and the weird and wonderful collections) into the Edulinq.TestSupport project.

Add a DummyClass to Edulinq.TestSupport in the namespace System.Linq, so that using directives within Edulinq.Tests don’t need to be conditionalized

All seems to be working now – and finally, if I try to refer to an extension method I haven’t implemented in Edulinq yet, it will fail to compile instead of silently using the System.Core one. Phew. Now, back to Intersect…

Fairly obviously, it computes the intersection of two sequences: the elements that occur in both "first" and "second". Points to note:

Again, the first overload uses the default equality comparer for TSource, and the second overload does if you pass in "null" as the comparer, too.

Neither "first" nor "second" can be null

The method does use deferred execution – but in a way which may seem unusual at first sight

The result sequence only contains distinct elements – even if an element occurs multiple times in both input sequences, it will only occur once in the result

Now for the interesting bit – exactly when the two sequences are used. MSDN claims this:

When the object returned by this method is enumerated, Intersect enumerates first, collecting all distinct elements of that sequence. It then enumerates second, marking those elements that occur in both sequences. Finally, the marked elements are yielded in the order in which they were collected.

This is demonstrably incorrect. Indeed, I have a test which would fail under LINQ to Objects if this were the case. What actually happens is this:

Nothing happens until the first result element is requested. I know I’ve said so already, but it’s worth repeating.

As soon as the first element of the result is requested, the whole of the "second" input sequence is read, as well as the first (and possibly more) elements of the "first" input sequence – enough to return the first result, basically.

Results are read from the "first" input sequence as and when they are required. Only elements which were originally in the "second" input sequence and haven’t already been yielded are returned.

We’ll see this pattern of "greedily read the second sequence, but stream the first sequence" again when we look at Join… but let’s get back to the tests.

var query = firstQuery.Intersect(second); using (var iterator = query.GetEnumerator()) { // We can get the first value with no problems Assert.IsTrue(iterator.MoveNext()); Assert.AreEqual(1, iterator.Current);

// Getting at the *second* value of the result sequence requires// reading from the first input sequence until the "bad" division Assert.Throws<DivideByZeroException>(() => iterator.MoveNext()); } }

The first test just proves that execution really is deferred. That’s straightforward.

The second test proves that the "second" input sequence is completely read as soon as we ask for our first result. If the operator had really read "first" and then started reading "second", it could have yielded the first result (1) without throwing an exception… but it didn’t.

The third test proves that the "first" input sequence isn’t read in its entirety before we start returning results. We manage to read the first result with no problems – it’s only when we ask for the second result that we get an exception.

Let’s implement it!

I have chosen to implement the same behaviour as LINQ to Objects rather than the behaviour described by MSDN, because it makes more sense to me. In general, the "first" sequence is given more importance than the "second" sequence in LINQ: it’s generally the one which is streamed, with the second sequence being buffered if necessary.

Let’s start off with the comparer-free overload – as before, it will just call the other one:

Now for the more interesting part. Obviously we’ll have an argument-validating method, but what should we do in the guts? I find the duality between this and Distinct fascinating: there, we started with an empty set of elements, and tried to add each source element to it, yielding the element if we successfully added it (meaning it wasn’t there before).

This time, we’ve effectively got a limited set of elements which we can yield, but we only want to yield each of them once – so as we see items, we can remove them from the set, yielding only if that operation was successful. The initial set is formed from the "second" input sequence, and then we just iterate over the "first" input sequence, removing and yielding appropriately:

Ta-da – it all works as expected. I don’t know whether this is how Intersect really works in LINQ to Objects, but I expect it does something remarkably similar.

Conclusion

After suffering from some bugs earlier today where my implementation didn’t follow the documentation, it’s nice to find some documentation which doesn’t follow the real implementation :)

Seriously though, there’s an efficiency point to be noted here. If you have two sequences, one long and one short, then it’s much more efficient (mostly in terms of space) to use longSequence.Intersect(shortSequence) than shortSequence.Intersect(longSequence). The latter will require the whole of the long sequence to be in memory at once.

Next up – and I might just manage it tonight – our final set operator, Except.

This "two overloads, one taking an equality comparer" pattern is familiar from Distinct; we’ll see that and Intersect and Except do exactly the same thing.

Simply put, Union returns the union of the two sequences – all items that are in either input sequence. The result sequence has no duplicates in even if one of the input sequences contains duplicates. (I’m using the term "duplicate" here to mean an element which is equal to another according to the equality comparer we’re using in the operator.)

Characteristics:

Union uses deferred execution: argument validation is basically all that happens when the method is first called; it only starts iterating over the input sequences when you iterate over the result sequence

Neither first nor second can be null; the comparer argument can be null, in which case the default equality comparer is used

The input sequences are only read as and when they’re needed; to return the first result element, only the first input element is read

It’s worth noting that the documentation for Union specifies a lot more than the Distinct documentation does:

When the object returned by this method is enumerated, Union enumerates first and second in that order and yields each element that has not already been yielded.

To me, that actually looks like a guarantee of the rules I proposed for Distinct. In particular, it’s guaranteeing that the implementation iterates over "first" before "second", and if it’s yielding elements as it goes, that guarantees that distinct elements will retain their order from the original input sequences. Whether others would read it in the same way or not, I can’t say… input welcome.

What are we going to test?

I’ve written quite a few tests for Union – possibly more than we really need, but they demonstrate a few points of usage. The tests are:

Arguments are validated eagerly

Finding the union of two sequences without specifying a comparer; both inputs have duplicate elements, and there’s one element in both

The same test as above but explicitly specifying null as the comparer, to force the default to be used

The same test as above but using a case-insensitive string comparer

Taking the union of an empty sequence with a non-empty one

Taking the union of a non-empty sequence with an empty one

Taking the union of two empty sequences

Proving that the first sequence isn’t used until we start iterating over the result sequence (using ThrowingEnumerable)

Proving that the second sequence isn’t used until we’ve exhausted the first

No new collections or comparers needed this time though – it’s all pretty straightforward. I haven’t written any tests for null elements this time – I’m convinced enough by what I saw when implementing Distinct to believe they won’t be a problem.

Let’s implement it!

First things first: we can absolutely implement the simpler overload in terms of the more complex one, and I’ll do the same for Except and Intercept. Here’s the Union method:

So how do we implement the more complex overload? Well, I’ve basically been a bit disappointed by Union in terms of its conceptual weight. It doesn’t really give us anything that the obvious combination of Concat and Distinct doesn’t – so let’s implement it that way first:

That feels like an absurd waste of code when we can achieve the same result so simply, admittedly. This is the first time my resolve against implementing one operator in terms of completely different ones has wavered. Just looking at it in black and white (so to speak), I’m close to going over the edge…

Conclusion

Union was a disappointingly bland operator in my view. (Maybe I should start awarding operators marks out of ten for being interesting, challenging etc.) It doesn’t feel like it’s really earned its place in LINQ, as calls to Concat/Distinct can replace it so easily. Admittedly as I’ve mentioned in several other places, a lot of operators can be implemented in terms of each other – but rarely quite so simply.

The point of the operator is straightforward: the result sequence contains the same items as the input sequence, but with the duplicates removed – so an input of { 0, 1, 3, 1, 5 } will give a result sequence of { 0, 1, 3, 5 } – the second occurrence of 1 is ignored.

This time I’ve checked and double-checked the documentation – and in this case it really is appropriate to think of the first overload as just a simplified version of the second. If you don’t specify an equality comparer, the default comparer for the type will be used. The same will happen if you pass in a null reference for the comparer. The default equality comparer for a type can be obtained with the handy EqualityComparer<T>.Default property.

Just to recap, an equality comparer (represented by the IEqualityComparer<T> interface) is able to do two things:

Determine the hash code for a single item of type T

Compare any two items of type T for equality

It doesn’t have to give any sort of ordering – that’s what IComparer<T> is for, although that doesn’t have the ability to provide a hash code.

One interesting point about IEqualityComparer<T> is that the GetHashCode() method is meant to throw an exception if it’s provided with a null argument, but in practice the EqualityComparer<T>.Default implementations appear not to. This leads to an interesting question about Distinct: how should it handle null elements? It’s not documented either way, but in reality both the LINQ to Objects implementation and the simplest way of implementing it ourselves simply throws a NullReferenceException if you use a not-null-safe comparer and have a null element present. Note that the default equality comparer for any type (EqualityComparer<T>.Default) does cope with nulls.

There are other undocumented aspects of Distinct, too. Both the ordering of the result sequence and the choice of which exact element is returned when there are equal options are unspecified. In the case of ordering, it’s explicitly unspecified. From the documentation: "The Distinct method returns an unordered sequence that contains no duplicate values." However, there’s a natural approach which answers both of these questions. Distinct is specified to use deferred execution (so it won’t look at the input sequence until you start reading from the output sequence) but it also streams the results to some extent: to return the first element in the result sequence, it only needs to read the first element from the input sequence. Some other operators (such as OrderBy) have to read all their data before yielding any results.

When you implement Distinct in a way which only reads as much data as it has to, the answer to the ordering and element choice is easy:

The result sequence is in the same order as the input sequence

When there are multiple equal elements, it is the one which occurs earliest in the input sequence which is returned as part of the result sequence.

Remember that it’s perfectly possible to have elements which are considered equal under a particular comparer, but are still clearly different when looked at another way. The simplest example of this is case-insensitive string equality. Taking the above rules into account, the distinct sequence returned for { "ABC", "abc", "xyz" } with a case-insensitive comparer is { "ABC", "xyz" }.

What are we going to test?

All of the above :)

All the tests use sequences of strings for clarity, but I’m using four different comparers:

The default string comparer (which is a case-sensitive ordinal comparer)

The case-insensitive ordinal comparer

A comparer which uses object identity (so will treat two equal but distinct strings as different)

A comparer which explicitly doesn’t try to cope with null values

The tests assume that the undocumented aspects listed above are implemented with the rules that I’ve given. This means they’re over-sensitive, in that an implementation of Distinct which matches all the documented behaviour but returns elements in a different order would fail the tests. This highlights an interesting aspect of unit testing in general… what exactly are we trying to test? I can think of three options in our case:

Just the documented behaviour: anything conforming to that, however oddly, should pass

The LINQ to Objects behaviour: the framework implementation should pass all our tests, and then our implementation should as well

Our implementation’s known (designed) behaviour: we can specify that our implementation will follow particular rules above and beyond the documented contracts

In production projects, these different options are valid in different circumstances, depending on exactly what you’re trying to do. At the moment, I don’t have any known differences in behaviour between LINQ to Objects and Edulinq, although that may well change later in terms of optimizations.

None of the tests themselves are particularly interesting – although I find it interesting that I had to implement a deliberately fragile (but conformant) implementation of IEqualityComparer<T> in order to test Distinct fully.

Let’s implement it!

I’m absolutely confident in implementing the overload that doesn’t take a custom comparer using the one that does. We have two options for how to specify the custom comparer in the delegating call though – we could pass null or EqualityComparer<T>.Default, as the two are explicitly defined to behave the same way in the second overload. I’ve chosen to pass in EqualityComparer<T>.Default just for the sake of clarity – it means that anyone reading the first method doesn’t need to check the behavior of the second to understand what it will do.

We need to use the "private iterator block method" approach again, so that the arguments can be evaluated eagerly but still let the result sequence use deferred execution. The real work method uses HashSet<T> to keep track of all the elements we’ve already returned – it takes an IEqualityComparer<T> in its constructor, and the Add method adds an element to the set if there isn’t already an equal one, and returns whether or not it really had to add anything. All we have to do is iterate over the input sequence, call Add, and yield the item as part of the result sequence if Add returned true. Simple!

So what about the behaviour with nulls? Well, it seems that HashSet<T> just handles that automatically, if the comparer it uses does. So long as the comparer returns the same hash code each time it’s passed null, and considers null and null to be equal, it can be present in the sequence. Without HashSet<T>, we’d have had a much uglier implementation – especially as Dictionary<TKey, TValue> doesn’t allow null keys.

Conclusion

I’m frankly bothered by the lack of specificity in the documentation for Distinct. Should you rely on the ordering rules that I’ve given here? I think that in reality, you’re reasonably safe to rely on it – it’s the natural result of the most obvious implementation, after all. I wouldn’t rely on the same results when using a different LINQ provider, mind you – when fetching the results back from a database, for example, I wouldn’t be at all surprised to see the ordering change. And of course, the fact that t documentation explicitly states that the result is unordered should act as a bit of a deterrent from relying on this.

We’ll have to make similar decisions for the other set-based operators: Union, Intersect and Except. And yes, they’re very likely to use HashSet<T> too…

Aggregate is an extension method using immediate execution, returning a single result. The generalised behaviour is as follows:

Start off with a seed. For the first overload, this defaults to the first value of the input sequence. The seed is used as the first accumulator value.

For each item in the list, apply the aggregation function, which takes the current accumulator value and the newly found item, and returns a new accumulator value.

Once the sequence has been exhausted, optionally apply a final projection to obtain a result. If no projection has been specified, we can imagine that the identity function has been provided.

The signatures make all of this look a bit more complicated because of the various type parameters involved. You can consider all the overloads as dealing with three different types, even though the first two actually have fewer type parameters:

TSource is the element type of the sequence, always.

TAccumulate is the type of the accumulator – and thus the seed. For the first overload where no seed is provided, TAccumulate is effectively the same as TSource.

TResult is the return type when there’s a final projection involved. For the first two overloads, TResult is effectively the same as TAccumulate (again, think of a default "identity projection" as being used when nothing else is specified)

In the first overload, which uses the first input element as the seed, an InvalidOperationException is thrown if the input sequence is empty.

What are we going to test?

Obviously the argument validation is reasonably simple to test – source, func and resultSelector can’t be null. But there are two different approaches to testing the "success" cases.

We could work out exactly when each delegate should be called and with what values – effectively mock every step of the iteration. This would be a bit of a pain, but a very robust way of proceeding.

The alternative approach is just to take some sample data and aggregation function, work out what the result should be, and assert that result. If the result is sufficiently unlikely to be achieved by chance, this is probably good enough – and it’s a lot simpler to implement. Here’s a sample from the most complicated test, where we have a seed and a final projection:

Now admittedly I’m not testing this to the absolute full – I’m using the same types for TSource and TAccumulate – but frankly it gives me plenty of confidence that the implementation is correct.

EDIT: My result selector now calls ToInvariantString. It used to just call ToString, but as I’ve now been persuaded that there are some cultures where that wouldn’t give us the right results, I’ve implemented an extension method which effectively means that x.ToInvariantString() is equivalent to x.ToString(CultureInfo.InvariantCulture) – so we don’t need to worry about cultures with different numeric representations etc.

Just for the sake of completeness (I’ve convinced myself to improve the code while writing this blog post), here’s an example which sums integers, but results in a long – so it copes with a result which is bigger than Int32.MaxValue. I haven’t bothered with a final projection though:

Since I first wrote this post, I’ve also added tests for empty sequences (where the first overload should throw an exception) and a test which relies on the first overload using the first element of the sequence as the seed, rather than the default value of the input sequence’s element type.

Okay, enough about the testing… what about the real code?

Let’s implement it!

I’m still feeling my way around when it’s a good idea to implement one method by using another, but at the moment my gut feeling is that it’s okay to do so when:

You’re implementing one operator by reusing another overload of the same operator; in other words, no unexpected operators will end up in the stack trace of callers

There are no significant performance penalties for doing so

The observed behaviour is exactly the same – including argument validation

The code ends up being simpler to understand (obviously)

Contrary to an earlier version of this post, the first overload can’t be implemented in terms of the second or third ones, because of its behaviour regarding the seed and empty sequences.

It still makes sense to share an implementation for the second and third overloads though. There’s a choice around whether to implement the second operator in terms of the third (giving it an identity projection) or to implement the third operator in terms of the second (by just calling the second overload and then applying a projection). Obviously applying an unnecessary identity projection has a performance penalty in itself – but it’s a tiny penalty. So which is more readable? I’m in two minds about this. I like code where various methods call one other "central" method where all the real work occurs (suggesting implementing the second overload using the third) but equally I suspect I really think about aggregation in terms of getting the final value of the accumulator… with just a twist in the third overload, of an extra projection. I guess it depends on whether you think of the final projection as part of the general form or an "extra" step.

For the moment, I’ve gone with the "keep all logic in one place" approach:

The bulk of the "real work" method is argument validation – the actual iteration is almost painfully simple.

Conclusion

The moral of today’s story is to read the documentation carefully – sometimes there’s unexpected behaviour to implement. I still don’t really know why this difference in behaviour exists… it feels to me as if the first overload really should behave like the second one, just with a default initial seed. EDIT: it seems that you need to read it really carefully. You know, every word of it. Otherwise you could make an embarrassing goof in a public blog post. <sigh>

The second moral should really be about the use of Aggregate – it’s a very generalized operator, and you can implement any number of other operators (Sum, Max, Min, Average etc) using it. In some ways it’s the scalar equivalent of SelectMany, just in terms of its diversity. Maybe I’ll show some later operators implemented using Aggregate…

Next up, there have been requests for some of the set-based operators – Distinct, Union, etc – so I’ll probably look at those soon.