Reimplementing LINQ to Objects: Part 2 – “Where”

Warning: this post is quite long. Although I’ve chosen a simple operator to implement, we’ll encounter a few of the corner cases and principles involved in LINQ along the way. This will also be a somewhat experimental post in terms of format, as I try to work out the best way of presenting the material.

We’re going to implement the "Where" clause/method/operator. It’s reasonably simple to understand in general, but goes into all of the deferred execution and streaming bits which can cause problems. It’s generic, but only uses one type parameter (which is a big deal, IMO – the more type parameters a method has, the harder I find it to understand in general). Oh, and it’s a starting point for query expressions, which is a bonus.

Before I go into what it actually does, I’ll point out a few things which are going to be common across nearly all of the LINQ operators we’ll be implementing:

These are extension methods – they’re declared in a top-level, non-nested, static class and the first parameter has a "this" modifier. They can be invoked roughly as if they were instance methods on the type of the first parameter.

They’re generic methods – in this case there’s just a single type parameter (TSource) which indicates the type of sequence we’re dealing with. So for (say) a list of strings, TSource would be string.

They take generic delegates in the Func<…> family. These are usually specified with lambda expressions, but any other way of providing a delegate will work fine too.

I fully expect that most readers will be comfortable with all of these concepts, so I won’t go into them in more detail. If any of the above makes you nervous, please familiarise yourself with them before continuing, otherwise you’re likely to have a hard time.

The purpose of "Where" is to filter a sequence. It takes an input sequence and a predicate, and returns another sequence. The output sequence is of the same element type (so if you put in a sequence of strings, you’ll get a sequence of strings out) and it will only contain elements from the input sequence which pass the predicate. (Each item will be passed to the predicate in turn. If the predicate returns true, the item will be part of the output sequence; otherwise it won’t.)

Now, a few important details about the behaviour:

The input sequence is not modified in any way: this isn’t like List<T>.RemoveAll, for example.

The method uses deferred execution – until you start trying to fetch items from the output sequence, it won’t start fetching items from the input sequence

Despite deferred execution, it will validate that the parameters aren’t null immediately

It streams its results: it only ever needs to look at one result at a time, and will yield it without keeping a reference to it. This means you can apply it to an infinitely long sequence (for example a sequence of random numbers)

It will iterate over the input sequence exactly once each time you iterate over the output sequence

Disposing of an iterator over the output sequence will dispose of the corresponding iterator over the input sequence. (In case you didn’t realise, the foreach statement in C# uses a try/finally block to make sure the iterator is always disposed however the loop finishes.)

Many of these points will be true for a lot of our other operators too.

The overload which takes a Func<TSource, int, bool> lets the predicate use the index within the sequence as well as the value. The index always starts at 0, and increments by 1 each time regardless of previous results from the predicate.

What are we going to test?

Ideally, we’d like to test all of the points above. The details of streaming and how many times the sequence is iterated over are frankly a pain to deal with, unfortunately. Given how much there is to implement already, we’ll come back to those.

Let’s have a look at some tests. First, here’s a simple "positive" test – we’re starting with an array of integers, and using a lambda expression to only include elements less than 4 in the output. (The word "filter" is omnipresent but unfortunate. It’s easier to talk about "filtering out" elements than "including" them, but the predicate is expressed in a positive way.)

I’ve kept the TestExtensions from MoreLINQ, despite NUnit coming with CollectionAssert. I find the extension methods easier to work with for three reasons:

They’re extension methods, which helps to reduce the clutter

They can use a parameter array for the expected output, which makes the test simpler to express

The message is clearer when the assertion fails

Basically, AssertSequenceEqual does what you’d expect it to – it checks that the actual result (usually expressed as the variable you call the method on) matches the expected result (usually expressed as a parameter array).

I’m not bothering to check the name within the ArgumentNullException, but importantly I’m testing that the arguments are being validated immediately. I’m not trying to iterate over the result – so if the validation is deferred, the test will fail.

The final interesting test for the moment is also around deferred execution, using a helper class called ThrowingEnumerable. This is a sequence which blows up with an InvalidOperationException if you ever try to iterate over it. Essentially, we want to check two things:

Just calling Where doesn’t start iterating over the source sequence

When we call GetEnumerator() to get an iterator and then MoveNext() on that iterator, we should start iterating, causing an exception to be thrown.

We’ll need to do something similar for other operators, so I’ve written a small helper method in ThrowingEnumerable:

These tests have all dealt with the simpler predicate overload – where the predicate is only passed the item, not the index. The tests involving the index are very similar.

Let’s implement it!

With all the tests passing when running against the real LINQ to Objects, it’s time to implement it in our production code. We’re going to use iterator blocks, which were introduced in C# 2 to make it easier to implement IEnumerable<T>. I have a couple of articles you can read if you want more background details… or read chapter 6 of C# in Depth (either edition). These give us deferred execution for free… but that can be a curse as well as a blessing, as we’ll see in a minute.

At its heart, the implementation is going to look something like this:

Simple, isn’t it? Iterator blocks allow us to write the code pretty much how we’d describe it: we iterate over each item in the source, and if the predicate returns true for that particular item, we yield (include) it in the output sequence.

Lo and behold, some of our tests pass already. Now we just need argument validation. That’s easy, right? Let’s give it a go:

Hmm. Our validation tests still seem to be red, and putting a breakpoint on the "throw" statements doesn’t help… they’re not getting hit. What’s going on?

I’ve given a few pretty broad hints already. The problem is deferred execution. Until we start trying to iterate over the result, none of our code will run. Our tests deliberately don’t iterate over the result, so validation is never performed.

We’ve just hit a design flaw in C#. Iterator blocks in C# simply don’t work nicely when you want to split execution between "immediate" (usually for validation) and "deferred". Instead, we have to split our implementation into two: a normal method for validation, which then calls the iterator method for the deferred processing:

Now the bar is green, and we’re done. Hang on though… we haven’t used it every way we can yet.

Query expressions

So far, we’ve been calling the method directly (although as an extension method) – but LINQ also provides us with query expressions. Here’s our "SimpleFiltering" test rewritten to use a query expression:

(Note that the name is different here to in the downloadable code, to stop the blog server software blocking the name of the method. Grr.)

That will produce exactly the same code as our earlier test. The compiler basically translates this form into the previous one, leaving the condition (x < 4) as a lambda expression and then converting it appropriately (into a delegate in this case). You may be surprised that this works as we have no Select method yet… but in this case we have a "no-op" select projection; we’re not actually performing a real transformation. In that case – and so long as there’s something else in the query, in this case our "where" clause – the compiler effectively omits the "select" clause, so it doesn’t matter that we haven’t implemented it. If you changed "select x" to "select x * 2", it would fail to compile against our Where-only LINQ implementation.

The fact that query expressions are just based on patterns like this is a very powerful feature for flexibility – it’s how LINQ to Rx is able to only implement the operators that make sense in that environment, for example. Similarly, there’s nothing in the C# compiler that "knows" about IEnumerable<T> when it comes to query expressions – which is how entirely separate interfaces such as IObservable<T> work just as well.

What have we learned?

There’s been a lot to take in here, in terms of both implementation and core LINQ principles:

LINQ to Objects is based on extension methods, delegates and IEnumerable<T>

Operators use deferred execution where appropriate and stream their data where possible

Operators don’t mutate the original source, but instead return a new sequence which will return the appropriate data

Query expressions are based on compiler translations of patterns; you don’t need to implement any more of the pattern than the relevant query expression requires

Your arguments seems fine but your implementation can be improved. The problem is that the exception’s stack trace will point to your ValidateInput method which may be misleading. You can handle this using something like:

@Jason: Yes – that’s one of the types of test I haven’t implemented yet. I figured that testing deferred execution and eager validation was enough for this time. I may do the disposal for Select, which would otherwise be pretty simple :)

Additional tests would be to check that the predicate is only invoked once for every element and a few others. Ideally you would test that no files have been deleted etc. You venture quite far along the testing route towards inefficiency.

“LINQ to Objects is based on extension methods, delegates and IEnumerable”

It is ofcourse based on extension methods and delegates, But there is an underlying implementation of IQueryable interface for all these extensions to be invoked from. Simply put, You can have the end-point source as IQueryable and run LINQ to Objects because its query translation is IL. You can check the EnumerableQuery which implements this, and acts the glue for IQueryable translation. How do you think the LINQ to SQL is implemented if it is only based on extension methods / lambdas / delegates?

@Fahad: LINQ to SQL isn’t LINQ to Objects. I very carefully wrote that *LINQ to Objects* is based on those things, rather than LINQ in general. Yes, you *can* go via IQueryable, but that’s still going to end up compiling the query expressions down to delegates.

Jon, My point here is that there is a set of methods that you need to implement that makes “LINQ”, and that is with the IQueryable interface and not the other way around. It would be more like “via” delegates / lambdas and straight with IQueryable.

@Fahad: But I’m not implementing all of LINQ – I’m implementing LINQ to Objects. I’ll talk about expression trees etc if and when I implement AsQueryable, but I think my statement about *LINQ to Objects* is reasonable. I think it’s best to start off with *relatively* straightforward concepts, and introduce the trickier ones later, when they become relevant.

(In particular, LINQ to Objects would still have been useful without IQuerable existing at all.)

@tobi: Why do you say Jon is “venture[ing] quite far along the testing route towards inefficiency”? I’m not the best tester in the world, and have some questions myself on how the test for deferred execution is actually working, but all of his tests seem reasonable. Which ones would you not include?

Hi Jon, Thanks for the blog series just discovered it today. Just a question on this statement “We’ve just hit a design flaw in C#. Iterator blocks in C# simply don’t work nicely when you want to split execution between “immediate” (usually for validation) and “deferred”. Instead, we have to split our implementation into two: a normal method for validation, which then calls the iterator method for the deferred processing:”

In the code above what causes it to be deferred? Is it the presence of the yield keyword in the method?

Absolutely fascinating. The part that I don’t get is that why c# doesn’t defer the checking of the validation as well……….this is most confusing to me. It seems like quantum physics – a magic wand: how does the debugger know whether a yield statement is in the where method unless it goes there and checks: yet it seems to foresee whether a yield statement exists in the where method before going there and checking – because when we add in the WhereImp method which actually does the implementation, then the debugger actually enters the where method. i hope that makes sense.

In any case my question is the same as as @Dominic – what causes it to be deferred? and how does c# know that there is a yield statement in the where method before going there to check it.

That’s simple – C# is a compiled language. The compiler looks at the method and works out what code it’s going to emit. If the method includes a yield statement, it’s an iterator and uses deferred execution. If it doesn’t contain a yield statement, it’s not an iterator. All of this is decided at compile time.