Simplex Sigillum Veri

Menu

Monthly Archives: December 2006

Today I had record stats on my blog. I passed the 200 hits per day threshold.

Since Mitch Denny told me the kind of visitor numbers he got on his blog, I’ve been operating with a bit of an inferiority complex. At the time I was peaking at about 20 hits per day and at the time I was pleased with those levels. Since then (about 4 months ago) I have been trying to attract more visitors to the blog. I get a lot of fun out of writing the posts (and my mum reads some of them back in the UK) but in the end there’s not much point keeping a public weblog unless the public reads it.

I started watching which posts brought the most traffic, and unsurprisingly it was the ones that told readers in advance what they were going to read about. Obscure or humorous titles got nowhere. People want to know the topic before they expend the time and mental effort visiting the page. Choosing my titles and topics more wisely, and creating cross-links to well-known sites (such as Mitch’s) has helped a lot, as had search engine registrations (especially reddit which I had never heard of before). I also found that controversial-but-lite content (such as my Anti-Agile Gripe) got way more traffic than the more painstaking articles on configuration and LINQ series. I don’t know whether the LINQ series will get more traffic the closer to release Orcas gets?

I haven’t gotten much in the way of comments which has been disappointing – I can’t tell whether readers are just skimming through or reading the posts! I’m not sure what to do about that. Any ideas?

Memetically, the family is stood around the foot of my bed and the priest has been called up from the village.

Memetically, I am raving while the hospital staff tie straps to my arms and legs to prevent me from hurting myself.

Memetically, I am on the back of a wagon being taken to the unmarked grave called “yuletide casualties“.

Yes, Christmas is here to provide the memetic straw that breaks this camel’s back. I was already infected by the recent wave of Wiggles that was going around (the Di Dickey Doo Dum even shows on my face, and if the office is too quiet people can hear my subvocalised Big Red Cars). Add to that the background malaise from all that twinkle twinkling and the blind mice that race up and down my spine (presumably competing for mind-space with the little piggies) at night.

But now I am mordant with memes that are thick as fleas on a dogs back – it is Christmas and the new rash of novelty singing toys are out. Kerry and her mother thought it was in the yuletide spirit to have about a dozen of these monstrosities wandering the house singing songs that are so virulent they should be ranked alongside military grade bioweapons.

I know the ill-disciplined mind of a child needs something seriously catchy to get an idea through the background noise of the forming mind. The effect on us adults, though, is little short of being mustard gassed in the trenches for a solid month.

That sacred space that is the core of a centred mind is, in my case, so infected with viral memes it bears more than a passing resemblance to an ambulatory slime mould.

And don’t get me started on those ruddy 50’s and 60’s Xmas tunes from the likes of Bing Crosby.

No meme would be worth its salt if it didn’t compel me to sing you a song, so here goes:

I’m the happiest Christmas tree – Ho Ho Ho. He He He.Came one day, and they found me, and took me home with them.

NASA Ames research centre have announced a collaboration with Google to make available the gigaquads of data that they have gathered over the years. Hopefully this should make available data from lunar expeditions and some very interesting satellite imagery. I don’t know whether this will include data from things like the Hubble space telescope, but I hope so. One thing I am sure about, it will be a priceless resource for professional and amateur astronomers alike.

I wonder who else Google is talking to? The Human Genome Project? The Visible Human Project? Interpol? Imagine if they made it possible for researchers to publish their data sets when they publish a paper… Maybe in future, readers of scientific papers can draw their own conclusions about the validity of results by running their own analysis of the results. It might improve the quality of research if scientists knew that their results would come under greater scrutiny?

As you may have already gathered, I have some reservations about the value of Agile and XP methodologies. I’ve even gone so far as to say that they are a license to cut corners.

I just read a very interesting post by Tony Wright, on the relative cost of fixing a bug in different stages of a project. Do we need a better justification for applying a little bit of forethought to a problem before rushing headlong into the implementation phase of a project? I guess the justification of an iterative development plan is that there is always a requirements, and design phase coming up that defects can be fixed in. The problem with that is that it blows out the project timeline, because time spent on developing worthwhile functionality is being spent on defect fixing. Not only that, the ‘agile’ ethos promotes a mentality that actively avoids solving problems ahead of time. That leads to short-termism that ultimately bakes bugs into the code!

Introduction

In recent weeks I’ve been decompiling LINQ with Reflector to try to better understand how expression trees get converted into code. I had some doubts about Reflector’s analysis capabilities, but Matt Warren and Nigel Watson assure me that it can resolve everything that gets generated by the C# 3.0 compiler. I am going to continue disassembling typical usage of LINQ to Objects, and will use whatever tools are available to allow me to peer beneath the hood. I’ll follow the flow of control from creation of a query to getting the first object out of that query. At least that way I’ll know if there’s something fishy going on in LINQ or Resharper.

What I’ve found from my researches is that there is a lot going on under the hood of LINQ. To begin to understand how LINQ achieves what it does, we will need to understand the following:

What the C# 3.0 compiler does to your queries.

Building and representing expression trees.

Code generation for anonymous types, and delegates and iterators.

Converting Expression trees into IL code.

What happens when the query is enumerated.

I’ll try to answer some of these questions. I’ve already made a start in some of my earlier LINQ posts to give an outline of the strategies LINQ uses to get things done. As a rule, it tends to use a lot of IL code generation to produce anonymous types to do its bidding. In this post, I’ll try to show you how it generates expression trees and turns them into code. In some cases, to prevent you from falling asleep, I’ll have to gloss over the details a little. I read through early drafts of this post and had to admit that describing every line of code was pretty much out of the question. I hope that at the very least, you’ll come away with a better idea of how to use LINQ.

What happens to your Queries

The way your queries are compiled depends on whether your data source is an enumerable sequence or a queryable.The handling for IEnumerable is much simpler and immediate than for IQueryable. I covered much of what happens in a previous post. The next section will show you how queries look behind the scenes.

Querying a Sequence

To illustrate what happens when you write a query in LINQ, I produced a couple of test methods with different types of query. Here’s the first:

Example 1

privatestaticvoid Test3()

{

int[] primes = {1, 3, 13,17, 23, 5, 7, 11};

var smallPrimes = from q in primes

where q < 11

select q;

foreach (int i in smallPrimes)

{

Debug.WriteLine(i.ToString());

}

}

It enumerates an array of integers using a clause to filter out any that are greater than or equal to 11. The bit I’m going to look at is

Example 3

from q in primes where q < 11 select q

which is a query that we store in a variable smallPrimes. When this gets compiled, the C# 3.0 deduces the type of smallPrimes by working back through the query from primes through to the output of where and the output of select. The type flows through the various invocations of the extension methods to end up in an IEnumerable<int>. In case you didn’t know, the new query syntax of C# vNext is just a bit of (very nice) syntactic sugar hiding the invocation of extension methods. Example 2 is equivalent to

Example 3

primes.Where(new delegate(int a){return a < 11;});
It has been translated into familar C# 2.0 syntax by the C# 3.0 compiler. By the time the C# 3.0 compiler has gotten through with Example 2, the code has been converted into this:

Most of it looks the same as before, but the LINQ query has been expanded out into two private static fields on the class called Program.<>9__CachedAnonymousMethodDelegate2. The first is a generic specification of Func<Type, bool>and other is a specialisation for type int (i.e. Func<int, bool>), which is how Test3 will use it. The anonymous delegate is initialised with Program.<Test3>b__0 which is a simple one line method:

Example 5

[CompilerGenerated]

privatestaticbool <Test3>b__0(int q)

{

return (q < 11);

}

You can see that the query has been transformed into the code needed to test whether elements coming from an enumerator are less than 11. My previous post explains what happens inside of the call to Where (It’s a call to the static extension method Sequence.Where(this IEnumerable<T>)).

Querying a SequenceQuery

As you’ve probably also noticed, the previous example is a straightforward example of LINQ in its guise as a nice way of filtering over an enumerable. It’s all done using IEnumerable<T>. It’s easy enough to prove that to ourselves since we can substitute IEnumerable<int> in place of var. There are no repeatable queries or expression trees here – the enumerator that gets stored in smallPrimes is generated in advance by the compiler and it is just enumerated in the conventional way. The Enumerator in Sequence is different from using SequenceQuery – in code generated from a SequenceQuery, the elements from the query are not stored in a private sequence field.

Lets see what happens when we convert the IEnumerable<int> into an IQueryable<int>. It’s pretty easy to do this. Just invoke ToQueryable on your source collection. It creates a SequenceQuery. thereafter all of the extension methods will just elaborate an expression tree.

Example 6

privatestaticvoid Test4()

{

int[] primes = {1, 3, 13,17, 23, 5, 7, 11};

IQueryable<int> smallPrimes = from q in primes.ToQueryable()

where q < 11

select q;

foreach (int i in smallPrimes)

{

Debug.WriteLine(i.ToString());

}

}

I converted the array of integers into an` IQueryable<int> using the extension method ToQueryable(). This extension method is fairly simple, it creates a SequenceQuery out of the enumerator it got from the array. I cover some of the capabilities of the SequenceQuery in this post. This is what the test method looks like now:

Quite a difference in the outputs! The call to ToQueryable has led the compiler to generate altogether different output. It inlined the primes collection, converted smallPrimes into a SequenceQuery and created a Lambda Expression containing a BinaryExpression for the less-than comparison rather than a simple anonymous delegate. As we know from the outline in this post, the Expression will eventually get converted into an anonymous delegate through a call to System.Reflection.Emit.DynamicMethod. That bit happens later on when the IEnumerator<T> is requested in a call to GetEnumerator on smallPrimes (in the foreach command).

Building the Expression Tree

This section describes what the LINQ runtime does with the expression tree to convert it into code. In this section I will guide you through a simple example of how the expression tree gets built out of example 9. The sequence of nested calls that eventually produces smallPrimes, each creates a node for insertion into the tree. By tracing through the calls to Queryable.Where, Queryable.ToQueryable and Expression.Lambda we can see that it constructs a tree as in Figure 1 below. It seems large, for a simple query. But, as Ander Hejlsberg pointed out, the elements of these trees are tiny (around 16 bytes each) so we get a lot of bang for our buck. The root of the tree, after calling Where, is a MethodCallExpression. In my previous post, I showed some of what happens when you try to get an enumerator from a SequenceQuery – an anonymous type is generated that iterates through the collection deciding whether or not to yield elements depending on the result from the predicate. This post I have a more or less accurate expression tree to work with, and I’ll explore how the GetEnumerator in SequenceQuery generates code by walking the expression tree.

If you are already familiar with the ideas of Abstract Syntax trees (ASTs) or object query languages, you could probably skip this part. If you’re one of those hardy souls who can withstand any quantity of detail, no matter how dry, then refer to sidebar for a more detailed breakdown of how an expression tree gets created.

Generating Code from an Expression Tree

This section describes what the LINQ runtime does with the expression tree to convert it into code. I think that the code generation phase is the most important part to understand in LINQ. It’s a tour de force of ingenuity that effectively converts C# into a strongly typed, declarative scripting language. It brings a lot of new power to the language. If you’re interested in seeing how the CodeDOM ought to be used, you couldn’t find a better example. It’s a must for anyone interested in code generation.

The code generation phase begins when the user calls GetEnumerator on the SequenceQuery. Till that point, all you have is a tree shaped data structure that declares the kind of results that you are after and where you want to get them from. This data structure can’t do anything. But when it is compiled, it suddenly gains power. LINQ interprets what you want, and generates the code to find it. That power is what I wanted to understand when I started digging into the LINQ with Reflector. I’d built a couple of ORM systems in the past so I had an inkling of what might get done with the expression trees – you have to turn them into queries that are comprehensible to the data source. That is easy enough with an ORM system, but how can you use the same code to query in-memory objects as for database bound ones. Well, you don’t use the same code – the extension methods allow LINQ to seem like that is what is happening, In truth what happens is that the code that processes your query is very different for each type of data source – the common syntax of LINQ hides different query engines.

The algorithm of the code generation system is simple:

Create a dynamic method (DM) as a template for later code generation

Find all parameters in the expression tree and use them as parameters of the
method

Generate the Body of the dynamic method:

Walk the expression tree

Generate IL for each node according to it’s type –
i.e. a MethodCallExpression will yield IL to call a method, whereas a LE BinaryExpression
will conjoin its two halves using the less than or equal operator (Cle).

Create a Multicast Delegate to store the dynamic method in.

iterate the source collection, passing each element to the multicast delegate –
if it returns true yield the element otherwise, ignore it.

Inside GetEnumerator for the query, the ExpressionCompiler class is invoked to create the code. It has a Compile methodperforms the algorithm above. The Compile method initially generates a top-level LambdaExpression. This lambda expression is a specification for a function – it tells the code generator what parameters the function needs to take, and how the result of the function is to be calculated. As with lambda calculus, these functions are nameless entities that can be nested. What that means for developers is that we can compose new queries from other queries. That is ideal for ad hoc query generators that add criteria to a query that is stored until they hit search.

Example 9

internal Delegate Compile(LambdaExpression lambda)

{

this.lambdas.Clear();

this.globals.Clear();

int num2 = this.GenerateLambda(lambda);

ExpressionCompiler.LambdaInfo info2 = this.lambdas[num2];

ExpressionCompiler.ExecutionScope scope2 =

new ExpressionCompiler.ExecutionScope(

null,

lambda,

this.lambdas.ToArray(),

this.globals.ToArray(),

info2.HasLiftedParameters);

return info2.Method.CreateDelegate(lambda.Type, scope2);

}

The first important thing that the Compile method does is create the top level lambda. As Example 10 shows, it first creates a compilation scope. The scope defines the visibility of variables in the method that is to be generated. We know from Example 9 that it maintains a couple of collections called lambdas and globals. Lambdas would define the parameters to the method (and recursively, would do the same for any sub method calls that are buried deeper in the expression tree). Globals maintains a list of references that will be visible to all dynamic methods.

Next, the code generator creates the outline of an anonymous delegate using a DynamicMethod. It then goes on to generate the body of the anonymous delegate. The call to GenerateInitLiftedParameters should be familiar from my previous posts – it emits code for loading the parameters to the the lambda into the evaluation stack.

Next GenerateLambda will create the body for the anonymous method that has just been created. It is using the same code generator, so the body is inserted into the method as it goes along. It recurses through the expression tree to generate the code for the expression. As each element is visited it has code generated for the node, then each sub tree or leaf node is also passed to ExpressionCompiler.Generate. ExpressionCompiler.Generate is a huge switch statement that passes control off to a method dedicated to each node type. In processing the expression shown in Figure 1 we will end up calling GenerateLambda, GenerateBinaryExpression, and GenerateConstant. Each of these methods emits a bit of IL needed to flesh out the body of the Dynamic Method.

In Generate, each of the parameters of the the lambda are each passed through the code in example 11:

This piece walks the expression tree. The parameters of the lambda expression contain the terms of the comparison function to be performed (i.e. things like‘int q‘ and ‘q < 11‘). Eventually the generator will reach the sub expressions of the lambda – the LT BinaryExpression. GenerateBinary will be called on the ExpressionCompiler. I won’t include the code for that here, it’s 450 lines long – essentially a giant switch statement on the operator type. In this case it’s ExpressionType.LT. As a result the generator produces code for evaluating the left and right sides of the operation then emits a Clt opcode that will perform the signed comparison leaving 1 or 0 on the evaluation stack.

Enough code has now been generated to allow the production of the anonymous delegate. It’s applied to the elements of the array of ints. At the beginning of SequenceQuery.GetEnumerator() a lambda was produced from the (unmodified) Expression tree in Figure 1. The lambda was passed to the Compile function that invoked GenerateLambda that caused the whole recursive code generation process. Now, the Compile method creates an ExecutionsScope, and the DynamicMethod created during the recursive code generation process previously is given the scope to create a delegate. It has access to all of the byproducts of the code generation process so far, stored in a LambdaInfo class.

A Multicast delegate is then created following the format of the DynamicMethod created at the beginning. That DynamicMethod was not just a template for later use though, it passed the dynamic method that was created from the expression tree to add to its invocation list. We now have a delegate that we can call for each element in the collection, and a newly minted predicate from the expression tree to attach to the delegate.

The result of Compile is an IEnumerable<T>. The foreach loop of example 1 will invoke GetEnumerator and the the whole code generation process will kick in.

Conclusions

That’s it! The expression tree has been generated. It’s been used to create an anonymous method that was attached to a multicast delegate called when the source data store is enumerated. Each element that matched the predicate defined in the expression tree was yield returned. When you consider what this whole process has achieved, you’ll see how it is to produce something that does the same as the Sequence.While extension method that I described in my previous LINQ posts. The difference has been the use of expression trees to store the intent till the query needed to be interpretted. It is a lot more useful than a simple storable query – after all storing a delegate would have done that. The point of all this massive abstraction is to provide an opportunity to give our own implementation. There is no rule of LINQ that says we have to generate IL code in the way described here – we could generate SQL commands as must be done by LINQ to SQL. In future posts I aim to show how you can provide your own interpreter of expression trees.

Some might argue that since C# 3.0 concepts can be turned into C# 2.0 easily enough, yet C# 2.0 can’t be turned into C# 1.2, that C# 2.0 is a more significant advance. After reading some of the code that backs up the functional programming changes in C# 3.0 I’m not so sure. I think it just demonstrates that the C# 2.0 platform was rich enough not to require significant low-level changes this time around.

BTW: I tried valiantly to get a peek at the original source for LINQ, but the LINQ team are holding their cards very close to their chests at the moment, so no dice. Consequently, this article is my best guess, and should be taken under advisement that the source may change prior to release, and the abstractions may have clouded my view of the mechanism at work.