Simplex Sigillum Veri

Menu

Tag Archives: ORM

Kingsley Idehen has again graciously given LinqToRdf some much needed link-love. He mentioned it in a post that was primarily concerned with the issues of mapping between the ontology, relational and object domains. His assertion is that LinqtoRdf, being an offshoot of an ORM related initiative, is reversing the natural order of mappings. He believes that in the world of ORM systems, the emphasis should be in mapping from the relational to the object domain.

I think that he has a point, but not for the reason he’s putting forward. I think that the natural direction of mapping stems from the relative richness of the domains being mapped. The impedence mismatch between the relational and object domains stems from (1) the implicitness of meaning in the relationships of relational systems and (2) the representation of relationships and (3) type mismatches.

If the object domain has great expressiveness and explicit meaning in relationships it has a ‘larger’ language than that expressible using relational databases. Relationships are still representable, but their meaning is implicit. For that reason you would have to confine your mappings to those that can be represented in the target (relational) domain. In that sense you get a priority inversion that forces the lowest common denominator language to control what gets mapped.

The same form of inversion occurs between the ontological and object domains, only this time it is the object domain that is the lowest common denominator. OWL is able to represent such things as restriction classes and multiple inheritance and sub-properties that are hard or impossible to represent in languages like C# or Java. When I heard of the RDF2RDB working group at the W3C, I suggested (to thunderous silence) that they direct their attentions to coming up with a general purpose mapping ontology that could be used for performing any kind of mapping.

I felt that it would have been extremely valuable to have a standard language for defining mappings. Just off the top of my head I can think of the following places where it would be useful:

Object/Relational Mapping Systems (O/R or ORM)

Ontology/Object Mappings (such as in LinqToRdf)

Mashups (merging disparate data sources)

Ontology Reconciliation – finding intersects between two sets of concepts

You can see that most of these are perennial real-world problems that programmers are ALWAYS having to contend with. Having a standard language (and API?) would really help with all of these cases.

I think such an ontology would be a nice addition to OWL or RDF Schema, allowing a much richer definition of equivalence between classes (or groups or parts of classes). Right now one can define a one-to-one relationship using the owl:equivalentClass property. It’s easy to imagine that two ontology designers might approach a domain from such orthogonal directions that they find it hard to define any conceptual overlap between entities in their ontologies. A much more complex language is required to allow the reconciliation of widely divergent models.

I understand that by focusing their attentions on a single domain they increase their chances of success, but what the world needs from an organization like the W3C is the kind of abstract thinking that gave rise to RDF, not another mapping markup language!

Here’s a nice picture of how LinqToRdf interacts with Virtuoso (thanks to Kingsley’s blog).

As promised last time, I have extended the query mechanism of my little application with a LINQ Query Provider. I based my initial design on the method published by Bart De Smet, but have extended that framework, cleaned it up and tied it in with the original object deserialiser for SemWeb (a semantic web library by Joshua Tauberer).

In this post I’ll give you some edited highlights of what was involved. You may recal that last post I provided some unit tests that i was working with. For the sake of initial simplicity (and to make it easy to produce queries with SemWeb’s GraphMatch algorithm) I restricted my query language to make use of Conjunction, and Equality. here’s the unit test that I worked with to drive the development process. What I produced last time was a simple scanner that went through my podcasts extracting metadata and creating objects of type Track.

This method queries the Tracks collection in an in-memory triple store loaded from a file in N3 format. It searches for any UC Berkley pod-casts produced in 2006, and performs a projection to create a new anonymous type containing the title and location of the files.

I took a leaf from the book of LINQ to SQL to crate the query object. In LINQ to SQL you indicate the type you are working with using a Table<T> class. In my query context class, you identify the type you are working with using a ForType<T>() method. this method instantiates a query object for you, and (in future) will act as an object registry to keep track of object updates.

As you can see, it is pretty bare at the moment. It maintains a reference to the store, and instantiates query objects for you. But in future this would be the place to create transactional support, and perhaps maintain connections to triple stores. By and large, though, this class will be pretty simple in comparison to the query class that is to follow.

I won’t repeat all of what Bart De Smet said in his excellent series of articles on the production of LINQ to LDAP. I’ll confine myself to this implementation, and how it works. So we have to start by creating our Query Object:

First it stores a reference to the triple store for later use. In a more real world implementation this might be a URL or connection string. But for the sake of this implementation, we can be happy with the Memory Store that is used in the unit test. next we keep a record of the original type that is being queried against. this is important because later on you may also be dealing with a new anonymous type that will be created by the projection. This will not have any of the Owl*Attribute classes with which to work out URLs for properties and to perform deserialisation.

The two most important methods in IQueryable<T> are CreateQuery and GetEnumerable. CreateQuery is the place where LINQ feeds you the expression tree that it has built from your initial query. You must parse this expression tree and store the resultant query somewhere for later use. I created a string called query to keep that in, and created a class called ExpressionNodeParser to walk the expression tree to build tyhe query string. This is equivalent to the stage where the SQL SELECT query gets created in DLINQ. My CreateQuery looks like this:

You create new query because you may be doing a projection, in which case the type you are enumerating over will not be the original type that you put into ForType<T>(). Instead it may be the anonymous type from the projection. You transfer the vital information over to the new Query object, and then handle the expression that has been passed in. I am handling two methods here: Where and Select. There are others I could handle, such as OrderBy or Take, but that will have to be for a future post.

Where is the part where the expression representing the query is passed in. Select is passed the tree representing the projection (if there is one). The work is passed off to BuildQuery and BuildProjection accordingly. these names were gratefully stolen from LINQ to LDAP.

BuildQuery in LINQ to LDAP is a fairly complicated affair, but in LINQ to RDF I have paired it right downb to the bone.

We create a StringBuilder that can be passed down into the recursive descent tree walker to gather the fragments of the query as each expression gets parsed. the result is then stored in the Query property of the Query object. BuildProjection looks like this:

Much of it is taken directly from LINQ to LDAP. I have adapted it slightly because I am targeting the May 2007 CTP of LINQ. I’ve done this only because I have to use VS 2005 during the day, so I can’t use the March 2007 version of Orcas.

ParseQuery is used by BuildQuery to handle the walking of the expression tree. Again that is very simple since most of the work is now done in ExpressionNodeParser. It looks like this:

Each handler method then handles the root of the expression tree, breaking it up and passing on what it can’t handle itself. For example, the method AndAlso just takes the left and right side of the operator and recursively dispatches them:

The equality expression can be formed either through the use of a binary expression with NodeType.EQ or as a MethodCallExpression on op_Equality for type string. If the handler for the MethodCallExpression spots op_Equality it passes the expression off to the EQ method for it to render instead. EQ therefore needs to spot which type of Node it’s dealing with to know how to get the left and right sides of the operation. In a BinaryExpression there are Right and Left properties, whereas in a MethodCallExpression these will be found in a Parameters collection. In our example they get the same treatment.

You’ll note that we assume that the left operand is a MemberExpression and the right is a ConstantExpression. That allows us to form clauses like this:

Each of these cases will have to be handled individually, so the number of cases we need to handle can grow. As Bart De Smet pointed out, some of the operations might have to be performed after retrieval of the results since semantic web query languages are unlikely to have complex string manipulation and arithmetic functions. Or at least, not yet.

The QueryAppend forms an N3 Triple out of its parameters and appends it to the StringBuilder that was passed to the Parser initially. At the end of the recursive tree walk, this string builder is harvested and preprocessed to make it ready to pass to the triple store. In my previous post I described an ObjectDeserialisationsSink that was passed to SemWeb during the query process to harvest the results. This has been reused to gather the results of the query from within our query.

I mentioned earlier that the GetEnumerator method was important to IQueryable. An IQueryable is a class that can defer execution of its query till someone attempts to enumerate its results. Since that’s done using GetEnumerator the query must be performed in GetEnumerator. My implementation of GetEnumerator looks like this:

result is the List<TElement> variable where I cache the results for later use. What that means is that the query only gets run once. Next time the GetEnumerator gets called, result is returned directly. This reduces the cost of repeatedly enumerating the same query. Currently the methods ConstructQuery, PrepareQueryAndConnection, and PresentQuery are all fairly simple affairs that exist more as placeholders so that I can reuse much of this code for a LINQ to SPARQL implementation that is to follow.

As you’ve probably worked out, there is a huge amount of detail that has to be attended to, but the basic concepts are simple. the reason why more people haven’t written LINQ query providers before now is simply that fact that there is no documentation about how to do it. When you try though, you may find it easier than you thought.

There is a great deal more to do to LINQ to RDF before something it is ready for production use, but as a proof of concept that semantic web technologies can be brought into the mainstream it serves well. Thereason why we use ORM systems such as LINQ to SQL is to help us overcome the Impedance Mismatch that exists between the object and relational domains. An equally large mismatch exists between the Object and Semantic domains. tools like LINQ to RDF will have to overcome the mismatch in order for them to be used outside of basic domain models.

I’ve been off the air for a week or two – I’ve been hard at work on the final stages of a project at work that will go live next week. I’ve been on this project for almost 6 months now, and next week I’ll get a well earned rest. What that means is I get to do some dedicated Professional Development (PD) time which I have opted to devote to Semantic Web technologies. That was a hard sell to the folks at Readify, what with Silverlight and .NET 3 there to be worked on. I think I persuaded them that consultancies without SW skills will be at a disadvantage in years to come.

Anyway, enough of that – onto the subject of the post, which is the next stage of my mini-series about using semantic web technologies in the production of a little MP3 file manager.

At the end of the last post we had a simple mechanism for serialising objects into a triple store, with a set of services for extracting relevant information out of an object, and to tie it to predicates defined in on ontology. In this post I will show you the other end of the process. We need to be able to query against the triple store and get a collection of objects back.

The query I’ll show you is very simple, since the main task for this post is object deserialisation, once we can shuttle objects in and out of the triple store then we can focus on beefing up the query process.

Querying the triple store

For this example I just got a list of artists for the user and allowed them to select one. That artist was then fed into a graph match query in SemWeb, to bring back all of the tracks whose artist matches the one chosen.

The query works in the usual way – get a connection to the data store, create a query, present it and reap the result for conversion to objects:

We’ll get on the the ObjectDeserialiserQuerySink in a short while. The process of creating the query is really easy, given the simple reflection facilities I created last time. I’m using the N3 format for the queries, for the sake of simplicity – we could just as easily used SPARQL. We start with a prefix string to give us a namespace to work with, we then enumerate the persistent properties of the Track type. For each property we then insert a triple meaning “whatever track is select – get its property as well”. Lastly, we add the artist name ass a known fact, allowing us to specify exactly what tracks we were talking about.

Having created a string representation of the query we’re after we pass it to a GraphMatch object, which is a kind of query were you specify a graph that is a kind of prototype for the structure of the results desired. I also created a simple class called ObjectDeserialiserQuerySink:

For each match that the reasoner is able to find, a call gets made to the Add method of the deserialiser with a set of VariableBindings. Each of the variable bindings corresponds to solutions of the free variables defined in the query. Since we generated the query out of the persistent properties on the Track type the free variables matched will also correspond to the persistent properties of the type. What that means is that it is a straightforward job to deserialise a set of VariableBindings into an object.

That’s it. We now have a simple triple store that we can serialise objects into and out of, with an easy persistence mechanism. But there’s a lot more that we need to do. Of the full CRUD behaviour I have implemented Create and Retrieve. That leave Update and Delete. As we saw in one of my previous posts, that will be a mainly manual programmatical task since semantic web ontologies are to a certain extend static. What that means is that they model a domain as a never changing body of knowledge about which we may deduce more facts, but where we can’t unmake (delete) knowledge.

The static nature of ontologies seems like a bit of handicap to one who deals more often than not with transactional data – since it means we need more than one mechanism for dealing with data – deductive reasoningh, and transactional processing. With the examples I have given up till now I have been dealing with in-memory triple stores where the SemWeb API is the only easy means of updating and deleting data. When we are dealing with a relational database as our triple store, we will have the option to exploit SQL as another tool for managing data.

For the mathematician there is no Ignorabimus, and, in my opinion, not at all for natural science either. … The true reason why [no one] has succeeded in finding an unsolvable problem is, in my opinion, that there is no unsolvable problem. In contrast to the foolish Ignoramibus, our credo avers:
We must know,
We shall know.

It’s that time of the month again – when all of the evangelically inclined mavens of Readify gather round to have the traditional debate. Despite the fact that they’ve had similar debates for years, they tend to tackle the arguments with gusto, trying to find a new angle of attack from which to sally forth in defence of their staunchly defended position. You may (assuming you never read the title of the post🙂 be wondering what is it that could inspire such fanatical and unswerving devotion? What is it that could polarise an otherwise completely rational group of individuals into opposing poles that consider the other completely mad?

What is this Lilliputian debate? I’m sure you didn’t need to ask, considering it is symptomatic of the gaping wound in the side of modern software engineering. This flaw in software engineering is the elephant in the room that nobody talks about (although they talk an awful lot about the lack of space).

The traditional debate is, of course:

What’s the point of a database?

And I’m sure that there’s a lot I could say on the topic (there sure was yesterday) but the debate put me in a thoughtful mood. The elephant in the room, the gaping wound in the side of software engineering is just as simply stated:

How do we prove that a design is optimal?

That is the real reason we spend so much of our time rehearsing these architectural arguments, trying to win over the other side. Nobody gets evangelical about something they just know – they only evangelise about things they are not sure about. Most people don’t proclaim to the world that the sun will rise tomorrow. But like me, you may well devote a lot of bandwidth to the idea that the object domain is paramount, not the relational. As an object oriented specialist that is my central creed and highest article of faith. The traditional debate goes on because we just don’t have proof on either side. Both sides have thoroughly convincing arguments, and there is no decision procedure to choose between them.

So why don’t we just resolve it once and for all? The computer science and software engineering fraternity is probably the single largest focussed accumulation of IQ points gathered in the history of mankind. They all focus intensively on issues just like this. Surely it is not beyond them to answer the simple question of whether we should put our business logic into stored procedures or use an ORM product to dynamically generate SQL statements. My initial thought was “We Must Know, We Will Know” or words to that effect. There is nothing that can’t be solved given enough resolve and intelligence. If we have a will to do so, we could probably settle on a definitive way to describe an architecture so that we can decide what is best for a given problem.

Those of you who followed the link at the top of the post will have found references there to David Hilbert, and that should have given you enough of a clue to know that there’s a chance that my initial sentiment was probably a pipe dream. If you are still in the dark, I’m referring to Hilbert’s Entscheidungsproblem (or the Decision Problem in English) and I beg you to read Douglas Hofstadter’s magnificent Gödel, Escher, Bach – An eternal golden braid. This book is at the top of my all-time favourites list, and among the million interesting topics it covers, the decision problem is central.

The Decision Problem – a quick detour

One thing you’ll notice about the Entscheidungsproblem and Turing’s Halting Problem is that they are equivalent. They seem to be asking about different things, but at a deeper level the problems are the same. The decision problem asks whether there is a mechanical procedure to determine the truth of any mathematical statement. At the turn of the century they might have imagined some procedure that cranked through every derivation of the axioms of mathematical logic till it found a proof of the statement returning true. The problem with that brute-force approach is that mathematics allows a continual complexification and simplification of statements – it is non-monotonic. The implication is that just because you have applied every combination of the construction rules on all of the axioms up to a given length you can’t know whether there are new statements of the same length that could be found by the repeated application of growth and shrinkage rules that aren’t already in your list. That means that even though you may think you have a definitive list of all the true statements of a given length you may be wrong, so you can never give a false, only continue till you either find a concrete proof or disproof.

Because of these non-monotonic derivation rules, you will never be sure that no answer from your procedure means an answer of false. You will always have to wait and see. This is the equivalence between the Entscheidungsproblem and Alan Turing’s Halting Problem. If you knew your procedure would not halt, then you would just short-circuit the decision process and immediately answer false. If you knew that the procedure would halt, then you would just let it run and produce whatever true/false answer it came up with – either way, you would have a decision procedure. Unfortunately it’s not that easy, because the halting decision procedure has no overview of the whole of mathematics either, and can’t give an answer to the halting question. Ergo there is no decision procedure either. Besides, Kurt Gödel proved that there were undecidable problems, and the quest for a decision procedure was doomed to fail. he showed that even if you came up with a more sophisticated procedure than the brute force attack, you would still never get a decision procedure for all mathematics.

The Architectural Decision Problem

What has this got to do with deciding on the relative merits of two software designs? Is the issue of deciding between two designs also equivalent to the decision problem? Is it a constraint optimisation problem? You could enumerate the critical factors, assign a rank to them and then sum the scores for each design? That is exactly what I did in one of my recent posts entitled “The great Domain Model Debate – Solved!” Of course the ‘Solved!‘ part was partly tongue-in-cheek – I just provided a decision procedure for readers to distinguish between the competing designs of domain models.

One of the criticisms levelled at my offering for this problem was that my weights and scores were too subjective. I maintained that although my heuristic was flawed, it held the key to solving these design issues because there was the hope that there are objective measures of the importance of design criteria for each design, and it was possible to quantify the efficacy of each approach. But I’m beginning to wonder whether that’s really true. Let’s consider the domain model approach for a moment to see how we could quantify those figures.

Imagine that we could enumerate all of the criteria that pertained to the problem. Each represents an aspect of the value that architects place in a good design. In my previous post I considered such things as complexity, density of data storage, performance, maintainability etc. Obviously each of these figures varies in just how subjective it is. Complexity is a measure of how easy it is to understand. One programmer may be totally at home with a design whereas another may be confused. But there are measures of complexity that are objective that we could use. We could use that as an indicator of maintainability – the more complex a design is, the harder it would be to maintain.

This complexity measure would be more fundamental than any mere subjective measure, and would be tightly correlated with the subjective measure. Algorithmic complexity would be directly related to the degree of confusion a given developer would experience when first exposed to the design. Complexity affects our ability to remember the details of the design (as it is employed in a given context) and also our ability to mentally visualise the design and its uses. When we give a subjective measure of something like complexity, it may be due to the fact that we are looking at it from the wrong level. Yes, there is a subjective difference, but that is because of an objective difference that we are responding to.

It’s even possible to prove that such variables exist, so long as we are willing to agree that a subjective dislike that is completely whimsical is not worth incorporating into an assessment of a design’s worth. I’m thinking of such knee-jerk reactions like ‘we never use that design here‘ or ‘I don’t want to use it because I heard it was no good‘. Such opinions whilst strongly felt are of no value, because they don’t pertain to the design per-se but rather to a free-standing psychological state in the person who has them. The design could still be optimal, but that wouldn’t stop them from having that opinion. Confusion on the other hand has its origin in some aspect of the design, and thus should be factored in.

For each subjective criterion that we currently use to judge a design, there must be a set of objective criteria that cause it. If there are none, then we can discount it – it contributes nothing to an objective decision procedure – it is just a prejudice. If there are objective criteria, then we can substitute all occurrences of the subjective criterion in the decision procedure with the set of objective criteria. If we continue this process, we will eventually be left with nothing but objective criteria. At that point are we in a position to choose between two designs?

Judging a good design

It still remains to be seen whether we can enumerate all of the objective criteria that account for our experiences with a design, and its performance in production. It also remains for us to work out ways to measure them, and weigh their relative importance over other criteria. We are still in danger of slipping into a world of subjective opinions over what is most important. We should be able to apply some rigour because we’re aiming at a stationary target. Every design is produced to fulfil a set of requirements. Provided those requirements are fulfilled we can assess the design solely in terms of the objective criteria. We can filter out all of the designs that are incapable of meeting the requirements – all the designs that are left are guaranteed to do the job, but some will be better than others. If that requires that we formally specify our designs and requirements then (for the purposes of this argument) so be it. All that matters is that we are able to guarantee that all remaining designs are fit to be used. All that distinguishes them are performance and other quality criteria that can be objectively measured.

Standard practice in software engineering is to reduce a problem to its component parts, and attempt to then compose the system from those components in a way that fulfils the requirements for the system. Clearly there are internal structures to a system, and those structures cannot necessarily be weighed in isolation. There is a context in which parts of a design make sense, and they can only be judged within that context. Often we judge our design patterns as though they were isolated systems on their own. That’s why people sometimes decide to use design patterns before they have even worked out if they are appropriate. The traditional debate is one where we judge the efficacy of a certain type of data access approach in isolation of the system it’s to be used in. I’ve seen salesmen for major software companies do the same thing – their marks have decided they are going to use the product before they’ve worked out why they will do so. I wonder whether the components of our architectural decision procedure can be composed in the same way that our components are.

In the context that they’re to be used, will all systems have a monotonic effect on the quality of a system? Could we represent the quality of our system as a sum of scores of the various sub-designs in the system like this? (Q1 + Q2 + … + Qn) That would assume that the quality of the system is the sum of the quality of its parts which seems a bit naive to me – some systems will work well in combination, others will limit the damage of their neighbours and some will exacerbate problems that would have lain dormant in their absence. How are we to represent the calculus of software quality? Perhaps the answer lies in the architecture itself? If you were to measure the quality of each unique path through the system, then you would have a way to measure the quality of that path as though it was a sequence of operations with no choices or loops involved. You could then sum the quality of each of these paths weighted in favour of frequency of usage. That would eliminate all subjective bias and the impact of all sub designs would be proportional to the centrality of its role within the system as a whole. In most systems data access plays a part in pretty much all paths through a system, hence the disproportionate emphasis we place on it in the traditional debates.

Scientific Software Design?

Can we work out what these criteria are? If we could measure every aspect of the system (data that gets created, stored, communicated, the complexity of that data etc) then we have the physical side of the picture – what we still lack is all of those thorny subjective measures that matter. Remember though that these are the subjective measures that can be converted into objective measures. Each of those measures can thus be added to the mix. What’s left? All of the criteria that we don’t know to ask about, and all of the physical measurements that we don’t know how to make, or don’t even know we should make. That’s the elephant in the room. You don’t know what you don’t know. And if you did, then it would pretty immediately fall to some kind of scientific enquiry. But will we be in the same situation as science and mathematics was at the dawn of the 20th Century? Will we, like Lord Kelvin, declare that everything of substance about software architecture is known and all the future holds for us is the task of filling in the gaps?

Are these unknown criteria like the unknown paths through a mathematical derivation? Are they the fatal flaw that unhinges any attempt to assess the quality of a design, or are they the features that turns software engineering into a weird amalgam of mathematics, physics and psychology? There will never be any way for us to unequivocally say that we have found all of the criteria that truly determine the quality of a design. Any criteria that we can think of we can objectify – but it’s the ones we can’t or don’t think of that will undermine our confidence in a design and doom us to traditional debates. Here’s a new way to state Hilbert’s 10th Problem:

Is there a way to fully enumerate all of the criteria that determine the quality of a software design?

Or to put it another way

Will we know when we know enough to distinguish good designs from bad?

The spirit of the enlightenment is fading. That much is certain. The resurgence of religiosity in all parts of the world is a backward step. It pulls us away from that pioneering spirit that Kant called a maturing of the human spirit. Maturity is no longer needing authority figures to tell us what to think. He was talking about the grand program to roll back the stifling power of the church. In software design we still cling to the idea that there are authority figures that are infallible. When they proclaim a design as sound, then we use it without further analysis. Design patterns are our scriptures, and traditional wisdom the ultimate authority by which we judge our designs. I want to see the day when scientific method is routinely brought to bear on software designs. Only then will we have reached the state of maturity where we can judge each design on its objective merits. I wonder what the Readify Tech List will be like then?

I’ve recently been learning more about the OWL web ontology language in an attempt to find a way to represent SPARQL queries in LINQ. SPARQL and LINQ are very different, and the systems that they target are also dissimilar. Inevitably, it’s difficult to imagine the techniques to employ in translating a query from one language to the other without actually trying to implpement the LINQ extensions. I’ve got quite a long learning curve to climb. One thing is clear though: OWL, or some system very similar to it, is going to have a radical impact both on developers and society at large. The reason I haven’t posted in the last few weeks is because I’ve been writing an article/paper for publication in some related journal. I decided to put up the abstract and the closing remarks of the paper here, just so you can see what I’ll be covering in more depth on this blog in the future: LINQ, OWL, & RDF.

PS: If you are the editor of a journal that would like to publish the article, please contact me by email and I’ll be in touch when it’s finished. As you will see, the article is very optimistic about the prospects of the semantic web (despite its seeming lack of uptake outside of academia), and puts forward the idea that with the right fusion of technologies and environments the semantic web will have an even faster and more pervasive impact on society than the world wide web. This is all based on the assumption that LINQ the query language is capable of representing the same queries as SPARQL.

Introduction

In recent weeks I’ve been decompiling LINQ with Reflector to try to better understand how expression trees get converted into code. I had some doubts about Reflector’s analysis capabilities, but Matt Warren and Nigel Watson assure me that it can resolve everything that gets generated by the C# 3.0 compiler. I am going to continue disassembling typical usage of LINQ to Objects, and will use whatever tools are available to allow me to peer beneath the hood. I’ll follow the flow of control from creation of a query to getting the first object out of that query. At least that way I’ll know if there’s something fishy going on in LINQ or Resharper.

What I’ve found from my researches is that there is a lot going on under the hood of LINQ. To begin to understand how LINQ achieves what it does, we will need to understand the following:

What the C# 3.0 compiler does to your queries.

Building and representing expression trees.

Code generation for anonymous types, and delegates and iterators.

Converting Expression trees into IL code.

What happens when the query is enumerated.

I’ll try to answer some of these questions. I’ve already made a start in some of my earlier LINQ posts to give an outline of the strategies LINQ uses to get things done. As a rule, it tends to use a lot of IL code generation to produce anonymous types to do its bidding. In this post, I’ll try to show you how it generates expression trees and turns them into code. In some cases, to prevent you from falling asleep, I’ll have to gloss over the details a little. I read through early drafts of this post and had to admit that describing every line of code was pretty much out of the question. I hope that at the very least, you’ll come away with a better idea of how to use LINQ.

What happens to your Queries

The way your queries are compiled depends on whether your data source is an enumerable sequence or a queryable.The handling for IEnumerable is much simpler and immediate than for IQueryable. I covered much of what happens in a previous post. The next section will show you how queries look behind the scenes.

Querying a Sequence

To illustrate what happens when you write a query in LINQ, I produced a couple of test methods with different types of query. Here’s the first:

Example 1

privatestaticvoid Test3()

{

int[] primes = {1, 3, 13,17, 23, 5, 7, 11};

var smallPrimes = from q in primes

where q < 11

select q;

foreach (int i in smallPrimes)

{

Debug.WriteLine(i.ToString());

}

}

It enumerates an array of integers using a clause to filter out any that are greater than or equal to 11. The bit I’m going to look at is

Example 3

from q in primes where q < 11 select q

which is a query that we store in a variable smallPrimes. When this gets compiled, the C# 3.0 deduces the type of smallPrimes by working back through the query from primes through to the output of where and the output of select. The type flows through the various invocations of the extension methods to end up in an IEnumerable<int>. In case you didn’t know, the new query syntax of C# vNext is just a bit of (very nice) syntactic sugar hiding the invocation of extension methods. Example 2 is equivalent to

Example 3

primes.Where(new delegate(int a){return a < 11;});
It has been translated into familar C# 2.0 syntax by the C# 3.0 compiler. By the time the C# 3.0 compiler has gotten through with Example 2, the code has been converted into this:

Most of it looks the same as before, but the LINQ query has been expanded out into two private static fields on the class called Program.<>9__CachedAnonymousMethodDelegate2. The first is a generic specification of Func<Type, bool>and other is a specialisation for type int (i.e. Func<int, bool>), which is how Test3 will use it. The anonymous delegate is initialised with Program.<Test3>b__0 which is a simple one line method:

Example 5

[CompilerGenerated]

privatestaticbool <Test3>b__0(int q)

{

return (q < 11);

}

You can see that the query has been transformed into the code needed to test whether elements coming from an enumerator are less than 11. My previous post explains what happens inside of the call to Where (It’s a call to the static extension method Sequence.Where(this IEnumerable<T>)).

Querying a SequenceQuery

As you’ve probably also noticed, the previous example is a straightforward example of LINQ in its guise as a nice way of filtering over an enumerable. It’s all done using IEnumerable<T>. It’s easy enough to prove that to ourselves since we can substitute IEnumerable<int> in place of var. There are no repeatable queries or expression trees here – the enumerator that gets stored in smallPrimes is generated in advance by the compiler and it is just enumerated in the conventional way. The Enumerator in Sequence is different from using SequenceQuery – in code generated from a SequenceQuery, the elements from the query are not stored in a private sequence field.

Lets see what happens when we convert the IEnumerable<int> into an IQueryable<int>. It’s pretty easy to do this. Just invoke ToQueryable on your source collection. It creates a SequenceQuery. thereafter all of the extension methods will just elaborate an expression tree.

Example 6

privatestaticvoid Test4()

{

int[] primes = {1, 3, 13,17, 23, 5, 7, 11};

IQueryable<int> smallPrimes = from q in primes.ToQueryable()

where q < 11

select q;

foreach (int i in smallPrimes)

{

Debug.WriteLine(i.ToString());

}

}

I converted the array of integers into an` IQueryable<int> using the extension method ToQueryable(). This extension method is fairly simple, it creates a SequenceQuery out of the enumerator it got from the array. I cover some of the capabilities of the SequenceQuery in this post. This is what the test method looks like now:

Quite a difference in the outputs! The call to ToQueryable has led the compiler to generate altogether different output. It inlined the primes collection, converted smallPrimes into a SequenceQuery and created a Lambda Expression containing a BinaryExpression for the less-than comparison rather than a simple anonymous delegate. As we know from the outline in this post, the Expression will eventually get converted into an anonymous delegate through a call to System.Reflection.Emit.DynamicMethod. That bit happens later on when the IEnumerator<T> is requested in a call to GetEnumerator on smallPrimes (in the foreach command).

Building the Expression Tree

This section describes what the LINQ runtime does with the expression tree to convert it into code. In this section I will guide you through a simple example of how the expression tree gets built out of example 9. The sequence of nested calls that eventually produces smallPrimes, each creates a node for insertion into the tree. By tracing through the calls to Queryable.Where, Queryable.ToQueryable and Expression.Lambda we can see that it constructs a tree as in Figure 1 below. It seems large, for a simple query. But, as Ander Hejlsberg pointed out, the elements of these trees are tiny (around 16 bytes each) so we get a lot of bang for our buck. The root of the tree, after calling Where, is a MethodCallExpression. In my previous post, I showed some of what happens when you try to get an enumerator from a SequenceQuery – an anonymous type is generated that iterates through the collection deciding whether or not to yield elements depending on the result from the predicate. This post I have a more or less accurate expression tree to work with, and I’ll explore how the GetEnumerator in SequenceQuery generates code by walking the expression tree.

If you are already familiar with the ideas of Abstract Syntax trees (ASTs) or object query languages, you could probably skip this part. If you’re one of those hardy souls who can withstand any quantity of detail, no matter how dry, then refer to sidebar for a more detailed breakdown of how an expression tree gets created.

Generating Code from an Expression Tree

This section describes what the LINQ runtime does with the expression tree to convert it into code. I think that the code generation phase is the most important part to understand in LINQ. It’s a tour de force of ingenuity that effectively converts C# into a strongly typed, declarative scripting language. It brings a lot of new power to the language. If you’re interested in seeing how the CodeDOM ought to be used, you couldn’t find a better example. It’s a must for anyone interested in code generation.

The code generation phase begins when the user calls GetEnumerator on the SequenceQuery. Till that point, all you have is a tree shaped data structure that declares the kind of results that you are after and where you want to get them from. This data structure can’t do anything. But when it is compiled, it suddenly gains power. LINQ interprets what you want, and generates the code to find it. That power is what I wanted to understand when I started digging into the LINQ with Reflector. I’d built a couple of ORM systems in the past so I had an inkling of what might get done with the expression trees – you have to turn them into queries that are comprehensible to the data source. That is easy enough with an ORM system, but how can you use the same code to query in-memory objects as for database bound ones. Well, you don’t use the same code – the extension methods allow LINQ to seem like that is what is happening, In truth what happens is that the code that processes your query is very different for each type of data source – the common syntax of LINQ hides different query engines.

The algorithm of the code generation system is simple:

Create a dynamic method (DM) as a template for later code generation

Find all parameters in the expression tree and use them as parameters of the
method

Generate the Body of the dynamic method:

Walk the expression tree

Generate IL for each node according to it’s type –
i.e. a MethodCallExpression will yield IL to call a method, whereas a LE BinaryExpression
will conjoin its two halves using the less than or equal operator (Cle).

Create a Multicast Delegate to store the dynamic method in.

iterate the source collection, passing each element to the multicast delegate –
if it returns true yield the element otherwise, ignore it.

Inside GetEnumerator for the query, the ExpressionCompiler class is invoked to create the code. It has a Compile methodperforms the algorithm above. The Compile method initially generates a top-level LambdaExpression. This lambda expression is a specification for a function – it tells the code generator what parameters the function needs to take, and how the result of the function is to be calculated. As with lambda calculus, these functions are nameless entities that can be nested. What that means for developers is that we can compose new queries from other queries. That is ideal for ad hoc query generators that add criteria to a query that is stored until they hit search.

Example 9

internal Delegate Compile(LambdaExpression lambda)

{

this.lambdas.Clear();

this.globals.Clear();

int num2 = this.GenerateLambda(lambda);

ExpressionCompiler.LambdaInfo info2 = this.lambdas[num2];

ExpressionCompiler.ExecutionScope scope2 =

new ExpressionCompiler.ExecutionScope(

null,

lambda,

this.lambdas.ToArray(),

this.globals.ToArray(),

info2.HasLiftedParameters);

return info2.Method.CreateDelegate(lambda.Type, scope2);

}

The first important thing that the Compile method does is create the top level lambda. As Example 10 shows, it first creates a compilation scope. The scope defines the visibility of variables in the method that is to be generated. We know from Example 9 that it maintains a couple of collections called lambdas and globals. Lambdas would define the parameters to the method (and recursively, would do the same for any sub method calls that are buried deeper in the expression tree). Globals maintains a list of references that will be visible to all dynamic methods.

Next, the code generator creates the outline of an anonymous delegate using a DynamicMethod. It then goes on to generate the body of the anonymous delegate. The call to GenerateInitLiftedParameters should be familiar from my previous posts – it emits code for loading the parameters to the the lambda into the evaluation stack.

Next GenerateLambda will create the body for the anonymous method that has just been created. It is using the same code generator, so the body is inserted into the method as it goes along. It recurses through the expression tree to generate the code for the expression. As each element is visited it has code generated for the node, then each sub tree or leaf node is also passed to ExpressionCompiler.Generate. ExpressionCompiler.Generate is a huge switch statement that passes control off to a method dedicated to each node type. In processing the expression shown in Figure 1 we will end up calling GenerateLambda, GenerateBinaryExpression, and GenerateConstant. Each of these methods emits a bit of IL needed to flesh out the body of the Dynamic Method.

In Generate, each of the parameters of the the lambda are each passed through the code in example 11:

This piece walks the expression tree. The parameters of the lambda expression contain the terms of the comparison function to be performed (i.e. things like‘int q‘ and ‘q < 11‘). Eventually the generator will reach the sub expressions of the lambda – the LT BinaryExpression. GenerateBinary will be called on the ExpressionCompiler. I won’t include the code for that here, it’s 450 lines long – essentially a giant switch statement on the operator type. In this case it’s ExpressionType.LT. As a result the generator produces code for evaluating the left and right sides of the operation then emits a Clt opcode that will perform the signed comparison leaving 1 or 0 on the evaluation stack.

Enough code has now been generated to allow the production of the anonymous delegate. It’s applied to the elements of the array of ints. At the beginning of SequenceQuery.GetEnumerator() a lambda was produced from the (unmodified) Expression tree in Figure 1. The lambda was passed to the Compile function that invoked GenerateLambda that caused the whole recursive code generation process. Now, the Compile method creates an ExecutionsScope, and the DynamicMethod created during the recursive code generation process previously is given the scope to create a delegate. It has access to all of the byproducts of the code generation process so far, stored in a LambdaInfo class.

A Multicast delegate is then created following the format of the DynamicMethod created at the beginning. That DynamicMethod was not just a template for later use though, it passed the dynamic method that was created from the expression tree to add to its invocation list. We now have a delegate that we can call for each element in the collection, and a newly minted predicate from the expression tree to attach to the delegate.

The result of Compile is an IEnumerable<T>. The foreach loop of example 1 will invoke GetEnumerator and the the whole code generation process will kick in.

Conclusions

That’s it! The expression tree has been generated. It’s been used to create an anonymous method that was attached to a multicast delegate called when the source data store is enumerated. Each element that matched the predicate defined in the expression tree was yield returned. When you consider what this whole process has achieved, you’ll see how it is to produce something that does the same as the Sequence.While extension method that I described in my previous LINQ posts. The difference has been the use of expression trees to store the intent till the query needed to be interpretted. It is a lot more useful than a simple storable query – after all storing a delegate would have done that. The point of all this massive abstraction is to provide an opportunity to give our own implementation. There is no rule of LINQ that says we have to generate IL code in the way described here – we could generate SQL commands as must be done by LINQ to SQL. In future posts I aim to show how you can provide your own interpreter of expression trees.

Some might argue that since C# 3.0 concepts can be turned into C# 2.0 easily enough, yet C# 2.0 can’t be turned into C# 1.2, that C# 2.0 is a more significant advance. After reading some of the code that backs up the functional programming changes in C# 3.0 I’m not so sure. I think it just demonstrates that the C# 2.0 platform was rich enough not to require significant low-level changes this time around.

BTW: I tried valiantly to get a peek at the original source for LINQ, but the LINQ team are holding their cards very close to their chests at the moment, so no dice. Consequently, this article is my best guess, and should be taken under advisement that the source may change prior to release, and the abstractions may have clouded my view of the mechanism at work.

No posts have been forthcoming on this blog, so far, but that doesn’t mean I’ve been idle. Far from it! I’ve been hard at work till late at night working on the tidying up of the source code. Part of that involved coming up with a new name for the system. There is already a system on GotDotNet (seems like an orphan, but who can tell) called Norm, that is a .NET ORM system.

Anyway, Koan saw first light today with end to end operations retrieving a collection of data from the SqlServer 2000 Northwind database. That’s more of an achievement that you might think. It involves reading schema data from the database, generation of a domain model, construction of object based queries, dynamic creation of targeted SQL queries for the target database, presentation of the target query to the back end API (ADO.NET/XML RAW in this case) retrieval of the data from the database, deserialisation of the data into domain objects and registration of those objects in an instance registry for sharing across AppDomain boundaries. All in all, things are going well. Most of the central code of the system now passes FxCop’s analysis (at least in terms of naming conventions, validation, etc) so the code is way more readable than a month ago. I have also been working on Oaf, the Orm Abstraction Framework – Object Queries on Steroids!

There’s still a long way to go. Pressing tasks involve targeting of OleDb rather than SqlClient APIs that will enable you to access anything for which a data provider has been written, specifically Access and MySql, the two other platforms I really want Koan to work on. That will take some time to achieve, but will be worth it in the end.