ExceLINQ – Not Your Typical LINQ Provider

Introduction

On my last trip I had the opportunity to talk on the subject of LINQ once more. Geeky as we are, this time’s session title was “LINQ in breadth”, an orthogonal view on LINQ compared to my last year’s LINQ in depth talk.

But what makes LINQ have a certain breadth? The simple answer: its spectrum of possible LINQ “provider” implementations. Far too often people think LINQ is only extensible by means of the IQueryable<T> interface which makes you opt-in to every imaginable query operator. That’s one useful side of the LINQ spectrum (albeit most of the time an overkill approach). The totally opposite end of the spectrum is ironically IEnumerable<T>. I say ironically because theoretically both interfaces are identical (I should really go into monads once more, but let’s suppress that urge). It’s just that making something LINQable is trivial if one can expose the data as an IEnumerable<T> because from that point on one can fall back to LINQ to Objects (and that’s precisely what LINQ to XML does).

The rest of the breadth comes from all intermediate implementations that are possible. I’ve mentioned before that, from a language perspective, LINQ is a tiny syntactical layer on top of the rest of the language. To be convinced about that, look at my C# 3.0 Query Expression Translation Cheat Sheet. All you’re seeing there is some kind of fixpoint algorithm that gradually compiles away the C# 3.0 language’s knowledge of query comprehension syntax till nothing but methods and operands remain.

All of this means we can party along on the syntactical sugar that LINQ provides us, as shown in my “Who ever said LINQ predicates need to be Boolean-valued?” post. It turns out I’ve talked about this spectrum theory over there too. All of this can be used for samples that are only of technical value, like LINQ to Simpsons (see below), but also to create useful stuff.

(Exercise for the reader: what’s the type of x in the fragment above?)

The genesis of ExceLINQ

So, while preparing for my demos, I was thinking about potential samples of non-obvious but nevertheless useful LINQ implementations. Rather by accident I came up with bridging Excel to LINQ. Maybe a short story on how that idea came along. One evening (night according to the post’s publication time) I was writing up a blog entry on “The M Programming Language – Part 0 – Intellipad, MrEPL, VS2008 project support and Excel integration” (more posts on that topic to follow soon). As I’m not the kind of guy that spontaneously opens Excel – which doesn’t mean I don’t like the product, I just don’t have a job that requires overdoses of spreadsheet interaction on a day-to-day basis – I took the opportunity to browse a bit more through the product in its current form. The last screenshot is where the fascination kicked in:

Clicking the little arrows on the column headers reveals a whole new world (I admit I was exposed to it in a distant past, but it was a pleasant deja-vu as you can tell, otherwise this post wouldn’t exist):

Query operators, anyone? As I already had a C# project open to prepare C# 4.0 demos on improved COM interop (posts on that to follow too), I decided to explore the Excel automation APIs a bit more. Here are a few of the interesting methods I came across:

_Application.Union – usable for simple column projections; a COM interop beauty with no less than 30 parameters, 28 of which are optional (luckily C# 4.0 will help out here – yes, you can say “finally” if you want)

Great, lots of stuff that’s applicable to querying. But let’s not settle for the obvious: creating a LINQ binding that results in yet another IEnumerable<T> where T is an entity type representing a set of strongly-typed columns (with an ExcelMetal.exe export tool, sounds familiar doesn’t it?). To illustrate the flexibility of LINQ I wanted the query to return something non-obvious. What about an Excel chart? In preparing my C# 4.0 talk I’d been exposed to Office interop code to create a chart in Excel, so that was certainly doable too.

Ah beautiful – it was the first time I didn’t curse myself for going on a speaking tour with 7 distinct talks. After all my M talk led me to dive into Excel, link it to C# 4.0 COM interop to end up with something applicable for my LINQ talk. And obviously I added PowerShell to the ingredients by showing off LINQ through PowerShell during that talk too :-).

The mission statement for ExceLINQ

Another great distinction between developers and managers: the latter need to open up Word to come up with a mission statement, while the former group opens up their favorite IDE to cook up a prototype. So in the true sense of “say it with code”, here’s what I want to accomplish:

Or more graphically, a press of the button results in a chart being rendered:

We could go even further and allow any IEnumerable<T> to be imported into Excel transparently in order to create a chart, but in the sample above I’m simply using an existing Excel file (which I populated manually using Excel interop code). But hey, it’s a proof-of-concept.

The construction of ExceLINQ

There’s a magical touch to ExceLINQ. The query shown above does look like a regular LINQ query, yet there’s a bunch of stuff going on even in just declaring the query. What if I told you that the > operator in the first where clause is not the one you’d expect? And what if I told you that none of the query operators you’re seeing is defined in the same spot (like Enumerable or Queryable)? Indeed, everything is custom in ExceLINQ: a la carte cuisine, no microwave food.

Regular readers of this blog will be bored by what follows in this paragraph (hopefully not by the things coming in the subsequent ones), but the query fragment above translates into:

I kept everything on one line deliberately to emphasize the fluent interface LINQ is. At this point there’s no query comprehension syntax left and the compiler falls back to the well-known rules for method overload resolution, processing from the left to the right (duh). In other words: given src which is of type ExcelSource<Proc>, can we find a method called Where that takes one argument with type …:

src.Where(proc => proc.WorkingSet > 20 * 1024 * 1024)

Right, but what’s the type of the argument? We can only tell once we know what the type of proc is, so we can compute the type of the lambda body. One thing’s sure: the “shape” of the argument is required to be a Func<A, B>, a function from some type to another type (type theorists will see a relationship to the concept of kinds).

All we need to see from this discussion is that we can provide our own Where method (possibly an instance method, or an extension method but there’s no reason to choose for the latter option in this case) and as long as it has the right shape of signature, the compiler will be happy. Let’s reveal its signature now:

public Query<T> Where(Func<T, Filter> filter)

We see the generic parameter T comes from the surrounding type, ExcelSource<T>. Obviously T is meant to be substituted for an entity type, a strongly-typed representation of the columns in the worksheet. This mapping is nothing fancy:

Ignore the custom attributes for now, those are true details required for the interop plumbing to correlate entity types to sheets and properties to columns in that sheet. Trivial stuff to define and to use. What’s more important is the typing we’re looking at. Name is not a string, it’s a StringColumn. Similarly WorkingSet isn’t an int, but rather it’s an IntColumn.

So, looking back at our Where method, “proc” ought to be an instance of Proc, hence the lambda body should be the type of:

proc.WorkingSet > 20 * 1024 * 1024

We know WorkingSet is of type IntColumn, so the > comparison operator can only come from an operator overload defined on that type:

You get the idea. In a very similar way the OrderBy method is defined on Query<T>. This time it doesn’t take in a lambda producing a Filter (as ordering is not filtering, right?), but rather a lambda that produces a Column. After all the function required to carry out an ordering is a so-called key extractor (and we don’t bother about things like composite keys here):

public Query<T> OrderBy(Func<T, Column> column)

As both properties on Proc are typed to be *Column, each of which derives from the Column base class, the OrderBy call succeeds:

The return type of the projection lambda is not generic (as in other LINQ implementations, where the projection function would be a Func<T, R> where R can be anything, e.g. an anonymous type). This way we make sure the user just returns a set of columns with no fancy decoration over it (like new { Bar = x.Foo } which results in an anonymous type).

The return type of the whole Select query operator is of type Result<T>, no longer Query<T>. This means we can’t continue our query beyond the Select call (assuming we don’t provide methods like Where, OrderBy, Select on Result<T>, which we don’t). By doing this, we can rest assured the user doesn’t “keep going on” and our query is terminated properly.

The last point is quite important. In fact we’ve created a state machine of types. Starting from an ExcelSource<T>, we transitioned into Query<T> to end up with Result<T>. You could go much further in controlling the order of operators and the number of operator calls that are allowed (e.g. a maximum of two Where calls) but creating much more types:

Finally we have a Result<T>, which has a method to turn the result into a chart:

public Image ToChart(XlChartType chartType)

where Image is the built-in System.Drawing.Image class from the .NET Framework.

The plumbing of ExceLINQ

We’re done with the conceptual part of our mission. The user can write queries that result in calls being made to our own methods on types like Query<T>. Even more, we gained control over the lambda expressions that are passed to those query operators too. So we have all the information about the query the user intended to write, meaning we can turn it into execution.

That’s where the plumbing part comes in. The implementation of our query operator methods needs to collect all the information about the query. Once we have aggregated all of this knowledge into a Result<T> object, we need to turn it into instructions that Excel understands. And finally, we ask Excel to generate the graph upon calling Result<T>.ToChart, producing an Image instance somehow. All of this is quite irrelevant to the LINQ topic in this post.

I’ll omit all of the COM interop magic as it doesn’t contribute much to the core idea, but there’s one thing I should dive into a bit more here. We’ve seen that purely based on the type information available through ExcelSource<T>, more particularly because of the T part in there, we can write strongly typed queries over some entity type like the one below:

We’ve also seen that we defined operators like Where based on such an entity type, e.g. taking in a Func<T, Filter> where T is substituted for the particular entity type we’re dealing with:

where proc.WorkingSet > 20 * 1024 * 1024

The reason we did that is to be able to call the (in this case predicate-valued) delegate to get the corresponding filter object (i.e. the runtime representation of the > condition applied in the predicate) without further runtime hoops. Those hoops would come from the IQueryable style of query provider implementation where the predicate would not be a Func<T, Filter> but an Expression<Func<T, bool>>, hence the user could write all sorts of expressions that ultimately produce a Boolean value, and although the compiler would be fine with that, we might give up at runtime because the generated expression tree is too complex (e.g. when the user wrote something like “where CallGreenManOnMars(proc).Replied(“Okay”)”, which doesn’t make sense but type-checks fine). In some sense, we pushed down the Expression<…> part into the function signature to constrain the lambda body to something of type Filter, which is our mini (domain-specific) expression tree.

There’s something missing though: in order to get the result of the predicate delegate, we need to feed it an instance of the entity type T, in this case a Proc instance. Moreover, we need to make sure all properties are properly instantiated because we’re about to call into them during the execution of the lambda body, e.g. by doing “proc.WorkingSet”.

The way we accomplish this is by new’ing up the entity type at runtime and filling in the properties dynamically:

Let’s step through this a bit. On the first line we create an instance of T, nothing fancy (an alternative would have been to have a generic parameter constraint for a default constructor, it just turned out this way for no particularly good reason and, hey, this is just a sample). Next we use LINQ to Objects (don’t you love it?) to get all the properties and their ColumnAttribute mapping. I’m using the Single operator here, so foreign properties (which don’t have a mapping attribute) will cause our CreateInstance method to fail. I could be more polite and ignore those properties or define my own exception type here (which I should really have done) but you get the idea.

Finally, the loop going over the columns is required to instantiate each individual property. Again, we’re a bit lazy here and expect the properties to be of a Column-derived type we know about, as we assume a two-parameter private constructor is available to new that up. To make this code production quality you’d obviously go for more thorough checking code with proper exception types being used to signal failures. The bit of tricky code here is to order the columns, so that we can determine the column index in the spreadsheet, required for other COM plumbing code where we select Excel ranges. This is just an implementation detail and could be eliminated with some more careful study of the Office automation libraries (but I’m not an Office programming expert at all). Here’s a sample column type with the constructor used:

Query<T> mutates its own state on every query operator call; it would be better to return new instances on every call as queries might act as the base for other queries (a common pattern).

A better design would be to go for the “type state machine approach” as outlined in this post, to restrict the order and multiplicity of query operators.

Anyway, if you’re curious to see how all of this works, I’ve uploaded the prototype over here. Usual warnings about sample code apply. Below is a nice scenery of ExceLINQ stepping through:

Now pressing F10 applies the first Where clause, causing the filter to be applied in Excel instantaneously.

Notice the filtering icon in the first row now. This live filtering makes ExceLINQ an ideal visualization of query execution, which was a big appealing factor to use it for my “LINQ in Breadth” demo. Next we apply ordering:

Another fantastic post Bart, and the more posts I read from you about how to use standard Func<T, TResult> for LINQ providers rather than IQueryable<T> the more I think I've made the wrong choice for the LINQ provider I'm working on.

Seeing what you can achieve with just method chaining what would make IQueryable<T> a better choice, or is it becoming less and less a good choice?

Thanks for the nice words - glad you like it. To answer your question: IQueryable<T> is still a good choice if the language you're targeting is a rich one (meaning it supports most of the operators) and semantics of CLR objects can be mapped nicely onto the target language.

SQL was an ideal candidate for IQueryable<T> primarily because of the first reason: lots of operators are supported. But it also makes sense to deal with regular .NET types (as opposed to "fake" types like StringColumn or DbString) to make mapping and integration with well-known APIs easier. Unfortunately it means there's quite some plumbing to be done in the expression tree translation in order to keep the client-side semantics as faithful as possible (ideally - well, from a translation point of view that is - you should be able to stick AsEnumerable on the Table<T> object and run the same query client-side, still getting the same results).

Having custom patterns as shown in this post helps in other scenarios: the query language is poor (but if it gets a little bit rich, the type state machine can become quite complex - we need YALPC, Yet Another Linq Provider Compiler) and/or operations have very specific semantics that deserve their own types (what's more natural: SqlMethods.Like(column, "pattern") or column.Like("pattern"), also considering discoverability). The only thing you miss is a one-on-one mapping to well-known CLR types, but you gain more compile-time checking and the types help for discoverability. Still you have all the benefits of language integration.

It's not black or white though. Mixed approaches are possible too. For instance, one could think of custom types to generate a state machine of types to reduce the available query operators, but still leave the arguments as expression trees. That improves compile-time checking on the level of the operators. Or one could use custom types for entity "parts" (to generalize the concept of "columns" in an ad hoc manner) in expression trees. Or ... anything goes as long as the compiler is fine with it: party on!

Thanks for the great response, I've actually refactored my LINQ provider to be IEnumerable<T> based rather than IQueryable<T> (not just based on this post). I can see merits in IQueryable, but a full provider like L2S and EF are far beyond what a mear single developer can achieve.

MemberAccessExpression and MethodCallExpression can be just a pain to work with, expecially when some developer has done "hello world".ToString().ToString().ToString(). Func<...> passed to a LINQ to Objects collection is just a lot easier to deal with!

You wouldn't happen to have any speaking tours set for Denmark in June would you :P