More fun with LINQ: SelectMany

By Jim, on January 8th, 2014

A coworker came to me with a little coding problem yesterday, and my investigation of the problem revealed a few things about C#, the .NET Framework, and LINQ. It all started with an email in which he decried the “ickiness” of this code.

Simply explained, he’s creating a list of all the work types that are used in that particular row set.

We can simplify that quite a bit by removing some unnecessary code. First of all, HashSet<T>.Add fails gracefully if the item passed to it already exists in the collection. The method returns True if the item is added to the to the collection. If item already exists in the collection, then Add returns False.

In addition, the return statement is unnecessarily complex. Select(wt => wt) doesn’t do anything different; it just enumerates the collection and creates another IEnumerable.

That’s pretty standard code, and about as good as you’re going to get with looping constructs. It’s not aesthetically pleasing, though. You can improve it with LINQ.

Anybody who’s worked with LINQ is familiar with the Select method and its variants to project each element of a sequence into a new form. If all he wanted to do was get a list of the individual rows’ WorkTypeKey values, for example, he could have written:

return rows.Select(r => r.WorkTypeKey).Distinct();

But that only provides one item per row. The goal in the original code is to provide at least one but possibly two items per row. Select can’t do that. But SelectMany can. For example, if he wanted to unconditionally return both items, he could write:

Note that I didn’t specify long in the array declarations. The compiler can infer the type, so in this case new[] resolves to new long[]. Whether you take advantage of the type inference is a matter of style.

I see two common objections to this type of code. The first is that it’s inefficient because it creates a list and then does a Distinct call to cull the duplicates. But that’s not what happens. You have to keep in mind that LINQ is using lazy evaluation and deferred execution. The calls to SelectMany and Distinct don’t actually do anything until some code tries to enumerate the resulting sequence (by using foreach or by calling ToList(), for example). Even then, it’s not as though the code calls SelectMany to create a list and then Distinct to cull the duplicates. What really happens is that Distinct creates a HashSet and then iterates over the rows collection, checking each row in turn. Essentially, the code does this:

So you see that the code doesn’t really create a temporary list and then de-dupe it. It does, however, create small temporary arrays, which causes some people to raise efficiency concerns. In truth, there is a small inefficiency there. But the garbage collector is optimized to efficiently handle allocation of many small, short-lived objects. So those little arrays shouldn’t cause a problem. I might think differently if the rows list were to contain millions of items, but for the handful of items it will typically contain, there’s no reason to worry about the small additional load on the garbage collector.

Unfortunately, that code doesn’t compile. For reasons that I won’t go into, you can’t use yield return in a Lambda or anonymous method. You can, however, write a separate method to do that, resulting in:

2 comments to More fun with LINQ: SelectMany

Select(wt => wt) does have an effect, actually: it returns an iterator instead of the set itself, preventing it from accidentally being cast back to ICollection or something and modified. But there is still no need for it, because we could just do this in the loop instead:

if (workTypes.Add(key))
yield return key;

That’s basically what Distinct() does.

I can think of a couple of other ways you could write the conditional without using yield:

1. Substitute the optional key with an empty enumerable and use Concat: