C# 4.0 Feature Focus - Part 3 - Intermezzo: LINQ's new Zip operator

After named parameters and optional parameters, we'll take a little breadth and deviate a bit from the language specifics to present a new LINQ operator: Zip. Just like a zipper zips two streams of materials together, LINQ's Zip operator can zip together two sequences. Here's the signature of the new method:

Sample

Given two sequences and a function that combines two elements from both sequences, a sequence of zipped pairs is produced. Here's a sample:

string[] names = { "Bart", "John" }; int[] ages = { 25, 60 };

names.Zip(ages, (name, age) => name + " is " + age + " years old.");

This produces a sequence with the sentences "Bart is 25 years old." and "John is 60 years old.". The lambda syntax for the passed-in function should speak for itself, and notice we're using extension method invocation here, so names is propagated to become the left-hand side of the method call.

How Zip works

Previously I've implemented the Standard Query Operators for reference purposes on the Codeplex site at http://www.codeplex.com/LINQSQO. I won't update that sample library just yet, but here's an illustration on how easy Zip is to implement, ignoring exception handling (which is more subtle than you might think, see further):

(Exercise for the reader: think of more alternatives.) How does this work? To explain this we need a bit of vocabulary, so let's refer to Wikipedia for a second and apply an isomorphism between textile sciences and computer sciences (something in me screams "zip, zipper, zippest"):

The bulksignature of a zipperZip consists of two stripssequences of fabricelement taypes, each affixed to one of the two piecessequence instances to be joined, carrying tens or hundreds or anything below OutOfMemoryException conditions of speciallyregularlyshapedallocatedmetalreference- or plasticvalue-typedteethelements. (...) The sliderFunc<TFirst, TSecond, TResult> delegate, operated by handexecuting Zip, moves along the rowssequences of teethelements. Inside the sliderFunc<TFirst, TSecond, TResult> delegate is a Y-shapedstrongly-typed channelfunction that meshes together or separates the opposing rowssequences of teethelements, depending on the direction of its movement. The frictionbinding and vibrationapplication of the sliderFunc<TFirst, TSecond, TResult> delegate againston the teethelements causes a characteristic buzzingcallvirt'ing noise, which is probably not the origin of the name zipper. The name also may have originated in the greater speed and ease with which the two sides of a zipperZip can be joined, compared to the time needed for fasteningexecutinglaces or buttonsthe method above.

Well, that's exactly what happens: opposing rowssequences of teethelements are combined by a Y-shapedstrongly-typed channelfunction. How to determine opposing sequence elements? Using Select's overload that provides (besides the element itself) an index denoting the element's position in the original sequence. Matching the opposing elements is a matter of joining both sequences, extracting the keys (marked as I in the anonymous type) and combining the selected pairs of elements from both sequences (marked as X in the anonymous type) by feeding them in to the sliderFunc<TFirst, TSecond, TResult> delegate. I told you the explanation was straightforward, or did I?

Iterators and exceptions

Most of the LINQ operators are implemented using iterators. You might wonder this this is relevant at all. Obviously we need it as LINQ operators need to be lazy. Only when you start fetching results by iterating of the sequence, the internal machinery should kick in. Declaring a query doesn't cause any execution whatsoever. Internally iterators are implemented as state machines. Every state can do on "yield", after which the state machine is suspended till the consumer asks for the next element in the sequence being produced by calling MoveNext on the IEnumerator<T> object.

// // Retting the iterator enumerator is not supported; create a new one instead by calling GetEnumerator. // The foreach statement does this anyway. // voidIEnumerator.Reset() { throw newNotSupportedException(); }

Notice that the code above holds the middle between the specification (paragraph 10.14 of the C# 3.0 specification) and the actual implementation in the Visual C# 2008 compiler. More specifically, I've made the discrete states (which are internally represented as integers) match the ones in the specification, but haven't gone all the way in matching the code up with the

Of all this machinery, the MoveNext method is the most interesting one from the iterator's point of view. Here a state machine is built, rewriting the original iterator block by splitting it into discrete blocks. To perform this transformation, yield return statements are replaced as follows:

yield a;

becomes

this._current = a; this._state = State.Suspended; return true;

Similar rewrites happen for yield break statements, but that's not relevant here. Furthermore all the iterator block code is placed in a MoveNext method, switching on the current state. The specification only mentions the four states I've implemented above, but we're lucky the number of states for our sample matches up with the number of states that's specified. In cases where there are more yield statements, more states are needed to suspend the machine and resume at the same point during the next MoveNext call. Other transformations needed include rewriting of loops (things like while don't really play well with suspension points and need to be rewritten in terms of if/goto, where gotos require additional states). Applying all those tricks gives us:

Here the local variables used in the iterator block are captured in the initial pass through MoveNext, when state is still set to Before. Strictly speaking we move from Before to Running all the way to the first point where the code gets suspended, but in our case states can be merged quite a bit. In fact the code really produced by the compiler looks like this:

But why am I telling you all this? Because it's fascinating? Yes, for that reason too. But I promised to tell you something about exceptions, right? Let me quote from the C# 3.0 specification paragraph 10.14.4.1 first:

(...) The precise action performed by MoveNext depends on the state of the enumerator object when MoveNext is invoked:

If the state of the enumerator object is 'before', invoking MoveNext:

Changes the state to 'running'.

Initializes the parameters (including this) of the iterator block to the argument values and instance value saved when the enumerator object was initialized.

Executes the iterator block from the beginning until execution is interrupted (as described below).

(...)

The red line is what's of interest to us, and you should read it in reverse:

The code from the beginning of the iterator block till the place where execution is interrupted (i.e. a yield occurs) is executed by MoveNext when transitioning from 'before' to 'running'.

So just calling the iterator does nothing that can cause the exceptions to be thrown, until a call is made to MoveNext, e.g. by using a foreach loop. As we want the exception to be thrown straight away when calling Zip with invalid arguments, the right way to implement our Zip method is:

Conclusion

The addition of Zip to LINQ is a nice one, but not a mind-blower. So I hope you'll accept my apologies for using this as a lame excuse to nag about the implementation of iterators and their subtle impact on exceptions. Next time: generics co- and contra-variance in C# 4.0.

You're definitely right about this - rest assured our LINQ to Objects implementation does the right thing wherever required :-). There's another thing here that's missing, a null-check for the func parameter.

The LINQSQO project historically doesn't wrap the iterator blocks in using clauses; I have it on the to-do list for a future release though.

You can think of it that way if you like. Zip could well be called Select, taking in another sequence as its argument. The most relevant fact here is that the shortest sequence determines the length of the outcome, i.e. the zipper function is not called with default(TShorterSequence) arguments.

Could you comment on Zip from a PLINQ perspective? Could we call list1.Zip(list2).AsParallel() with your implementation above, or do we have to worry about synchronization between the two IEnumerable(OfT) sequences?

With Reactive Extensions for .NET (Rx) and .NET Framework 4 a new LINQ operator was introduced – Zip ( Bart De Smet gives excellent explanation about the idea and implementation details behind Zip operator ). In a nutshell it merges two sequences into