Introduction

This blog post is a continuation of the series of LINQ Short takes. In this blog post, I present an implementation of a pivot, or crosstab, transformation in Linq. I have used an implementation as a Linq extension method (specifically an extension method to IEnumerable(Of T) Interface).

The Problem

The pivot, or crosstab, transformation is notoriously difficult to achieve in SQL. The original definition of SQL used as foundations concepts from relational algebra and tuple relational calculus. Linq having some of its root in SQL, shares the difficulty with SQL formulation of a pivot, or crosstab, transformation. Fortunately, there are enough of the procedural, and functional, programming aspects of C# that are included into the Linq design and implementation. The aspects of functional and procedural, programming present in Linq make expressing a generic solution to the pivot, or crosstab, transformation possible.

The following diagram attempts to show what the pivot transformation does to the data. At its core, the pivot transformation is just the swapping of the columns to rows, with the data following the transposition of the row and column vector. The difficulty comes in expressing this in a generic way.

The Solution – Design

The following are the design considerations and design criteria. These design criterions, then informed the implementation of the pivot transformation. These were the mandatory criteria.

The implementation must be an extension method (see MSDN documentation: Extension Methods (C# Programming Guide)for further details). This allows the implementation of Pivot to be applied to the just like any other Linq transformation.

The extension method must use a generic type parameter (see MSDN documentation: Generics (C# Programming Guide)) for the object type being manipulated. This allows the broadest range of objects as inputs to the pivot transform.

The extension method must be an extension to IEnumerable(Of T) Interface. This is the other half, to a broadest application criterion, for the pivot transformation. This criterion allows application of the pivot transformation to the broadest range of collections.

The return type of the pivot transformation must be as generic as possible. This implies using the IEnumerable(Of T) Interface and a generic type parameter.

The return type should preserve the type of object from the input sequence to the transformation. Failure to do this would mean that the pivot transformation is arbitrarily changing the input data. It is my belief that this would break the Linq transformation model the .Net Framework provides. Failure to adhere to this criterion would make the transformation far less usable in general.

The implementation must be correct. I define correctness, as the pivot transformation should result in the generally accepted output of a pivot transformation.

The implantation of the pivot transformation should be efficient. Here I mean that it should be relatively quick for reasonable volumes of data. I did not want an implementation that showed up as a performance hot spot in any program that used the function.

The following were the desirable design criteria:

The implementation should be reasonable robust in dealing with ragged data. What I mean by ragged, or jagged, data is that not all rows of data have the same number of columns. This may be potentially difficult, and expensive, design criterion to address.

The implementation should be able to cater for null objects in the input, transferring the null into the output.

The signature element this IEnumerable<ienumerable> pivotSource argument to the method addresses a number of the design criteria. These design criteria are:

The use of this as the preface to the first argument in the argument list addresses, in part, the requirement for the implementation to be an extension method.

The use of the IEnumerable(Of T) Interface address the requirement for the implementation to process the broadest range of collections.

The use of the TSource type parameter enables the use of the method against the broadest range of objects possible.

In combination, the complete function signature addresses the following design criterion:

The function signature addresses the requirement for type stability, or not arbitrarily change the type of the object processed, implemented through the usage of the same type argument TSource as part of the definition of the input and return types.

The Source Code For The Pivot Implementation

Below is the source code for the implementation of the pivot extension method.

The bulk of the code is devoted to the checking of the argument. These argument checks protect the implementation from invalid argument that could cause exceptions from the underlying Linq implementation.

Key Features in the Implementation

The following features of the implementation are worthwhile highlighting. These include:

To manage ragged data the extension method makes all of the rows the same length (squares out the data). The square out process is done by the following method logic. The Concat Linq Extension Method appends the data required to fill out the input rows. The call to the Enumerable Repeat Method, builds the sequence that contains the number of elements required to fill out the row. The default(TSource) obtains the default value for the type of object in the sequence.

The implementation uses an upper bound for the Enumerable.Range Method that is the longest row in the inner IEnumerable of the pivotSource argument. The code fragment which achieves the determination of the longest row is: ‘pivotSource.Select(a => a.Count( )).Max( )‘. The use of the longest row in the pivotSource, in part, addresses the design criterion requirement for the pivot implementation to deal appropriately with ragged data.

The implementation uses the Linq SelectMany extension method to flatten sequence of column values from each row of the input rows into a single IEnumerable for each output row.

Downloadable Versions of the code

The following URLs contain the source code, including the XML Documentation Comments, for the implementation of the Pivot Linq extension method. There are pdf and docx versions of the code. These are the only types of text (rich text) files that Word Press allows to be loaded to the web site.

How and Why the Pivot Transformation Works

The following sections describe the logical operations, and implementation details, of the pivot transformation. I have broken this into three segments, and Overview, The Square Out Process, and The Data Pivot Process.

Overview

The following diagram describes the broad logical flow of the implementation of the pivot extension method.

The validate arguments phase of the implementation is a relatively simply trying set of checks. The checks protect the remained of the implementation from inputs that would cause an exception to thrown. Since the implementation uses many Linq extension methods, these exceptions may originate from deep within the .Net Framework (see MSDN article: .NET Framework Conceptual Overview). Diagnosis of the reason why an exception was thrown, when thrown from within the .Net Framework can prove to be very difficult process. This would be even harder if the developer is using the pivot extension method supplied as of a class library (see MSDN article: Assemblies and the Global Assembly Cache (C# and Visual Basic)).

There are a couple of feature of the implementation of the input validation phase of the pivot transformation worthwhile noting. These features include:

Many of the validation tests use the Object.ReferenceEquals method. Using the ReferenceEquals avoids possibility that Object.Equals method and the == operator could have been overridden. What the overridden versions implement could be inappropriate for the argument testing implementation.

The argument validations use a Linq extension method worth mentioning. This is the check validates that there are no rows in the input that are object with a value of null. This check employs the Linq extension method ‘Any’ (see: Enumerable.Any(Of TSource) Method ). The Any Linq extension method is one of the Linq extension methods that I have overlooked in the past. Therefore, I make note of it, just in case the reader has not seen the ‘Any’ extension in action previously.

There is another argument validation that uses a Linq extension method that is worthy of mentioning. This is the argument validation that all of the rows in the input that are not zero elements in length. This check employs the Linq extension method ‘All’ (see: Enumerable.All(Of TSource) Method). The ‘All’ Linq extension method is another Linq extension methods that I have overlooked in the past. Therefore, I make note of ‘All’ Linq extension method here, just in case the reader has not seen it in action previously.

Pivot Transform Top Level Processes

There are two processes in the implementation that will describe in further depth. These are:

I will describe in more detail the Linq process of squaring out the input data. The square out process makes all of the rows in the inner IEnumerable the same length. Only ragged input data has this transformation applied.

I will also describe in more detail the Linq process of performing the pivot transformation. The only input to the pivot transformation is the squared out data. The output from the pivot transform is then the return value of the extension method.

The Square Out Process

The following diagram illustrates the transformation to the row from the input to form the squared output. The transformation actually uses the default keyword (see the MSDN article: default Keyword in Generic Code (C# Programming Guide) for further details) to obtain the fill value used in the square out process.

The flowing diagram describes (attempts to) the logical process flow of the Square Out process.

The Components of the Square Out Process

The Code for the Square Out Process

The following is the code used in the pivot transform to generate the squared out data.

// Get the row lengths to check if the data needs squaring out int maxRowLen = pivotSource.Select(a => a.Count( )).Max( );
int minRowLen = pivotSource.Select(a => a.Count( )).Min( );
// Set up the input to the Pivot IEnumerable<IEnumerable<TSource>> squared = pivotSource;
// If a square out is required if (maxRowLen != minRowLen)
// Fill the tail of short rows with the default value for the type squared = pivotSource.Select(row =>
row.Concat(
Enumerable.Repeat(default(TSource), maxRowLen - row.Count( ))));

The Square Out Process in Prose

There are only a couple of steps it the square out process. These steps are:

Get the minimum and maximum column count from the input sequence. These are the Count().Min() and Count().Max().

If the column count minimum value and column count maximum value, are not the same. Then data needs squaring out, and proceeds into the square out process. Otherwise, the data is already square and the data proceeds to the pivot transformation.

For each row in the input sequence. The code: pivotSource.Select(row =>

Concatenate onto the row, for output, the required filler. The Enumerable.Concat() call for each of the rows, does this.

The Data Pivot Process

The diagram Linq Pivot Transformation (blow) tries to express visually what the Linq statement that implements the Pivot does. I hope that the readers of this blog post find the diagram helps understand how the pivot transformation works.

The Pivot Process Logical Process

The keys to the operation of the pivot process are the way use of the following:

The use of the Enumerable.Range() that is used to generate the sequence of columns numbers. The pivot transformation is one that takes each of the columns of the input sequence. Then the transformation makes a row from that set of values that then forms the result.

The SelectMany() which squashes the set of values from each column, into a row for the output.

The Where(value, column) variant that exposed the column number. This allows the selection of the correct column in the row.

Future Versions or Enhancements

The following are the potential enhancements I may make to the Pivot Linq Extension method.

As I describe below, the question ‘Should the square out process be optional within the pivot?’, I have yet to satisfactorily resolve. The enhancement to enable this would be to include a Boolean argument to the Pivot method, with a default value of true. It is my belief that most of the use cases for the pivot transformation would require squared out data. Nevertheless, I cannot completely dismiss the possibility that there could be use cases where the output of a pivot without squared out data is required.

In addition, as described below, I still have the question ‘Should the square out process be a separate Linq extension method?” unresolved. Refactoring the pivot method to make the square out process a separate Linq extension method, or a private support method within the library, I will leave until sometime in the future.

Making an IEnumerable<IEnumerable<of Source Type>>

Introduction

It has become a concern to me that I have been using the generic data structure IEnumerable<ienumerable>, but have omitted any hints, or code examples, on how to build such a structure. This section will address that concern and present some of the ways I have used to create this type of data structure. Much of what I will present is from my testing unit of the pivot transformation. Although these will be short code snippets, I trust that they will give the reader, some clues for creating this type of data structure.

Code Examples

I have decided to include a link to the code examples in pdf, and docx format. The pasting of the code for the examples in here would only increase the size of this post, which I believe is getting too big anyway.

Summary and Conclusions

There are many different ways to achieve the formation of an IEnumerable<ienumerable> structure. The above links present two of the basic strategies:

The first set of examples shows an approach to achieve the structure that centre around using an array, and the IEnumerable interface that the System.Array class presents.

There is a second set of examples included above. These links show the approach of implementing the IEnumerable interface on a user defined class. Object instances, within an array definition, achieve the required structure.

As an aside, these objects also build random test data for a variety of data types. These classes built the test data for the pivot process. The inheritance hierarchy allowed creation of ragged data.

Conclusions and Observations

There are a number of areas which are worthy of noting in my final remarks.

The Pivot and Square Out Processes

I am still to reach satisfactory conclusions to the following questions:

Should the square out operation within the Pivot process be optional? As I note below, the Pivot on ragged data without the square out process, gives a results that, I would describe as unacceptable (or plain ‘wrong’). This conclusion is strictly a matter of one man’s, that man being I, opinion. There may be use cases where the type of results of a pivot transformation on unpadded (squared out) data is what the caller of the method requires. This is one of the many eternal dilemmas that library designer consistently wrestle with, and ruminate upon. The question, put succinctly, for the library designer to resolve is, ‘Does this design decision precluding some real use cases?’ The normal response to this dilemma is to make the option an argument option in the method signature and allow the user to make the choice between the behaviours. Fortunately, C# provides optional arguments with a default value, which provides one resolution to the dilemma. The library designer can provide the library user with the option in the method signature, and provide some guidance by using an optional argument, with the default value that satisfies the expected normal, and most common, use case. Before I publish this blog post, I may yet, make switching on the square out process an optional Boolean argument, with a default value of true.

Is the square out transformation useful as a standalone Linq extension method? Here again I am uncertain that there are use cases where this would be needed. In this case, I doubt that I will refactor out the square out process into a new extension method any time soon. I will wait and see if I find a need for the square out process as separate Linq extension method.

Conclusions, and Observations, from the Design and Development Process

During the writing of this blog post, I have modified the code for the Pivot extension method a number of times. There have been two major sources for the modifications.

One source of modifications was the building a set of unit tests (see MSDN article: Verifying Code by Using Unit Tests) for Pivot extension method. The process of designing, and implementing, the unit tests made me think more clearly on the validation of input arguments.

Another source of modifications was writing this blog post. Writing this article made me think more clearly about ragged data as an input.

The most notable modifications were:

The tightening up, and greater clarity, in the validation checks performed on the input data sequence.

The addition of the square out process came from the thinking more clearly, and deeply, on the processing of ragged input data. Specifically, I realised that what the transformation resulted in when processing ragged data was unacceptable. The output from a ragged input would have broken the rows swapped to columns conceptual model of the transform. The pivot transformation applied to input data without the square out resulted in data from row ending up in the wrong columns. This realisation resulted in the implementation of the square out process.

The result of these modifications is a ‘better’ implementation of the pivot transformation. The formulation of the input argument is clearer. The output from the extension method is significantly more robust. This increase in robustness comes from the introduction of the square out process that addresses the problems that ragged input data brought.

Observations from the Implementation

There were a number of ‘discoveries’ (they were always there, but I found them and had a use for them) in the standard Linq extension methods. These Linq extension methods included:

The Enumerable.Any(Of TSource) Method is a method is a method that is very useful for argument validation. I would expect the Any Linq extension method would make more appearances in my code.

The conclusion and observation I would draw, is that no matter how well you think you know the .Net Framework library, there is still more to useful classes and methods to find. Microsoft unremittingly is adding to my ‘voyage of discovery’ through the .Net Framework. Microsoft keeps C# .Net environment relevant to the emerging challenges in the IT industry, by adding new features to the .Net Framework library, and the C# language, with every new version of the C# .Net environment. For me, this only adds to the ‘fun’ of using C# .Net, there seems that there is always something new to learn how to use effectively.

Introduction

I am posting this quick tip in response to a question, which seems to on some people’s minds who are reading my blog. The search terms ‘create string from ienumerable’ appear as hits on my blog statistics quite frequently. The way I create a string from an IEnumerable is in the Example Code section below.

Special Votes of Thanks and Referenced Articles:

Whilst, the solution to the problem I present here is not the same as these articles present, these articles did push me down some of the correct paths. One thing that these articles helped me with was getting the ‘shape’ of the returned Cartesian product ‘right’ (correct by my judgement). The ‘shape’ I settled on was an IEnumerable, or a sequence of elements, each element being an IEnumerable with on element from each of the input IEnumerable objects.

The CartesianProduct Extension Method

In the following sections, I present the design, implementation, and some sample code, that describes the development process resulting in the CartesianProduct extension method.

CartesianProduct – Design Goals

There were numerous of design features that I wished to include in the implementation of CartesianProduct method. These features included:

The order of the resulting Cartesian product elements must be in the same order as the input sequences. This may well be a defining attribute of a Cartesian product, but I include this characteristic as a design goal anyway. This is a mandatory design criterion.

The returned set of Cartesian product elements should be as generic as possible. This is a mandatory design criterion.

The returned set of Cartesian product elements should be easily processed using Linq. This is a mandatory design criterion.

The method’s implementation must be as an extension method. This is a mandatory design criterion.

The method’s implementation must be an extension method that operates like any other Linq extension method (see: IEnumerable(Of T) Interface). This allows the users of the extension method to chain it within a set of other calls to Linq extension methods calls. This allows the user to assemble Linq transformations that they require. This is a mandatory design criterion.

The method should accept, as generically as possible, the set of sequences that used to generate the Cartesian product. This is a mandatory design criterion.

The method should use the Linq Join extension method to build the Cartesian product. If had to resort to the Linq Aggregate extension method, then the two articles I have referenced above demonstrate that approach. Plagiarising those works, and representing there content here, was not an acceptable outcome. This was a highly desirable, leaning heavily to mandatory, design criterion.

CartesianProduct – Function Signature

The following is the function signature for the CartesianProduct extension method. Many features of the function signature directly address the design goals.

The return type IEnumerable<IEnumerable<TSource>> addresses the design goals for a generic return type, and a return type which can be processed with Linq. If it makes it easier to understand, you can interpret this type as an outer list of inner lists, each inner list being list made up of type the TSource objects. The following diagram attempts to show graphically what this ‘looks like’.

Figure 1 – IEnumerable of IEnumerable of T

The CartesianProduct<TSource> part of the method signature identifies this method as a parameterised type method (see MSDN article: Generic Methods (C# Programming Guide) for further information).declaration serves the design goal of requiring the method to accept as generically as possible inputs, and produce a generic output.

The argument this IEnumerable<TSource> source serves a couple of the design goals. Primarily, it flags this method as an extension method that should be attached, or associated, with any object that implements the IEnumerable(of T) Interface. Secondly, this argument identifies the first sequence used to assemble the Cartesian product.

The argument IEnumerable<IEnumerable<TSource>> inputs identifies the set of sequences to be processed to form the output Cartesian product. As with the return type, a loose interpretation is an outer list that contains inner list of objects. This satisfies the any number of input sequences design goal.

CartesianProduct – Implementation

The following is the implementation of the CartesianProduct extension method. I will detail many of the significant features after the source code.

This implementation satisfies the design gaol of does preserve of the order of the input sequences and reproduces that order in output Cartesian product.

This implementation satisfies the design gaol of being generic and catering for any input type. There are some nasty hoops the caller needs to jump through when mixing objects a with C# value types. Although, it may be nasty it does work. I have an example that shows what is required.

This is an implementation uses recursion, see the call to CartesianProduct in the Join contained in the else case. The version of recursion that is utilised here is tail recursion. The removal an element from the inputs parameter at each level of recursion controls the recursion. This technique of removing an element from a control sequence until nothing is left is an approach to controlling recursion I have used frequently.

There are two overloads of the helper function ToIEnumerable. These simply wrap an IEnumerable around the objects passed in. Using these functions allows the simple assembly of the result IEnumerable of objects with the Linq Union extension method.

There is one point worth noting about these methods. The implementation may not be the most efficient way to achieve the goal. The use of the yield return results in C# compiler implementing a ‘bunch of code’ under the covers. My justification for the implementation using yield return is simple. It results in very simple code for anyone who needs to maintain, or modify, the code.