Strongly typed powershell csv parser

Somehow I missed the powershell boat. I’ve been a .NET developer for years and I trudged through using the boring old cmd terminal, frequently mumbling about how much I missed zsh. But something snapped and I decided to really dive into powershell and learn why those who use it really love the hell out of it. After realizing that the reason everyone loves it is because everything is strongly typed and you can use .NET in your shell I was totally sold.

My first forays into powershell included customizing the shell environment. First I got conemu and made it look nice and pretty. Next was to get an ls highlighting module, since I love that about unix shells.

I set up a few fun aliases in my profile and felt ready to conquer the world! My next experiment was to try and create an actual binary cmdlet. I figured, what better way than to create a csv reader. Now, I realize there is already an Import-Csv cmdlet that types your code, but I figured I’d write one from scratch, since apparently that’s what I tend to do (instead of inventing anything new).

My hope was to make it so that it would emit strongly typed objects (which it does), but forwarning, you don’t get intellisense on it in the shell. This is due to the fact that types are generated at runtime and not compile time.

The Plan

At first I thought I’d just wrap the F# csv type provider, but I realized that the type provider needs a template to generate its internal data classes. That won’t do here because the cmdlet needs to accept any arbitrary csv file and strongly type at runtime.

To solve that, I figured I could leverage the F# data csv library which would do the actual csv parsing, and then emit runtime bytecode to create data classes representing the header values.

As emitting bytecode is a pain in the ass, I wanted to keep my data classes simple. If I had a csv like:

Since that would be the bare minimum that powershell would need to display the type.

Emitting bytecode

First, lets look at the final result of what we need. The best way to do this is to create a sample type in an assembly and then to use Ildasm (an IL disassembler) to view the bytecode. For example, the following class

While I didn’t just divine how to write bytecode by looking at the IL (I followed some other blog posts), when I got an “invalid bytecode” CLR runtime error, it was nice to be able to compare what I was emitting which what I expected to emit. This way simple errors (like forgetting to load something on the stack) became pretty apparent.

To emit the proper bytecode, we need a few boilerplate items: an assembly, a type builder, an assembly builder, a module builder, and a field builder. These are responsible for the metadata you need to finally emit your built type.

None of this is really all that interesting and hopefully is self explanatory.

The fieldBuilder is important since that will let us declare our local fields. In fact, once we’ve declared our local fields using the builder, the only bytecode we have to emit is the constructor (which accepts arguments and instantiates fields in them).

It’s important to note that values is an obj []. Because its an object array we can pass it to the activates overloaded function that wants a params obj[] and so it’ll treat each object in the object array as another argument to the constructor.

Dynamic static typing of CSV’s

Since there is a way to dynamically create classes at runtime, it should be easy for us to leverage this to do the csv strong typing. In fact, the entire reader is this and emits to you a list of strongly typed entries:

The randomName() is a silly workaround to make sure I don’t create the same Type in an assembly. Each time you run the csv reader it’ll create a new random type representing that csv’s data. I could maybe have optimized this that if someone calls in for a type with the same list of headers that another type had then to re-use that type instead of creating a duplicate, oh well.

Using the reader from the cmdlet

Like I mentioned in the beginning, there is a major flaw here. The issue is that since my types are generated at runtime (which was really fun to do), it doesn’t help me at all. Cmdlet’s need to expose their output types via an OutputType attribute, and since its an attribute I can’t expose the type dynamically.

This reads an implicit file name (or file with wildcards) and leverages the inherited PsCmdlet class to resolve the path from the passed in file (or expand any splat’d files like some*). All we do now is pass each file stream to the reader, convert to an array, and pass it to the next item in the powershell pipe.

See it in action

Maybe this whole exercise was overkill, but let’s finish it out anyways. Let’s say we have a csv like this:

Cleanup

After getting draft one done, I thought about the handling of the IL generator in the Data Emitter. There are two things I wanted to accomplish:

1. Clean up having to seed the generator reference to all the functions
2. Clean up passing an auto incremented index to the field initializer

After some mulling I realized that implementing a computation expression to handle the seeded state would be perfect for both scenarios. We can create an IlBuilder computation expression that will hold onto the reference of the generator and pass it to any function that uses do! syntax. We can do the same for the auto incremented index with a different builder. Let me show you the final result and then the builders:

Now all mutability and state is contained in the expression. I think this is a much cleaner implementation and the functions I used in the builder workflow didn’t have to have their function signatures changed!

Conclusion

Sometimes you just jump in and don’t realize the end goal won’t work, but I did learn a whole lot figuring this out so the time wasn’t wasted.