Wednesday, October 9, 2013

How to reimplement the conduit parsing API in 50 lines of pipes code

Michael's recent blog posts highlighted several deficiences of pipes-based parsing. Particularly, he emphasized that it was incompatible with idioms from the core pipes library, and I agree with that assessment. Programming with pipes-parse is a different beast from programming with vanilla pipes and pipes-parse idioms more closely resemble conduit idioms.

Several comments in response to Michael's post asked if one could internally implement conduit on top of pipes, in order to simplify conduit internals. This post answers half that question by showing how to implement conduit sans finalization on top pipes using the tools from pipes-parse.

This code is short, but very dense, so I will walk through the implementation step-by-step, explaining the underlying pipes-parse idioms that I'm using to reconstruct conduit functionality. If you just want to skip to the complete code then go straight to the Appendix at the end of this post.

The Conduit Type

The way you internally represent a conduit-like parser using pipes is the following data type:

To recap, a ConduitM i o m r has an input of type i and an output of type o, and the output is distinct from the return value, just like pipes.

I model this as a Producer of os that reads from and writes to a Producer of is. The Producer of is is our conduit's upstream input end. awaiting a value will pop an elements off of this Producer and adding back a leftover pushes an element back onto this Producer.

This representation differs from conduit's implementation in one important way: it makes no distinction between leftovers and future input. Both are stored together within the inner Producer. This is one neat trick of reifying future input as a Producer: you now have an obvious place to store leftovers.

Primitives

The next step is to implement await, which is just a thin wrapper around draw from pipes-parse:

The key is the runStateP function from Pipes.Lift, which has the following (simplified) type:

runStateP
:: s -> Producer a (StateT s m) r -> Producer a m (r, s)

Compare this with the type for runStateT:

runStateT :: StateT s m r -> s -> m (r, s)

runStateP differs from runStateT in two ways:

runStateP unwraps a StateT buried inside of a pipe

runStateP takes arguments in the opposite order from runStateT

runStateP takes care to thread state as it wraps the internal StateT so it behaves just like runStateT. Once you familiarize yourself with how runStateP works, the solution is a matter of type-chasing. In fact, what you will discover is that if you restrict yourself to runStateP, there is only one solution that type-checks.

This now looks just like a parser combinator. It takes an input stream of values of type a and generates an output stream of values of type b, returning unused input alongside the () return value. We're not interested in this () return value, so we'll use execStateP instead:

Identity

If that is composition, what is the identity? Why, it's just input from pipes-parse:

idP :: (Monad m) => ConduitM a a m ()
idP = ConduitM (void input)

Neat how that works out. This is equivalent in behavior to:

idP = do
ma <- await
case ma of
Nothing -> return ()
Just a -> do
yield a
idP

Connect and Resume

Last but not least we need connect and resume. Like I said before, this will ignore finalization concerns, so I will only implement a variation on ($$+) that returns a new Source, rather than a ResumableSource (which is just a Source tagged with a finalizer).

Conclusion

The purpose of this post is not to suggest that Michael necessarily should implement conduit in terms of pipes, especially since this does not contain finalization code, yet. Rather, I wanted to exhibit that pipes is a powerful tool that you can use to build other abstractions concisely and with less room for error.

Appendix

The minimal test implementation is 50 lines of code, which I've included here: