Wednesday, February 5, 2014

pipes-parse-3.0: Lens-based parsing

pipes-parse-3.0.0 introduces a new lens-based parsing mechanism. These lenses improve the library in two ways:

They greatly simplify the API

They correctly propagate leftovers further upstream

Overview

The new parsing API consists of three main types, which roughly parallel the three pipes abstractions:

Producers, which are unchanged from pipes

Producer a m x

Parsers, which are the parsing analog of Consumers:

type Parser a m r = forall x . StateT (Producer a m x) m r

Lens'es between Producer's, which are the parsing analog of Pipes:

Lens' (Producer a m x) (Producer b m y)

What's neat is that pipes-parse does not need to provide any operators to connect these three abstractions. All of the tools you need already exist in either transformers and lens (or lens-family-core, if you prefer a simpler lens alternative).

For example, you connect a Parser to a Producer using either runStateT, evalStateT, or execStateT:

Leftovers

There is a subtle difference between pipes-parse and conduit: pipes-parse correctly propagates leftovers further upstream whereas conduit does not. To illustrate this, let's begin from the following implementation of peek for conduit:

The internal loop function in splitAt corresponds to the internal loop from isolate. The extra magic lies within the first line:

splitAt n0 k p0 = fmap join (k (loop n0 p0))

Lens aficionados will recognize this as the dependency-free version of:

splitAt n0 = iso (loop n0) join

This not only instructs the Lens in how to isolate out the first ten elements (using loop), but also how to reverse the process and merge the remaining elements back in (using join). This latter information is what makes leftover propagation work.

Getters

Note that not all transformations are reversible and therefore cannot propagate leftovers upstream. Fortunately, the lens model handles this perfectly: just define a Getter instead of a Lens' for transformations that are not reversible:

getter :: Getter (Producer a m x) (Producer b m y)

A Getter will only type-check as an argument to (^.) and not zoom, so you cannot propagate unDraw through a Getter like you can with a Lens':

zoom getter unDraw -- Type error

However, you can still use the Getter to transform Producers:

producer ^. getter -- Type checks!

This provides a type-safe way to distinguish transformations that can propagate leftovers from those that cannot.

In practice, pipes-based parsing libraries just provide functions between Producers instead of Getters for simplicity:

getter :: Producer a m x -> Producer b m y

... but you can keep using lens-like syntax if you promote them to Getters using the to function from lens:

to :: (a -> b) -> Getter a b
producer ^. to getter

Lens Support

Some people may worry about the cost of using lens in conjunction with pipes-parse because lens is not beginner-friendly and has a large dependency graph, so I'd like to take a moment to advertise Russell O'Connor's lens-family-core library. lens-family-core is a beginner-friendly lens-alternative that is (mostly) lens-compatible. It provides much simpler, beginner-friendly types and has a really tiny dependency profile.

Note that pipes-parse does not depend on either lens library. This is one of the beautiful aspects about lenses: you can write a lens-compatible library using nothing more than stock components from the Prelude and the transformers library. You can therefore combine pipes-parse with either lens-family-core or lens without any conflicts. This provides a smooth transition path from the beginner-friendly lens-family-core library to the expert-friendly lens library.

Laws

The lens approach is nice because you get many laws for free. For example, you get several associativity laws, like the fact that (^.) associates with (.):

Limiting one parser to n elements is not the same as limiting its two sub-parsers to n elements each. So if you use pipes-parse lenses you cannot rely on zoom being a monad morphism when doing equational reasoning. This was a tough call for me to make, but I felt that delimiting parsers were more important than the monad morphism laws for zoom. Perhaps there is a more elegant solution that I missed that resolves this conflict, but I'm still pleased with the current solution.

Generality

Notice how all of these functions and laws are completely pipes-independent. Any streaming abstraction that has some Producer-like type can implement lens-based parsing, too, and also get all of these laws for free, including pure streams (such as strict or lazy Text) or even lazy IO. Other streaming libraries can therefore benefit from this exact same trick.

Batteries included

Downstream libraries have been upgraded to use the pipes-parse API, including pipes-bytestring, pipes-binary, and pipes-attoparsec.This means that right now you can do cool things like:

The above parser composes two lenses so that it zooms in on a stream of decoded ints that consume no more than 100 bytes. This will transmute draw to now receive decoded elements and unDraw will magically re-encode elements when pushing back leftovers:

Also, Michael Thompson has released a draft of pipes-text on Hackage. This means you can parse a byte stream through a UTF-8 lens and any undrawn input will be encoded back into the original byte stream as bytes. Here is an example program show-casing this neat feature:

The remainder of the first line is undrawn by the decoder and restored back as the original encoding bytes.

Note that the above example is line-buffered, which is why the program does not output the Text chunks immediately after the 10th input character. However, if you disable line buffering then all chunks have just a single character and the example wouldn't illustrate how leftovers worked.

The above example could have also been written as a single Parser:

parser :: Parser ByteString IO ()
parser = do
texts

... but I wanted to make the leftover passing explicit to emphasize that the leftover behavior holds correctly whether or not you exit and re-enter pipes.

Conclusion

The pipes-parse API lets you propagate leftovers upstream, encode leftover support in the type system, and equationally reason about code with several theoretical laws. Additionally, pipes-parse reuses existing functions and concepts from lens and StateT rather than introducing a new set of abstractions to learn.

Note that pipes-parse used to have a some FreeT-based operations as well. These have been moved to a separate pipes-group library (and upgraded to use lenses) since they are conceptually orthogonal to parsing and I will blog about this library in a separate post.