Enumerators Tutorial Part 1: Iteratee

September 30, 2010

Michael Snoyman

This content is now part of the Yesod book. It is recommended to read there, since the content is more up-to-date.

Introduction

One of the upcoming patterns in Haskell is the enumerators. Unfortunately, it's very difficult to get started with them since:

There are multiple implementations, all with slightly different approaches.

Some of the implementations (in my opinion) use incredibly confusing naming.

The tutorials that get written usually don't directly target an existing implementation, and work more on building up intuition than giving instructions on how to use the library.

I'm hoping that this tutorial will fill the gap a bit. I'm going to be basing this on the enumerator package. I'm using version 0.4.0.2, but this should be applicable to older and hopefully newer versions as well. This package is newer and less used than the iteratee package, but I've chosen it for three reasons:

That said, both packages are built around the same basic principles, so learning one will definitely help you with the other.

Three Parts

The title of this post says this is part 1. In theory, there will be three parts (though I may do more or less, I'm not certain yet). There are really three main concepts to learn to use the enumerator package: iteratees, enumerators and enumeratees. A basic definition would be:

Iteratees are consumers: they are fed data and do something with it.

Enumerators are producers: they feed data to an iteratee.

Enumeratees are pipes: they are fed data from an enumerator and then feed it to an iteratee.

What good are enumerators?

But before you really get into this library, let's give some motivation for why we would want to use it. Here's some real life examples I use the enumerator package for:

When reading values from a database, I don't necessarily want to pull all records into memory at once. Instead, I would like to have them fed to a function which will consume them bit by bit.

When processing a YAML file, instead of reading in the whole structure, maybe you only need to grab the value of one or two records.

If you want to download a file via HTTP and save the results in a file, it would be a waste of RAM to store the whole file in memory and then write it out. Enumerators let you perform interleaved IO actions easily.

A lot of these problems can also be solved using lazy I/O. However, lazy I/O is not necessarily a panacea: you might want to read some of Oleg's stuff on the pitfalls of lazy I/O.

It's fairly annoying to have to write two completely separate sum functions just because our data source changed. Ideally, we would like to generalize things a bit. Let's start by noticing a similarity between these two functions: they both only yield a value when they are informed that there are no more numbers. In the case of sum1, we check for an empty list; in sum2, we check for Nothing.

The Stream datatype.

The first datatype defined in the enumerator package is:

data Stream a = Chunks [a] | EOF

The EOF constructor indicates that no more data is available. The Chunks constructor simply allows us to put multiple pieces of data together for efficiency. We could now rewrite sum2 to use this Stream datatype:

That's all well and good, but let's pretend we want to have two datasources to sum over: values the user enters on the command line, and some numbers we read over an HTTP connection, perhaps. The problem here is one of control: sum4 is running the show here by calling getNum. This is a pull data model. Enumerators have an inversion of control/push model, putting the enumerator in charge. This allows cool things like feeding in multiple data sources, and also makes it easier to write enumerators that properly deal with resource allocation.

The Step datatype

So we need a new datatype that will represent the state of our summing operation. We're going to allow our operations to be in one of three states:

Waiting for more data.

Already calculated a result.

For convenience, we also have an error state. This isn't strictly necessary (it could be modeled by choosing an EitherT kind of monad, for example), but it's simpler.

As you could guess, these states will correspond to three constructors for the Step datatype. The error state is modeled by Error SomeException, building on top of Haskell's extensible exception system. The already calculated constructor is:

Yield b (Stream a)

Here, a is the input to our iteratee and b is the output. This constructor allows us to simultaneously produce a result and save any "leftover" input for another iteratee that may run after us. (This won't be the case with the sum function, which always consumes all its input, but we'll see some other examples that do no such thing.)

Now the question is how to represent the state of an iteratee that's waiting for more data. You might at first want to declare some datatype to represent the internal state and pass that around somehow. That's not how it works: instead, we simply use a function (very Haskell of us, right?):

Continue (Stream a -> Iteratee a m b)

Euerka! We've finally seen the Iteratee datatype! Actually, Iteratee is a very boring datatype that is only present to allow us to declare cool instances (eg, Monad) for our functions. Iteratee is defined as:

newtype Iteratee a m b = Iteratee (m (Step a m b))

This is important: Iteratee is just a newtype wrapper around a Step inside a monad. Just keep that in mind as you look at definitions in the enumerator package. So knowing this, we can think of the Continue constructor as:

Continue (Stream a -> m (Step a m b))

That's much easier to approach: that function takes some input data and returns a new state of the iteratee. Let's see what our sum function would look like using this Step datatype:

The first real line (Continue $ go 0) initializes our iteratee to its starting state. Just like every other sum function, we need to explicitly state that we are starting from 0 somewhere. The real workhorse is the go function. Notice how we are really passing the state of the iteratee around as the first argument to go: this is also a very common pattern in iteratees.

We need to handle two different cases: when handed an EOF, the go function must Yield a value. (Well, it could also produce an Error value, but it definitely cannot Continue.) In that case, we simply yield the running sum and say there was no data left over. When we receive some input data via Chunks, we simply add it to the running sum and create a new Continue based on the same go function.

Now let's work on making that function a little bit prettier by using some built-in helper functions. The pattern Iteratee . return is common enough to warrant a helper function, namely:

Monad instance for Iteratee

This is all very nice: we now have an iteratee that can be fed numbers from any monad and sum them. It can even take input from different sources and sum them together. (By the way, I haven't actually shown you how to feed those numbers in: that is in part 2 about enumerators.) But let's be honest: sum5 is an ugly function. Isn't there something easier?

In fact, there is. Remember how I said Iteratee really just existed to facilitate typeclass instances? This includes a monad instance. Feel free to look at the code to see how that instance is defined, but here we'll just look at how to use it:

The liftIO function comes from the transformers package, and simply promotes an action in the IO monad to any arbitrary MonadIO action. Notice how we don't really track any state with this iteratee: we don't care about its result, only its side effects.

Like our sum6 function, this also wraps an inner "go" function with a continue. However, we now have three clauses for our go function. The first handles the case of Chunks []. To quote the enumerator docs:

(Chunks []) is used to indicate that a stream is still active, but currently has no available data. Iteratees should ignore empty chunks.

The second clause handles the case where we are given some data. In this case, we yield the first element in the list, and return the rest as leftover data. The third clause handles the end of input by returning Nothing.

Exercises

Implement the consume function using first the high-level functions like head, and then using only low-level stuff.

Write a modified version of consume that only keeps every other value, once again using high-level functions and then low-level constructors.

Next time

Well, now you can write iteratees, but they're not very useful if you can't actually use them. Next time we'll cover what an enumerator is, some basic enumerators included with the package, how to run these things and how to write your own enumerator.

Summary

Here's what I consider the most important things to glean from this tutorial:

Iteratee is a simple wrapper around the Step datatype to allow for cool typeclass instances.

Using the Monad instance of Iteratee can allow you to build up complicated iteratees from simpler ones.

The three states an enumerator can be in are Continue (still processing data), Yield (a result is ready) and Error (duh).

Well behaved iteratees will never return a Continue after receiving an EOF.