Sunday, July 19, 2015

Thoughts on Conduits

Summary: I'm growing increasingly fond of the Conduit library. Here I give my intuitions and some hints I'd have found useful.

Recently I've been working on converting the Hoogle database generation to use the Conduit abstraction, in an effort to reduce the memory and improve the speed. It worked - database generation has gone from 2Gb of RAM to 320Mb, and time has dropped from several minutes (or > 15 mins on memory constrained machines) to 45s. These changes are all in the context of Hoogle 5, which should hopefully be out in a month or so.

The bit that I've converted to Conduit is something that takes in a tarball of one text files per Hackage file, namely the Haddock output with one definition per line (this 22Mb file). It processes each definition, saves it to a single binary file (with compression and some processing), and returns some compact information about the definition for later processing. I don't expect the process to run in constant space as it is accumulating some return information, but it is important that most of the memory used by one definition is released before the next definition. I originally tried lazy IO, and while it somewhat worked, it was hard to abstract properly and very prone to space leaks. Converting to Conduit was relatively easy and is simpler and more robust.

The Conduit model

My mental model for a conduit Conduit a m b is roughly a function [a] -> m [b] - a values go in and b values come out (but interleaved with the monadic actions). More concretely you ask for an a with await and give back a b with yield, doing stuff in the middle in the m Monad.

A piece of conduit is always either running (doing it's actual work), waiting after a yield for the downstream person to ask for more results (with await), or waiting after an await for the upstream person to give the value (with yield). You can think of a conduit as making explicit the demand-order inherent in lazy evaluation.

Things to know about Conduit

I think it's fair to say Conduit shows its history - this is good for people who have been using it for a while (your code doesn't keep breaking), but bad for people learning it (there are lots of things you don't care about). Here are some notes I made:

The Data.Conduit module in the conduit library is not the right module to use - it seems generally accepted to use the Conduit module from the conduit-combinators package. However, I decided to build my own conduit-combinators style replacement in the Hoogle tree, see General.Conduit - the standard Conduit module has a lot of dependencies, and a lot of generalisations.

Don't use Source or Sink - use Producer and Consumer - the former are just a convenient way to get confusing error messages.

Don't use =$ or $=, always use =$= between conduits. The =$= operator is essentially flip (.).

Given a Conduit you can run it with runConduit. Alternatively, given a =$= b =$= c you can replace any of the =$= with $$ to run the Conduit as well. I find that a bit ugly, and have stuck to runConduit.

Conduit and ConduitM have their type arguments in different orders, which is just confusing. However, generally I use either Conduit (a connector) or Producer (something with a result). You rarely need something with a result and a return value.

You call await to see if a value is available to process. The most common bug I've had with conduits is forgetting to make the function processing items recursive - usually you want awaitForever, not just await.

The ByteString lines Conduit function was accidentally O(n^2) - I spotted and fixed that. Using difference lists does not necessarily make your code O(n)!

Useful functions

When using Conduit I found a number of functions seemingly missing, so defined them myself.

First up is countC which counts the number of items that are consumed. Just a simple definition on top of sumC.

I regularly find zipFrom i = zip [i..] very useful in strict languages, and since Conduit can be viewed as a strict version of lazy lists (through very foggy glasses) it's no surprise a Conduit version is also useful.

Finally, it's useful to zip two conduits. I was surprised how fiddly that was with the standard operators (you have to use newtype wrappers and an Applicative instance), but a simple |$| definition hides that complexity away.

Why not Pipes?

I am curious if all Pipes users get asked "Why not use Conduit?", or if this FAQ is asymmetrical?

I realise pipes are billed as the more "principled" choice for this type of programming, but I've yet to see anywhere Conduit seems fundamentally unprincipled. I use WAI/Warp and http-conduit, so learning Conduit gives me some help there.

I wouldn't necessarily call these additions "unprincipled", except insofar as they extend the semantics of the term algebra in ways that no longer allow certain mathematical constructions, like categories, to be formed. Obeying fewer laws means the type is less constrained in its behavior, which does reduce reasoning power.

Where I think the real question lies is: How much functionality should one "bake in" when other approaches are possible? For example, leftover support can be added externally using StateT, and is the approach taken by pipes-parse. Yes, it is more difficult to learn at the beginning, but isn't Haskell about wearing the hair shirt, and communicating as much of a program's behavior in types as we can get away with, to write better programs?

Conduit aims at a certain intuitive simplicity by providing a semantically rich core type, and for its chosen crowd of users, this may be exactly what they want. I tend to prefer the core simplicity of pipes, and like the fact that to get fancy behavior out of my pipelines, I need to use fancier types.

John: Thanks for that explanation. I suggest you put it on a blog post somewhere (or School of Haskell or some such) as I think it would be useful for others - it's the clearest comparison I've seen, the kind of thing I was looking for. I think in my case I'm using the subset that makes use of neither of the extensions, so my pipe and conduit experiences are likely to be the same.

In terms of your experience using both libraries probably being the same: Although you don't explicitly make use of conduit's extra features, you probably do so implicitly. For example, notification of upstream termination allows conduit to collect elements into lists using a Consumer (sinkLazy). In pipes, such collection must be done by a Producer to list function (toListM). Receiving Producers as arguments is a common pipes idiom for detecting Producer termination. You'll even see functions from Producer to Producer, that in conduit could just be Conduits.

An example where leftovers become significant is takeWhile, since pipes can't "put back" the element that fails the test. Currently, there is no way to access that element at all (although I've opened a pull request to remedy this). Even after my pull request, the user of takeWhile must consciously use that element first, before calling await to get the next element, which is a requirement *not* present in the types.

Lastly there is prompt finalization, which means that a file can be closed as soon as its last chunk is read. This can only sort of be done in pipes, by testing for EOF and closing the handle before transmitting the last chunk. However, even that doesn't help if downstream exits before upstream has a chance to read that chunk. Closure is then deferred until the entire SafeT session has ended, where the pipes bracket finalizer was registered. In conduit the right behavior happens naturally for any user of bracket, meaning there are fewer ways to get it wrong.

Whatever the Next Generation of streaming libraries may be, my hope is that as many gotchas like this are eliminated by the types as possible, while retaining maximal expressiveness about the semantics of those types (and so, more laws).