The Worst Function in Conduit

May 7, 2017

This blog post addresses a long-standing FIXME in the
conduit-combinators documentation, as well as
a question on Twitter. This
blog post will assume familiarity with the Conduit streaming data
library; if you'd like to read up on it first, please
check out the tutorial. The
full executable snippet is at the end of this blog post, but we'll
build up intermediate bits along the way. First, the
Stack script header, import
statement, and some minor helper functions.

src10 just provides the numbers 1 through 10 as a source, and
remaining tells you how many values are remaining from
upstream. Cool.

Now let's pretend that the Conduit libraries completely forgot to
provide a drop function. That is, a function that will take an Int
and discard that many values from the upstream. We could write one
ourselves pretty easily:

Well, there's another formulation of this drop function. Instead of
letting the next monadically bound component pick up remaining values,
we could pass the remaining values downstream. Fortunately it's
really easy to implement this function in terms of dropSink:

Many may argue that this is more natural. To some extent, it mirrors
the behavior of take more closely, as take passes the initial
values downstream. On the other hand, dropTrans cannot guarantee
that the values will be removed from the stream; if instead of
dropTrans 5 .| remaining I simply did dropTrans 5 .| return (),
then the dropTrans would never have a chance to fire, since
execution is driven from downstream. Also, as demonstrated, it's
really easy to capture this transformer behavior from the sink
behavior; the other way is trickier.

My point here is that we have two legitimate definitions of a
function. And from my experience, different people expect different
behavior for the function. In fact, some people (myself included)
intuitively expect different behavior depending on the circumstance!
This is what earns drop the title of worst function in conduit.

To make it even more clear how bad this is, let's see how you can
misuse these functions unintentionally.

This code looks perfectly reasonable, and if we just replaced
dropSink with dropTrans, it would be correct. But instead of
saying, as expected, that we have 5 values remaining, this will
print 0. The reason: src10 yields 10 values to
dropSink. dropSink drops 5 of those and leaves the remaining 5
untouched. But dropSink never itself yields a value downstream, so
remaining receives nothing.

Because of the type system, it's slightly trickier to misuse
dropTrans. Let's first do the naive thing of just assuming it's
dropSink:

The problem is that runConduit expects a pipeline where the final
output value is Void. However, dropTrans has an output value of
type Int. And if it's yielding Ints, so must remaining. This is
definitely an argument in favor of dropTrans being the better
function: the type system helps us a bit. (It's also an argument in
favor of keeping
the type signature of runConduit as-is.)

However, it's still possible to accidentally screw things up in bigger
pipelines, e.g.:

This code may look a bit contrived, but in real-world Conduit code
it's not at all uncommon to deeply nest these components in such a way
that the error would not be present. You may be surprised to hear that
the output of this program is:

Remaining: 0
[6,7,8,9,10]

The reason is that the sinkList is downstream from dropTrans, and
grabs all of its output. dropTrans itself will drain all output from
src10, leaving nothing behind for remaining to grab.

The Conduit libraries use the dropSink variety of function. I wish
there was a better approach here that felt more intuitive to
everyone. The closest I can think of to that is deprecating drop and
replacing it with more explicitly named dropSink and dropTrans,
but I'm not sure how I feel about that (feedback welcome, and other
ideas certainly welcome).