Sunday, May 27, 2012

I want to preface this by saying that this post is not intended to be mean-spirited and I will offer some insights on how to fix these problems.

I often find that violations of type-class laws almost invariably lead to subtle bugs, which is why I go to so much effort to ground my pipes library in theoretically-derived type-classes and enforce their corresponding laws very strictly. In this post I'm going to illustrate the importance of this for conduit by showing several conduit bugs and demonstrating how they stem directly from type-class law violations. I based these bugs on conduit-0.4.2.

The right-associative grouping accidentally drops the finalizer. Conduits don't guarantee deterministic finalization in certain corner cases, again because they violate the monad laws.

More generally, any conduit that sequences two buffered conduits is prone to the above two bugs. The source of both violations is the attempt to "push back" input by attaching residual input to the Done constructor. I can't yet offer a solution for this because I have not yet tackled the parsing problem, however removing this feature will at least prevent the above case of dropped finalizers.

Conduits violate the Category laws

I know that conduits do not advertise being a Category, but I investigated where conduits deviated from a Category and found more bugs in the process.

For example, conduits lack an identity conduit. This is apparent from the type of conduit composition:

(=$=) :: Monad m
=> Pipe a b m () Pipe b c m r
-> Pipe a c m r

The upstream conduit is constrained to return (), so you immediately lose the ability to return values from upstream like you can with pipes and you cannot form a proper upstream identity pipe. However, we could type-restrict composition to only return () and see if that forms a category, even if not necessarily as powerful as the pipes category:

(=$=) :: Monad m
=> Pipe a b m ()
-> Pipe b c m ()
-> Pipe a c m ()

The closest this composition comes to an identity conduit is:

idC = await >>= maybe (return ()) (\a -> yield a >> idC)

Unfortunately, idC does not serve as an identity conduit when you place it upstream of a conduit:

This seemingly innocuous identity law violation actually represents a known subtle bug in conduits where if they are fed a Nothing followed by a Just x they break. Some conduit library code depends on this scenario never happening in order to guarantee safety and correctness. Michael already knows this, because he notes this in the comments of composition:

-- [...] However, it is not
-- recommended to give input to a pipe after it has
-- been told there is no more input. [...]

I find it interesting that despite not attempting to make conduits a category, the invariant Michael requires is precisely the invariant necessary to make the identity law work. This exemplifies how even when we don't program using theoretical constructs, our intuition for correct behavior exactly matches the laws for the theoretically-grounded classes.

The lack of a proper upstream identity is also a source of data loss. When you compose two pipes, the residual input for the downstream pipe is discarded under all circumstances, and there is no way to solve this without introducing a proper upstream identity conduit.

pipes has solved this issue, but the solution is precisely what necessitated the parametrized monad and I can discuss this with Michael if he is interested because I think he would not be as reluctant to use extensions to allow do notation for parametrized monads. The other advantage of switching to the pipes solution is that you gain the ability to finalize upstream early without terminating, which conduit does not presently have.

Conduits violate Monad Transformer laws

This is my fault. The first release of my library violated the monad transformer laws and while I've fixed it in version 2.0, conduit still has the original version found in pipes.

The original version violates both the identity and composition laws of monad transformers. You can see this if you just count the number of PipeM constructors generated:

This is the reason I switched to using an approach based on a free monad transformer (i.e. FreeT), instead of using an ordinary free monad.

This makes the monad bind mandatory at each step, which actually leads to very little degradation of performance, so I had no problem making the switch. More importantly, though, it allows more powerful optimizations using rewrite rules than are currently possible when the monad is optional. For example, the following rewrite rule is safe once you have a correct instance for MonadTrans:

I will discuss rewrite rules later in a post about optimizing pipes and frames, but one of my big motivations for strictly enforcing all the class laws as strong as possible is that it then permits rewrites so powerful that all the pipe code completely disappears, leaving behind the hand-written loop code (think: stream fusion on steroids).

Other bugs

There are other bugs in conduits that I found in the process, mainly associated with registering finalizers. For example, conduits has a nasty habit of releasing resources multiple times, but this is hidden by the ResourceT machinery which ignores duplicate resource release requests.

However, this is quite easy for conduits to fix. All you do is remove the finalizer field from the PipeM constructor and have pipeClose ignore PipeM actions completely, only associating finalizers with HaveOutput. This fixes the issue of multiple finalizations and also fixes the issue of accidentally drawing one last chunk of data before finalizing if you aren't careful about writing the source.

Even if you fix that, though, there are still other finalization problems. For example, while Michael provides a mechanism for bidirectionally-safe finalization, he never uses it for sources and sinks, meaning that if you elevate them to conduits they won't finalize correctly, however this is easy to fix and he will know how to do it. He also never exposes any utility functions for bidirectionally-safe finalization that is an equivalent to the finallyP in pipes.

Discussion

This post presents some examples where if you are close to a theoretically-grounded type-class but not quite there, the difference is most likely a bug. This means that you can identify bugs in a library rapidly just by examining where it approximates theoretical type-classes and then studying where they fail to observe the corresponding class laws.

Monday, May 21, 2012

I'm happy to announce that pipes-2.0 now includes an elegant and sound way to finalize resources promptly and deterministically. You can find the latest version of the library here.

Before I continue I want to acknowledge Paolo Capriotti, who contributed a lot of instrumental discussion that led to this solution, particularly for how to manage downstream finalization.

The library introduces a new higher order data-type called a Frame complete with its own Category that you can use to solve all your finalization problems. It's layered on top of ordinary Pipes and it has plenty of nice properties that fall out of enforcing the Category laws. However, this post is not documentation, so I encourage you to read the expanded tutorial, starting from the "Frames" section before you continue.

The real topic of this post is a really quick and dirty meta-discussion containing a lot of comments on the release that didn't belong in the documentation. I will also touch upon issues with finalization that I grappled with when working on finalizing pipes while satisfying the Category laws. I'm hoping it will help guide other iteratee libraries which might have different goals but still deal with the same issues. I will discuss these topics in more depth in future posts, but I felt all of these are worth briefly mentioning right now.

Parametrized Monads

The first big issue is that prompt finalization is inherently not a monad, but rather a parametrized monad. By prompt finalization, I mean the finalizing upstream resources before the current pipe is done. What we'd like is some sort of way to write:

do someCode
finalizeUpstream
codeThatCan'tAwait

The problem is that if it's a monad, there is no restriction on the ordering of commands, so nothing prevents the user from calling an await after upstream is finalized. With a parametrized monad (a.k.a. an indexed monad) and GADTs, you can solve this problem by marking the finalization point with a phantom type that marks upstream has been finalized. There are actually more elegant ways to do it, but that's the most straightforward way. To learn more about parametrized monads, I highly recommend Dan Piponi's introduction to them.

However, for the time being I chose to not use parametrized monads and stuck to just splitting it into two separate monads: one for before finalizing upstream and one for after finalizing upstream. The first monad would would return the second one as a return value, so you have something that looks like:

Pipe a b m (Producer b m r)

The second block is forbidden from awaiting anything other than (), so you can safely finalize upstream at the transition between the two monads.

Besides avoiding extensions, there is a more important reason that I haven't released a version of pipes that uses parametrized monads. It turns out you can use them to communicate more complicated session types than just ordinary streams and I wanted more time to experiment that with generalization of pipes before incorporating it into the library.

The choice not to use parametrized monads complicated the underlying type for Frame slightly, but when I do include parametrized monads, it should clean up the type considerably. What that means for now is that there are two types you switch between, one of which is a Monad (the Ensure type), and the other which is a Category (the Frame) type. Once I figure out the most elegant way to do parametrized monads they should become the same type. Fortunately, it's not hard to switch between them and the documentation explains their relationship, especially Control.Pipe.Final, which goes into more depth about the types.

Directionality

Another issue that I grappled with was whether or not bidirectional pipes can possibly form a Category. Bidirectional pipes would have made finalization a lot easier, but unfortunately I could never come up with a solution that formed a proper Category. I can't definitively rule out the possibility of them, but I am reasonably sure that you can't implement them and retain anything that remotely resembles my original Pipe type. That's obviously a vague statement since I can't quantify my notion of what it means to resemble the original Pipe type.

The key breakthrough in designing finalized pipes was to not fight the unidirectionality of pipes at all and to instead just "go with the flow" (literally). This means that I had to use two separate tricks to ensure that finalization worked in both directions. This was disconcerting at first until I noticed a very general dual pattern underlying the two tricks that emerged...

Monoids

So let's imagine for a second that I'm right and pipes really must be unidirectional if you want to form a category. Now let's try to figure out how to finalize pipes correctly if a downstream pipe terminates before them. Well, if pipes are unidirectional, there is absolutely no way to communicate back upstream that the pipe terminated. Instead, we just "go with the flow" and do the reverse: every time a pipe yields it passes its most up-to-date finalizer alongside the yielded value. Downstream pipes are then responsible for calling the finalizers passed to them if they terminate first before awaiting a new value from the pipe that yielded.

What composition does is remember the last finalizer each frame yielded (defaulting to the empty finalizer for pipes that haven't run yet) then it combines the most current finalizer of each pipe with the most current finalizers of every pipe upstream. So if we had three pipes composed like so:

p1
Then you can make a diagram showing how finalizers are combined automatically by composition behind the scenes:

p1
... where I'm using (*) to denote mappend (i.e. monoid multiplication), and 1 to denote mempty (i.e. monoid unit). f1 would be the most up-to-date finalizer of pipe p1, f2 would be the most up-to-date finalizer of pipe p2, and f3 would be the most up-to-date finalizer of pipe p3. The monoid in this case is just monad sequencing where (*) = (>>) and 1 = return ().

So let's say that pipe p1 were to close its input end. Composition would then take the collected upstream finalizers coming into pipe p1, which in this case is (f3*f2) = f3 >> f2, and call them since it knows it is safe to finalize them.

All of this occurs completely behind the scenes. From the user's perspective, all they do is yield the individual finalizers alongside ordinary output and the process of collecting finalizers and calling them upon termination is completely a black box:

So the process of collecting upstream finalizers and running them behaves like a monoid under the hood where it folds all the upstream finalizers towards the pipe that is terminating so that it knows exactly what to call to finalize all of them. Where it gets really interesting is that finalizing downstream behaves dually as a comonoid.

Comonoids

Finalizing downstream can use the ordinary metaphor of exception propagation. It's not quite as trivial as it sounds (and it was in fact the harder of the two halves to get correct, surprisingly), but if you understand exceptions you are 90% of the way towards understanding how to finalize downstream pipes.

Once you get this, it's not hard to see how this behavior is inherently comonoidal. All we have to do is reverse all the arrows in our previous diagram:

Now, this time (*) is comultiplication and 1 is counit. Comultiplication consists of splitting off the exceptional value for each pipe to handle, which ensures that no pipe accidentally swallows the exception. Counit discards the exception and doesn't bother to handle it.

Again, the user doesn't see any of this propagation or unfolding and is not responsible for rethrowing the exception. All the user sees is that they await input and get an exception instead. The rest is a black box automatically handled by composition:

Synthesis

There is another way to think of how this monoid/comonoid duality solves the issue of unidirectionality. For the upstream half of finalization, the finalizer is transmitted to the exception (which is the termination in this case). For the downstream half, the exception (i.e. termination) is transmitted to the finalizer.

If you study the source code, you'll notice that I define composition as follows:

p1
Where mult is what does all the monoidal folding of the finalizers, comult is what does the comonoidal unfolding of the exception. ( just simulates the parametrized monad. But where are unit and counit? If you've read the documentation, the answer is that they are right there under your nose in the form of yieldF (which is unit) and awaitF (which is counit). And if you check out the definition for the identity frame, it is just:

idF = forever $ awaitF >>= yieldF

In other words, idF is both the unit for the monoidal fold and the counit for the comonoidal unfold. Diagramatically, this looks like:

...
... where the top half is the monoid and the bottom half is the comonoid.

Strictness

One important mistake I made in the first release of my library was releasing the Strict category, which wasn't a true category. It's easy to see this because it violates one of the identity laws:

(p >> discard)
So that got removed from the library in this release, however one thing I realized from countless failed attempts to fix it is that the category of pipes is inherently lazy. However, when I finally solved the finalization issue, I discovered that you could implement strictness anyway on top of the lazy core in a way that is very elegant and compositional and I describe this in the tutorial.

The "inherent" laziness of pipes/frames is important for another reason, though. You'll notice that I mentioned the capability to finalize upstream promptly but I never mentioned being able to finalize downstream promptly before the pipe is done. It turns out this is not possible and in the same way that pipe/frame composition inherently prefers to be lazy, it also prefers to order finalizers from upstream to downstream because it's lazy.

To see why, just spend a minute thinking about what it would mean to finalize downstream before a frame was done under lazy semantics.

Speed

One big problem is that the finalization machinery now incurs a large overhead. Some naive benchmarks I ran show that for IO the frame overhead takes about as much time as the IO operations it manages. However, I have done absolutely nothing to optimize the implementation at all (i.e. no Codensity transformation, no INLINE pragmas, no rewrite rules), so if you are good at optimizing Haskell code and are interested in contributing to the pipes library then I could really use your help as improving speed is a major goal for the next patch. However, I might split the faster code into a separate library if it complicates the source code considerably, because I would like to keep the main library as a reference implementation that is easier to understand.

Exceptions

The implementation only covers finalization for now, but the same principles and techniques can be adapted to solve the issue of exceptions as well. For example, while the implementation currently uses Nothing to transmit the termination exception, it can be modified to transmit a value of any kind (although not as trivially as it sounds!). Similarly, the finalizers need not be limited to the base monad, nor do they even need to be finalizers (or even monads!). Really, anything that is a monoid or comonoid can be used. I chose to go with the simplest and most practical implementation that demonstrates the basic underlying principle in order to give other people ideas.

FreeT

FreeT is the monad transformer version of the free monad that I use to refactor the implementation of the Pipe type (and it's quite common in iteratee/coroutine libraries, where it appears under various names). I'm working with Ross to include it in the transformers package, but until that happens I'm rolling it into the pipes library for now. Once it is added to transformers I will remove it from pipes and use the transformers version, so don't make pipes a dependency for FreeT.

Other Changes

Also, if you haven't been following the library closely, pipes is now BSD-licensed, so feel free to use it as you please.

Conclusion

Like with the original pipe composition, I verified that frame composition enforces the category laws (and this was an extraordinarily tedious and headache-inducing proof). This alone is probably the most significant contribution of this library as it is the only existing implementation that has a finalization mechanism that is:

Safe: Finalizers never get duplicated or dropped.

Modular: It completely decouples the finalization code of each frame and lets you to reason about their finalization properties independently of each other.

Scalable: It is very easy to build long pipelines with no increase in complexity at all.

Easy: No glue code is required to chain together the finalization mechanisms of pipes.

This means that we can now skip type-level programming and implement everything within lambda calculus at the value level. Things that previously required elaborate extensions now only require ordinary Haskell functions.

... instead of attempting a bunch of type-class hackery that doesn't work. More importantly, we can now use our more featureful value-level programming tools to do what was incredibly difficult to do at the type level.

Class maintenance

One big issue in Haskell is maintaining class APIs. However, when we implement classes at the value level, this problem completely disappears.

For example, let's say that I realize in retrospect that my Monad class needed to be split into two classes, one named Pointed to hold return and one named Monad that has Pointed as a superclass. If people use my Monad class extensively, then I'd have to break all their Monad instances if I split it into two separate classes because now they would have to spin off all of their return implementations into separate instances for Pointed.

Now, had I implemented it as a data type, it wouldn't even matter. I'd just write:

Now users can automatically derive Pointed instances from their old Monad instances, or they can choose to write a Pointed instance and then build a Monad instance on top of it.

Backwards compatibility

Similarly, let's say that I forgot to make Functor a superclass of Monad. What's incredibly painful for the Haskell community to solve at the type-level is utterly straightforward to fix after-the-fact at the value level:

Now we're actually writing in a true functional style where sum and product are true functions of the instance, rather than fake functions of a class constraint using awkward newtypes.

Value-level programming is safer

Type classes are used most often for operator overloading. The dark side to this that your overloaded function will type-check on anything that is an instance of that class, including things you may not have intended it to type-check on.

For example, let's say I'm trying to write the following code using the ever-so-permissive Binary class:

main = encodeFile "test.dat" (2, 3)

... but it's 3:00 in the morning and I make a mistake and instead type:

main = encodeFile "test.dat" (2, [3])

This type-checks and silently fails! However, had I explicitly passed the instance I wished to use, this would have raised a compile-time error:

You might say, "Well, I don't want to have to annotate the type I'm using. I want it done automatically." However, this is the exact same argument made for forgiving languages like Perl or PHP were people advocate that in ambiguous situations the language or library should attempt to silently guess what you intended to do in instead of complaining loudly. This is exactly the antithesis of a strongly typed language!

Also, in the above case you would have had to annotate it anyway, because Binary wouldn't have been able to infer the specific type of the numeric literals!

main = encodeFile "test.dat" (2 :: Int, 3, :: Int)

Or what if I wanted to implement two different ways to encode a list, one which was the naive encoding and one which used more efficient arrays for certain types:

If I was really clever, I could even write implement both instances using the same binList function and then have it select whether to encode a list or array based on the sub-instance passed to it! That's not even possible using type-classes.

No type annotations

Here's another example of an incredibly awkward use of typeclasses:

class Storable a where
...
sizeOf :: a -> Int

Anybody who has ever had to use this knows how awkward it is when you don't have a value of type a to provide it, which is common. You have to do this:

sizeOf (undefined :: a)

That's just horrible, especially when the solution with value-level instances is so simple in comparison:

In fact, with value-level instances, type annotations are never ever necessary. Instead of:

readInt :: String -> Int
readInt = read

... or:

read "4" :: Int

... we'd just use:

read read'Int "4"

In other words, the value-level instance is all the information the function needs, and it's guaranteed to be sound and catch incorrect instance errors at compile-time.

Powerful Approach

I wanted to demonstrate that this is a really industrial-strength replacement to type classes, so I took the mtl's StateT, ReaderT, and Identity and implemented them entirely in value-level instances. The code is provided in the Appendix of this post. This implementation allows you to straightforwardly translate:

It's implemented with only a single extension: Rank2Types. No UndecidableInstances required.

No type signatures or type annotations are necessary. You can delete every single type signature in the file, which is completely self-contained, and the compiler infers every single type correctly. Try it!

More tricks

This is just scratching the surface. This post doesn't even really cover all the things that are only possible with value-level instances like:

In other words, what I'm trying to say is that value-level instances are right now above us in the "power spectrum" of Haskell programming and you don't really get a feel for how incredibly useful they are until you actually start using them.

Simplicity

Another feature about value-level instances is the conceptual simplicity and elegance. Before there is a type-class checker and a type-checker. Now there is just a type-checker. You don't really appreciate how great this is until you try it and start getting amazingly clear compiler errors. Programming without type-classes is very intuitive! Really, the hardest part about it is simply naming things!

Also, the fact that it's implementable purely using ordinary functional programming is a very big win. If anything, it would make the GHC compiler writer's jobs much easier by not requiring them to entertain any of the half-baked type-class extensions that people propose. This approach allows you to completely remove type-classes from the language. I'm just putting that out there.

Flaws

On that note, that brings me to the last section, where I will frankly discuss all the huge problems with it. The four biggest problems are:

No ecosystem for it. To make effective use of it, you'd need new versions of most Haskell libraries.

No do syntactic sugar. This one hurts.

Verbosity. Every instance has to be named and passed around.

Inertia. Programmers used to overloading will be reluctant to start specifying the instance they want.

The first issue is a huge problem and can only be solved if the community agrees this is actually a good idea. I'm only one person and that's about all my opinion counts for. All I can do is mention that more recent data types are already moving in this direction, with Lens (from data-lens) being the best example. Just imagine how impossible it would have been to implement Lens as a class:

class Lens a b where
get :: a -> b
set :: b -> a -> a

It fails horrendously, for the exact same reason the Isomorphism class crashes and burns. When implemented as a data type, it works completely flawlessly at the expense of extra verbosity. So if you liked Lens, chances are you'll like value-level instances in general.

The second issue of syntactic sugar can be solved by something like RebindableSyntax and having do notation use whatever (>>=) is in scope. You would then specify which MonadI instance you use for each do block:

let (>>=) = _bind m
in do ...

... or you pass the MonadI instance as a parameter to the do block.

This is not ideal, unfortunately and ties into the third issue of verbosity. All I can say is that the only way you can understand that the verbosity is "worth it" is if you try it out and see how much more powerful and easier it is than type-class programming. Also, value-level instances admit the exact same tricks to clean up code as normal parameter passing. For example, you can use Reader (MonadI m) to avoid explicitly passing a monad instance around.

However, this still doesn't solve the problem of just coming up with names for the instances, which is uncomfortable until you get used to it and come up with a systematic nomenclature. This is a case where a more powerful name-spacing system would really help.

The last problem is the most insidious one, in my opinion, which is that we as Haskell programmers have been conditioned to believe that it is correct and normal to have operators change behavior silently when passed different arguments, which completely subverts type-safety. I'm going to conclude by saying that this is absolutely wrong and that the most important reason that you should adopt value-level instances is precisely because they are the type-safe approach to ad-hoc polymorphism.

Appendix

The following code implements StateT, ReaderT, Identity, MonadState, and MonadReader from the mtl, along with some example functions. The code is completely self-contained and can be loaded directly into ghci. Every function is annotated with a comment showing how the mtl implements the exact same class or instance so you have plenty of examples for how you would translate the type-class approach into the value-level instance approach.