Pages

Wednesday, January 28, 2015

It's time for another anti-tutorial! Whereas a tutorial is an advanced user giving step-by-step instructions to help newbies, an anti-tutorial is a new user describing their path to enlightenment. My approach is usually to follow the types, so my anti-tutorials are also examples of how to do that.

Today, inspired by a question from Syncopat3d, I'll try to learn how to use Simon Marlow's Haxl library. I think Haxl is supposed to improve the performance of complicated queries which use multiple data sources, such as databases and web services, by somehow figuring out which parts of the query should be executed in parallel and which ones should be batched together in one request. Since Syncopat3d is looking for a way to schedule the execution of a large computation which involves running several external processes in parallel, caching the results which are used more than once, and batching together the processes which use the same input, Haxl seemed like a good fit!

Black triangle

To understand the basics of the library, I'd like to create a black triangle, that is, a trivial program which nevertheless goes through the whole pipeline. So as a first step, I need to figure out what the stages of Haxl's pipeline are.

Since I'm using a type-directed approach, I need some type signature from which to begin my exploration. Hunting around Haxl's hackage page for something important-looking, I find GenHaxl, "the Haxl monad". Despite the recent complaints about the phrase "the <something> monad", finding that phrase here is quite reassuring, as it gives me a good idea of what to expect in this package: a bunch of commands which I can string together into a computation, and some function to run that computation.

Thus, to a first approximation, the Haxl pipeline has two stages: constructing a computation, and then running it.

A trivial computation

Since GenHaxl is a monad, I already know that return 42 is a suitably trivial and valid computation, so all I need now is a function to run a GenHaxl computation.

That function is typically right after the definition of the datatype, and indeed, that's where I find runHaxl. I see that in addition to my trivial GenHaxl computation, I'll need a value of type Env u. How do I make one?

Clicking through to the definition of Env, I see that emptyEnv can make an Env u out of a u. Since there are no constraints on u so far, I'll simply use (). I fully expect to revisit that decision once I figure out what the type u represents in the type GenHaxl u a.

Good, we now have a base on which to build! Let's now make our computation slightly less trivial.

What's a data source?

There are a bunch of GenHaxl commands listed after runHaxl, but most of them seem to be concerned with auxiliary matters such as exceptions and caching. Except for one:

dataFetch :: (DataSource u r, Request r a) => r a -> GenHaxl u a

That seems to be our link to another stage of Haxl's pipeline: data sources. So the first stage is a data source, then we describe a computation which fetches from the data source, then finally, we run the computation.

So, I want an r a satisfying DataSource u r. Is there something simple I could use for r? The documentation for DataSource doesn't list any instances, so I guess I'll have to define one myself. Let's see, there is only one method to implement, fetch, and it uses both u and r. The way in which they're used should give me a hint as to what those type variables represent.

fetch :: State r
-> Flags
-> u
-> [BlockedFetch r]
-> PerformFetch

I find it surprising that neither u nor r seem to constrain the output type. In particular, u is again completely unconstrained, so I'll keep using (). The description of the u parameter, "User environment", makes me think that indeed, I can probably get away with any concrete type of my choosing. As for r, which seems to be the interesting part here, we'll have to look at the definitions for State and BlockedFetch to figure out what it's about.

Okay, so State r is an associated type in an otherwise-empty typeclass, so I can again pick whatever I want. BlockedFetch r is much more interesting: it has an existential type a, which ties the r a to its ResultVar a. The documentation for BlockedFetch explains this link very clearly: r a is a request with result a, whose result must be placed inside the ResultVar a. This explains why r wasn't constraining fetch's output type: this ResultVar is the Haskell equivalent of an output parameter. So instead of being a pure function returning something related to r, this fetch method must be an imperative computation which fills in its output parameters before returning to the caller. The type of fetch's return type, PerformFetch, is probably some monad which has commands for filling in ResultVars.

data PerformFetch = SyncFetch (IO ()) | ...

At least in the simple case, PerformFetch is a simple wrapper around IO (), so I guess ResultVar must be a simple wrapper around MVar or IORef.

A trivial data source

Anyway, we now have a clear idea of what r a is: a request whose result has type a. Let's create a simple data source, Deep Thought, which only knows how to answer a single request.

A trivial state

...but the execution fails at runtime. Now that I think about it, it makes a lot of sense: even though I don't use it, fetch receives a value of type State DeepThought, but since this is a custom type and I haven't given any of its inhabitants to anything, there is no way for Haxl to conjure one up from thin air. There must be a way to initialize the state somehow.

I must say that I'm a bit disappointed by how imperative Haxl's API has been so far. Whether we're assigning values to result variables or initializing a state, correctness requires us to perform actions which aren't required by the types and thus can't be caught until runtime. This is unusual for a Haskell library, and if the rest of the API is like this, I'm afraid following the types won't be a very useful exploration technique.

Anyway, I couldn't find any function with "init" in the name, but by looking for occurences of State in the types, I figured out how to perform the initialization: via the environment u which I had left empty until now.

It worked! We have a trivial data source, we have a trivial expression which queries it, we can run our expression, and we obtain the right answer. That's our black triangle!

Multiple data sources, multiple requests

Next, I'd like to try a slightly more complicated computation. Syncopat3d gives the following example:

F_0(x, y, z) = E(F_1(x, y), F_2(y, z))

Here we clearly have two different data sources, E and F. Syncopat3d insists that E is computed by an external program, which is certainly possible since our data sources can run any IO code, but I don't think this implementation detail is particularly relevant to our exploration of Haxl, so I'll create two more trivial data sources.

This is not the trace I was hoping to see. Since fetch is receiving a list of request/var pairs, I expected Haxl to send me multiple requests at once, in case my data source knows how to exploit commonalities in the requests. But it doesn't look like Haxl figured out that the F_1 and F_2 requests could be performed at the same time.

It turns out that this is a well-known problem with Haxl's monadic interface. I remember about it now, it was described in a presentation about Haxl (slide 45) when it came out. The solution is to use the Applicative syntax to group the parts which are independent of each other:

Style

I don't like the way in which we have to write our computations. Consider a slightly more complicated example:

E(
E(F_1(x,y), F_2(y,z)),
E(F_1(x',y'), F_2(y',z'))
)

Since the four F_1 and F_2 requests at the leaves are all independent, it would make sense for Haxl to batch them all together. But in order to obtain this behaviour, I have to list their four subcomputations together.

Conclusion

I now understand Haxl's purpose much better. With the appropriate intermediate functions, Haxl allows us to describe a computation very concisely, as nested function calls. Haxl executes this computation one layer at a time: all of the leaves, then all the requests which only depend on the leaves, and so on. Within a single layer, the requests are subdivided again, this time according to their respective data sources. Finally, for a given data source, it is fetch's responsibility to find and exploit opportunities for reusing work across the different requests belonging to the same batch. There are also some features related to caching and parallelism which I didn't explore.

I also understand Haxl's implementation much better, having reimplemented part of it myself. In fact, I'd be interested in writing a follow-up post named "Homemade Haxl", in the same vein as my "Homemade FRP" series. What do you think? Are you more interested in watching me learn some new libraries, watching me reimplement libraries, or watching me implement new stuff? I'll be doing all three anyway, I just want to know which of those activities I should blog about :)

Really, your feedback would be greatly appreciated, as the only reason I started this anti-tutorial series in the first place is that my first write-up on understanding Pipes was so surprisingly popular. I've streamlined the format a lot since that first post, and I want to make sure I haven't lost any of the magic in the process!

I'm not sure I understand what you're trying to do; is it the data sources which mutate a data structure, or is it the code which fetches data from those data sources? Would it work for you to perform the pure part of the computation inside GenHaxl, and then to perform the mutations inside IO, once you have received the results from runHaxl?