logji

Wednesday, February 12, 2014

This post is being resurrected from the dustbin of history - it was originally posted on the precog.com engineering blog, which has since been lost to acquisition and bitrot. My opinion on the applicability of the following techniques has changed somewhat since I originally wrote this post; I will address this in a followup post in the near future. Briefly, though, I believe that parameterizing each method with an implicit monad constraint is preferable where possible; it provides the user with greater flexibility.

In our last blog post on Precog development, Daniel wrote about how we use the Cake Pattern to structure our codebase and to leave the implementation types abstract as long as possible. As he showed in that post, this is an extremely powerful concept; by keeping a type existential, values of that type remain opaque to any modules that aren’t “aware” of the eventual type chosen, and so are prevented by the compiler from breaking encapsulation boundaries.

In today’s post, we’re going to extend this notion beyond types to handle type constructors, and in so doing will show a mechanism that allows us to switch out entire models of computation.

If you’ve been working with Scala for any length of time, you’ve undoubtedly heard the word “monad” floating around in one context or another, perhaps in a discussion about the syntactic sugar provided by Scala’s ‘for’ keyword or a blog post discussing how the Option type can be used to avoid the pitfalls of null references. While a significant amount of discussion of monads in Scala focuses on the “container” types, a few types common in the Scala ecosystem display a more interesting facet of monadic composition – delimited computation. While all monadic types exhibit this in composition, perhaps the most commonly used monadic type in Scala that exemplifies this sort of use directly is akka.dispatch.Future, (which is scheduled to replace Scala’s current Future interface in the standard library in Scala 2.10) which encodes asynchronous computation. It embodies the aspect of monadic composition that we’re most concerned with here by providing a flexible way to order the steps of a computation.

I’d like to step back a moment here and state that this post isn’t intended to function as a monad tutorial; there are numerous (perhaps too many!) articles about monads, and their relevance to programming in Scala exist elsewhere. If you’re new to the concept it will be useful for you to take advantage of one or more of these resources before continuing here. It is, however, important to point out at first that the use of monads in Scala, while pervasive (as evidenced by the presence of ‘for’ as syntactic sugar for monadic composition) is somewhat idiosyncratic in that the Scala standard libraries actually provide no Monad type. For this, we have to look outside of the standard library to the excellent scalaz project. Scalaz’s encoding of the monadic abstraction relies upon the implicit typeclass pattern. The base Monad type is shown here, simplified, for reference:

You’ll note that the Monad trait is not parameterized by a specific type, but instead a type constructor of one argument. The methods defined inside of Monad are then parametrically polymorphic, which means that they must provide a specific type to be inserted into the “hole” at the invocation point. This will be important later, when we talk about how to actually take advantage of this abstraction.

Scalaz provides implementations of this type for most of the monadic types in the Scala standard library, as well as several more sophisticated monadic types, which we’ll return to in a moment. For now, however let’s talk a bit about Akka’s Futures.

An Akka Future represents a computation whose value is produced asynchronously, and which may fail. Also, as I noted before, akka.dispatch.Future is monadic; that is, it is a type for which the Monad trait above can be trivially implemented and which satisfies the monad laws, and so it provides an extremely useful primitive for composing asynchronous computations without all sorts of tedious mucking about with manual management of threads and shared mutable state. At Precog, we use Futures extensively, both in a direct fashion and to allow us a composable way to interact with subsystems that are implemented atop Akka’s actor framework. Futures are arguably one of the best tools we have for reining in the complexity of asynchronous programming, and so our many of our early versions of APIs in our codebase exposed Futures directly. For example, here’s a snippet of one of our internal APIs, which follows the Cake pattern as described previously.

trait DatasetModule {
type Dataset
trait DatasetLike {
/** The members of this dataset will be used to determine what sets to
load, and the resulting sets will be unioned together */
def load: Future[Dataset]
/** Sorts the dataset by the specified value function. */
def sort(sortBy: /*...*/): Future[Dataset]
/** Retains a prefix of this dataset. */
def take(size: Int): Dataset
/** Map members of the dataset into the A type using the specified value
function, then combine using the resulting monoid */
def reduce[A: Monoid](mapTo: /*...*/): Future[A]
}
}

The Dataset type here is something of a strawman, but is loosely representative of the type that we use internally to represent an intermediate result of a computation - a lazy data structure with a number of operations that can be used to manipulate it, some of which may involve actually evaluating a function over the entire dataset and which may involve I/O, distributed evaluation, and asynchronous computation. Based on this interface, it’s easy to see that evaluation of some query with respect to a dataset might involve a load, a sort, taking a prefix, and a reduction of that prefix. Moreover, such an evaluation will not rely upon anything except the monadic nature of Future to compose its steps. What this means is that from the perspective of the consumer of the DatasetModule interface, the only aspect of Future that we’re relying upon is the ability to order operations in a statically checked fashion; the sequencing, rather than any particular semantics related to Future’s asynchrony, is the relevant piece of information provided by the type. So, the following generalization becomes natural:

trait DatasetModule[M[+_]] {
type Dataset
implicit def M: Monad[M]
trait DatasetLike {
/** The members of this dataset will be used to determine what sets to
load, and the resulting sets will be unioned together */
def load: M[Dataset]
/** Sorts the dataset by the specified value function. */
def sort(sortBy: /*...*/): M[Dataset]
/** Retains a prefix of this dataset. */
def take(size: Int): Dataset
/** Map members of the dataset into the A type using the specified value
function, then combine using the resulting monoid */
def reduce[A: Monoid](mapTo: /*...*/): M[A]
}
}

and, of course, down the road some concrete implementation of DatasetModule will refine the type constructor M to be Future:

In practice, we may actually leave M abstract until “the end of the universe.” In the Precog codebase, the M type will frequently represent the bottom of a stack of monad transformers including StateT, StreamT, EitherT and others that the actual implementation of the Dataset type depends upon.

This generalization has numerous benefits. First, as with the previous examples of our use of the Cake pattern, consumers of the DatasetModule trait are completely and statically insulated from irrelevant details of the implementation type. An important such consumer is a test suite. In a test, we probably don’t want to worry about the fact that the computation is being performed asynchronously; all that we care about is that we obtain a correct result. If our M is in fact at the bottom of a transformer stack, we can trivially replace it with the identity monad and use the “copointed” nature of this monad (the ability to “extract” a value from the monadic context). This allows us to build a similarly generic test harness:

For most cases, we’ll use the identity monad for testing. Suppose that we’re testing the piece of functionality described earlier, which has computed a result from the combination of a load, a sort, take and reduce. The test framework need never consider the monad that it’s operating in.

In the case that we have a portion of the implementation that actually depends upon the specific monadic type (say, for example, that our sort implementation relies on Akka actors and the “ask” pattern under the hood, so that we’re using Futures) we can simply encode this in our test in a straightforward fashion:

Future is, of course, not properly copointed (since Await can throw an exception) but for the purposes of testing (and testing exclusively) this construction is ideal. As before, we get exactly the type that we need, statically determined, at exactly the place that we need it.

In practice, we’ve found that abstracting away the particular monad that our code is concerned with has aided tremendously with keeping the concerns of different parts of our codebase well isolated, and ensuring that we’re simply not able to sidestep the sequencing requirements that are necessary to make a large, functional codebase work together as a coherent whole. As an added benefit, many parts of our application that were not initially designed thinking in terms of parallel execution are able to execute concurrently. For example, in many cases we’ll be computing a List[M[...]] and then using the “sequence” function provided by scalaz.Traverse to turn this into an M[List[...]] - and when M is future, each element may be computed in parallel, with the final sequenced result becoming available only when all the computations to produce the members of the list are complete. And, ultimately, even this merely touches the surface of a deep pool of composing our computation that is made possible by making this abstraction.

Thursday, November 29, 2012

This post is intended to address a question that I seem to see come up every few months on the Scala mailing
list or the #scala IRC channel on freenode. The question is often posed in terms of the phrase
"recursive type" and usually involves someone having some difficulty or other (and many arise) with
a construction like this:

This sort of self-referential type constraint is known formally as F-bounded type polymorphism and is usually attempted when someone is
trying to solve a common problem of abstraction in object-oriented languages; how to define a polymorphic
function that, though defined in terms of a supertype, will when passed a value of some subtype will always
return a value of the same subtype as its argument.

It's an interesting construct, but there are some subtleties that can trip up the unwary user, and which make
naive or incautious use of the pattern problematic. The first issue that must be dealt with is how to
properly refer to the abstract supertype, rather than some specific subtype. Let's begin with a simple
example; let's say that we need a function that, given an account, will apply a transaction fee for adding
funds below a certain threshold.

This is straightforward; the type bound is enforced via polymorphism at the call site. You'll notice that the
type ascribed to the "account" argument is T, and not Account[T] - the bound on T gives us all the constraints
that we want. This does what we want it to do when we're talking about working with one account at a time.
But, what if we want instead to perform some action with a collection of accounts of varying types; suppose
we need a method to debit all of a customer's accounts for a maintenance fee? We can expect our type bounds
to hold, but things begin to get a little complicated; we're forced to use bounded existential types.
Here is the correct way to do so:

The important thing to notice here is that the type of individual members of the list are existentially
bounded, rather than the list being existentially bounded as a whole. This is important, because it means
that the type of elements may vary, rather than something like "List[T] forSome { type T <: Account[T] }"
which states that the values of the list are of some consistent subtype of T.

So, this is a bit of an issue, but not a terrible one. The existential types clutter up our codebase and
sometimes give the type inferencer headaches, but it's not intractable. The ability to state these existential
type bounds does, however, showcase one advantage that Scala's existentials have over Java's wildcard types,
which cannot express this same construct accurately.

The most subtle point about F-bounded types that is important to grasp is that the type bound is *not*
as tight as one would ideally want it to be; instead of stating that a subtype must be eventually
parameterized by itself, it simply states that a subtype must be parameterized by some (potentially
other) subtype. Here's an example.

This will compile without error, and presents a bit of a pitfall. Fortunately, the type bounds that we
were required to declare at the use sites will prevent many of the failure scenarios that we might be
concerned about:

but it's a little disconcerting to realize that this level of strictness is *only* available at the use site,
and cannot be readily enforced (except perhaps by implicit type witness trickery, which I've not yet tried)
in the declaration of the supertype, which is what we were really trying to do in the
first place.

Finally, it's important to note that F-bounded type polymorphism in Scala falls on its face when you start
talking about higher-kinded types. Suppose that one desired to state a supertype for Scala's structurally
monadic types (those which can be used in for-comprehensions):

This fails to compile outright, complaining about cyclic references in the type constructor M.

In conclusion, my experience has been that F-bounded type polymorphism is tricky to get right and causes typing clutter in the codebase. That isn't to say that it's without value, but I think it's best to consider very carefully whether it is actually necessary to your particular application before you step into the rabbit hole. Most of the time, there's no wonderland at the bottom.

The problem here is that this has a subtly different meaning; this says that for debitAll2, all the members
of the list must be of the *same* subtype of Account. This becomes apparent when we actually try to use the
method with a list where the subtype varies. In both constructions, actually, you end up having to explicitly
ascribe the type of the list, but I've not been able to find a variant for the debitAll2 where the use site
will actually compile with such a variant-membered list.

Saturday, February 18, 2012

I'm writing this post because I want to address a problem that I see time and time again: people who are trying to figure out how to encode algebraic data types in languages that do not support them come up with all kinds of crazy solutions. But, there is a simple and effective encoding of algebraic data types that everyone knows about, but had just been doing wrong: the Visitor pattern. At this point, I expect many devotees of functional programming to start screaming "but... mutability!!!" And, yes, the mutability required by the textbook definition of the Visitor pattern is indeed a major problem - the problem that I intend to show how to correct right here. The solution is remarkably simple. Here's a tree traversal using the Visitor pattern, implemented *correctly* in Java:

This is exactly the traditional Visitor pattern, with one minor variation: the 'accept' method is generic (parametrically polymorphic), returning whatever return type is defined by a particular TreeVisitor instance instead of void. This is a vitally important distinction; by making accept polymorphic and non-void returning, it allows you to escape the curse of being forced to rely on mutability to accumulate a result. Here's an example of the implementation of the 'depth' method from the previously mentioned blog post, and an example of its use. You'll note that no mutable variables were harmed (or indeed used) in the creation of this example:

TreeVisitor encodes the f-algebra for the Tree data type; the accept method is the catamorphism for Tree. Moreover, given this definition, you can also see that the visitor forms a monad (example in Scala), giving rise to lots of nice compositional properties. Implemented in this fashion, Visitor is actually nothing more (and nothing less) than a multiple dispatch function over the algebraic data type in question. So stop returning void from your visitors, and ramp up their power in the process!

Wednesday, August 17, 2011

When using Specs with ScalaCheck, I have occasionally found myself annoyed that, when a function under test threw an exception, ScalaCheck didn't show me enough of a stack trace to actually figure out what was going on. There is a means to configure this, but to the best of my knowledge it's essentially not documented anywhere. In any case, to get full stack traces in your spec, all that you need is this:

Sunday, February 27, 2011

I had a ton of fun yesterday at Code Retreat Boulder, even though I ended up as the Scala guy and thus didn't get to play in other languages as much as I would have liked.
Anyway, this morning I decided to recreate my favorite of the solutions. Small, simple, pure, and relatively flexible:

Thursday, July 29, 2010

I was working on a problem today where some of the computations could fail, but degrade gracefully while providing information about how exactly they failed so that clients could choose whether or not to use the result. This is what I came up with:

Here each of the various mergeX methods return an Attempt[List[String], X] where X is something needed to build the ultimate report. Here I'm just aggregating lists of strings describing the errors, but of course any type for which (E, E) => E is defined would work.

Attempt. For all those times where you want a monad that says, "Hey, I maybe couldn't get exactly what you asked for. Maybe it's little broken, maybe it won't work right, but this the best I could do. You decide."

EDIT:

After a bit of thinking, I realized that this monad is really more general than being simply related to success or failure - it simply models a function that may or may not produce some additional metadata about its result. Then a lightbulb went off, and quick google search confirmed... yup, I just reinvented the writer monad. It's not *exactly* like Writer, because it just requires a semigroup for E instead of a monoid, and the presence of a "log" is optional, so maybe it's better suited than Writer for a few instances.

The one really nice thing about rediscovering well-known concepts is that doing the derivation for yourself, in the context of some real problem, gives you a much more complete understanding of where the thing you reintevented is applicable!

Inspired by michid's example in the comments below, here's a simpler example that demonstrates some utility his doesn't quite capture.

Thursday, July 15, 2010

I've been working a bit with Hadoop lately. As many will know, there are essentially two versions of Hadoop that coexist within the Hadoop API; a deprecated legacy API and a newer one that is not nearly so feature-complete. It's a source of confusion to say the least.

Earlier today, I asked in the #hadoop channel on freenode, why were the breaking changes not simply made on a separate branch, the major version number incremented, and so forth such that everyone could move forward? The answers that I got back illuminated a troublesome trend that I think is worth considering for those projects (like my beloved Scala) that are moving towards professional support models of funding.

During the discussion, I commented that there is an unfortunate reality that the largest companies, who may provide the greatest support any given open-source project are also those who will likely be most conservative with respect to change. So as soon as open-source projects grow large enough that companies selling commercial support for those projects survive, there is an immediate disincentive for the support companies to make breaking changes. And stagnation ensues.

The response was startling. People who had been involved with numerous prominent businesses and projects immediately replied with stories of their own projects suffering such a fate, sometimes to the point that the project died entirely.

I think that this "success effect" can be traced to a single, simple cause, with a single, simple solution. Companies (and the individuals that make them up) need to stop thinking of mission-critical software as something that is ever complete. Marketing is never complete; sales is never done with their job, purchasing and HR are never done, nor is accounting. Software is no different - as engineers, we do not (and should not purport to) produce solutions to the problems that companies have; instead, we automate and support the operation of businesses so that businesses can do more, serve more customers, and in the end create more wealth. A project is finished when nobody is using it; never before then.