Big Ideas for Haskell

In a conversation last night, a friend and I talked about the following thought experiment:

If you had a few million dollars and a few years to spend on hiring/recruiting and leading a strike force to make some big improvements in the infrastructure of your favorite programming language… what would you do?

Note that the changes have to be to infrastructure. We aren’t interested in language changes, which tend to get too much play already; but rather in the tools and fundamental libraries available to us. Here’s my list for Haskell.

Project #1: Fix Dependencies in GHC/Cabal

Managing dependencies is a HUGE issue in Haskell right now. Any substantial software project seems to require nearly as much time managing the various versions of different dependent libraries as writing code. And there’s one really big thing that can be done about it: don’t expose dependencies from a library unless they are really, truly needed.

Let’s look at an example. A gazillion different libraries use Parsec internally to parse various things from text. They use various different versions of Parsec. Even though there is no reason why two different versions of Parsec can’t be used in the same program, still Cabal’s dependency resolution will try very hard to avoid combining those two libraries together, just because of some inconsequential implementation details. Parsec isn’t the only such package, either: QuickCheck, for years, split the Haskell world because various different packages depended on different versions of QuickCheck for their internal testing. The situation with mtl and transformers is a little more complicated, since it’s possible in principle that libraries that depended on both actually needed the same versions; but poking around a bit reveals that for the most part, these libraries used mtl or transformers internally to build monads that could very well have been wrapped up in opaque newtype wrappers at the package boundary.

Basically, we have a lot of confusion between implementation dependencies, and exposed dependencies.

So this idea has two steps: Step #1, instead of just telling GHC about all of the packages your code depends on, you should be able to give it a list of hidden dependencies, and a separate list of exposed dependencies. It should check that any types accessible via exports of your package’s exposed modules is in the list of exposed dependencies. Step #2, Cabal should get separate fields for exposed and hidden dependencies, and pass them along to GHC. And its constraint solver should be set up to never fail to install a package because of needing different versions of hidden dependencies.

I don’t know an easy way to measure this, but my strong suspicion is that this would save 80% of the time Haskell programmers currently spend on maintaining the dependencies of large Haskell projects. It would be a huge obstacle to real-world use of Haskell, removed in one fell swoop. It easily gets the #1 position on my list.

Project #2: Expand STM to External Systems

In my opinion, one of the most underestimated bits of potential in Haskell is software transactional memory. The STM library is, as Don Stewart so elegantly pointed out recently, done and working and quite usable. The problem is that it’s just about transactional memory.

I’d say that from my experience, a very dangerous source of software bugs comes from the fact that so much software is written from the perspective of someone standing on the outside, looking in to a transaction. Managing transactions properly can be very difficult, and many programmers just plain get it wrong. I shudder to think of the number of web applications we trust every day that probably have data loss bugs because of poor transaction handling. This is something that desperately needs a solution.

One solution is database stored procedures, and they are used liberally for this purpose. The ability to write code that sees the world from inside of the transaction is, I think, at the root of why it’s generally considered safer to write data-related code in stored procedures rather than in applications. But stored procedures are specific to a single data source. Most significant information stores provide some support for distributed transactions these days… but using them requires doing your data manipulation in application code, which means back outside the transaction. Very little work has been done on writing code at the application level that nevertheless lives inside of a (distributed) transaction. A big reason for that is that applications have always been effectful, and we’ve had no good way to take the side effects of the code in the application itself and manage even that central bit in a transaction-safe way.

Enter STM. Now we do have a nice, working, fully functional system for writing application level code that sees the world from inside of a transaction. STM was developed as a means of writing composable code that gets speedups from parallelism; I think that’s missing the bigger picture. Speedups from parallelism are nice, but what STM really gives us is a way to write composable code that’s organized as concurrent processes acting on data with safety offered by transactions. The next step is to expand those transactions. I want to write code that makes changes to both a database and my data structures in memory, and have that code run as a transaction, which either succeeds or fails as a whole.

There are challenges here, to be sure: STM’s crucial retry operation is unlikely to be supported by any kind of external transaction system, for example, and the transaction models of different external systems may be hard to bring together under a single interface (savepoints? nested transactions? isolation levels?). But even a simple lowest common denominator here would be a huge step forward.

I feel like Haskell has a huge opportunity here to be the first popular general-purpose languages in which it’s possible to easily write and compose code that is really transaction-safe. It would be a true shame to miss the opportunity.

Project #3: Universal Serialization

This is cheating a little bit, because it’s almost asking for a language feature… but I can get away with it because it doesn’t actually require a change to the language: just exposing a somewhat magical new API from the compiler.

Basically, it’s possible to take a lot of types in Haskell, and turn them into some kind of serialized form from which they can be recovered later. This is what Data.Binary (from the binary package) and Data.Serialize (from the cereal package) both do. But it’s not possible for all types. First class functions, for example, cannot be serialized. It’s not just that no one has written the code yet; in Haskell, it simply can’t happen.

But in a magic library provided by a Haskell compiler, it definitely could happen.

You’d have to be willing to live with a few limitations: recompiling your code would most likely invalidate serialized first-class functions. The interpreter is an even harder challenge. Serialized values would need to embed checksums of various bits of the code, and I haven’t even begun to think through which checksums would be needed. But it should be possible. After all, a closure is just a function pointer (to a function which has some offset in the compiled binary code) together with an environment consisting of values of other types. Throw in some smart handling of cyclic data structures based on a lookup table, and you’ve got something.

Even a half-baked attempt to make higher-order programming with functions and closures compatible with the existing world of SQL databases and binary network protocols would be a huge benefit.

Ofcourse, the *big* problem that separates Haskell from Clean in this regard is the open world assumption, which Haskell does have, and Clean does not.

Basically, this means that Clean assumes that all code that will ever be run as part of the program was seen at the time of compilation. This means that it is possible guarantee that the particular numbering of types chosen is the same for all parts of the code.

With Haskell’s open world assumption, numbering of types becomes a much harder problem. How do you consistently number your types (and thereby specify what the exact types of e.g. a higher order function are) in the case that there are multiple numbering passes?

I’m not sure what numbering passes you’re referring to, but the standard way to designate code in open networks is via cryptographic identifiers. In this case, I would impose a total order on constructors making up a type (say, alphanumeric), then construct a cryptographic hash of the resulting type from its constructors. This also supplies versioning based on the shape of a type. Upgrades which change the shape are incompatible with existing code, upgrades which do not change the shape are compatible. Other hashing approaches with different tradeoffs are possible as well, for instance, adding legs keeps compatibility but changing or removing existing legs does not.

I agree whole-heartedly with the sentiment that every modern language should be able to serialize any construct in the language. This is absolutely necessary for long-running, robust distributed systems like web programs.

Any and all contributions to Cabal are a big boon to the community. Cabal has a large impact on productivity. For example, I’m really excited about what the new support tests will do for code quality. No more cut-n-pasting large Makefiles between projects just to run tests!

Just stumbled over this post (from Haskell Weekly News).
About the serialisation, you might find be interested in this paper: Orthogonal Serialisation for Haskell, accepted for publication in LNCS.
In short: use a runtime system feature that exists in the parallel runtime system, and add some Haskell code on top to make it reasonably type safe. It has severe limitations (you already mention them, the recompilation is the most severe one) but it can be used for a few nice things even though.

Arbitrary instances for datatypes are also exported, thus you cannot simply hide a dependency on QuickCheck. If you do not need custom Arbitrary instances, you can import QuickCheck only in the test executable and disable building that executable and the QuickCheck dependency by default using a Cabal flag.

Well, if your Arbitrary instances are defined in modules that aren’t imported from any exposed modules, certainly you can. If you export Arbitrary instances, then yes, the version of QuickCheck would indeed be an exposed dependency. I’d guess that most people probably don’t want to expose Arbitrary instances anyway (though we’re brushing up against the orphan instance debate there).

1. I’ve been using Scala for eight months and I have never once had an sbt install (sbt is the Scala build tool and packaging system) ruin my entire set of Scala libraries. This happened so often with GHC and cabal that I was afraid to install new packages.

3. I misunderstood your last point. Serializing closures ain’t possible in Scala, but it does have the standard java.io.Serializable libraries.

cdsmith / Jun 12 2011 3:58 pm

In general, it looks like the only thing you’re talking about that relates to one of the points in my post is Akka incorporating distributed transactions, assuming that STM will be a player among those transactions. There’s no indication that sbt addresses the dependency issues that cause problems for GHC and Cabal (not a surprise; Nix is about the only dependency management system I’ve seen in any context that actually solves them). Stambecco doesn’t seem to have STM at all, that I can tell from its documentation. And right, first-class functions are definitely not serializable in Scala.