Monday, December 21, 2015

The GCHQ 2015 Christmas puzzle is a Nonogram puzzle, which involves filling in squares on a grid to reveal a picture, guided by constraints on the rows and columns. For a computer, a nice way to solve this problem is using a SAT solver. But humans aren't great at SAT solving, and I was given a print-out of this puzzle while on holiday, with no computer. I'd never encountered such a puzzle before, so working with a friend (and some wine) we came up with an approach, and set about applying it. Alas, robustly applying an algorithm with many steps is not easy for a human, and we eventually ended up with contradictions. On returning from holiday, I automated our approach, and tested it. Our approach worked, and the code is below.

The Problem

The puzzle is:

It comprises a 25x25 grid, some filled in squares, and alongside each row/column are the number of consecutive squares that must be filled in each line. For example, the 8th row down must have two runs of three filled squares, with a gap in between, and potentially gaps before or after.

Our Approach

Our approach was to take each line and compute the number of "free" gaps - how many spaces could be inserted with choice. For one row (4 from the bottom) the entire grid is constrained, with no free gaps. Starting with the most constrained lines, we tried to figure out where the pieces could go, based on the existing squares. We quickly realised that negative information was important, so tagged each square with "don't know" (left blank), must be filled (we shaded it in) or must be unfilled (we drew a circle in it). For each line in isolation, looking at the constraints, we inferred squares to be filled or unfilled by examining the possible locations of each run.

The Code

I implemented our approach in Haskell, complete code is available here.

Our constraint system works over a grid where each square is in one of three states. We can encode the grid as [[Maybe Bool]]. The [[.]] is a list of lists, where the outer list is a list of rows, and the inner list is a list of squares. Each of the inner lists must be the same length, and for the GCHQ puzzle, they must all be 25 elements long. For the squares we use Maybe Bool, with Nothing for unknown and Just for known, using True as filled and False as unfilled.

Given the [[Maybe Bool]] grid and the constraints, our approach was to pick a single line, and try to layout the runs, identifying squares that must be True/False. To replicate that process on a computer, I wrote a function tile that given the constraints and the existing line, produces all possible lines that fit. The code is:

The first equation (second line) says that if there are no remaining constraints we set all remaining elements to False. We use the ~> operator to check our desired assignment is consistent with the information already in the line:

This function takes a line of the grid (which may have unknowns), and a possible line (which is entirely concrete), and either returns Nothing (inconsistent) or Just the proposed line. We first check the sizes are consistent, then that everything which is concrete (not a Nothing) matches the proposed value.

Returning to the second equation in tile, the idea is to compute how many spaces could occur at this point. Taking the example of a line 25 long, with two runs of size 3, we could have anywhere between 0 and 18 (25-3-3-1) spaces first. For each possible size of gap, we split the line up (the splitAt calls), then constrain each piece to match the existing grid (using ~>).

Given a way of returning all possible lines, we then collapse that into a single line, by marking all squares which could be either True or False as Nothing:

If there are no satisfying assignments for the line, we return Nothing - that implies the constraints are unsatisfiable. Next, we scale up to a side of constraints, by combining all the constraints and lines:

To constrain the whole problem we apply constrainGrid repeatedly, until it returns Nothing (the problem is unsatisfiable), we have a complete solution (problem solved), or nothing changes. If nothing changes then there might be two solutions, or our approach might not be powerful enough without using search.

The Result

After four iterations we end up with a fully constrained answer. To see the progress, after one iteration we have:

Wednesday, December 09, 2015

The MinGHC project was started to produce a minimal Windows installer which didn't contain many packages, but which could install many packages - in particular the network package. But time has moved on, and Stack now offers everything MinGHC does, but cross-platform and better. To install GHC using Stack, just do stack setup, then stack exec -- my command. Even if you prefer to use Cabal, stack exec -- cabal install hlint is a reasonable approach.

A few final remarks on MinGHC:

MinGHC was an experiment started by Michael Snoyman, which myself (Neil Mitchell) and Elliot Cameron joined in with. I had fun working with those guys.

The ideas and approaches behind MinGHC got reused in Stack, so the knowledge learnt has transferred.

Tuesday, December 08, 2015

Stackage provides a set of precise versions of a subset of Hackage which all place nicely together. At any one time, there are two such version sets:

Nightly, which aims to be the latest version of all packages.

LTS (Long Term Support), which with every major release updates every package, and every minor release updates packages that only changed their minor version.

Stack currently defaults to LTS. I don't think LTS fulfils any need, and does cause harm (albeit very mild harm), so should be abandoned. Instead, people should use a specific nightly that suits them, and upgrade to later nightlies on whatever schedule suits them. I share these somewhat half-baked thoughts to hopefully guide how I think Stackage should evolve, and there may be good counterarguments I haven't thought of.

Why is Nightly better?

Nightly always has newer versions of packages. I believe that newer versions of packages are better.

As a package author, with each new release, I fix bugs and evolve the library in ways which I think are beneficial. I make mistakes, and new versions are how I fix them, so people using old versions of my software are suffering for my past mistakes.

As a package user, if I find a bug, I will endeavour to report it to the author. If I'm not using the latest version, it's highly likely the author will ask me to upgrade first. If a feature I want gets added, it's going to be to the latest version. The latest version of a package always has better support.

If the latest version of a package does not suit my needs, that's an important alarm bell, and I need to take action. In some cases, that requires finding a fork or alternative package to do the same thing. In some cases, that involves talking to the author to make them aware of my particular use case. In either case, doing this sooner rather than later makes it easier to find a good solution.

What's the benefit of LTS?

As far as I am aware, LTS major releases are just Nightly at a particular point in time - so if you only ever use x.0 LTS, you might as well just use Nightly. The main benefit of LTS comes from picking a major LTS version, then upgrading to the subsequent minor LTS versions, to access new minor versions of your dependencies.

Upgrading packages is always a risk, as LTS Haskell says. What LTS minor releases do are minimize the risk of having to fix compilation errors, at the cost of not upgrading to the latest packages. However, the reasons I do upgrades are:

Sometimes, if one package has a known bug that impacts me, I specifically upgrade just that one package. I don't want to upgrade other packages (it introduces unnecessary risk), so I take my package set (be it LTS or Nightly) and replace one constraint.

Every so often, I want to upgrade all my packages, so that I'm not missing out on improvements and so I'm not falling behind the latest versions - reducing the time to upgrade in future. I do that by picking the latest version of all packages, fixing breakages, and upgrading.

When upgrading, compile-time errors are a minor issue at worst. Before Stackage, my major headache was finding a compatible set of versions, but now that is trivial. Fixing compile-time errors is usually a little work, but fairly easy and predictable. Checking for regressions is more time consuming, and running the risk of regressions that escape the test suite has a significant cost. Tracking down if there are any resulting regressions is very time-consuming.

I don't see a use case for upgrading "a little bit", since I get a lot less benefit for only marginally less work.

Why is LTS better?

When I asked this question on Twitter, I got two reasons that did seem plausible:

Matt Parsons suggested that LTS provided a higher likelihood of precompiled binaries shared between projects. That does make sense, but a similar benefit could be achieved by preferring Nightly on the 1st on the month.

Gabriel Gonzalez suggested that when including a resolver in tutorials, LTS looks much better than a Nightly. I agree, but if there only was Nightly, then it wouldn't look quite as bad.

If Haskell packages had bug fixes that were regularly backported to minor releases, then the case for an LTS version would be stronger. However, my experience is that fixes are rarely backported, even for the rare security issues.

Is this a problem?

Currently everyone developing Haskell has two choices, and thanks to Stack, might not even be aware which one they've ended up making (Stack will pick nightly on its own if it thinks it needs to). Reducing the choices and simplifying the story removes work for the Stackage maintainers, and cognitive burden from the Stackage users.

Stackage was a radical experiment in doing things in an entirely different way. I think it's fair to say it has succeeded. Now, after experience using the features, I think it's right to critically look at tweaks to the model.

Given a function whose body is a lambda, you can use the function syntactic sugar to move the lambda arguments to the left of the = sign. One side condition is that you can't have a where binding, for example:

function x = \y -> xx + y
where xx = trace "hit" x

This is equivalent to:

function x = let xx = trace "hit" x in \y -> xx + y

Moving a let under a lambda can cause arbitrary additional computation, as I previously described, so is not allowed (hint: think of map (function 1) [2..5]).

View Patterns

One side condition I hadn't anticipated is that if x is a view pattern, the transformation can still reduce sharing. Consider:

function (trace "hit" -> xx) = \y -> xx + y

This is equivalent to:

function x = case trace "hit" x of xx -> \y -> xx + y

And moving y to the right of the = causes trace "hit" to be recomputed for every value of y.

I've now fixed HLint 1.9.22 to spot this case. Using Uniplate, I added the side condition:

null (universeBi pats :: [Exp_])

Specifically, there must be no expressions inside the pattern, which covers the PViewPat constructor, and any others that might harbour expressions (and thus computation) in future.

The problem with function definitions also applies equally to \p1 -> \p2 -> e, which cannot be safely rewritten as \p1 p2 -> e if p1 contains a view pattern.

The Problem Worsens (Pattern Synonyms)

Pattern synonyms make this problem worse, as they can embody arbitrary computation in a pattern, which is lexically indistinguishable from a normal constructor. As an example:

pattern Odd <- (odd -> True)
f Odd = 1
f _ = 2

However, putting complex computation behind a pattern is probably not a good idea, since it makes it harder for the programmer to understand the performance characteristics. You could also argue that using view patterns and lambda expressions to capture computation after partial application on definitions then lambda expressions is also confusing, so I've refactored Shake to avoid that.

Potential Fixes

I think it should be possible to fix the problem by optimising the desugaring of functions, ensuring patterns are matched left-to-right where allowable, and that each match happens before the lambda requesting the next argument. The question is whether such a change would improve performance generally. Let's take an example:

Which goes 3-4x faster when running map (test [1..]) [1..n] at -O2 (thanks to Jorge Acereda Maciá for the benchmarks). The trace is required to avoid GHC deoptimising the second variant to the first, as per GHC bug #11029.

There are two main downsides I can think of. Firstly, the desugaring becomes more complex. Secondly, these extra lambdas introduce overhead, as the STG machine GHC uses makes multiple argument lambdas cheaper. That overhead could be removed using call-site analysis and other optimisations, so those optimisations might need improving before this optimisation produces a general benefit.

Monday, October 26, 2015

Summary: When thinking about the filepath .., remember symlinks, or you will trip yourself up.

As the maintainer of the Haskell filepath package, one common path-related mistake I see is the assumption that filepaths have the invariant:

/bob/home/../cookies == /bob/cookies

I can see the conceptual appeal - go down one directory, go up one directory, end up where you started. Unfortunately, it's not true. Consider the case where home is a symlink to tony/home. Now, applying the symlink leaves us with the case:

/tony/home/../cookies == /bob/cookies

And, assuming /tony/home is itself not a symlink, that reduces to:

/tony/cookies == /bob/cookies

This is clearly incorrect (assuming no symlinks), so the original invariant was also incorrect, and cannot be relied upon in general. The subtle bit is that descending into a directory might move somewhere else, so it's not an operation that can be undone with ... Each step of the path is interpreted based on where it ends up, not based on the path it took to the current point.

While this property isn't true in general, there are many special cases where it is reasonable to assume. For example, the shake package contains a normaliseEx function that normalises under this assumption, but nothing in the standard filepath package assumes it.

This Shake talk took a distinct approach from my previous talks - partly because I now have a better understanding of how large Shake build systems are typically structured. The key theoretical innovation of Shake is monadic dependencies, and the key technical innovation is using Haskell as an embedded language. Putting these pieces together lets Shake users write a build system interpreter, rather than a monolithic build system.

In the talk I suggest that build systems can be split into metadata (changes frequently, can be tracked, custom format for each project relying on conventions) and an interpreter for that metadata (written using Shake). The result seems to work quite nicely, especially in larger projects where everyone is expected to update the metadata, but changes to the interpreter are expected to be rarer and more carefully thought out.

One of the goals of the talk was to convert some of the audience to using Shake. It seems to have worked, as @krisajenkinstweeted:

First fruit of #haskellx - switched my makefiles over to Shake. Parallelised builds & a readable syntax. Much win.

Tuesday, September 22, 2015

Summary: Using the technique from the previous post, here are three space leaks I found.

Every large Haskell program almost inevitably contains space leaks. This post examines three space leaks I found while experimenting with a space-leak detection algorithm. The first two space leaks have obvious causes, but I remain mystified by the third.

Hoogle leak 1

The motivation for looking at space leak detection tools was that Hoogle 5 kept suffering space leaks. Since Hoogle 5 is run on a virtual machine with only 1Gb of RAM, a space leak will often cause it to use the swap file for the heap, and destroy performance. I applied the detection techniques to the hoogle generate command (which generates the databases), which told me that writeDuplicates took over 100K of stack. The body of writeDuplicates is:

(Debugging tip: always use print for debugging and never for real code, that way getting rid of all debugging output is easy.) It failed after printing step 2, but before printing step 3. Pulling each subexpression out and repeating the evaluate/rnf pattern I reduced the expression to:

Map.fromListWith (\(x1,x2) (y1,y2) -> (min x1 y1, x2 ++ y2)) xs

The fromListWith function essentially performs a foldl over values with duplicate keys. I was using Data.Map.Strict, meaning it the fold was strict, like foldl'. However, the result is a pair, so forcing the accumulator only forces the pair itself, not the first component, which contains a space leak. I effectively build up min x1 (min x1 (min x1 ... in the heap, which would run faster and take less memory if reduced eagerly. I solved the problem with:

Map.fromListWith (\(x1,x2) (y1,y2) -> (, x2 ++ y2) $! min x1 y1) xs

After that the stack limit could be reduced a bit. Originally fixed in commit 102966ec, then refined in 940412cf.

I had already added ! in the definition of f when writing it, on the grounds it was likely a candidate for space leaks (an accumulating map), so was immediately suspicious that I hadn't got it right. However, adding bang patterns near real made no difference, so I tried systematically reducing the bug.

Since this code isn't in IO, the evaluate technique from the previous leak doesn't work. Fortunately, using seq works, but is a bit more fiddly. To check the argument expression (reverse . sortOn) wasn't leaking I made the change:

I was slightly worried that the GHC optimiser may break the delicate seq/trace due to imprecise exceptions, but at -O0 that didn't happen. Successive attempts at testing different subexpressions eventually lead to genericLength xs, which in this case returns a Word16. The definition of genericLength reads:

genericLength [] = 0
genericLength (_:l) = 1 + genericLength l

Alas, a very obvious space leak. In addition, the base library provides two rules:

If you use genericLength on Int or Integer then it is replaced with a strict version without a space leak - but on Word16 the space leak remains. To solve this space leak I replaced genericLength xs with fromIntegral (length xs) in commit 12c46e93, which worked. After that change, the Hoogle test suite can be run with 1Kb of stack - a test that has been added to the continuous integration.

Shake leak

After solving the space leak from the original post, I was then able to run the entire test suite with 1Kb stack on my Windows machine. I made that an RTS option to the Cabal test suite, and my Linux continuous integration started failing. Further experimentation on a Linux VM showed that:

The entire test failed at 50K, but succeeded at 100K.

The excessive stack usage could be replicated with only two of the tests - the tar test followed by the benchmark test. The tar test is incredibly simple and likely any of the tests before before benchmark would have triggered the issue.

The tests succeeded in 1K if running benchmark followed by tar.

The initial assumption was that some CAF was being partially evaluated or created by the first test, and then used by the second, but I have yet to find any evidence of that. Applying -xc suggested a handful of possible sites (as Shake catches and rethrows exceptions), but the one that eventually lead to a fix was extractFileTime, defined as:

There is a small (constant sized) space leak here - the result does not force extractTime, but returns a pair containing thunks. In fact, getFileStatus from the unix library allocates a ForeignPtr to store s, so by not forcing the pair we cause the ForeignPtr to live much longer than would be otherwise required. The fix from commit 2ee36a99 is simple:

Monday, September 21, 2015

Summary: Below is a technique for easily detecting space leaks. It's even found a space leak in the base library.

Every large Haskell program almost inevitably contains space leaks. Space leaks are often difficult to detect, but relatively easy to fix once detected (typically insert a !). Working with Tom Ellis, we found a fairly simple method to detect such leaks. These ideas have detected four space leaks so far, including one in the base library maximumBy function, which has now been fixed. For an introduction to space leaks, see this article.

Our approach is based around the observation that most space leaks result in an excess use of stack. If you look for the part of the program that results in the largest stack usage, that is the most likely space leak, and the one that should be investigated first.

Method

Given a program, and a representative run (e.g. the test suite, a suitable input file):

Run the program with a specific stack size, e.g. ./Main +RTS -K100K to run with a 100Kb stack.

Increase/decrease the stack size until you have determined the minimum stack for which the program succeeds, e.g. -K33K.

Reduce the stack by a small amount and rerun with -xc, e.g. ./Main +RTS -K32K -xc.

The -xc run will print out the stack trace on every exception, look for the one which says stack overflow (likely the last one) and look at the stack trace to determine roughly where the leak is.

Attempt to fix the space leak, confirm by rerunning with -K32K.

Repeat until the test works with a small stack, typically -K1K.

Add something to your test suite to ensure that if a space leak is ever introduced then it fails, e.g. ghc-options: -with-rtsopts=-K1K in Cabal.

I have followed these steps for Shake, Hoogle and HLint, all of which now contain -K1K in the test suite or test scripts.

Example: Testing on Shake

Applying these techniques to the Shake test suite, I used the run ./shake-test self test, which compiles Shake using Shake. Initially it failed at -K32K, and the stack trace produced by -xc was:

*** Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:
Development.Shake.Profile.generateSummary,
called from Development.Shake.Profile.writeProfile,
called from Development.Shake.Core.run.\.\.\,
called from Development.Shake.Core.run.\.\,
called from Development.Shake.Database.withDatabase.\,
called from Development.Shake.Storage.withStorage.continue.\,
called from Development.Shake.Storage.flushThread,
called from Development.Shake.Storage.withStorage.continue,
called from Development.Shake.Storage.withStorage.\,
called from General.FileLock.withLockFile.\,
called from General.FileLock.withLockFile,
called from Development.Shake.Storage.withStorage,
called from Development.Shake.Database.withDatabase,
called from Development.Shake.Core.run.\,
called from General.Cleanup.withCleanup,
called from Development.Shake.Core.lineBuffering,
called from Development.Shake.Core.run,
called from Development.Shake.Shake.shake,
called from Development.Shake.Args.shakeArgsWith,
called from Test.Type.shakeWithClean,
called from Test.Type.shaken.\,
called from Test.Type.noTest,
called from Test.Type.shaken,
called from Test.Self.main,
called from Test.main,
called from :Main.CAF:main
stack overflow

Looking at the generateSummary function, it takes complete profile information and reduces it to a handful of summary lines. As a typical example, one line of the output is generated with the code:

Most of the code is map, maximum and sum in various combinations. By commenting out pieces I was able to still produce the space leak using maximumBy alone. By reimplementing maximumBy in terms of foldl', the leak went away. Small benchmarks showed this space leak was a regression in GHC 7.10, which I reported as GHC ticket 10830. To fix Shake, I added the helper:

maximumBy' cmp = foldl1' $ \x y -> if cmp x y == GT then x else y

After switching to maximumBy' I was able to reduce the stack to -K1K. While this space leak was not problematic in practice (it's rarely used code which isn't performance sensitive), it's still nice to fix. I modified the Shake test suite to pass -K1K so if I ever regress I'll get an immediate notification. (Shake actually had one additional Linux-only space leak, also now fixed, but that's a tale for a future post.)

Caveats

This method has found several space leaks - two in Shake and two in Hoogle (I also ran it on HLint, which had no failures). However, there are a number of caveats:

GHC's strictness analyser often removes space leaks by making accumulators strict, so -O2 tends to remove some space leaks, and profiling may reinsert them by blocking optimisations. I currently check my code using -O0, but using libraries I depend on with whatever optimisation they install with by default. By ensuring optimisations do not remove space leaks, it is less likely that minor code tweaks will introduce space leaks due to missed optimisations.

The stack trace produced by -xc omits duplicate adjacent elements, which is often the interesting information when debugging a stack overflow. In practice, it's a little inconvenient, but not terrible. Having GHC provide repetition counts (e.g. Main.recurse *12) would be useful.

The stack traces don't contain entries for things in imported libraries, which is unfortunate, and often means the location of the error is a 20 line function instead of the exact subexpression. The lack of such information makes fixing leaks take a little longer.

The -xc flag prints stack information on all exceptions, which are often numerous. Lots of IO operations make use of exceptions even when they succeed. As a result, it's often easier to run without -xc to figure out the stack limit, then turn -xc on. Usually the stack overflow exception is near the end.

There are sometimes a handful of exceptions after the stack overflow, as various layers of the program catch and rethrow the exception. For programs that catch exceptions and rethrow them somewhat later (e.g. Shake), that can sometimes result in a large number of exceptions to wade through. It would be useful if GHC had an option to filter -xc to only certain types of exception.

Some functions in the base libraries are both reasonable to use and have linear stack usage - notably mapM. For the case of mapM in particular you may wish to switch to a constant stack version while investigating space leaks.

This technique catches a large class of space leaks, but certainly not all. As an example, given a Map Key LargeValue, if you remove a single Key but don't force the Map, it will leak a LargeValue. When the Map is forced it will take only a single stack entry, and thus not be detected as a leak. However, this technique would have detected a previous Shake space leak, as it involved repeated calls to delete.

Feedback

If anyone manages to find space leaks using this technique we would be keen to know. I have previously told people that there are many advantages to lazy programming languages, but that space leaks are the big disadvantage. With the technique above, I feel confident that I can now reduce the number of space leaks in my code.

Monday, September 14, 2015

Summary: We have a version of mapM for IO that takes O(1) stack and is faster than the standard Haskell/GHC one for long lists.

The standard Haskell/GHC base library sequence function in IO takes O(n) stack space. However, working with Tom Ellis, we came up with a version that takes O(1) stack space. Our version is slower at reasonable sizes, but faster at huge sizes (100,000+ elements). The standard definition of sequence (when specialised for both IO and []) is equivalent to:

First, we traverse the list using apply. Note that we pass the RealWorld token down the list (ensuring the items happen in the right order), but we do not pass it back.

To ensure all the IO actions performed during apply happen before we return any of the list, we then demand the list, ensuring the [] element has been forced.

Both these traversals use O(1) stack. The first runs the actions and constructs the list. The second ensures evaluation has happened before we continue. The trick in this algorithm is:

ys <- IO $ \r -> (# r, apply r xs #)

Here we cheat by duplicating r, which is potentially unsafe. This line does not evaluate apply, merely returns a thunk which when evaluated will force apply, and cause the IO to happen during evaluation, at somewhat unspecified times. To regain well-defined evaluation order we force the result of apply on the next line using demand.

Benchmarks

We benchmarked using GHC 7.10.2, comparing the default sequence (which has identical performance to the specialised monomorphic variant at the top of this post), and our sequenceIO. We benchmarked at different lengths of lists. Our sequenceIO is twice as slow at short lists, draws even around 10,000-100,000 elements, and goes 40% faster by 1,000,000 elements.

Our algorithm saves allocating stack, at the cost of iterating through the list twice. It is likely that by tuning the stack allocation flags the standard algorithm would be faster everywhere.

Using sequence at large sizes

Despite improved performance at large size, we would not encourage people to use sequence or mapM at such large sizes - these functions still require O(n) memory. Instead:

If you are iterating over the elements, consider fusing this stage with subsequence stages, so that each element is processed one-by-one. The conduit and pipes libraries both help with that approach.

If you are reducing the elements, e.g. performing a sum, consider fusing the mapM and sum using something like foldM.

For common operations, such a concat after a mapM, an obvious definition of concatMapM is:

Saturday, August 15, 2015

Recently, someone suggested to me that, thanks to test suites, things like changing compiler version or versions of library dependencies was "no big deal". If dependency changes still result in a passing test suite, then they have caused no harm. I disagree, and fortunately for me, Dijkstra explains it far more eloquently than I ever could:

Testing shows the presence, not the absence of bugs. Dijkstra (1969)

While a test suite can give you confidence in changes you make, it does not provide guarantees. Below are just a few reasons why.

The test suite does not cover all the code

For any reasonably sized code base (> 100 lines), covering all the lines of code is difficult. There are a number of factors that mean that mean a test suite is unlikely to provide 100% coverage:

Producing tests is a resource intensive activity, and most projects do not have the necessary manpower to test everything.

Sometimes there is no good way to test simple sugar functions - the definition is a specification of what the function should do.

Testing corner cases is difficult. As the corners get more obscure, the difficulty increases.

Testing error conditions is even harder. Some errors conditions have code to deal with them, but are believed to be unreachable.

The test suite does not cover all the ways through the code

Assuming the test suite really does cover every line of the code, making it cover every path through the code is almost certainly computationally infeasible. Consider a program taking a handful of boolean options. While it might be feasible to test each individual option in the true and false states, testing every state in conjunction with every other state requires an exponential amount of time. For programs with loops, testing every number of loop iterations is likely to be highly time consuming.

There is plenty of code you can't see

Even if you cover every line of source code, the compiler may still thwart your valiant efforts. Optimising compilers like to inline code (make copies of it) and specialise code (freeze in some details that would otherwise be dynamic). After such transformations, the compiler might spot undefined behaviour (something almost all C/C++ programs contain) and make modifications that break your code. You might have tested all the source code, but you have not tested all the code generated by the compiler. If you are writing in C/C++, and undefined behaviour and optimisation doesn't scare you a lot, you should read this LLVM article series.

Functions have huge inputs

Testing functions typically involves supplying their input and inspecting their output. Usually the input space is too large to enumerate - which is likely to be the case even if your function takes in an integer. As soon as your function takes a string or array, enumeration is definitely infeasible. Often you can pick cases at which the code is likely to go wrong (0, 1, -1, maxBound) - but maybe it only fails for Carmichael numbers. Random testing can help, and is always advisable, but the effort to deploy random testing is typically quite a bit higher than input/output samples, and it is no panacea.

Functions are not functions

Testing functions usually assumes they really are functions, which depend only on their input. In pure functional languages that is mostly true, but in C/C++ it is less common. For example, functions that have an internal cache might behave differently under parallelism, especially if their cache is not managed properly. Functions may rely on global variables, so they might perform correctly until some seemingly unrelated operation is performed. Even Haskell programs are likely to depend on global state such as the FPU flags, which may be changed unexpectedly by other code.

In my experience, the non-functional nature of functions is one of the biggest practical difficulties, and is also a common place where dependency changes cause frustration. Buggy code can work successfully for years until an improved memory allocator allows a race condition to be hit.

Performance testing is hard

Even if your code gives the correct results, it may take too long or use too much memory. Alas, testing for resource usage is difficult. Resource numbers, especially runtime, are often highly variable between runs - more so if tests are run on shared hardware or make use of parallelism. Every dependency change is likely to have some impact on resource usage, perhaps as dependencies themselves chose to trade time for memory. Spotting erroneous variations often requires a human to make a judgement call.

What is the solution?

Tests help, and are valuable, and you should aim to test as much as you can. But for any reasonably sized program, your tests will never be complete, and the program will always contain unknown bugs. Most likely someone using your code will stumble across one of these bugs. In this case, it's often possible (and indeed, highly desirable) to add a new test case specifically designed to spot this error. Bugs have a habit of recurring, and a bug that happens twice is just embarrassing.

Thinking back to dependency versions, there is often strength in numbers. If all your users are on the same version of all the dependencies, then any bug that is very common is likely to be found by at least one user and fixed for all.

Thinking more generally, it is clear that many of these issues are somewhat ameliorated by pure functional programming. I consider testability and robustness to be one of the great strengths of Haskell.

Shake is a Haskell build system, an alternative to Make, but with more powerful and accurate dependencies. I'll cover how to build things with Shake, and why I laugh at non-Monadic build systems (which covers most things that aren't Shake). Shake is an industrial quality library, with a website at http://shakebuild.com.

Bake is a Haskell continuous integration system, an alternative to Travis/Jenkins, but designed for large semi-trusted teams. Bake guarantees that all code arriving in your master branch passes all tests on all platforms, while using as few resources as possible, allowing you to have hours of tests, 100's of commits a day and one a few lonely test servers. Bake is held together with duct tape.

Thursday, July 30, 2015

Summary: I wrote a Conduit combinator which makes the upstream and downstream run in parallel. It makes Hoogle database generation faster.

The Hoogle database generation parses lines one-by-one using haskell-src-exts, and then encodes each line and writes it to a file. Using Conduit, that ends up being roughly:

parse =$= write

Conduit ensures that parsing and writing are interleaved, so each line is parsed and written before the next is parsed - ensuring minimal space usage. Recently the FP Complete guys profiled Hoogle database generation and found each of these pieces takes roughly the same amount of time, and together are the bottleneck. Therefore, it seems likely that if we could parse the next line while writing the previous line we should be able to speed up database generation. I think of this as analogous to CPU pipelining, where the next instruction is decoded while the current one is executed.

I came up with the combinator:

pipelineC :: Int -> Consumer o IO r -> Consumer o IO r

Allowing us to write:

parse =$= pipelineC 10 write

Given a buffer size 10 (the maximum number of elements in memory simultaneously), and a Consumer (write), produce a new Consumer which is roughly the same but runs in parallel to its upstream (parse).

The Result

When using 2 threads the Hoogle 5 database creation drops from 45s to 30s. The CPU usage during the pipelined stage hovers between 180% and 200%, suggesting the stages are quite well balanced (as the profile suggested). The parsing stage is currently a little slower than the writing, so a buffer of 10 is plenty - increasing the buffer makes no meaningful difference. The reason the drop in total time is only by 33% is that the non-pipelined steps (parsing Cabal files, writing summary information) take about 12s.

Note that Hoogle 5 remains unreleased, but can be tested from the git repo and will hopefully be ready soon.

The Code

The idea is to run the Consumer on a separate thread, and on the main thread keep pulling elements (using await) and pass them to the other thread, without blocking the upstream yield. The only tricky bit is what to do with exceptions. If the consumer thread throws an exception we have to get that back to the main thread so it can be dealt with normally. Fortunately async exceptions fit the bill perfectly. The full code is:

We are using a channel chan to move elements from producer to consumer, a quantity semaphore sem to limit the number of items in the channel, and a barrier bar to store the return result (see about the barrier type). On the consumer thread we read from the channel and yield to the consumer. On the main thread we awaitForever and write to the channel. At the end we move the result back from the consumer thread to the main thread. The full implementation is in the repo.

Enhancements

I have specialised pipelineC for Consumers that run in the IO monad. Since the Consumer can do IO, and the order of that IO has changed, it isn't exactly equivalent - but relying on such IO timing seems to break the spirit of Conduit anyway. I suspect pipelineC is applicable in some other moands, but am not sure which (ReaderT and ResourceT seem plausible, StateT seems less likely).

Acknowledgements: Thanks to Tom Ellis for helping figure out what type pipelineC should have.

Sunday, July 19, 2015

Summary: I'm growing increasingly fond of the Conduit library. Here I give my intuitions and some hints I'd have found useful.

Recently I've been working on converting the Hoogle database generation to use the Conduit abstraction, in an effort to reduce the memory and improve the speed. It worked - database generation has gone from 2Gb of RAM to 320Mb, and time has dropped from several minutes (or > 15 mins on memory constrained machines) to 45s. These changes are all in the context of Hoogle 5, which should hopefully be out in a month or so.

The bit that I've converted to Conduit is something that takes in a tarball of one text files per Hackage file, namely the Haddock output with one definition per line (this 22Mb file). It processes each definition, saves it to a single binary file (with compression and some processing), and returns some compact information about the definition for later processing. I don't expect the process to run in constant space as it is accumulating some return information, but it is important that most of the memory used by one definition is released before the next definition. I originally tried lazy IO, and while it somewhat worked, it was hard to abstract properly and very prone to space leaks. Converting to Conduit was relatively easy and is simpler and more robust.

The Conduit model

My mental model for a conduit Conduit a m b is roughly a function [a] -> m [b] - a values go in and b values come out (but interleaved with the monadic actions). More concretely you ask for an a with await and give back a b with yield, doing stuff in the middle in the m Monad.

A piece of conduit is always either running (doing it's actual work), waiting after a yield for the downstream person to ask for more results (with await), or waiting after an await for the upstream person to give the value (with yield). You can think of a conduit as making explicit the demand-order inherent in lazy evaluation.

Things to know about Conduit

I think it's fair to say Conduit shows its history - this is good for people who have been using it for a while (your code doesn't keep breaking), but bad for people learning it (there are lots of things you don't care about). Here are some notes I made:

The Data.Conduit module in the conduit library is not the right module to use - it seems generally accepted to use the Conduit module from the conduit-combinators package. However, I decided to build my own conduit-combinators style replacement in the Hoogle tree, see General.Conduit - the standard Conduit module has a lot of dependencies, and a lot of generalisations.

Don't use Source or Sink - use Producer and Consumer - the former are just a convenient way to get confusing error messages.

Don't use =$ or $=, always use =$= between conduits. The =$= operator is essentially flip (.).

Given a Conduit you can run it with runConduit. Alternatively, given a =$= b =$= c you can replace any of the =$= with $$ to run the Conduit as well. I find that a bit ugly, and have stuck to runConduit.

Conduit and ConduitM have their type arguments in different orders, which is just confusing. However, generally I use either Conduit (a connector) or Producer (something with a result). You rarely need something with a result and a return value.

You call await to see if a value is available to process. The most common bug I've had with conduits is forgetting to make the function processing items recursive - usually you want awaitForever, not just await.

The ByteString lines Conduit function was accidentally O(n^2) - I spotted and fixed that. Using difference lists does not necessarily make your code O(n)!

Useful functions

When using Conduit I found a number of functions seemingly missing, so defined them myself.

First up is countC which counts the number of items that are consumed. Just a simple definition on top of sumC.

I regularly find zipFrom i = zip [i..] very useful in strict languages, and since Conduit can be viewed as a strict version of lazy lists (through very foggy glasses) it's no surprise a Conduit version is also useful.

Finally, it's useful to zip two conduits. I was surprised how fiddly that was with the standard operators (you have to use newtype wrappers and an Applicative instance), but a simple |$| definition hides that complexity away.

Why not Pipes?

I am curious if all Pipes users get asked "Why not use Conduit?", or if this FAQ is asymmetrical?

I realise pipes are billed as the more "principled" choice for this type of programming, but I've yet to see anywhere Conduit seems fundamentally unprincipled. I use WAI/Warp and http-conduit, so learning Conduit gives me some help there.

Saturday, June 20, 2015

Summary: I wrote an extra library, which contains lots of my commonly used functions.

When starting to write Bake, my first step was to copy a lot of utility functions from Shake - things like fst3 (select the first element of a triple), concatMapM (monadic version of concatMap), withCurrentDirectory (setCurrentDirectory under bracket). None of these functions are specific to either Bake or Shake, and indeed, many of the ones in Shake had originally came from HLint. Copy and pasting code is horrible, so I extracted the best functions into a common library which I named extra. Unlike the copy/paste versions in each package, I then wrote plenty of tests, made sure the functions worked in the presence of exceptions, did basic performance optimisation and filled in some obvious gaps in functionality.

I'm now using the extra library in all the packages above, plus things like ghcid and Hoogle. Interestingly, I'm finding my one-off scripts are making particularly heavy use of the extra functions. I wrote this package to reduce my maintenance burden, but welcome other users of extra.

My goal for the extra library is simple additions to the standard Haskell libraries, just filling out missing functionality, not inventing new concepts. In some cases, later versions of the standard libraries provide the functions I want, so there extra makes them available all the way back to GHC 7.2, reducing the amount of CPP in my projects. A few examples:

System.Environment.Extra.lookupEnv is a function available in GHC 7.6 and above. On GHC 7.6 and above this package reexports the version from System.Environment while on GHC 7.4 and below it defines an equivalent version.

The module Extra documents all functions provided by the library, so is a good place to go to see what is on offer. Modules such as Data.List.Extra provide extra functions over Data.List and also reexport Data.List. Users are recommended to replace Data.List imports with Data.List.Extra if they need the extra functionality.

Which functions?

When selecting functions I have been guided by a few principles.

I have been using most of these functions in my packages - they have proved useful enough to be worth copying/pasting into each project.

The functions follow the spirit of the original Prelude/base libraries. I am happy to provide partial functions (e.g. fromRight), and functions which are specialisations of more generic functions (whenJust).

Most of the functions have trivial implementations. If a beginner couldn't write the function, it probably doesn't belong here.

I have defined only a few new data types or type aliases. It's a package for defining new utilities on existing types, not new types or concepts.

Testing

One feature I particularly like about this library is that the documentation comments are tests. A few examples:

These equalities are more semantic equality than Haskell's value equality. Things with lambda's are run through QuickCheck. Things which print to stdout are captured, so the print 1 test really does a print, which is scraped and compared to the LHS. I run these tests by passing them through a preprocessor, which spits out this code, which I then run with some specialised testing functions.

Friday, June 19, 2015

When I started writing the continuous integration server Bake, I thought I had a good idea of what would go fast. It turned out I was wrong. The problem Bake is trying to solve is:

You have a set of tests that must pass, which take several hours to run.

You have a current state of the world, which passes all the tests.

You have a stream of hundreds of incoming patches per day that can be applied to the current state.

You want to advance the state by applying patches, ensuring the current state always passes all tests.

You want to reject bad patches and apply good patches as fast as possible.

I assume that tests can express dependencies, such as you must compile before running any tests. The test that performs compilation is special because it can go much faster if only a few things have changed, benefiting from incremental compilation.

Both my wrong solution, and my subsequent better solution, are based on the idea of a candidate - a sequence of patches applied to the current state that is the focus of testing. The solutions differ in when patches are added/removed from the candidate and how the candidates are compiled.

A Wrong Solution

My initial solution compiled and ran each candidate in a separate directory. When the directory was first created, it copied a nearby candidate to try and benefit from incremental compilation.

Each incoming patch was immediately included in the candidate, compiled, and run on all tests. I would always run the test that had not passed for the longest time, to increase confidence in more patches. Concretely, if I have run test T1 on patch P1, and P2 comes in, I start testing T2 on the combination of P1+P2. After that passes I can be somewhat confident that P1 passes both T1 and T2, despite not having run T2 on just P1.

If a test fails, I bisect to find the patch that broke it, reject the patch, and immediately throw it out of the candidate.

The Problems

There are three main problems with this approach:

Every compilation starts with a copy of a nearby candidate. Copying a directory of lots of small files (the typical output of a compiler) is fantastically expensive on Windows.

When bisecting, I have to compile at lots of prefixes of the candidate, the cost of which varies significantly based on the directory it starts from.

I'm regularly throwing patches out of the candidate, which requires a significant amount of compilation, as it has to recompile all patches that were after the rejected patch.

I'm regularly adding patches to the candidate, each of which requires an incremental compilation, but tends to be dominated by the directory copy.

This solution spent all the time copying and compiling, and relatively little time testing.

A Better Solution

To benefit from incremental compilation and avoid copying costs, I always compile in the same directory. Given a candidate, I compile each patch in the series one after another, and after each compilation I zip up the interesting files (the test executables and test data). To bisect, I unzip the relevant files to a different directory. On Windows, unzipping is much more expensive than zipping, but that only needs to be done when bisecting is required. I also only need to zip the stuff required for testing, not for building, which is often much smaller.

When testing a candidate, I run all tests without extending the candidate. If all the tests pass I update the state and create a new candidate containing all the new patches.

If any test fails I bisect to figure out who should be rejected, but don't reject until I've completed all tests. After identifying all failing tests, and the patch that caused each of them to fail, I throw those patches out of the candidate. I then rebuild with the revised candidate and run only those tests that failed last time around, trying to seek out tests where two patches in a candidate both broke them. I keep repeating with only the tests that failed last time, until no tests fail. Once there are no failing tests, I extend the candidate with all new patches, but do not update the state.

As a small tweak, if there are two patches in the queue from the same person, where one is a superset of the other, I ignore the subset. The idea is that if the base commit has an error I don't want to track it down twice, once to the first failing commit and then again to the second one.

Using this approach in Bake

First, the standard disclaimer: Bake may not meet your needs - it is a lot less developed than other continuous integration systems. If you do decide to use Bake, you should run from the git repo, as the Hackage release is far behind. That said, Bake is now in a reasonable shape, and might be suitable for early adopters.

In Bake this approach is implemented in the StepGit module, with the ovenStepGit function. Since Bake doesn't have the notion of building patches in series it pretends (to the rest of Bake) that it's building the final result, but secretly caches the intermediate steps. If there is a failure when compiling, it caches that failure, and reports it to each step in the bisection, so Bake tracks down the correct root cause.

I am currently recommending ovenStepGit as the "best" approach for combining git and an incremental build system with Bake. While any incremental build system works, I can't help but plug Shake, because its the best build system I've ever written.

Sunday, June 07, 2015

Summary: I've released a new version of ghcid, which is faster and works better for Vim users.

I've just released version 0.4 of ghcid. For those who haven't tried it, it's the simplest/dumbest "IDE" that could possibly exist. It presents a small fixed-height console window that shows the active warnings/errors from ghci, which is updated on every save. While that might not seem like much, there are a number of improvements over a standard workflow:

You don't have to switch to a different window and hit :reload, just saving in your editor is enough.

It includes warnings from modules that successfully compiled previously, as opposed to ghci which omits warnings from modules that didn't reload.

It reformats the messages to the height of your console, so you can see several messages at once.

I find it gives me a significant productivity boost, and since there is no editor integration, it adds to my existing tools, rather than trying to replace them (unlike a real IDE).

Version 0.4 offers a few additional improvements:

We use fsnotify to watch for changes, rather than polling as in older versions. The result is a significant decrease in battery usage, combined with faster responses to changes.

If you change the .cabal or .ghci file then ghcid will restart the ghci session to pick up the changes.

If you are a Vim user, then the sequence of temporary files Vim uses on save could upset ghcid. I believe these issues are now all eliminated.

The number of errors/warnings is displayed in the titlebar of the console.

There is a feature to run a test after each successful save (using --test). I am expecting this to be a niche feature, but feel free to prove me wrong.

Thursday, May 21, 2015

Summary: The development version of ghcid seemed to have some problems with terminating when Control-C was hit, so I investigated and learnt some things.

Given a long-running/interactive console program (e.g. ghcid), when the user hits Control-C/Ctrl-C the program should abort. In this post I'll describe how that works in Haskell, how it can fail, and what asynchronous exceptions have to do with it.

What happens when the user hits Ctrl-C?

When the user hits Ctrl-C, GHC raises an async exception of type UserInterrupt on the main thread. This happens because GHC installs an interrupt handler which raises that exception, sending it to the main thread with throwTo. If you install your own interrupt handler you won't see this behaviour and will have to handle Ctrl-C yourself.

There are reports that if the user hits Ctrl-C twice the runtime will abort the program. In my tests, that seems to be a feature of the shell rather than GHC itself - in the Windows Command Prompt no amount of Ctrl-C stops an errant program, in Cygwin a single Ctrl-C works.

What happens when the main thread receives UserInterrupt?

There are a few options:

If you are not masked and there is no exception handler, the thread will abort, which causes the whole program to finish. This behaviour is the desirable outcome if the user hits Ctrl-C.

If you are running inside an exception handler (e.g. catch or try) which is capable of catching UserInterrupt then the UserInterrupt exception will be returned. The program can then take whatever action it wishes, including rethrowing UserInterrupt or exiting the program.

If you are running with exceptions masked, then the exception will be delayed until you stop being masked. The most common way of running while masked is if the code is the second argument to finally or one of the first two arguments to bracket. Since Ctrl-C will be delayed while the program is masked, you should only do quick things while masked.

How might I lose UserInterrupt?

The easiest way to "lose" a UserInterrupt is to catch it and not rethrow it. Taking a real example from ghcid, I sometimes want to check if two paths refer to the same file, and to make that check more robust I call canonicalizePath first. This function raises errors in some circumstances (e.g. the directory containing the file does not exist), but is inconsistent about error conditions between OS's, and doesn't document its exceptions, so the safest thing is to write:

If there is any exception, just return the original path. Unfortunately, the catch will also catch and discard UserInterrupt. If the user hits Ctrl-C while canonicalizePath is running the program won't abort. The problem is that UserInterrupt is not thrown in response to the code inside the catch, so ignoring UserInterrupt is the wrong thing to do.

What is an async exception?

In Haskell there are two distinct ways to throw exceptions, synchronously and asynchronously.

Synchronous exceptions are raised on the calling thread, using functions such as throw and error. The point at which a synchronous exception is raised is explicit and can be relied upon.

Asynchronous exceptions are raised by a different thread, using throwTo and a different thread id. The exact point at which the exception occurs can vary.

How is the type AsyncException related?

In Haskell, there is a type called AsyncException, containing four exceptions - each special in their own way:

StackOverflow - the current thread has exceeded its stack limit.

HeapOverflow - never actually raised.

ThreadKilled - raised by calling killThread on this thread. Used when a programmer wants to kill a thread.

UserInterrupt - the one we've been talking about so far, raised on the main thread by the user hitting Ctrl-C.

While these have a type AsyncException, that's only a hint as to their intended purpose. You can throw any exception either synchronously or asynchronously. In our particular case of caonicalizePathSafe, if canonicalizePath causes a StackOverflow, we probably are happy to take the fallback case, but likely the stack was already close to the limit and will occur again soon. If the programmer calls killThread that thread should terminate, but in ghcid we know this thread won't be killed.

How can I catch avoid catching async exceptions?

There are several ways to avoid catching async exceptions. Firstly, since we expect canonicalizePath to complete quickly, we can just mask all async exceptions:

We use catchJust to only catch exceptions which aren't of type AsyncException, so UserInterrupt will not be caught. Of course, this actually avoids catching exceptions of type AsyncException, which is only related to async exceptions by a partial convention not enforced by the type system.

Unfortunately, I don't know what the relevant exceptions are - on Windows canonicalizePath never seems to throw an exception. However, IOException seems like a reasonable guess.

How to robustly deal with UserInterrupt?

I've showed how to make canonicalizePathSafe not interfere with UserInterrupt, but now I need to audit every piece of code (including library functions I use) that runs on the main thread to ensure it doesn't catch UserInterrupt. That is fragile. A simpler alternative is to push all computation off the main thread:

We are using the Barrier type from my previous blog post, which is available from the extra package. We create a Barrier, run the main action on a forked thread, then marshal completion/exceptions back to the main thread. Since the main thread has no catch operations and only a few (audited) functions on it, we can be sure that Ctrl-C will quickly abort the program.

Using version 1.1.1 of the extra package we can simplify the code to ctrlC = join . onceFork.

What about cleanup?

Now we've pushed most actions off the main thread, any finally sections are on other threads, and will be skipped if the user hits Ctrl-C. Typically this isn't a problem, as program shutdown automatically cleans all non-persistent resources. As an example, ghcid spawns a copy of ghci, but on shutdown the pipes are closed and the ghci process exits on its own. If we do want robust cleanup of resources such as temporary files we would need to run the cleanup from the main thread, likely using finally.

Should async exceptions be treated differently?

At the moment, Haskell defines many exceptions, any of which can be thrown either synchronously or asynchronously, but then hints that some are probably async exceptions. That's not a very Haskell-like thing to do. Perhaps there should be a catch which ignores exceptions thrown asynchronously? Perhaps the sync and async exceptions should be of different types? It seems unfortunate that functions have to care about async exceptions as much as they do.

Combining mask and StackOverflow

As a curiosity, I tried to combine a function that stack overflows (using -O0) and mask. Specifically:

main = mask_ $ print $ foldl (+) 0 [1..1000000]

I then ran that with +RTS -K1k. That prints out the value computed by the foldl three times (seemingly just a buffering issue), then fails with a StackOverflow exception. If I remove the mask, it just fails with StackOverflow. It seems that by disabling StackOverflow I'm allowed to increase my stack size arbitrarily. Changing print to appendFile causes the file to be created but not written to, so it seems there are oddities about combining these features.

Disclaimer

I'm certainly not an expert on async exceptions, so corrections welcome. All the above assumes compiling with -threaded, but most applies without -threaded.

Wednesday, May 06, 2015

Summary: The library js-jquery makes it easy to get at the jQuery Javascript code from Haskell. I've just released a new version.

I've just released the Haskell library js-jquery 1.11.3, following the announcement of jQuery 1.11.3. This package bundles the minified jQuery code into a Haskell package, so it can be depended upon by Cabal packages. The version number matches the upstream jQuery version. It's easy to grab the jQuery code from Haskell using this library, as an example:

Make it easier for jQuery users to use and upgrade jQuery in Haskell packages. You can upgrade jQuery without huge diffs and use it without messing around with extra-source-files.

Make it easier for upstream packagers like Debian. The addition of a jQuery file into a Haskell package means you are mixing licenses, authors, and distributions like Debian also require the source (unminified) version of jQuery to be distributed alongside. By having one package provide jQuery they only have to do that work once, and the package has been designed to meet their needs.

It's pretty easy to convert something that has bundled jQuery to use the library, as some examples:

The library only depends on the base library so it shouldn't cause any version hassles, although (as per all Cabal packages) you can't mix and match libraries with incompatible js-jquery version constraints in one project.

As a companion, there's also js-flot, which follows the same ideas for the Flot library.

Saturday, April 25, 2015

Summary: Sometimes source files get deleted, and build products become stale. Using Shake, you can automatically delete them.

Imagine you have a build system that compiles Markdown files into HTML files for your blog. Sometimes you rename a Markdown file, which means the corresponding HTML will change name too. Typically, this will result in a stale HTML file being left, one that was previously produced by the build system, but will never be updated again. You can remove that file by cleaning all outputs and running the build again, but with the Shake build system you can do better. You can ask for a list of all live files, and delete the build products not on that list.

A basic Markdown to HTML converter

Let's start with a simple website generator. For each Markdown file, with the extension .md, we generate an HTML file. We can write that as:

Using a new feature in Shake 0.15, we can name save this script as Shakefile.hs and then:

shake will build all the HTML files.

shake -j0 will build all the files, using one thread for each processor on our system.

shake output/foo.html will build just that one HTML file.

shake clean will delete all the HTML files.

Removing stale files

Now let's imagine we've added a blog post using-pipes.md. Before publishing we decide to rename our post to using-conduit.md. If we've already run shake then there will be a stale file output/using-pipes.html. Since there is no source .md file, Shake will not attempt to rebuild the file, and it won't be automatically deleted. We can do shake clean to get rid of it, but that will also wipe all the other HTML files.

We can run shake --live=live.txt to produce a file live.txt listing all the live files - those that Shake knows about, and has built. If we run that after deleting using-pipes.md it will tell us that using-conduit.md and output/using-conduit.md are both "live". If we delete all files in output that are not mentioned as being live, that will clean away all our stale files.

Now when running shake --prune it will build all files, then delete all stale files, such as output/using-pipes.html. We are using the shakeArgsPrune function (just sugar over --live) which lets us pass a pruner function. This function gets called after the build completes with a list of all the live files. We use listFilesRecursive from the extra package to get a list of all files in output, then do list difference (\\) to delete all the files which are present but not live. To deal with the / vs \ path separator issue on Windows, we apply toStandard to all files to ensure they match.

A few words of warning:

If you run shake output/foo.html --prune then it will only pass output/foo.html and foo.md as live files, since they are the only ones that are live as you have asked for a subset of the files to be built. Generally, you want to enable all sensible targets (typically no file arguments) when passing --prune.

Sometimes a rule will generate something you care about, and a few files you don't really bother tracking. As an example, building a GHC DLL on Windows generates a .dll and a .dll.a file. While the .dll.a file may not be known to Shake, it probably doesn't want to get pruned. The pruning function may need a few special cases, like not deleting the .dll.a file if the .dll is live.