Monday, October 31, 2011

This article is about a GHC bug I encountered recently, but it's really an excuse to talk about some GHC internals at an intro level. (In turn, an excuse for me to learn about those internals.)

I'll assume you're familiar with the basics of Haskell and lazy evaluation.

The bug

I spoke before of using global locks in Haskell to protect a thread-unsafe C library. Unfortunately a GHC bug prevents this from working. Using unsafePerformIO at the top level of a file can result in IO that happens more than once.

This bug was reported two weeks ago and is already fixed in GHC HEAD. I tested with GHC 7.3.20111026, aka g6f5b798, and the problem seemed to go away.

Unfortunately it will be some time before GHC 7.4 is widely deployed, so I'm thinking about workarounds for my original global locking problem. I'll probably store the lock in a C global variable via StablePtr, or failing that, implement all locking in C. But I'd appreciate any other suggestions.

The remainder of this article is an attempt to explain this GHC bug, and the fix committed by Simon Marlow. It's long because

I try not to assume you know anything about how GHC works. I don't know very much, myself.

THUNK objects represent computations which have not yet happened. Suppose we write:

let x =2+2in f x x

This code will construct a THUNK object for x and pass it to the code for f. Some time later, f may force evaluation of its argument, and the thunk will, in turn, invoke (+). When the thunk has finished evaluating, it is overwritten with the evaluation result. (Here, this might be an I#CONSTR holding the number 4.) If f then forces its second argument, which is alsox, the work done by (+) is not repeated. This is the essence of lazy evaluation.

When a thunk is forced, it's first overwritten with a BLACKHOLE object. This BLACKHOLE is eventually replaced with the evaluation result. Therefore a BLACKHOLE represents a thunk which is currently being evaluated.

Identifying this case helps the garbage collector, and it also gives GHC its seemingly magical ability to detect some infinite loops. Forcing a BLACKHOLE indicates a computation which cannot proceed until the same computation has finished. The GHC runtime will terminate the program with a <<loop>> exception.

We can't truly update thunks in place, because the evaluation result might be larger than the space originally allocated for the thunk. So we write an indirection pointing to the evaluation result. These IND objects will later be removed by the garbage collector.

Static objects

Dynamically-allocated objects make sense for values which are created as your program runs. But the top-level declarations in a Haskell module don't need to be dynamically allocated; they already exist when your program starts up. GHC allocates these static objects in your executable's data section, the same place where C global variables live.

main is a THUNK_STATIC object. It represents the unevaluated expression formed by applying the function print to the argument (f x 3). A static thunk is also known as a constant applicative form, or a CAF for short. Like any other thunk, a CAF may or may not get evaluated. If evaluated, it will be replaced with a black hole and eventually the evaluation result. In this example, main will be evaluated by the runtime system, in deciding what IO to perform.

Black holes and revelations

That's all fine for a single-threaded Haskell runtime, but GHC supports running many Haskell threads across multiple OS threads. This introduces some additional complications. For example, one thread might force a thunk which is currently being evaluated by another thread. The thread will find a BLACKHOLE, but terminating the program would be incorrect. Instead the BLACKHOLE puts the current Haskell thread to sleep, and wakes it up when the evaluation result is ready.

If two threads force the same thunk at the same time, they will both perform the deferred computation. We could avoid this wasted effort by writing and checking for black holes using expensive atomic memory operations. But this is a poor tradeoff; we slow down every evaluation in order to prevent a rare race condition.

As a compiler for a language with pure evaluation, GHC has the luxury of tolerating some duplicated computation. Evaluating an expression twice can't change a program's behavior. And most thunks are cheap to evaluate, hardly worth the effort of avoiding duplication. So GHC follows a "lazy black-holing" strategy.12 Threads write black holes only when they enter the garbage collector. If a thread discovers that one of its thunks has already been claimed, it will abandon the duplicated work-in-progress. This scheme avoids large wasted computations without paying the price on small computations. You can find the gritty details within the function threadPaused, in rts/ThreadPaused.c.

unsafe[Dupable]PerformIO

You may remember that we started, all those many words ago, with a program that uses unsafePerformIO. This breaks the pure-evaluation property of Haskell. Repeated evaluation will affect semantics! Might lazy black-holing be the culprit in the original bug?

The core behavior is implemented by unsafeDupablePerformIO, using GHC's internal representation of IO actions (which is beyond the scope of this article, to the extent I even have a scope in mind). As the name suggests, unsafeDupablePerformIO provides no guarantee against duplicate execution. The more familiar unsafePerformIO builds this guarantee by first invoking the noDuplicate# primitive operation.

The implementation of noDuplicate#, written in GHC's Cmm intermediate language, handles a few tricky considerations. But it's basically a call to the function threadPaused, which we saw is responsible for lazy black-holing. In other words, thunks built from unsafePerformIO perform eager black-holing.

Since threadPaused has to walk the evaluation stack, unsafeDupablePerformIO might be much faster than unsafePerformIO. In practice, this will matter when performing a great number of very quick IO actions, like peeking a single byte from memory. In this case it is safe to duplicate IO, provided the buffer is unchanging. Let's measure the performance difference.

So performance-critical idempotent actions can benefit from unsafeDupablePerformIO. But most code should use the safer unsafePerformIO, as our bug reproducer does. And the noDuplicate# machinery for unsafePerformIO makes sense, so what's causing our bug?

This is an application of the function ($) to the argument unsafePerformIO. So it's a static thunk, a CAF. Here's the old description of how CAF evaluation works, from Storage.c:

The entry code for every CAF does the following:

builds a BLACKHOLE in the heap

pushes an update frame pointing to the BLACKHOLE

calls newCaf, below

updates the CAF with a static indirection to the BLACKHOLE

Why do we build an BLACKHOLE in the heap rather than just updating the thunk directly? It's so that we only need one kind of update frame - otherwise we'd need a static version of the update frame too.

So here's the problem. Normal thunks get blackholed in place, and a thread detects duplicated evaluation by noticing that one of its thunks-in-progress became a BLACKHOLE. But static thunks — CAFs — are blackholed by indirection. Two threads might perform the above procedure concurrently, producing two different heap-allocated BLACKHOLEs, and they'd never notice.

As Simon Marlow put it:

Note [atomic CAF entry]

With THREADED_RTS, newCaf() is required to be atomic (see #5558). This is because if two threads happened to enter the same CAF simultaneously, they would create two distinct CAF_BLACKHOLEs, and so the normal threadPaused() machinery for detecting duplicate evaluation will not detect this. Hence in lockCAF() below, we atomically lock the CAF with WHITEHOLE before updating it with IND_STATIC, and return zero if another thread locked the CAF first. In the event that we lost the race, CAF entry code will re-enter the CAF and block on the other thread's CAF_BLACKHOLE.

I can't explain precisely what a WHITEHOLE means, but they're used for spin locks or wait-free synchronization in various places. For example, the MVar primitives are synchronized by the lockClosure spinlock routine, which uses WHITEHOLEs.

We grab the CAF's info table pointer, which tells us what kind of object it is. If it's not already claimed by another thread, we write a WHITEHOLE — but only if the CAF hasn't changed in the meantime. This step is an atomic compare-and-swap, implemented by architecture-specific code. The function cas is specified by this pseudocode:

There are some interesting variations between architectures. SPARC and x86 use single instructions, while PowerPC and ARMv6 have longer sequences. Old ARM processors require a global spinlock, which sounds painful. Who's running Haskell on ARMv5 chips?

*deep breath*

Thanks for reading / skimming this far! I learned a lot by writing this article, and I hope you enjoyed reading it. I'm sure I said something wrong somewhere, so please do not hesitate to correct me in the comments.

Monday, October 24, 2011

This quasicrystal is full of emergent patterns, but it can be described in a simple way. Imagine that every point in the plane is shaded according to the cosine of its y coordinate. The result would look like this:

Now we can rotate this image to get other waves, like these:

Each frame of the animation is a summation of such waves at evenly-spaced rotations. The animation occurs as each wave moves forward.

I recommend viewing it up close, and then from a few feet back. There are different patterns at each spatial scale.

The code

To render this animation I wrote a Haskell program, using the Repa array library. For my purposes, the advantages of Repa are:

To combine several functions, we sum their outputs, and wrap to produce a result between 0 and 1. As n increases, (wrap n) will rise to 1, fall back to 0, rise again, and so on. sequence converts a list of functions to a function returning a list, using the monad instance for ((->) r).

We convert an array of floating-point values to an image in two steps. First, we map floats in [0,1] to bytes in [0,255]. Then we copy this to every color channel. The result is a 3-dimensional array, indexed by (row, column, channel). repa-devil takes such an array and outputs a PNG image file.

Note that repa-devil silently refuses to overwrite an existing file, so you may need to rm out.png first.

On my 6-core machine, this parallel code ran in 3.72 seconds of wall-clock time, at a CPU utilization of 474%. The same code compiled without -threaded took 14.20 seconds, so the net efficiency of parallelization is 382%. This is a good result; what's better is how little work it required on my part. Cutting a mere 10 seconds from a single run is not a big deal. But it starts to matter when rendering many frames of animation, and trying out variations on the algorithm.

As a side note, switching from Float to Double increased the run time by about 30%. I suspect this is due to increased demand for memory bandwidth and cache space.

You can grab the Literate Haskell source and try it out on your own machine. This is my first Repa program ever, so I'd much appreciate feedback on improving the code.

Friday, October 21, 2011

The Traveling Salesperson Problem (TSP) is a famous optimization problem with applications in logistics, manufacturing, and art. In its planar form, we are given a set of "cities", and we want to visit each city while minimizing the total travel distance.

Finding the shortest possible tour is NP-hard, and quickly becomes infeasible as the number of cities grows. But most applications need only a heuristically good solution: a tour which is short, if not the shortest possible. The Lin-Kernighan heuristic quickly produces such tours.

The Concorde project provides a well-regarded collection of TSP solvers. I needed TSP heuristics for a Haskell project, so I wrote a Haskell interface to Concorde's Lin-Kernighan implementation. Concorde provides a C library, but it's far from clear how to use it. Instead I chose to invoke the linkern executable as a subprocess.

tsp lets you represent the points to visit using any type you like. You just provide a function to get the coordinates of each point. The Config parameter controls various aspects of the computation, including the time/quality tradeoff. Defaults are provided, and you can override these selectively using record-update syntax. All considered it's a pretty simple interface which tries to hide the complexity of interacting with an external program.

Visualizing a tour

Here's a example program which computes a tour of 1,000 random points. We'll visualize the tour using the Diagrams library.

which can be managed through the usual module import/export mechanism.

Why global state?

Global state is a sign of bad software design, especially in Haskell. Why would we ever need it? Suppose you're wrapping a C library which is not thread-safe. Using a (hidden!) global lock, you can expose an interface which is simple and safe. In other words, you're using global state to compensate for others using global state.1 Another use case is generating unique identifiers to speed up comparison of values. This can be done without breaking referential transparency, but you need a source of IDs which is really and truly global.

In these situations it's typical to create global variables using a hack such as

ref ::IORefIntref = unsafePerformIO (newIORef 3){-# NOINLINE ref #-}

My library is just a set of Template Haskell macros for the same hack. If global variables are seldom needed, then what good are these macros?

Writing out the hack each time is unsafe. I might forget the NOINLINE pragma, or subvert the type system with a polymorphic reference. The safe-globals library prevents these mistakes. I'm of the opinion — and I know it's not shared by all — that even questionable techniques should be made as safe as possible. Call it "harm reduction" if you like.

In ten years, if GHC 9 requires an extra pragma for safety, then safe-globals can be updated, without changing every package that uses it. If JHC's ACIO feature is ported to GHC, then safe-globals can take advantage and get rid of the hacks entirely.

But the direct impetus to write safe-globals was the appearance of the global-variables library, which drew some attention in the Haskell community. global-variables aims to solve the same problem, using a different approach with a number of drawbacks. The rest of this article outlines some of these drawbacks.

Spooky action at a distance

Among the stated features of global-variables are

Avoid having to pass references explicitly throughout the program in order to let distant parts communicate. Enable a communication by convention scheme, where e.g. different libraries may communicate without code dependencies.

This refers to the fact that two global refs with the same name string will become entangled, no matter where in a program they were declared. This is certainly a bug, not a feature. Untracked interactions between different components are the archetypal defect in software engineering.

Neither is there a clear way for a user of global-variables to opt out of this misfeature. The best you can do is augment your names with some prefix which you hope is unique — the same non-solution used by C libraries. Haskell solves namespace problems with a module system and a package system. global-variables circumvents both.

Still, suppose that you choose "communication by convention" for your library. You'll need to manually document the name and type of every ref used by this communication, since they aren't tracked by the type system. A mismatch (as from a library upgrade) will cause silent breakage. Worse, you need to tell every library user how to initialize your library's own variables, and hope that they do it correctly. When a ref is given different initializers in different declarations, the result is indeterminate.

Type clashes

A polymorphic reference, with a type like ∀ t. IORef t, breaks the type system. You can write a value of one type and then read it with another type. So it's important for global-variables to disallow polymorphic refs. The mechanism it uses is that each declaration is implicitly a family of refs, one for each monomorphic type (via Typeable).

This will print 1, not 120. The ref is written at type Int (the return type of length) and an implicitly different ref is read at type Integer (because of the subsequent call to fact).

You can certainly argue that top-level refs should always be declared with a monomorphic type signature. Indeed, my library enforces this. But global-variables doesn't, and can't. Making type clashes a run-time error would be a step in the right direction.

A common response is that locking should be added in C code; however, concurrent programming in C is cumbersome and dangerous. It's much easier, if a bit ugly, to implement locking on the Haskell side. You could however store an MVar lock in a C global variable via StablePtr. Has anyone done this? ↩

Saturday, October 15, 2011

This program runs in 16-bit x86 real mode, without any operating system. It's formatted as a PC master boot record, which is 512 bytes long. Subtracting out space reserved for a partition table, we have only 446 bytes for code and data.

Programming in such a restricted environment is quite a challenge. It's further complicated by real mode's segmented addressing. Indexing an array bigger than 64 kB requires significant extra code — and that goes double for the video frame buffer. With two off-screen buffers and a 640 × 480 × 1 byte video mode, much of my code is devoted to segment juggling.

I spent a long time playing with code compression. In the end, I couldn't find a scheme which justifies the fixed size cost of its own decoder. It seems that 16-bit x86 machine code is actually pretty information-dense. For a bigger demo or 32-bit mode (with bigger immediate operands) I'd definitely want compression.

It's totally feasible to enter 32-bit protected mode within 446 bytes, but there's little gained by doing so. You lose easy access to the PC BIOS, which is the only thing you have that resembles an operating system or standard library.

You can browse the assembly source code or grab the MBR itself. It runs well in QEMU, with or without KVM, and I also tested it on a few real machines via USB boot. With QEMU on Linux it's as simple as

$qemu -hda phosphene.mbr

Thanks to Michael Rule for the original idea and for tons of help with tweaking the rendering algorithm. His writeup has more information about this project.

Wednesday, October 12, 2011

Yesterday I gave a talk on the topic of "Why learn Haskell?", and I've posted the slides [PDF]. Thanks to MIT's SIPB for organizing these talks and providing tasty snacks. Thanks also to the Boston Haskell group for lots of useful feedback towards improving my talk.

Monday, October 10, 2011

Shell scripts make it easy to pass data between external commands. But shell script as a programming language lacks features like non-trivial data structures and easy, robust concurrency. These would be useful in building quick solutions to system administration and automation problems.

As others have noted,12345 Haskell is an interesting alternative for these scripting tasks. I wrote the shqq library to make it a little easier to invoke external programs from Haskell. With the shquasiquoter, you write a shell command which embeds Haskell variables, execute it as an IO action, and get the command's standard output as a String. In other words, it's a bit like the backtick operator from Perl or Ruby.

For efficiency, we find potential duplicates by size, and then checksum only these files. We use external shell commands for checksumming as well as the initial directory traversal. At the end we print the names of duplicated files, one per line, with a blank line after each group of duplicates.

I included type signatures for clarity, but you wouldn't need them in a one-off script. Not counting imports and the LANGUAGE pragma, that makes 10 lines of code total. I'm pretty happy with the expressiveness of this solution, especially the use of parallel IO for an easy speedup.