Sunday, May 22, 2011

Last week I went to TFP 2011, and gave a talk on Hoogle entitled "Finding Functions from Types". The slides are now available online. These slides give some information about how the type searching works in Hoogle, and I intend to write further details in the future.

Sunday, May 08, 2011

Summary: CmdArgs can be used purely, and even if you choose to use it in an impure manner, you don't need to worry.

As a result of my blog post yesterday, a few people have said they have been put off using CmdArgs. There are three reasons why you shouldn't be put off using CmdArgs, even though it has some impure code within it.

1: You can use CmdArgs entirely purely

You do not need to use the impure code at all. The module System.Console.CmdArgs.Implict provides two ways to write annotated records. The first way is impure, and uses cmdArgs and &=. The second way is pure, and uses cmdArgs_ and +=. Both ways have exactly the same power. For example, you can write either of:

The first definition is impure. The second is pure. Both are equivalent. I prefer the syntax of the first version, but the second version is not much longer or uglier. If the impurities scare you, just switch. The Implicit module documentation describes four simple rules for converting between the two methods.

2: You do not need to use annotated records at all

Notice that the above module is called Implicit, CmdArgs also features an Explicit version where you create a data type representing your command line arguments (which is entirely pure). For example:

Here you construct modes and flags explicitly, and pass functions which describe how to update the command line state. Everything in the Implicit module maps down to the Explicit module. In addition, you can use the GetOpt compatibility layer, which also maps down to the Explicit parser.

Having written command line parsers with annotated records, I'd never go back to writing them out in full. However, if you want to map your own command line parser description down to the Explicit version you can get help messages and parsing for free.

3: I fight the optimiser so you don't have to

Even if you choose to use the impure implicit version of CmdArgs, you don't need to do anything, even at high optimisation levels. I have an extensive test suite, and will continue to ensure CmdArgs programs work properly - I rely on it for several of my programs. While I find the implicit impure nicer to work with, I am still working on making a pure version with the same syntax, developing alternative methods of describing the annotations, developing quasi-quoters etc.

CmdArgs is a Haskell library for concisely specifying the command line arguments, see this post for an introduction. I have just released versions 0.6.10 and 0.7 of CmdArgs, which are a strongly recommended upgrade, particular for GHC 7 users. (The 0.6.10 is so anyone who specified 0.6.* in their Cabal file gets an automatic upgrade, and 0.7 is so people can write 0.7.* to require the new version.)

Why CmdArgs is impure

CmdArgs works by annotating fields to determine what command line argument parsing you want. Consider the data type:

data Opts = Opts {file :: FilePath}

By default this will create a command line which would respond to myprogram --file=foo. If instead we want to use myprogram foo we need to attach the annotation args to the file field. We can do this in CmdArgs with:

cmdArgs $ Opts {file = "" &= args}

However, CmdArgs does not have to be impure - you can equally use the pure variant of CmdArgs and write:

cmdArgs_ $ record Opts{} [file := "" += args]

Sadly, this code is a little more ugly. I prefer the impure version, but everything is supported in both versions.

I am still experimenting with other ways of writing an annotated record, weighing the trade-offs between purity, safety and the syntax required. I have been experimenting with pure variants tonight, and hope in a future release to make CmdArgs pure by default.

GHC vs CmdArgs

Because CmdArgs uses unsafe and untracked side effects, GHC's optimiser can manipulate the program in ways that change the semantics. A standard example where GHC's optimiser can harm CmdArgs is:

Opts2 {file1 = "" &= typFile, file2 = "" &= typFile}

Here the subexpression "" &= typFile is duplicated, and if GHC spots this duplication, it can use common sub-expression eliminate to transform the program to:

let x = "" &= typFile in Opts2 {file1 = x, file2 = x}

Unfortunately, because CmdArgs is impure, this program attaches the annotation to file1, but not file2.

This optimisation problem happens in practice, and can be eliminated by writing {-# OPTIONS_GHC -fno-cse #-} in the source file defining the annotations. However, it is burdensome to require all users of CmdArgs to add pragmas to their code, so I investigated how to reduce the problem.

Beating GHC 6.10.4

The key function used to beat GHC's optimiser is &=. It is defined as:

In order to stop CSE, I use two tricks. Firstly, I mark &= as INLINE, so that it's definition ends up in the annotations - allowing me to try and modify the expression so it doesn't become suitable for CSE. For GHC 6.10.4 I then made up increasingly random expressions, with increasingly random pragmas, until the problem went away. The end solution was:

Unfortunately, after upgrading to GHC 7.0.1, the problem happened again. I asked for help, and then started researching GHC's CSE optimiser (using the Hoogle support for GHC I added last weekend - a happy coincidence). The information I found is summarised here. Using this information I was able to construct a much more targeted solution (which I can actually understand!):

As before, we INLINE the &=, so it gets into the annotations. Now we want to make all annotations appear different, even though they are the same. We use unique which is equivalent to id, but when wrapped around an expression causes all instances to appear different under CSE. The unit binding has the value (), but in a way that GHC can't reduce, so the case does not get eliminated. GHC does not CSE case expressions, so all annotations are safe.

How GHC will probably defeat CmdArgs next

There are three avenues GHC could explore to defeat CmdArgs:

Reorder the optimisation phases. Performing CSE before inlining would defeat everything, as any tricks in &= would be ignored.

Better CSE. Looking inside case would defeat my scheme.

Reduce unit to (). This reduction could be done in a number of ways:

Constructor specialisation of reverse should manage to reduce reverse to "", which then seq can evaluate, and then eliminate the case. I'm a little surprised this isn't already happening, but I'm sure it will one day.

Supercompilation can inline recursive functions, and inlining reverse would eliminate the case.

If GHC could determine reverse "" was total it could eliminate the seq without knowing it's value. This is somewhat tricky for reverse as it isn't total for infinite lists.

Of these optimisations, I consider the reduction of unit to be most likely, but also the easiest to counteract.

How CmdArgs will win

The only safe way for CmdArgs to win is to rewrite the library to be pure. I am working on various annotation schemes and hope to have something available shortly.