Please report any overly-slow GHC-compiled programs. Since GHC doesn't have any credible competition in the performance department these days it's hard to say what overly-slow means, so just use your judgement! Of course, if a GHC compiled program runs slower than the same program compiled with another Haskell compiler, then it's definitely a bug. Furthermore, if an equivalent OCaml, SML or Clean program is faster, this might be a bug.

If you can't reduce the GC cost any further, then using more memory by tweaking the GC options will probably help. For example, increasing the default heap size with +RTS -H128m will reduce the number of GCs.

3 Unboxed types

When you are really desperate for speed, and you want to get right down to the “raw bits.” Please see GHC Primitives for some information about using unboxed types.

This should be a last resort, however, since unboxed types and primitives are non-portable. Fortunately, it is usually not necessary to resort to using explicit unboxed types and primitives, because GHC's optimiser can do the work for you by inlining operations it knows about, and unboxing strict function arguments (see Performance:Strictness). Strict and unpacked constructor fields can also help a lot (see Performance:Data Types). Sometimes GHC needs a little help to generate the right code, so you might have to look at the Core output to see whether your tweaks are actually having the desired effect.

One thing that can be said for using unboxed types and primitives is that you know you're writing efficient code, rather than relying on GHC's optimiser to do the right thing, and being at the mercy of changes in GHC's optimiser down the line. This may well be important to you, in which case go for it.

3.1 An Example

Usually unboxing is not explicitly required (see the Core tutorial below), however there
are circumstances where you require precise control over how your code is
unboxed. The following program was at one point an entry in the
Great Language Shootout.
GHC did a good job unboxing the loop, but wouldn't generate the best loop. The
solution was to unbox the loop function by hand, resulting in better code.

which contains 1 less case statement. The second version runs as fast as C, the
first a bit slower. A similar problem was also solved with explicit unboxing in the recursive benchmark entry.

4 Primops

If you really, really need the speed, and other techniques don't seem to
be helping, programming your code in raw GHC primops can sometimes do
the job. As for unboxed types, you get some guarantees that your code's
performance isn't subject to changes to the GHC optimisations, at the
cost of more unreadable code.

For example, in an imperative benchmark program a bottleneck was
swapping two values. Raw primops solved the problem:

5 Inlining

Sometimes (often?) explicitly inlining critical chunks of code can help.
The INLINE pragma can be used for this purpose. In GHC 6.4.1 (at least)
there are some INLINE pragmas that are ignored, inlining by hand can
occasionally help.

6 Looking at the Core

GHC's compiler intermediate language can be very useful for improving
the performance of your code. Core is a functional language much like a very
stripped down Haskell (by design), so it's still readable, and still purely
functional. The general technique is to iteratively inspect how the critical
functions of your program are compiled to Core, checking that they're compiled
in the most optimal manner. Sometimes GHC doesn't quite manage to unbox your
function arguments, float out common subexpressions, or unfold loops ideally --
but you'll only know if you read the Core.

Compiled with -O2 it runs. However, the performance is really bad.
Somewhere greater than 128M heap -- in fact eventually running out of
memory. A classic space leak. So look at the generated Core.

7.1 Inspect the Core

The best way to check the Core that GHC generates is with the
-ddump-simpl flag (dump the results after code simplification, and
after all optimisations are run). The result can be verbose, so pipe it into a pager.

Looking for the 'loop', we find that it has been compiled to a function with
the following type:

Here the first guard is purely a syntactic trick to inform ghc that the
arguments should be strictly evaluated. I've played a little game here, using
! for `seq` is reminiscent of the new bang-pattern proposal for
strictness. Let's see how this compiles. Strictifying all args GHC produces an
inner loop of:

7.5 Strength reduction

Finally, another trick -- manual
strength reduction. When I checked the C
entry, it used an integer for the k parameter to the loop, and cast it
to a double for the math each time around, so perhaps we can make it an
Int parameter. Secondly, the alt parameter only has it's sign flipped
each time, so perhaps we can factor out the alt / k arg (it's either 1 /
k or -1 on k), saving a division. Thirdly, (k ** (-0.5)) is just a
slow way of doing a sqrt.