[Optimise, using <tt>-O</tt> or <tt>-O2</tt>: this is the most basic way to make your program go faster. Compilation time will be slower, especially with <tt>-O2</tt>.

+

Optimise, using <tt>-O</tt> or <tt>-O2</tt>: this is the most basic way to make your program go faster. Compilation time will be slower, especially with <tt>-O2</tt>.

At present, <tt>-O2</tt> is nearly indistinguishable from <tt>-O</tt>.

At present, <tt>-O2</tt> is nearly indistinguishable from <tt>-O</tt>.

Line 14:

Line 14:

* <tt>-O</tt>:

* <tt>-O</tt>:

* <tt>-O2</tt>:

* <tt>-O2</tt>:

−

* Do NOT use <tt>-O3</tt>, it actually gives less optimization than <tt>-O2</tt>, [[http://hackage.haskell.org/trac/ghc/ticket/1261]]

* <tt>-funfolding-use-threshold=16</tt>: demand more inlining.

* <tt>-funfolding-use-threshold=16</tt>: demand more inlining.

* <tt>-fexcess-precision</tt>: see [[Performance/Floating_point]]

* <tt>-fexcess-precision</tt>: see [[Performance/Floating_point]]

−

* <tt>-optc-O3</tt>: Enables a suite of optimizations in the GCC compiler. See the gcc(1) man-page for details. (a C-compiler option).

+

* <tt>-optc-O3</tt>: Enables a suite of optimizations in the GCC compiler. See the [http://www.openbsd.org/cgi-bin/man.cgi?query=gcc&sektion=1 gcc(1) man-page] for details. (a C-compiler option).

* <tt>-optc-ffast-math</tt>: A C-compiler option which allows it to be less strict with respect to the standard when compiling IEEE 754 floating point arithmetic. Math operations will not trap if something goes wrong and math operations will assume that NaN and +- Infinity are not in arguments or results. For most practical floating point processing, this is a non-issue and enabling the flag can speed up FP arithmetic by a considerable amount. Also see the gcc(1) man-page. (a C-compiler option).

* <tt>-optc-ffast-math</tt>: A C-compiler option which allows it to be less strict with respect to the standard when compiling IEEE 754 floating point arithmetic. Math operations will not trap if something goes wrong and math operations will assume that NaN and +- Infinity are not in arguments or results. For most practical floating point processing, this is a non-issue and enabling the flag can speed up FP arithmetic by a considerable amount. Also see the gcc(1) man-page. (a C-compiler option).

Line 53:

Line 52:

This tells you how much time is being spent running the program itself (MUT time), and how much time spent in the garbage collector (GC time).

This tells you how much time is being spent running the program itself (MUT time), and how much time spent in the garbage collector (GC time).

−

If your program is doing a lot of GC, then your first priority should be to check for [[Performance:Space Leaks|Space Leaks]] using [http://www.haskell.org/ghc/docs/latest/html/users_guide/prof-heap.html heap profiling], and then to try to reduce allocations by [http://www.haskell.org/ghc/docs/latest/html/users_guide/prof-time-options.html time and allocation profiling].

+

If your program is doing a lot of GC, then your first priority should be to check for [[Memory leak|Space Leaks]] using [http://www.haskell.org/ghc/docs/latest/html/users_guide/prof-heap.html heap profiling], and then to try to reduce allocations by [http://www.haskell.org/ghc/docs/latest/html/users_guide/prof-time-options.html time and allocation profiling].

If you can't reduce the GC cost any further, then using more memory by tweaking the [http://www.haskell.org/ghc/docs/latest/html/users_guide/runtime-control.html#rts-options-gc GC options] will probably help. For example, increasing the default heap size with <tt>+RTS -H128m</tt> will reduce the number of GCs.

If you can't reduce the GC cost any further, then using more memory by tweaking the [http://www.haskell.org/ghc/docs/latest/html/users_guide/runtime-control.html#rts-options-gc GC options] will probably help. For example, increasing the default heap size with <tt>+RTS -H128m</tt> will reduce the number of GCs.

If your program isn't doing too much GC, then you should proceed to [http://www.haskell.org/ghc/docs/latest/html/users_guide/prof-time-options.html time and allocation profiling] to see where the big hitters are.

If your program isn't doing too much GC, then you should proceed to [http://www.haskell.org/ghc/docs/latest/html/users_guide/prof-time-options.html time and allocation profiling] to see where the big hitters are.

+

+

== Modules and separate compilation ==

+

+

In general, splitting code across modules should not make programs less efficient. GHC does quite aggressive cross-module inlining: when you import a function f from another module M, GHC consults the "interface file" M.hi to get f's definition.

+

+

For best results, ''use an explicit export list''. If you do, GHC can inline any non-exported functions that are only called once, even if they are very big. Without an explicit export list, GHC must assume that every function is exported, and hence (to avoid code bloat) is more conservative about inlining.

+

+

There is one exception to the general rule that splitting code across modules does not harm performance. As mentioned above, if a non-exported non-recursive function is called exactly once, then it is inlined ''regardless of size'', because doing so does not cause code duplication. But if it's exported and is large, then its inlining is not exposed -- and even if it were it might not be inlined, because doing so duplicates its code an unknown number of times. You can change the threshold for (a) exposing and (b) using an inlining, with flags <tt>-funfolding-creation-threshold</tt> and <tt>-funfolding-use-threshold</tt> respectively.

== Unboxed types ==

== Unboxed types ==

Line 155:

Line 162:

If a function you want inlined contains a slow path, it can help a

If a function you want inlined contains a slow path, it can help a

−

good deal to seperate the slow path into its own function and NOINLINE

+

good deal to separate the slow path into its own function and NOINLINE

it.

it.

Line 179:

Line 186:

Here's a step-by-step guide to optimising a particular program,

Here's a step-by-step guide to optimising a particular program,

the [http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=ghc&id=2 partial-sums problem] from the [http://shootout.alioth.debian.org Great Language Shootout]. We developed a number

the [http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=ghc&id=2 partial-sums problem] from the [http://shootout.alioth.debian.org Great Language Shootout]. We developed a number

Please report any overly-slow GHC-compiled programs. Since GHC doesn't have any credible competition in the performance department these days it's hard to say what overly-slow means, so just use your judgement! Of course, if a GHC compiled program runs slower than the same program compiled with another Haskell compiler, then it's definitely a bug. Furthermore, if an equivalent OCaml, SML or Clean program is faster, this might be a bug.

-optc-O3: Enables a suite of optimizations in the GCC compiler. See the gcc(1) man-page for details. (a C-compiler option).

-optc-ffast-math: A C-compiler option which allows it to be less strict with respect to the standard when compiling IEEE 754 floating point arithmetic. Math operations will not trap if something goes wrong and math operations will assume that NaN and +- Infinity are not in arguments or results. For most practical floating point processing, this is a non-issue and enabling the flag can speed up FP arithmetic by a considerable amount. Also see the gcc(1) man-page. (a C-compiler option).

Other useful flags:

-ddump-simpl > core.txt: generate core.txt file (see below).

2 Measuring performance

The first thing to do is measure the performance of your program, and find out whether all the time is being spent in the garbage collector or not. Run your program with the +RTS -sstderr option:

If you can't reduce the GC cost any further, then using more memory by tweaking the GC options will probably help. For example, increasing the default heap size with +RTS -H128m will reduce the number of GCs.

3 Modules and separate compilation

In general, splitting code across modules should not make programs less efficient. GHC does quite aggressive cross-module inlining: when you import a function f from another module M, GHC consults the "interface file" M.hi to get f's definition.

For best results, use an explicit export list. If you do, GHC can inline any non-exported functions that are only called once, even if they are very big. Without an explicit export list, GHC must assume that every function is exported, and hence (to avoid code bloat) is more conservative about inlining.

There is one exception to the general rule that splitting code across modules does not harm performance. As mentioned above, if a non-exported non-recursive function is called exactly once, then it is inlined regardless of size, because doing so does not cause code duplication. But if it's exported and is large, then its inlining is not exposed -- and even if it were it might not be inlined, because doing so duplicates its code an unknown number of times. You can change the threshold for (a) exposing and (b) using an inlining, with flags -funfolding-creation-threshold and -funfolding-use-threshold respectively.

4 Unboxed types

When you are really desperate for speed, and you want to get right down to the “raw bits.” Please see GHC Primitives for some information about using unboxed types.

This should be a last resort, however, since unboxed types and primitives are non-portable. Fortunately, it is usually not necessary to resort to using explicit unboxed types and primitives, because GHC's optimiser can do the work for you by inlining operations it knows about, and unboxing strict function arguments (see Performance/Strictness). Strict and unpacked constructor fields can also help a lot (see Performance/Data Types). Sometimes GHC needs a little help to generate the right code, so you might have to look at the Core output to see whether your tweaks are actually having the desired effect.

One thing that can be said for using unboxed types and primitives is that you know you're writing efficient code, rather than relying on GHC's optimiser to do the right thing, and being at the mercy of changes in GHC's optimiser down the line. This may well be important to you, in which case go for it.

4.1 An example

Usually unboxing is not explicitly required (see the Core tutorial below), however there
are circumstances where you require precise control over how your code is
unboxed. The following program was at one point an entry in the
Great Language Shootout.
GHC did a good job unboxing the loop, but wouldn't generate the best loop. The
solution was to unbox the loop function by hand, resulting in better code.

which contains 1 less case statement. The second version runs as fast as C, the
first a bit slower. A similar problem was also solved with explicit unboxing in the recursive benchmark entry.

5 Primops

If you really, really need the speed, and other techniques don't seem to
be helping, programming your code in raw GHC primops can sometimes do
the job. As for unboxed types, you get some guarantees that your code's
performance isn't subject to changes to the GHC optimisations, at the
cost of more unreadable code.

For example, in an imperative benchmark program a bottleneck was
swapping two values. Raw primops solved the problem:

6 Inlining

GHC does a lot of inlining, which has a dramatic effect on performance.

Without -O, GHC does inlining within a module, but no cross-module inlining.

With -O, it does a lot of cross-module inlining. Indeed, generally
speaking GHC will inline across modules just as much as it does
within modules, with a single large exception. If GHC sees that a
function 'f' is called just once, it inlines it regardless of how big
'f' is. But once 'f' is exported, GHC can never see that it's called
exactly once, even if that later turns out to be the case. This
inline-once optimisation is pretty important in practice.

So: if you care about performance, do not export functions that are not used outside the module (i.e. use an explicit export list, and keep it as small as possible).

Sometimes explicitly inlining critical chunks of code can help.
The INLINE pragma can be used for this purpose; but not for recursive functions, since inlining them forever would obviously be a bad idea.

If a function you want inlined contains a slow path, it can help a
good deal to separate the slow path into its own function and NOINLINE
it.

7 Looking at the Core

GHC's compiler intermediate language can be very useful for improving
the performance of your code. Core is a functional language much like a very
stripped down Haskell (by design), so it's still readable, and still purely
functional. The general technique is to iteratively inspect how the critical
functions of your program are compiled to Core, checking that they're compiled
in the most optimal manner. Sometimes GHC doesn't quite manage to unbox your
function arguments, float out common subexpressions, or unfold loops ideally --
but you'll only know if you read the Core.

Compiled with -O2 it runs. However, the performance is really bad.
Somewhere greater than 128M heap -- in fact eventually running out of
memory. A classic space leak. So look at the generated Core.

8.1 Inspect the Core

The best way to check the Core that GHC generates is with the
-ddump-simpl flag (dump the results after code simplification, and
after all optimisations are run). The result can be verbose, so pipe it into a pager.

Looking for the 'loop', we find that it has been compiled to a function with
the following type:

Here the first guard is purely a syntactic trick to inform ghc that the
arguments should be strictly evaluated. I've played a little game here, using
! for `seq` is reminiscent of the new bang-pattern proposal for
strictness. Let's see how this compiles. Strictifying all args GHC produces an
inner loop of:

8.5 Strength reduction

Finally, another trick -- manual
strength reduction. When I checked the C
entry, it used an integer for the k parameter to the loop, and cast it
to a double for the math each time around, so perhaps we can make it an
Int parameter. Secondly, the alt parameter only has it's sign flipped
each time, so perhaps we can factor out the alt / k arg (it's either 1 /
k or -1 on k), saving a division. Thirdly, (k ** (-0.5)) is just a
slow way of doing a sqrt.

This entry in fact
runs
faster than hand optimised (and vectorised) GCC! And is only slower than
optimised Fortran. Lesson: Haskell can be very, very fast.

So, by carefully tweaking things, we first squished a space leak, and then
gained another 45%.

8.6 Summary

Manually inspect the Core that is generated

Use strictness annotations to ensure loops are unboxed

Watch out for optimisations such as CSE and strength reduction that are missed

Read the generated C for really tight loops.

Use -fexcess-precision and -optc-ffast-math for doubles

9 Parameters

On x86 (possibly others), adding parameters to a loop is rather
expensive, and it can be a large win to "hide" your parameters in a
mutable array. (Note that this is the kind of thing quite likely to
change between GHC versions, so measure before using this trick!)

10 Pattern matching

On rare occasions pattern matching can give improvements in code that
needs to repeatedly take apart data structures. This code:

11 Arrays

If you are using array access and GHC primops, do not be too eager to
use raw Addr#esses; MutableByteArray# is just as fast and frees you
from memory management.

12 Memory allocation and arrays

When you are allocating arrays, it may help to know a little about GHC's memory allocator. There are lots of deatils in The GHC Commentary), but here are some useful facts:

For larger objects ghc has an allocation granularity of 4k. That is it always uses a multiple of 4k bytes, which can lead to wasteage of up to 4k per array. Furthermore, a byte array has some overhead: it needs one word for the heap cell header and another for the length. So if you allocate a 4k byte array then it uses 8k. So the trick is to allocate 4k - overhead. This is what the Data.ByteString library does

GHC allocates memory from the OS in units of a "megablock", currently 1Mbyte. So if you allocate a 1Mb array, the storage manager has to allocate 1Mb + overhead, which will cause it to allocate a 2Mb megablock. The surplus will be returned to the system in the form of free blocks, but if all you do is allocate lots of 1Mb arrays, you'll waste about half the space because there's never enough contiguous free space to contain another 1Mb array. Similar problem for 512k arrays: the storage manager allocates a 1Mb block, and returns slightly less than half of it as free blocks, so each 512k allocation takes a whole new 1Mb block.

13 Rewrite rules

Algebraic properties in your code might be missed by the GHC optimiser.
You can use user-supplied rewrite rules to
teach the compiler to optimise your code using domain-specific
optimisations.