This is a call to arms: to improve the OCaml compiler, we
sorely need a suite of representative benchmarks. If you have ever
written performance-sensitive programs in OCaml, you can
contribute.

PS: When I originally started to write this blog post, it was about
a small experiment on optimizing a small program for float
unboxing. Writing a general introduction on the few joys and numerous
pitfalls of compiler optimizations suddenly turned it into a totally
different post. Oh well.

Having a simple compiler is a valid design choice

The OCaml native compiler is well-known for its simple and robust
behavior. Thanks to efficient choices of value representation and
a sophisticated garbage collector (GC), it is able to make good OCaml
programs run fast. But it doesn't actually optimize in clever way:
efficient code will give an efficient program, but little is done to
make inefficient code efficient.

This approach has some nice advantages. On the user side, it makes for
a predictable compiler (no hope of the killer optimization kicking in
when it actually doesn't due to a small change). On the developer
side we have a well-written and rather simple post-frontend that is
easy to understand and maintain. Compilation times are also usually
very small: with the additional good support for separate compilation,
compilation times is rarely a problem in OCaml as it can be in some
other compiled languages (cough C++ cough).

Yet this also comes with drawbacks. For example, "inefficient" code is
rarely written by OCaml programmers (smart men and women), of
course :), but is easy to obtain when you're doing automatic code
generation; you may have to be a bit careful if you produce OCaml
programs to not spit out stupid code, while a general whole-program
optimizer could have taken care of this for you.

Another problem is that some abstractions are more costly that we
would like them to be. It's hard to write monadic code, for example,
without generating a lot of (often unnecessary) closures. While the
cost of the additional closure allocations and function calls are
often drowned by the overall monadic logic (for example thread
scheduling in monadic cooperative threading libraries such as Lwt and
Async), fine-grained monad style can still have a sensible overhead.

For example, Arnaud Spiwak
reported
that switching the Coq internal tactics to a pervasive monadic style
incurred sensible overhead (~15% slowdown on the whole Coq archive,
with some individual tactics getting several times slower). Using
Coq's inlining engine to maximally reduce monadic closures, he was
able to get a 10% performance gain over what "idiomatic monadic OCaml
code" produces through the OCaml compiler. (Remark that inlining is
easier to do in Coq thanks to a pure (including total) programming
language with sensible strong reduction semantics.)

Finally, not every programming style and application domain is equal
in the face of OCaml compilation choices. Memory allocation, tail
calls, exceptions and pattern matching on algebraic datatypes are
extremely fast, and code using lot of them (typically: a compiler,
a proof assistant, symbolic computations in general) will have you
extremely satisfied. But this sometimes required compromises that have
downsides for other language features, and the sophistication effort
that went there didn't necessarily find itself reproduced in other
parts of the language features that were of less interest at the time
of the compiler's inception.

As a concrete example, OCaml does not shine at number crunching: its
integers are tagged to make the GC life easier, and its
(double-precision) floating points are boxed to keep a uniform value
representation. The compiler is well aware of these facts and tries to
suppress this overhead when it is easy to do so, but those
optimization are relatively simple and have room for improvement.

This doesn't mean that OCaml should not be used when a problem domain
involves some number crunching: the benefit you can gain from
expressive language features and a reasonable type system may very
well outweigh, in terms of productivity and final code efficiency, the
approximate halving of performance (when compared to C) on numeric
benchmarks. A large number of people pay a sensibly higher performance
price to use less-efficient language everyday, for example Javascript
(even on today's highly-optimized runtimes, that took exponentially
more effort to develop than the OCaml native compiler).

Compiler optimization is an ungrateful business

Just above I've mentioned a few fronts (function inlining, local
unboxing or untagging) on which the OCaml compiler could possibly help
with more aggressive optimizations. There are other specific areas
where compilation choices could be improved. The problem with compiler
optimizations is that there is a large disconnect between what we,
compiler programmers and language geeks, find satisfying to work on,
and the hard reality of actual program performances.

It is easy to spend some months implementing and debugging, say,
a state-of-the-art register allocator, only to find out that it makes
practically no performance difference on the programs you have at
hand. Why then would the maintainer accept a large code change that
certainly introduces bug and doesn't provably improves over the naive
but simple and debugged-by-time current implementation?

If advertising for a change in a compiler with the argument that "the
generated code is much nicer", you're doing it wrong. It is
surprisingly often the case that while the generated code looks much
nicer to the naked eye, it is not actually faster on today's
overly-complicated processors. More importantly, it is generally the
case that said nicer code never actually appears in the hot section of
your program, and therefore produces no observable performance
change. Again, "if it ain't broke, don't fix it", it would be
unreasonable to accept a change that cannot demonstrate a real
improvement on a real program -- while it's generally easy to craft
micro-benchmark that demonstrate impressive improvements.

I have personally experienced this problem in the work on
PR#4800. Currently,
OCaml optimizes tupled assignments such as let (x,y,z) = (foo, bar, baz)
to remove the tuple construction and destruction that would
happen per the naive pattern-matching semantics. However this only
works if the right-hand-side is syntactically a tuple, and would fail
for example on let (x,y) = if foo then (bar, baz) else (foobar, foobaz).

Alain Frisch proposed a patch to optimize this (and some other)
situation, to make this multi-binding style more convenient to
use. I've spent some time reviewing the patch and proposing an
alternative implementation, but we've both been unable to find an
actual real-world example where adding this optimization made
a significant difference.

This is a delicate point because those optimizations are here to make
new abstractions available that were previously avoided. If we don't
observe any difference, maybe it is because OCaml programmers have
avoided that idiom so far, but would make convenient use of it if it
was cost-free. However, if we can't demonstrate a real example where
this optimization is important, it's probably not a good idea to
include it upstream.

We really need a benchmark suite, and you can help

What we really need is a benchmark suite that would contain
representative examples of real-world OCaml code. There is a handful
of benchmarks distributed with the compiler but they don't cover
much. Likewise, the programs in the
Shootout are
too small to be representative of actual uses of OCaml.

This is why we currently use "big software" in OCaml to evaluate
performance changes (eg. bootstrap the OCaml compiler itself, or
compile Coq). Some big OCaml users have internal performance
testsuites (for example Pascal Cuoq, which works on Frama-C,
infrequently contributes performance-improving changes to the
OCaml compiler), but we severely lack an easily-available, informative
benchmark suite to evaluate performance changes.

This is a losing situation because optimization-related changes cannot
be reliably assessed in absence of representative code, and specific
problem domains that we can't measure against will not be improved.

The good thing with benchmark needs is that it's a problem the
community at large can efficiently solve in parallel. If everyone
contributed a piece of code that comes from real-world OCaml software
and is representative of its performance-critical part, we would have
a complete benchmark suite by now.

Feel free to contribute by providing, either here in the blog
comments or by mail (gabriel dot scherer at inria, or at gmail), your
benchmark program(s). Code that has few external dependencies and that
comes with a simple interface to parametrize running time is highly
preferred. Of course, an open source license is required (say MIT/BSD
or LGPL with linking exception).

What is a representative benchmark? A benchmark is representative when
performance changes observed in the benchmark will reliably be
reproducible against (some use cases of) the whole software it was
extracted or inspired from. The changes must be comparable in
magnitude: if a 80% performance change (improvement or degradation) in
the benchmark code results in a 0.5% performance change in the actual
software, it's difficult to rely on benchmark results to make design
compromises.