Writing Performance Sensitive OCaml Code

1) OCaml Performance

notes on reading OCaml assembly code so that you may understand
what your own OCaml programs do on the CPU.

When possible have tried to motivate these with snippets of assembly
code generated by ocamlopt.

As a caveat, performance optimizations are sometimes at odds with
desirable programming styles such as preserving abstraction boundaries
and higher order functional patterns. Hand inlining code across
abstraction boundaries and avoiding closure allocations buy you some
performance, however these also make code brittle and inflexible. One
should do these sorts of optimizations sparingly and only in the heart
of tight loops.

Possibly the most useful performance tuning tip is to identify the
places where your program spends most time and see if there are high
level algorithmic approaches and changes in representation of data
that can speed up things. These high level approaches will usually buy
you more than machine level optimizations. Resort to low level tuning
described in this document only after you have tried the former.

1.1) Compiling

Most of the discussion below assumes a 64-bit platform. Assembly code
presented here is x86_64 asm generated by using ocamlopt with the
compiler flags -S, -inline 20 and -nodynlink on a 64-bit Linux
platform.

You can add the -S flag, which instructs the compiler to generate
assembly code, by adding

OCAMLOPTFLAGS += -S

to your local OMakefile.

It is best to generate assembly code with full inlining and other
optimizations that the compiler supports. Even thought inlining and
optimization might make the code a bit harder to read, it will give
you a more accurate picture of what executes on the CPU. Profiling
tool like gprof where the inlining and such only serves to confuse
the profiler.

2) Reading OCaml Assembly

This section introduces the minimal assembly instruction patterns that
one must know to follow some of the cost arguments later. This is not
a general introduction to x86 assembly.

2.1) Named functions are mangled but recognizable.

In the file called test_asm.ml the function f

let f a = a+1

The generated asm looks like the following:

camlTest_asm__f_33700:
.L119:
addq $2, %rax
ret

Note that the file/module name and the name of the function appear in
the generated label. An anonymous function however will just have some
numeric label making it difficult to locate.

2.2) Memory allocation

When OCaml allocates memory it decrements the "heap top" which is
typically in %r15 followed by an overflow check. If there is an
overflow it calls the GC and the allocation is retried.

The snippet below allocates 24 bytes (three 64-bit words).

.L121: subq $24, %r15
cmpq caml_young_limit(%rip), %r15
jb .L122

This is the code at label .L122 that has the call to the GC followed
by the jump to retry allocation with the new %r15 and young limit
value after garbage collection.
.L122: call caml_call_gc
.L123: jmp .L121

2.3) Boxed Floats and Immediate ints

OCaml uses IEEE 64-bit floats and are usually heap allocated
(ie. boxed). The main reason for this is that OCaml needs to tag its
values such that the GC can identify them at runtime. OCaml allocates
floats on the heap using two words - the first word is a tag that says
that the next is a float.

In asm the following snippet indicates a typical boxing operation
where the actual float value is in the register %xmm0. Looking out
for these when trying to track down boxed floats in your code.

First allocate the 2 words.

.L120: subq $16, %r15
cmpq caml_young_limit(%rip), %r15
jb .L121

Assign the tag value (the first word).

leaq 8(%r15), %rax
movq $1277, -8(%rax)

Assign the actual value (the second word).

movlpd %xmm0, (%rax)-

NOTE: An important thing to note is that a pointer to a heap allocated
value does not point to the tag word but the actual first value. This
is why the "load-effective-address" (lea) instruction adds the offset
of 8 bytes. This is true for all heap allocated values, not just
floats.

Unlike floats, ints are represented as immediates, i.e., they are not
allocated on the heap. To tag ints OCaml steals the LSB bit of an int
and sets it to 1 always. This means that we get 63 bits of precision
in OCaml ints and any int value n will appear in the assembly as 2*n+1
(the LSB is always 1 and the bits of the number shifted one place in).

2.4) Tuples

Tuples are heap allocated. A pair will cause the allocation of 3
words: one word to indicate that it is a pair and two words for the
actual values.

If a tuple contains floats, the floats will be boxed. If one wishes to
return a pair of floats, a record containing the two floats may be more
efficient than using a tuple - more on this later.

2.5) Option types

Option types are heap allocated. Passing a value as an option will
cause memory allocation. The value None is however immediate and
represented as 1, the same as the representation of the int value 0.

2.6) Write barriers

A write barrier is a basically a call to caml_modify and looks like
the following:

call caml_modify

This webpage
(pdf)
and
this one
(pdf)
give a reasonable overview of what caml_modify does. Any introductory text on
garbage collection should give you some insight into why write
barriers exist.

2.7) Asserts are helpful in reading asm.

If you have a large function and wish to see how a particular part of
the function translates into asm, a simple way to track down the
fragment in assembly is to put asserts around it. This is just a hack,
but in practice it saves you some time reading asm.

Sometimes OCaml tends to optimize away provably correct asserts - this
is something to watch out for. In the generated asm, the presence of
the assert is indicated by:

leaq caml_exn_Assert_failure(%rip), %rbx

Hence if you are looking for a snippet of code in function f, first
find f and then look for the asserts that you put in it. A more
general method is the line numbers methods below:

2.8) Approximating line numbers in OCaml assembly

Assembly generated by OCaml does not have source line numbers. Having
line numbers is useful when comparing the generated asm back to OCaml
source. This hack lets one simulate have source line numbers in
generated assembly. This relies on the fact that OCaml does not inline
recursive functions.

To start, we define a handy set of recursive definitions to act as
line number markers:

Note that the .loc annotation also gives us information about where
the function being called is defined in the original source file.
This approach to annotating the asm makes the effects of inlining
operations and such done by the compiler evident in the assembly.

3) Cost of OCaml constructs

3.1) Cost of memory allocation

Having seen the OCaml memory allocation instructions its easy to see
that memory allocation itself is cheap - its just one addition and a
test, in the common case.

The real cost of memory allocation comes during GC time. The more
memory you allocate, even though the allocations are cheap, the more
time your program will spend in the GC. So a general rule of thumb for
speeding up programs is to reduce allocation rates. Useful reading
includes:

3.2) Closure allocation

One significant source of memory allocation is the allocation of
closures. Closure allocation basically involves allocating memory for
each of the free variables of the function and for a code (memory
operations are more expensive than register operations) and pointer.

If a function is called just once or not called at all, try and
restructure your code to not allocate the closure. For example: if
one is going to fold or map over a list that is frequently empty,
testing to see if the list is empty before doing the map operation
will avoid the closure allocation.

To see if a function will be inlined or not you can look at the output
of ocamlobjinfo. The inlining of a function is decided at
declaration time, so a function is either always inlined, either
never. Unless the .cmx file containing it is not available when
compiling a unit using this function.

3.4) Stack allocation of arguments

Only a small number of arguments are passed in registers to
functions - the actual number varies according to platform and
possibly other optimizations that the compiler performs.

The remainder of the arguments have to passed via the stack. Hence
functions that receive a large number of arguments will involve
several memory operations for processing the callee saved register and
then the stack allocation of formal arguments. In cases where it is
feasible, breaking up the called function into smaller functions will
allow for the arguments to be passed in registers and in possibly
allow for inlining of the resulting smaller functions.

3.5) Polymorphic compare

OCaml provides a polymorphic compare caml_compare for datatypes. It
works in much the same way the GC does by looking at the tags on
objects to determine their types at runtime and then using the
appropriate compare functions. This is more expensive than using a
compare operation written specifically for the actual type (for
example caml_float_compare for comparing floats).

3.6) Inlining and polymorphic compare

In the following snippets the compiler can infer that the compare is
only float comparison, however the inliner does not specialize the
comparison operation.

let cmp_float2 (a : float) (b : float) =
cmp a b

Note that the inlining has happened as expected, but no specialization:
camlTest_asm__cmp_float2_33676:
subq $8, %rsp
movq %rax, %rdi
movq %rbx, %rsi
leaq caml_compare(%rip), %rax
call caml_c_call
addq $8, %rsp
ret
Doing just renaming also does not buy us anything, it is just as bad.

3.7) Float boxing. Time is usually represented as a float.

Why are floats boxed/unboxed? GC, IEEE format, SSE instructions,
tagging... this is a long story and one should look for descriptions
of this online. Here are some general rules about when floats occur
boxed and unboxed:

Float arguments to (non-inlined) functions are boxed.

Float return values from (non-inlined) functions are boxed.

Floats have to be unboxed for doing calculations - they usually use
the SSE instructions of the CPU.

An array of floats uses an immediate representation - ie. each
float is not boxed.

A record of consisting only of floats uses an immediate
representation. This does not apply to polymorphic records that get
specialized to floats. This does not apply to abstract types that
are internally defined to be floats.

An record that has floats and non-floats has the floats boxed.

A tuple of floats has the individual floats boxed.

3.8) Float compare

Compare for floats caml_float_compare requires the floats to be
boxed. This can be terrible when you have immediate floats.

3.9) Mutable fields in records

Mutations have an additional cost when the mutated value is heap
allocated - ie. when the value is a boxed float, a list, a tuple, a
record etc.

These mutations involve calls to the caml write barrier. A simple
benchmark shows that invoking the write barrier repeatedly makes thing
about 5-6 times slower. Consider the following snippet:
Consider:

type t = {
x : int;
mutable y : float;
}
let set_y t y =
t.y

Which generates the following. Here there are calls to the write
barrier.
camlTest_asm__set_y_33718:
subq $8, %rsp
.L136:
addq $8, %rax
movq %rax, %rdi
movq %rbx, %rsi
call caml_modify
movq $1, %rax
addq $8, %rsp
ret
Contrast the above with the following. Here the only difference is the
type of x which causes the float to have an immediate representation
(remember that in mixed records floats are boxed, whereas they are not
in float-only records).

type t = {
x : float;
mutable y : float;
}
let set_y t y =
t.y

Which generates:
camlTest_asm__set_y_33727:
.L137:
movlpd (%rbx), %xmm0
movlpd %xmm0, 8(%rax)
movq $1, %rax
ret
Changing representation of your data to avoid calls to the write
barrier will help performance. As a side effect, your code may have
the additional overhead of introducing unboxing and boxing
instructions and so the next result may or may not be advantageous -
one has to profile to sure.

3.10) Mutable single-field float records can outperform float refs.

As a direct consequence of the above, float ref represents the float
as a boxed value. Mutating the ref causes calls to the write
barrier. The following type does not:

type float_ref = {
mutable x : float;
}

While it is tempting to try and represent floats in an immediate way
as much as possible, however this sometimes hinder performance rather
than help it. The fact that function arguments/returns are boxed can
cause repeated boxing and unboxing as the floats are passed around.

Note: A useful experiment to try is to use the above float_ref type
everywhere that one uses floats and explicitly cast to float only when
one needs to do calculations. With respect to memory layout,
float_ref is essentially the same as a boxed float but we explicitly
have to insert the boxing and unboxing operations - making these
explicit can help us optimize away needless cases and save on
mutations.

3.11) Inlining avoids boxing (sometimes).

If the following function is inlined the argument and return value are
not boxed.

let f x = x + 0.1

However, in the following function it is possible that x is used as
a boxed float sometimes, the OCaml compiler will always box the float
on entry to the function.

let f x =
if rarely_true_flag
then ls := (x :: !ls);
x + 0.1

NOTE: This could be something that can be fixed in the compiler by
pushing the box allocation into the conditional.

3.12) Local module definitions are not free

Local module definitions have a runtime cost - they are not purely
syntactic.

let mod_test x =
let module M = T1.T2 in
M.f x

Generates:
camlTest_asm__mod_test_33742:
movq camlTest_asm + 120(%rip), %rbx
movq 8(%rbx), %rbx
movq (%rax), %rax
ret
Notice that the value in %rbx has no real purpose. The inlined call
to M.f is only the instruction movq (%rax), %rax. If we write:

module M = T1.T2
let mod_test x =
M.f x

We have:
camlTest_asm__mod_test_33745:
movq (%rax), %rax
ret

3.13) Functor application is a runtime operation.

Generates:
call camlTest_asm__S_33743
The name of the functor S shows up as a function name. There is not
much one can do about this other than to move functor applications
outside function definitions such that they are not repeatedly
applied.

The compiler does not always do this. I don't know why - one should
check.

NOTE: The above code is not as optimal as it could be - in particular
the anonymous function itself is not inlined. Even if the inlining is
not done it could still be better by moving the functional address
lookup inside the conditional and avoiding the assignment to
%rax. We should ask the compiler folk to fix this.

One should hand-inline the above code when perf matters.

let opt_iter opt =
match opt with
Some x -> x_ref := x
None -> ()

3.17) Optional arguments are not free even when inlined.

When a function with optional arguments (and default values) is marked
as an inlined function, the inlining does not eliminate the cost of
allocating the intermediate option type. Details below.

1) The call to f2 has indeed been inlined. However the inlining has
only saved us the cost of making a function call (roughly, the cost of
doing one jump). It has not optimized away the cost of the optional
arguments which can now be fully determined at compile time.

2) The argument ~i1:200 is allocated at compile time to the constant
Some 200. At runtime the compiler checks this constant to see if
it is a None or Some.

3) The argument ~i2 is allocated to a Some i2 at runtime and is
immediately checked for at runtime to see if it None or Some.

This should really be fixed in the compiler because optional arguments
that are properly inlined are a very valuable feature.

4) Misc

Some misc perf notes.

1) glibc pow() and the OCaml ** operator.

OCaml's ** operator calls pow() in glibc. Until glibc version
2.12, the execution time of pow() from the standard math lib varies
significantly depending on the numeric values of arguments. This
behavior is apparently fixed in glibc 2.14. Consider:

The typical execution time for this function is 60-70ns, but in some
specific cases it can be as bad as 400us. One workaround for the
problem is to reduce the precision of x in pow(x, y), by doing
(Float.round_nearest (x *. 1e8)) *. 1e-8..

There is surprisingly little about this issue on the Internet. Here
are two somewhat relevant links:

2) Strings, Bigstrings and the OCaml runtime lock.

In the following code, the OCaml runtime lock is being held while the
time consuming LZ4 compression routine is run. The OCaml runtime lock
forces all calls into the C code to be sequential, since it behaves
like a critical section.

The snippet is faster in the sense that it allows the OCaml runtime to
schedule other threads that may use the runtime lock. For example,
this may be the case with an Async program where many jobs are trying
to compress strings and multiple such compression routines can happen
on different worker threads.

This does not work with strings. In particular, doing the following is
unsafe:

The reason is that OCaml strings live on the OCaml heap. By releasing
the runtime lock, we allow the OCaml GC to run. If the OCaml GC moves
the string in memory while the C code is using it, then things fail.

This is safe to do with Bigstrings since they are allocated on the C
heap. The OCaml GC never moves them. The GC may move the stub that
point to the C heap, but the values of the src_c and dst_c pointers
always point to the right data.

3) Fast, Slow and Incorrect Array blits

Array blits in OCaml, in the general case, are slow -- i.e. they are
not merely memmove/memcpy calls as one might assume.

Array.blit as provided by Core and also supplied by OCaml call
caml_array_blit provided by the runtime. In the OCaml runtime,
there are arrays tagged as "double arrays" -- these are typically
float arrays. Here are the rules:

1) caml_array_blit does a memmove for double arrays. (For float
arrays)
2) caml_array_blit does a memmove for arrays in the minor
heap. (i.e. for arrays created recently).
3) For every other array type there is a for-loop that calls
caml_modify.

Here, caml_modify is OCaml's write barrier (see notes earlier about
this). Blits that fall under case 3 are an order of magnitude
slower. Long lived int arrays, bool arrays and arrays of other simple
immediate value representations will fall under the third case thereby
incurring the cost of calling caml_modify.

1) Compiler patching

It may be possible to patch the compiler to create arrays with
representation guarantees at initialization time, such they are not
conservatively treated as arrays of mutable references. We don't have
this yet.

2) Use Bigarrays

One can use Bigarrays instead of regular OCaml arrays for immediate
values. One approach to blitting Bigarrays is to use the provided blit
function:

This is indeed a fast blit and works between Bigarrays of immediate
types such as floats, ints, bools, chars etc. Further, for large
enough blits one can give up the caml runtime lock, thereby allow
other ocaml code to run in parallel with the blit. (see Section on the
OCaml runtime lock.)

To blit from a bigarray to an OCaml array one would do something like:

The above snippet works well for Bigarrays of floats but not for much
else, i.e. this approach does not work well if one needs to blit
values our from Bigarrays to regular OCaml arrays of any type.

The reason this is so is that the when assigning a value into a
Bigarray, the library changes the values representation from its OCaml
representation to a representation compatible with native C
code. (Bigarrays were after all originally designed for interop with C
code). In the case of floats, OCaml does change the word level
representation of floats and so the above snippet words fine. Hence
the following function works as expected:

However for ints and other OCaml immediate types, the OCaml
representation involves setting some tag bits. When values are
assigned to Bigarrays, "bigarr.{i} > 1. When values are read back using the getter
provided, "<- bigarr.{i}", the tag bits are reconstructed:
((intnat)(x) << 1) + 1.

If one were to blit/memove the values from Bigarrays to OCaml arrays,
the tag bits are not reconstructed causing numbers to be garbled and
confusing the GC. The following does work however, but it is slow:

3) Use FFI and build your own.

The most direct solution is to write your own FFI and call a memmove
in C. To ensure that this is not misused for array types that may have
references, the types should be fixed to the concrete types that this is
allowed for.

While this gives us fast blits on immediate arrays, we do lose the
ability to give up the caml runtime lock for large blits.

4) Time.now() and RDTSC

What may arguably be the best approach to timing OCaml code is to
manually instrument your code using Time.now. One can experiment with
various configurations of the code by commenting out parts of the code
to understand the relative costs of particular functions.
The downside to using Time.now() is that (1) it is an expensive call
and (2) it has limited resolution and is hence not useful in measuring
snippets that may only take milliseconds.

RDTSC is a light weight CPU counter which can be used instead of
Time.now(). It gives a measure of CPU cycles elapsed between
subsequent calls and it useful for timing.