An Indentation Engine for OCaml

Since our last activity report we have released the first stable versions of two projects: OPAM, an installation manager for OCaml source packages, and ocp-indent, an indentation tool.

We have already described the basics of OPAM in two precedent blogposts, so today we will focus on the release of ocp-indent.

Indentation should be consistent across editors

When you work on a very large code-base, it is crucial to keep a consistent indentation scheme. This is not only good for code review purposes (when the indentation carries semantic properties) but also when your code is starting to evolve and when the one who makes the change is not the one who wrote the initial piece of code. In the latter case, the variety of editors and local configurations usually leads to lot of small changes carrying no semantic value at all (such as changing tabs to spaces, adding few spaces at the beginning or end of lines, and so on). This semantic noise considerably decreases the efficiency of any code-review and change process and is usually very good at hiding hard-to-track bugs in your code-base.

A few months ago, the solutions for OCaml to this indentation problem were limited. For instance, you could write coding guidelines and hope that all the developers in your project would follow them. If you wanted to be more systematic, you could create and share a common configuration file for some popular editors (most OCaml developers use the emacs' tuareg-mode or vim) but it is very hard to get consistent indentation result across multiple tools. Moreover, having to rely on a specific editor mode means that it is harder to fully automatize the indentation process, for instance when setting-up a VCS hook.

In order to overcome these pitfalls, Jane Street asked us to design a new external tool with the following high-level specification:

it should be easy to use inside and outside any editor;

it should understand the OCaml semantics and reflect it in the indentation;

it should be easy to maintain and to extend;

So we started to look at the OCaml tools' ecosystem and we found an early prototype of Jun Furuse's ocaml-indent. The foundation looked great but the result on real-world code sources was not as nice as it could be, so we decided to start from this base to build our new tool, that we called ocp-indent. Today, ocp-indent and ocaml-indent do not have much code in common anymore, but the global architecture of the system remains the same.

Writing an indentation engine for OCaml

An indentation engine may seem like a rather simple problem: given any line in the program, we want to compute its indentation level, based on the code structure.

It turns out to be much more difficult than that, mainly because indentation is only marginally semantic, and, worse, is a matter of taste and "proper layout". In short, it's not a problem that can be expressed concisely, because one really does want lots of specific cases handled "nicely", depending on the on-screen layout -- position of line breaks -- rather than the semantic structure. Ocp-indent does contain lots of ad-hoc logic for such cases. To make things harder, the OCaml syntax is known to be difficult to handle, with a few ambiguities.

Indent process

Ocp-indent processes code in a simple and efficient way:

We lex the input with a modified version of the OCaml lexer, to guarantee complete consistency with OCaml itself. The parser had to be modified to be more robust (ocaml fails on errors, the indentation tool should not) and to keep tokens like comments, quotations, and, in the latest version, some ocamldoc block delimiters.

Taking the token stream as input, we maintain a "block" stack that keeps informations like the kinds of blocks we have been through to get to the cursor position, the column and the indentation parameters. For instance, the "block" stack [KBody KFfun; KLet; KBody KModule] corresponds to the position of X in the following piece of (pseudo-) code:

...moduleFoo=struct...let f =fun a ->X

Each token may look up the stack to find its starting counterpart (in will look for KLet, etc.), or disambiguate (= will look for KLet, stopping on opening tokens like KBracket, and will be inserted as an operator if none is found). This is flexible enough to allow for "breaking" the stack when incorrect grammar is found. For example, the unclosed paren in module let x = ( end should not break indent after the end. Great care was taken in deciding what tokens should be able to remove from the stack in which conditions.

The stack can also be used to find a token that we want to align on, typically bars | in a pattern-matching.

On every line break, the stack can be used to compute the indentation of the next line.

In the case of partial file indentation (typically, reindenting one line or a single block), on lines that shouldn't be reindented the stack is reversely updated to adapt to the current indentation.

Priorities

The part where some abstraction can be put into the engine is the knowledge of the semantics, and more precisely of the scope of the operations. It's also in that case that the indenter can help you write, and not only read, your code. On that matter, ocp-indent has a knowledge of the precedence of operators and constructs that is used to know how far to unwind the stack, and what to align on. For example, a ; will flush function applications and most operators.

It is that part that gives it the most edge over tuareg, and avoids semantically incorrect indents. All infix operators are defined with a priority, a kind of indentation (indentation increment or alignment over the above concerned expression), and an indentation value (number of spaces to add). So for example most operators have a priority lower than function application, but not ., which yields correct results for:

let f =
somefun
record.
field
y
+ z

Boolean operators like && and || are setup for alignment instead of indentation:

let r = a
|| b
&& c
|| d

Additionally, some special operators are wanted with a negative alignment in some cases. This is also handled in a generic way by the engine. In particular, this is the case for ; or |:

type t =A|Blet r ={ f1 = x
; f2 = y
}

A note on the integration in editors

ocp-indent can be used on the command-line to reindent whole files (or part of them with --lines), but the most common use of an indenter is from an editor. If you are lucky enough to be able to call OCaml code from your editor, you can use it directly as a library, but otherwise, the preferred way is to use the option --numeric: instead of reprinting the file reindented, it will only output indentation levels, which you can then process from your editor (for instance, using indent-line-to with emacs). That should be cheaper and will help preserve cursor position, etc.

Currently, a simple emacs binding working on either the ocaml or the tuareg mode is provided, together with a vim mode contributed by Raphaël Proust and David Powers.

Results

We've built ocp-indent based on a growing collection of unit-tests. If you find an indentation bug, feel free to send us a code snippet that we will incorporate into our test suite.

Our tests clearly show that the deep understanding that ocp-indent has of the OCaml syntax makes it shines on specific cases. We are still discussing and evaluating the implementation of few corner-cases related, see for instance the currently failing tests.

We have also run some benchmarks on real code-bases and the result is quite conclusive: ocp-indent is always better than tuareg! This is a very nice result as most of the existing source files are either indented manually or are following tuareg standards. But ocp-indent is also orders of magnitude faster, which means you can integrate it seamlessly into any automatic process.