I recently started an internship in Gallium, working on the HeVeA
LaTeX to HTML compiler from and with Luc Maranget. I'll stay here for
two months, which is sufficient for having to "blog about it". In my
civil life, I'm interested in about approximately everything
concerning language(-related) theory.

My first job for HeVeA was changing the structure used by the Out
module. It was yielding the ouput by concatenating together bare
strings, thus blowing up the maximal size of strings on 32 bits
machines. A data structure fit for concatenations was therefore
needed, and a good choice for it is Ropes.

Ropes

Basically, ropes are just balanced binary-trees, the leaves being
strings. Each node is also weighted for example by the length of the
rope which it is the root of (it could also be by the length of its left
child, which makes no difference but administrative ones). The leaves
are designed to be small enough to see the difference, and big enough
to avoid too deep trees (which could have a negative effect on
performances). Typically, the maximum length of leaves is 256, and the
height is only limited by integer size in OCaml.

A rope for the string "Gagallium" could be :

9
__|__
/ \
5 4
/ \ / \
Ga gal li um

On this data structure, we win on complexity against bare strings for
most costly operations, while losing with O(log n) vs. O(1) for
a few ones (like reading the character at a given position). Let's just
link Wikipedia
for the administrative details. Bottom line is, on big strings we
positively gain something (more than simply being able to run HeVeA on
old machines).

In HeVeA

The main place in HeVeA where this structure finds its use is in a file
called out.ml. It provides functions to manipulate "buffers", which
are called by the html-generating function to produce the final output.
Before my intervention, the buffer type looked like :

buff is the string containing the output, bp is the position of the
cursor in this string, and len is String.length buff (probably for
optimization reasons). When you wrote something to the buff, it is
blitted at bp position, and when you retrieve something (for example
to put it in the final output), you get all the part between the
beginning and bp. Chan and Null are used for unknown (at my global
understanding level) purposes and are not changed with ropes.

As HeVeA globally works a lot by mutability, I use a reference to a
rope, which is an immutable structure. All I had to do then was changing
the code of the module function by function, which was quite straightforward
once I understood the exact meaning of the old code. Well, I still had
strange bugs, mainly for reading too quickly the former version, but
also for unexpected behaviours from external modules which I did not
really know about (like Format from the standard library, which does
strange things with formatters; you should read carefully about it before
thinking you know how to use them).

The result

In conclusion, HeVeA is now 20% slower, which is quite a good
result considering the optimizations which were done before (and the way
we totally did not care about it when using ropes). It will probably
stay as it is concerning this point, because it is quick enough for main
purpose (namely less than one second on reasonable documents). But if it
has to be sped up one day, it would be possible due to the stronger
architecture of the Out module, and by using ropes on the other parts of
the software, instead of converting from rope to (small) string and
reciprocally on many places.