In 2011, I designed a formally verified C parser for my master's
internship. You can refer to
our ESOP
paper (with François Pottier and Xavier Leroy) to see the
details. Here, I am going to explain what problems (and some of
their solutions) I have to overcome in order to use it
in Compcert, the certified
compiler developed at Gallium in Coq.

The parsing engine, an overview

I am not going to go into the details here, but let's get an overview
of what was my work of 2011. Here is the big picture:

Basically, I added an instrumented Coq backend to the
Menhir parser
generator. It generates a Coq version of the Grammar, together with an
LR(1)-like automaton and a certificate for this automaton. A
validator, written in Coq, checks that this certificate is valid, that
is that the automaton is related, in some sense, to the grammar.

In order to execute the parser, I developed in Coq an interpreter for
LR(1) automatons. It takes a token stream in input and returns a
semantic value.

Then, I give a formal proof of the Coq code. Informally, it says that
if a given automaton has been validated with respect to a grammar,
using some certificate, then I get a correct and complete parser by
using the interpreter and the automaton.

What is a correct and complete parser? Correct means that when the
parser accepts the input, then returned semantic value is a valid
semantic value for the input (which implies that the input word is in
the grammar's language). Complete means that if the input word has
some semantic value, then the parser will return this semantic value.

One advantage of this approach is that I do not need to prove the
generator itself, which is a pretty complicated task. The validation
process is not as complicated, as some simple information over the
invariants of the automaton is enough. Moreover, one can change the
variant of LR(1) automaton (LR(0), LALR, SLR, pager-LR(1), canonical
LR(1)...) one wants to use without much work. Note that it does not
suffers the typical drawback of translation validation in verified
compilation. Typically, the validator performs at run time, and it
does not need to validate its input. When it does not, either
because it is incomplete or because its input is invalid, the
compiler has to fail unexpectedly. Here, as the validation is done
at parser generation time, we are sure, once it is generated, that
it will perform well.

In order to get more details on this parts, please refer to our
ESOP
paper.

When theory goes into practice

My final aim when I began this work was to integrate it into
Compcert. Besides, before writing the paper, I implemented a
proof-of-concept version of Compcert including a verified
parser. However, this parser had several more or less critical
downsides:

The context sensitivity of the C grammar makes it impossible to
parse it using pure LR(1) technology. This
good blog
article explains the problem if you are not aware of
it. Without entering into the details, our solution worked, but
was not fully satisfactory, because the "lexer" was so complicated
that it included a full non-certified parser, which could, if
buggy, in some corner cases, introduce bugs in the whole parser.

Parser is a part of a compiler where the performance matters,
because it manipulates quite a lot of data. Our approach appears
not to be disastrous in terms of performances, but still, in our
first implementation, the parser and lexer ran 5 times slower than
the previous non-verified Menhir-generated parser.

It is much more convenient to describe a grammar in the Menhir's
description language than directly in Coq. However, the validation
of the automaton with respect to a grammar must be done using a
Coq form of the grammar. So, actually, one must be confident over
the translation of the grammar from Menhir form to Coq form. I
tried to make this translation very readable so that someone could
check it by hand, but still, it is not very satisfactory

Performance of the extracted code

Internalizing types the right way

After some benchmarking, it appeared that the main bottleneck was
the polymorphic hash function of OCaml. Why? Simply because the Coq
code needed to manipulate some data that had a type I did not want
to formalize in Coq. More details below.

This problem appears typically when you want to manipulate
identifiers in Coq: as the naive encoding of strings as lists of
characters in Coq is terribly slow, the typical way of using them as
identifiers in a Coq code is to build a hash map between some
positive numbers and all the identifiers appearing in the source
code. The construction of this hash map is done in some OCaml
trusted code. Then you can do "efficiently" multiple things on
identifiers in the Coq code: you can compare them, you can build
maps using them as keys...

However, in my case, I used this technique on the type of location
information (that is, the data associated to most of the nodes of
the AST to indicate where it comes from). It was very inefficient,
because I needed to hash one location information for every very
token. Moreover, I do not need to do complex computations on these
data. So I figured out that I could use an "abstract type" for these
data: I declare it as a Parameter in Coq, and I force
extraction to replace it by the right OCaml type. This improved
performances a lot.

Using Int31

The other bottleneck was dynamic checks: in order to avoid using
too much dependent types in the interpreter (there are already a few
of them, and the proofs are complicated enough like that), I put
some dynamic checks. I have proved that they will never fail, so
there is no problem of correctness here. But still, they exist and
they are executed in the extracted code. It was very frustrating
because they were consuming quite a lot of time, but I knew they
were not useful at all.

These dynamic checks typically verify that the symbols present at
the top of the stack correspond to what is expected. This is, from
an OCaml point of view, very simple: they consists in comparison of
two elements of inductive types of hundreds of constructors without
parameters.

However, from a Coq point of view, this is harder: in order to
compare values of a type, one needs to provide a decision procedure
for equality over this type. Naively, the size of such a decision
procedure is (at least) quadratic over the size of the inductive
type. Let's take, for example, the case of this simple inductive
type:

This code has quadratic size in the number of constructors. The
extraction is smart enough to factorize cases and get some linear
length code, but the Coq kernel has to deal with these big
terms. Actually, the problem is even worse than it seems: in order
to prove that this term is correct directly, it seems necessary to
give a term of cubic size...

This is not tractable in our case: the types of data we are
comparing typically have a few hundreds of constructors, and the
generated term would be enormous, and would take ages to typecheck.
A smarter solution is to provide an injection from our type to
another Coq type for which I can decide equality easily. Typically,
I can provide an injection to the type positive, and
that is what I did. This is quite inefficient, because at each
comparison, OCaml has to find the corresponding positives, and then
compare them bits after bits.

The optimization I used here was to use 31 bits integers provided
in the Coq standard library, instead of positives, but it was not as
easy as it seemed. Indeed, these integers are internally represented
as 31 bits OCaml integers when using the vm_compute
tactic, achieving good performance. But they are extracted to OCaml
to a bit by bit representation, which have not much better
performance than positives. So I had to teach Coq to
extract int31 to OCaml's int (which is
either 31 bits long or 63 bits long, depending on the
architecture). A final care was necessary to make sure that the
int31 literals in the Coq source code were really converted to
a constant after the whole compilation process (that was not
obvious!).

Storing the automaton tables

After these optimization, the performance are about only 2.5 to 3
times slower than the non-certified parser. Based on my
benchmarks, one of the bottleneck is the representation of the
tables of the automaton: I currently use big match
constructs, which does not seem to be as efficient as static
arrays, which I do not have in Coq.

However, I don' know a satisfactory solution here (maybe you do?):
I think I will let it as is, as the performances are good
enough.