Work in Progress

A description of the Haddock comment support in GHC

Haddock comment support was added to GHC by David Waern as part of a ​Google Summer Of Code project. The aim of the project was to port the existing Haddock program to use the GHC API. Since the project is now over, GHC can understand Haddock comments (here called doc comments) and they are available through the GHC API. This is a very rough overview of the implementation.

Usage

To turn this extension on, you supply the -haddock flag on the command line. Then doc comments are lexed, parsed and renamed. They end up both in the parsed and renamed abstract syntax trees. Without the -haddock flag, doc comments are disregarded just like normal comments.

With the flag on, any doc comments appearing where they're not expected will result in parse errors. Any parse errors inside the comments themselves will also result in normal parse errors. No warnings are generated by this extension.

Module headers, every type of doc comment and option, basically all information that the original Haddock can obtain from doc comments can also be parsed by GHC and obtained from the GHC API.

Lexer details

In the lexer, a doc comment is recognized as a token. Without the -haddock flag, the lexer won't recognize the doc tokens, and this is what effectively turns off the entire extension.

There are four types of doc comments at this level, each having its own token. Each token contains the entire comment string.
Just like the original Haddock, we support "next" and "previous"-type comments, "named" comments and section headings. The options token is used for specifiying Haddock options. Options are specified using a pragma, like this: {-# DOCOPTIONS prune, ignore-exports }. You can no longer specify them using dash comments (e.g -- # prune).

Parser details

The doc tokens appear in a lot of places in the grammar and having a look at compiler/parser/Parser.y.pp is probably the best way to get an overview of this.

When a doc token is encountered by the parser, it tries to parse the content of the token. This is done by invoking a special Alex lexer (compiler/parser/HaddockLex.x) and Happy parser (compiler/parser/HaddockParse.y), taken directly from the original Haddock sources. This process turns the token into a value of type HsDoc RdrName, representing the (internal structure of the) comment. It can then be stored in the Haskell AST by the parser at the appropriate place. A lot of places (constructors) in the AST definition (compiler/hsSyn) allow HsDocs, and more can be added.

Binding groups

Before the renaming phase, GHC restructures function definitions into binding groups. This is done by going through the list of HsDecls representing the top declarations of the source file, grouping different type of declarations together.

We do this with the top level doc comments as well. There's a problem though: An external program must be able to use the GHC API to associate multiple "next" and "prev" style comments with the right Haskell binding. This can be done by looking at the parsed syntax tree, where the file structure is preserved. But, by going through this restructuring, the renamed syntax loose this structure. We want to be able to use the renamed syntax, so instead of just grouping the comments together, we let the grouping process return a list of DocEntity:

An external program can now figure out which doc comment belongs to what "entity", i.e what Haskell binding. This solution is also used for method declarations in classes.

The renamer

The doc comments go through the renamer, and the reason is that an HsDoc can contain a reference to an identifier. It can be important for users of the GHC API to get hold of comments that contain the original name of references (HsDoc Name).

The GHC API

The doc comments are present in ParsedSource as well as in RenamedSource.

There are three pieces of information besides the ordinary comments themselves that could also be of interest. Those are the doc options, the module-specific doc comment and the Haddock module header information. All of them are available in the HsModule data type in the ParsedSource. The last two pieces of information may contain names of identifiers, so they are also part of the renamed syntax. They can be obtained from the last two elements in RenamedSource 5-tuple, as seen below.