Archive for the ‘decoupling parsing from GHC’ Category

Status summary: Patches are pending to Haddock and GHC [update: they have been included, and released in GHC 6.12.1 and Haddock 2.6.0]. Processing the guts of Haddock comments gets moved out of GHC to Haddock, although recognizing “– |” etc. are still done by GHC.

In fact, at top-level they’re only recognized stand-alone by GHC (data DocDecl: DocCommentNext, DocCommentNamed, etc.), and Haddock gets to match them up with their definitions (Haddock.Interface.Create.collectDocs). Inside function types, data constructors and record-fields, though, they still have to be parsed to a more precise attachment by GHC, which occurs in compiler/parser/Parser.y.pp. But the interiors of the comments can now be parsed by Haddock, at least. Technically, that is Lexing, Parsing and Renaming RdrName->Name. Not to be confused with Haddock’s later renaming Name->DocName that tries to figure out where the documentation links for each rendered thing (like Int or Monad or concatMap) should point to. Renaming RdrName->Name instead starts with something like “concatMap” or “SomethingSauce.Exts.Int16” and figures out what the original defining module is, given the context in which the RdrName appears. For Haddock, this is determined solely by the modules’s GlobalRdrEnv which contains information like “there was an ‘import GHC.Exts as SomethingSauce.Exts hiding (seq)’” and “This module defines a function or value called ‘foomatic’”. Full renaming needs to be more sophisticated and resolve the right-hand-side ‘x’ in ‘f x = x’ correctly, by looking more places than the top-level, but Haddock comments don’t have any local scope like that. At least not currently.

General important concepts in Haddock:

There is a data Interface (Haddock.Types) for each module processed. This is computed by a sequence in Haddock.Interface, most of which is in Haddock.Interface.Create. This data Interface is used to render the HTML docs for the module (Haddock.Backends.Html . Which uses an old locally-kept version of an HTML combinator library). (Or instead/additionally to HTML, it can make Hoogle info.)

This would be simple if every module were self-contained, but it isn’t. Haddock needs to find out about other modules, in order to link to them and re-export things from them. In order to find out about other modules in the group currently being processed (typically a package), Haddock uses a fold over the group’s dependency graph and passes the depended-upon modules’ Interfaces to Haddock.Interface.Create.createInterface. Note that this has consequences for modules that import each other, although I think it might work acceptably/imperfectly in the presence of .hs-boot files. Haddock also loads, in sequence, each of this group of modules using GHC (The GHC-API, -package ghc), so GHC can tell us all about them.

Cross-package, the situation is more complicated. For tedious reasons having to do with space/time efficiency or ease of implementation or nondeterminism or something, we don’t just save all .hs files and Interfaces and stuff to disk. GHC saves “.hi” files to disk for each module, which tell it about all exported information that’s relevant to a compiler, and a bit more. (For example, it doesn’t include doc comments or remember whether a data was declared GADT-style. Probably. There are some weird things it does remember, like whether a constructor was declared infix.) These .hi files are how it can possibly compile the modules that you’re haddocking now. They also let us look up the declarations and types in those .hi files, incidentally (with GHC.lookupName) — though it’s a conversion effort to turn them into HsDecls (see Haddock.Convert, a new module added by my patches). Haddock, likewise, has to record some information about a module in parallel to the .hi file. Haddock.Types.InstalledInterface contains this information for each module — it’s a subset of Interface that mostly contains docs since GHC’s .hi doesn’t save any information about them. (And we’re lazy/stingy so we still depend on GHC for type information, despite its imperfections for our purpose.) When a module is being processed, its Interface is created, and then the InstalledInterface subset is saved to disk. Actually it’s more complicated, because Haddock, unlike GHC, creates a single .haddock interface-file for each *group* of modules it processes (see Haddock.InterfaceFile). Then when you haddock a dependent module, Haddock loads those .haddock files and looks for info in them rather akin to how it would look for information in a locally imported module’s Interface (though it’s always a different though nearby code-path). At least hopefully Haddock loads those .haddock files; it has to be told where they are on the command-line. Cabal will helpfully do so for you, as long as the depended-upon packages have got any installed haddock documentation!

Okay… I think that’s a general overview for now. Questions? Was I confusing or clear?

I’ve done the first step now! I made a patch that turns the representation of HsDoc in GHC into a FastString rather than a parsed entity, and deleted the parsing code and made it compile. (HaddockModInfo will need to be FastString-ized also.)

(By the way, this means parsing the interiors of the comments. GHC will still be the one to recognize “– |”, “– ^” and so forth, for this phase of the project, and to attach them to the parsed declarations.)

Next comes the presumably harder part: add support in Haddock! (at least the final-product will need to be full of #ifdefs, in order to keep supporting GHC < 6.11, also.)