Noweb 3: What and Why

Norman Ramsey

Don Knuth coined the term ``literate programming'' to describe the art
of programming primarily for the human reader, and only secondarily
for the machine.
Literate programming is supported by many tools, all of which provide
some way for authors to interleave program source code with well
typeset documentation.
Most tools also support automatic or semi-automatic cross-referencing
of source code.
Only four or five literate-programming tools are widely used, and
noweb may
be the most widely used of all.
It is certainly the most widely used literate-programming tool that is independent of
the target programming language, and it was the first such tool.

Noweb emphasizes simplicity, extensibility, and
language-independence.
Noweb has the simplest markup of any literate-programming
tool, making it easy for authors to understand the tool and to create
literate programs.
Noweb uses a pipelined architecture, which makes it possible
for expert users to extend the system without recompiling and
using the programming language of their choice.
Users write extensions as Unix programs and
use command-line options to insert them into the noweb pipeline.
Users of noweb have written extensions for prettyprinting,
conditional compilation, language-dependent cross-reference, etc.
The pipelined architecture also makes it easy to support multiple
styles of documentation; noweb is unique in supporting
plain TeX, LaTeX, HTML, and troff.

Noweb is structured as a collection of C programs, shell
scripts, awk scripts, and Icon programs, connected together by Unix
pipelines.
Noweb can be difficult to install; installers may have to
work around bugs in vendors' implementations of awk, and installers
must get Icon [Available for free from the University of
Arizona] to exploit all of the capabilities of the system.
Porting Noweb to the DOS or Windows platform requires either some
effort to replace shell scripts or the purchase of a commercial shell.

Noweb's main competitor in the market for
language-independent literate-programming tools is nuweb,
whose design was inspired by noweb, but which is structured
as a monolithic C program.
As a result, nuweb is not extensible, but it is easy to port,
and it runs quickly.
Noweb can run slowly when it is necessary to fork many
pipeline stages, some of which run in interpreted languages.
Noweb can process nuweb files, but nuweb
users continue to prefer nuweb because of its speed and
installation.

Noweb's cross-referencing capability extends to HTML; a
reader of a literate program can use a Web browser to click on an
identifier and jump to the identifier's definition (and
documentation).
This capability has proven very useful, but it is limited to single
documents.
When large programs are composed of many separately compiled modules,
it is awkward, to say the least, to process the entire program as a
single document. (Such documents may run to hundreds of pages, even
for a program of modest size, say 10,000 lines.)
Users would much prefer to browse one document per module, and to be
able to follow references between documents, but noweb does
not currently support this model.

In sum, the three improvements that noweb's users would most like to
see implemented are

Ability to make cross-references between documents.

Easier porting and installation.

Improved performance.

I would like to make these improvements, and I see three
possible paths.

Simple programming improvements.
Rewrite the elements of noweb as components of a monolithic
C program, solving the portability and performance problems.
One would need a little language to control the pipeline and to enable
the insertion of external stages, to retain the ability to extend the
system without recompiling anything.
This path has no research content.

Case study of embedded languages.
There are already a slew of embedded languages on the market,
including Tcl, perl, Python, lua, slang, Visual Basic, and several
flavors of Scheme.
I'm not aware of any comparative studies among these languages.
I would love to use the modifications to noweb as a vehicle
for undertaking such a study.
The study would address such questions as:

How big is the embedded language relative to the application?
(Is the tail wagging the dog?)

What's the effect on portability?

How hard is it to integrate the same functionality in different
languages?

Which languages can support a new native type to represent the
information transmitted down the noweb pipeline?

What are the bug rates in different implementations?

I'm not exactly sure how to do a good job with such a study, but I
think the results would be interesting to a broad segment of the
research community.

Approximate programming environments.
Neither of the paths above addresses the issue of better
cross-reference.
Doing a good job with cross-reference would involve, among other
problems, something like smart recompilation for documents.
What is more interesting is to see how to build a system that makes a
smooth transition from approximate to complete cross-reference
information.
Noweb version 2 can provide language-dependent cross-reference
information without giving up language-independence by using one of
two mechanisms:

Have users mark definitions by hand, and find uses using a
variant of an Aho-Corasick recognizer.
The variant uses a language-independent algorithm to recognize
identifiers.

Write a language-dependent pipeline stage that approximately
identifies definitions, and use the same algorithm to recognize uses.

Both these methods are approximate. A third method would be to use
proper language-dependent analysis to compute exact def-use
information, but this would require essentially a compiler front end,
which is about two orders of magnitude more work than an
approximate tool that identifies definitions.
This third method has the additional attraction that it can recognize
declarations, so uses can be connected to declarations, and from there
to definitions.

To follow this path I would

Develop a structure that can support all three of these
cross-reference methods.

Compare the accuracy of the three methods.
We could get access to the source code for the Fraser-Hanson book to
get 5,000--7,000 lines of code that have definitions carefully marked
by hand.

It should also be possible to find an industrial partner that would
help us discover which cross-reference links are actually used in
practice, so we can find out how consequential are the failures of the
approximate cross-reference.
Knowing the value of approximate cross-reference, and developing
better techniques for approximate cross-reference, would be helpful for
building code browsers and other tools that go beyond simple tools for
literate programming.