Writing nndoc digest definitions

Introduction

One of the neatest features of the gnus
mail and newsreading package for
emacs
is its ability to expand digests into individual messages that can be
read with the full power of the newsreader. What's
really cool about the feature is that it's
extensible: you can write rules to describe new digest formats.
That's especially handy in the modern world, where too many publishers
think that RFC 1153
shouldn't apply to them.

The downside is that it's not at all easy to write these rules. The
documentation is terse, to say the least, and when you screw up, the
error messages are monumentally unhelpful. This Web page is an
attempt to rectify that situation by teaching you how to write and
debug digest rules. I also provide links to all the rulesets I've
written.

Existing Rulesets

If you want to undigestify something, the easiest approach is to use
somebody else's work. :-) The second easiest is to adapt something
that already exists. If one of the following rulesets matches your
needs, just slap it into your .gnus file and you're
done. I like to put my rulesets into an nndoc
subdirectory in my search path, and then use the following code
in my .gnus file to pick it up:

If no ruleset fits, try adapting something that's close before you
start from scratch. (Note that some of the following rulesets are
early efforts that don't do as much as some later ones. See the
rulesets for Yahoo! Groups and Crypto-Gram for some useful
techniques.)

Some rulesets are no longer maintained. I apologize if they don't
work; I've stopped receiving those lists and so I'm not able to fix
them.

Two common SANS digests:
SANS NewsBites and SANS PrivacyBits,
which thankfully have a common format.

SANS security
digests. SANS digests are a PITA to parse, by the way.
Every one of them has a slightly different format, for no
reason whatsoever. The parser doesn't work perfectly, but it
comes fairly close.

Yahoo! Groups. All
mailing lists from Yahoo! Groups use the same format, making
them easy to parse.
(No longer maintained.)

Anyone who wishes to contribute additional rulesets is welcome to
e-mail them to me. Please name them
nndoc-xxx.el and include only one
ruleset per file.

Writing a Digest Ruleset

This section is intended as a supplement to the GNUS documentation.
Before you read this, you should familiarize yourself with what the
TexInfo files have to say about adding new digests. If there's
something you don't understand there, I suggest you don't try to
puzzle it out, because it may become clearer here. It may be useful
to reread the documentation after reading this page.

The basic idea of a new ruleset is that you must describe to
nndoc how to find the beginning and ending of each
article in the digest. Ideally, this is done with a few regular
expressions. Sometimes (all too often, it seems) you will also have
to write code that converts a badly formatted article into a more
mail-like layout.

How nndoc Parses a Digest

The most important part of writing a ruleset is understanding the
exact process gnus (i.e., the nndoc package) goes about
turning a digest into individual messages. This process
very complex because it has tons of options. You
need to know about all of the options, though, because they are the
key to getting your ruleset to work correctly.

Digest processing is divided into two parts: dissection and display.
During dissection, nndoc figures out exactly where each
message starts and ends in the digest. The output of this process is
an association list ("alist") that describes each individual message
as a set of offsets. See the comments about
nndoc-dissection-alist in the nndoc.el code
for more information. This step is usually the killer; it's very hard
to get it exactly right.

The second processing step happens during display. Here, the message
is extracted from the digest (which is easy because of the offsets
generated in step 1) and then reformatted for display. This is where
you can make things look nice.

Dissection

Dissection is performed by the function
nndoc-dissect-buffer. Understanding this function is key
to writing correct rulesets. If you have problems, this is also the
function to step through in the debugger. The output of
nndoc-dissect-buffer is the alist mentioned above.

The steps performed by nndoc-dissect-buffer are as follows:

Preparation is performed once per digest:

Remove blank lines from the beginning of the buffer.

If a dissection-function is defined, call it and
return the result, skipping all the other steps listed below.

If the file-begin pattern is defined, search for
it.

Dissection is performed in a loop, until there are no more messages
(articles) in the digest. In all cases, the term
"bol-search" means "Search for the given regular
expression, and set point to the beginning of the line containing it.
If the regular expression is not found, set point to the beginning of
the current line." The dissection loop is:

Find the beginning of the article. This is a complex step:

If this is the first time through the loop andfirst-article is defined, bol-search for
first-article.

If article-begin-function is defined, call it.
Note that there is no first-article-function.
However, the free
variable first is available to
article-begin-function and is
t for the first article, so the effect of
a first-article-function can be achieved
by testing first.

Otherwise, bol-search for article-begin.

All of these functions should leave point unchanged or at the
beginning of the article header. If they don't, you can fix
it up in the next step.

If there is a head-begin-function, call it.
Otherwise, if head-begin is defined, bol-search
for it.

If we are now at the end of the buffer, or if
file-end is defined and we are looking at
file-end, terminate the loop. (Note that this
means file-end must always match from the
beginning of a line, no matter how the digest is formatted.)

Otherwise, assume that there is a new article. Save the
current position as the beginning of the article header.

Bol-search for head-end (default is "^$", i.e., a
blank line). Save this as the end of the article header.

If body-begin-function is defined, call it
to find the beginning of the body. Otherwise, bol-search for
body-begin (default "^\n"). Save the result as
the beginning of the article body. Note that this step can
potentially cause information to be ignored between the
article header and body. Also note that because the pattern
includes a newline instead of a dollar sign, the position
saved is after the blank line rather than at it.

Find the end of the article body and save its position. This
step is complex:

If body-end-function is defined, call it and
use the resulting value of point.
body-end-function must return a
non-nil value or the following steps will
be executed.

If body-end is defined, bol-search for it.

Otherwise (including search failures for
body-end), search for the beginning of
the following article using the procedure in Step 1
above, subparts (2) and (3).

If the beginning of the following article can't be found,
go to the end of the file. If file-end
is defined, search backwards for it and go to the
beginning of that line.

Add the article number and the saved positions to the
dissection alist.

If generate-head-function is defined, call it to
generate fake headers for the article. Otherwise, simply grab
the lines between the beginning and end of the article header
and call them the headers. In either case, add a "Lines:"
header with a calculated line count. (Note: the important
header material depends on what you show in your summary
buffer. Typically, "Subject:", "From:", and maybe "Date:" are
useful things to generate.)

Whew! That's a complicated mess. Fortunately, you often don't need
to understand it in detail. It's documented above in case you need to
debug something. But the general summary is:

Always prefer a -function over a pattern.

Find the header, using first-article as
the pattern for article #1.

Find the end of the header.

Find the start of the body.

Find the end of the body, or the end of the file.

Save the headers, either real or generated.

That makes it much simpler, right?

Display

The second layer of processing comes when it's time to display the
article. This is much simpler:

In an empty buffer, insert the header of the article (as
noted during dissection).

Insert a blank line.

Insert the body.

Go to the beginning of the body.

If prepare-body-function is defined, call it.

If article-transform-function is defined, call it.

Process the result like a normal mail message. In
particular, this means highlighting certain header fields,
"washing" the body according to your preferences, etc.

I've found that the most important detail is that
article-transform-function needs to produce "proper"
headers. For example, the subject should be preceded by "Subject: "
(including the blank). I also find it very useful to create
"From:", "Cc:", and "Reply-To:" lines designed so that I can just use
the "reply" and "wide reply" features to reply to article authors or
the entire mailing list. Thus, for example, when I recognize Yahoo!
group digests I save the group name in
nndoc-yahoo-groups-cc, and the
nndoc-transform-yahoo-groups-article function inserts a
CC: line to that group. The result is that I can reply to an
individual or wide-reply to the entire group, as needed.

Summary of nndoc Variables

Here's a summary of all the options you can set for an
nndoc digest type. All "find" functions can leave point
anywhere in the line found; nndoc will move to the
beginning of that line before proceeding. Unless otherwise specified,
all options are "if defined"; the default is to simply do nothing.
Also, all patterns and functions are used during dissection, with the
exception of article-transform-function and
prepare-body-function.

article-begin-function

Called to find the beginning of each article. Must
return t if an article is found, nil otherwise. If there
are no more articles, should leave point at the end of the
buffer or at a line matching nndoc-file=end.

article-begin

Pattern that matches the beginning of an article. After
this pattern matches, point should be somewhere
on the first meaningful line of the article.
NOTE: it may be necessary for this pattern to
also match nndoc-file-end, so that the EOF check
in step 3 above can work.

article-transform-function

During display, arbitrarily transforms the article.
Often used to generate RFC-compliant header lines
(nonblank characters followed by colon) at the beginning
of nonconforming articles. See also
prepare-body-function.
Note that if necessary, you can extract information from
the original unparsed article; see the
Google Groups
code for an example.

body-begin-function

Called to find the beginning of an article body.

body-begin

Pattern that matches the beginning of an article body.
Default is "^\n".

body-end-function

Called to find the end of an article body.
Must return t if another article follows this one, nil
otherwise.

body-end

Pattern that matches the end of an article body. Default is
the beginning of the next article, or the end of the file.

file-begin

Pattern that matches the beginning of the digest. This
feature is almost the same as first-article.
The difference is that first-article can
stand entirely alone, while file-begin is followed by
a search for either first-article (if
defined) or article-begin.

file-end

Pattern that matches the first-meaningful line that marks
the end of the digest. Note that file-end
will only work properly if either (a)
body-end-function and body-end
are undefined, or (b) the body-end functions
leave pointfile-end line.

first-article

Pattern that matches the beginning of the first article
in the digest, in case the first article is distinguished
differently. Often, this is a multi-line pattern (with
embedded newlines). For many digest formats, however, it
is better to leave first-article unset and
use file-begin to skip past the garbage at
the front of the file.

generate-head-function

Called after dissection to generate (possibly fake)
headers that will be used to build the group summary
buffer. Must switch to nndoc-current-buffer
to extract relevant information, then return to the
original buffer and insert generated headers there.
This function must modify the article buffer.
Use an existing one as a guide for writing your own.

head-begin-function

Called to position point at the beginning of the article
header. If there
are no more articles, should leave point at the end of the
buffer or at a line matching nndoc-file=end.

head-begin

Pattern that matches the first line of the article header.
NOTE: it may be necessary for this pattern to
also match nndoc-file-end, so that the EOF check
in step 3 above can work.

head-end

Pattern that matches the last line of the article
header. Default is "^$".

prepare-body-function

During display, arbitrarily prepares the article body for
display. Most commonly used to remove quoting in embedded
articles (e.g., MIME digests), but can do whatever it
wants. Called with point at the beginning of the body,
but can go to point-min if it wants to muck
with the article headers as well; in this sense it
duplicates article-transform-function (q.v.).

Debugging a Digest Ruleset

Rulesets are hard to write correctly. No matter how hard you try,
you'll make mistakes, and then you're stuck with figuring out what
went wrong.

One thing to remember is that nndoc caches some
information for speed. Whenever you change your rulesets, go to a
different article than the one you're working on, and type "C-d" to
enter it. It doesn't matter if it's a digest or not; the point is to
get nndoc to clear its cache. Then return to the article in question
and try it again.

Common Mistakes

Some mistakes happen over and over again. Here are some common
problems and suggested solutions:

I just get a bell when I try to enter the
digest. This is the most common symptom of a failed
pattern set. Unfortunately, it's very hard to debug; you may
have to step through the code (see below). First, though, go
into the *Article* buffer and check your
patterns. Every option listed above is saved in
nndoc-option-name. For example, the
head-begin pattern is in
nndoc-head-begin. You can use ESC :
to execute an Elisp expression that experiments with those
patterns. For example, use ESC : (re-search-forward
nndoc-first-article) RET to see if you're correctly
finding the first article in the digest. Remember that
point must wind up on the first line of the
article header (unless head-begin-function is
going to correct it).

Every second message is missing. Perhaps you have a
head-begin pattern that skips past the article
beginning found by article-begin. Usually,
head-begin should be unset.

Every second message is missing. You have an
article-begin pattern that matches multiple
lines, but no body-end pattern. The result is
that the end of the body extends into the beginning of the
following article, so that a subsequent
article-begin search won't find the beginning of
that article. The solution is to define a
body-end pattern that matches only the first line
of the article-begin pattern, or to define a
body-end-function that finds the beginning of the
proper area. I often use the following body-end
function:

Sometimes the first line of an article is
missing. The article doesn't use RFC-compliant
headers, and you didn't write an
article-transform-function. In the absence of
proper headers, gnus guesses that the first line
of the article is a subject. But if the subject has a colon
in it, gnus gets confused. The solution is
simple: insert "Subject: " (with the blank) in front of the
first line.

Serious Debugging

If the above hints don't get you going, you're kind of up a creek. It
would be nice if there were some special functions to help debugging.
For example, it would be really cool to be able to go into an article
buffer, type M-x nndoc-show-markers RET, and see
colorization that describes how nndoc parsed the buffer.
Maybe someday.

Until then, you have two tools: experimenting with individual
parameters, and stepping through the relevant code.

The very first thing to do is to verify that your
nndoc-foo-type-p function works.
Go to *Article* and type ESC : (nndoc-foo-type-p)
RET where foo is the name of your added type (e.g.,
technews-summary). It should return t. If
not, fix that function so that it correctly recognizes your digest.
Be as selective as possible; you don't want your TechNews recognizer
to try to parse RFC 1153-compliant digests.

If your type-recognition function seems to work, double-check it by
looking at the contents of nndoc-article-type. If that's
wrong, some other type may have beaten you to the punch. Use the
second argument of nndoc-add-type to control this
problem. Also, remember that if the type-recognition function returns
a number, it's taken as a priority, so be sure it returns t
if it's certain it's found the correct type.

The next step is to check all your patterns. In
*Article*, search for each pattern you defined. If the
type recognizer succeeded, each pattern will be saved in a variable
with the same name, preceded by nndoc-. So, for example,
start with ESC : (re-search-forward nndoc-first-article)
RET. Make sure each pattern matches what it's supposed to, and
that it leaves point somewhere in the line that's at the
beginning or end of the header or body, as appropriate.

Using the Debugger

If none of this helps, you need the debugger. Before you start
debugging, make sure you have non-compiled code by explicitly loading
the file "nndoc.el" (use the locate command
to find it). In the group summary buffer, select the digest and use
C-u g to get the "raw" version that nndoc
looks at. Then use M-x debug-on-entry RET nndoc-dissect-buffer
RET to set a breakpoint. Type C-d to enter the
digest, hit "d", then "c". At this point
the buffer should have been dissected, and the results are available
in the variable nndoc-dissection-alist. You can look at
the values with ESC : nndoc-dissection-alist RET or
(better) go into the *scratch* buffer to look at it. The
alist will be too long to see all of it, but you can check some of the
values to see if they look reasonable. Copy those values into another
window (I like to copy and paste into "cat >/dev/null" in a shell
window to record this sort of information). You can then go into the
*Article* buffer and use M-x goto-char RET
to go to the various places in the buffer and see if they seem
reasonable.

If you have trouble generating the alist, or if it looks very wrong,
you can step through your dissection functions (if any) or
nndoc-dissect-buffer itself. While stepping, the command
ESC : (switch-to-buffer nndoc-current-buffer) RET will
put you into the buffer that is being dissected, so you can look at
what the functions are seeing.

If the alist looks OK and you can get a group summary, but can't see
an individual article correctly, you probably have display-related
problems. Use M-x cancel-debug-on-entry RET
nndoc-dissect-buffer RET to turn off debugging, the M-x
debug-on-entry RET nndoc-request-article RET to set a new
breakpoint. This time, use only d to step through the
function. After the second time insert-buffer-substring
is called, you can use ESC : (switch-to-buffer buffer)
RET to temporarily get into the scratch buffer where the
article is being built. This will let you see what the transformation
functions are about to work on. Use C-x b RET to return
to the debugger buffer, and step through your own code with
d. At any time, you can see the current state of the
buffer (including point) by repeating the ESC :
(switch-to-buffer buffer) RET command. (A handy shortcut is
C-x ESC ESC, which repeats the last command—often,
that's the switch-to-buffer command, or at least you can
get there with a few M-p keystrokes.)

Finally, if you are getting inexplicable behavior (i.e., the changes
you make don't seem to take effect, or you breakpoint on a function
that you know is being called and the debugger isn't entered), try
exiting GNUS and reentering. Sometimes, stuff gets cached in weird
places.

It's fair to say that the debugging process is sometimes painful.
However, the end result is well worth it: you type C-d on
a big digest with tons of messages, and they're nicely broken up (and
even threaded) for your reading convenience.

This text is explicitly placed in the public domain. Feel free to use
it, extend it, modify it, abuse it, or destroy it as you wish.