The Weaver Daemon is something that serves out threads. A web-based
front end exists -- Loom.

Normal news readers fetch the headers for the articles you haven't
seen, and then threads the headers and displays them in a pleasing
manner. This is fine, but isn't really an ideal way for a web-based
threading interface to work. Doing threading is a (relatively
speaking) time-consuming task, and if you wish to design a web-based
news reader that's able to serve out, say, tens of pages per second,
you either have to throw a 64 CPU Sun Enterprise server at the
problem, or you do something non-traditional.

weaverd (i.e., The Weaver Daemon) is an attempt at doing something
clever.

The main idea in weaverd is to keep all the threads in memory.
Scratch-buffer computation shows that this means that every 10M
articles will require 1GB RAM, which is OK.

weaverd will keep its structures mirrored to disk, so that it can
start up in a pretty speedy fashion, but won't really read anything
from disk once it has bootstrapped itself. (Well, it'll read new
articles as they arrive to thread them, but won't touch the data
structures otherwise.)

weaverd has three basic structures:

String storage. Many strings (author names, subjects) repeat
themselves, so only one copy of each unique string is kept, and all
the instances refer to this string. A hash structure is used to keep
track of the strings.

The group array. Each group is represented by a structure that
has the group name, group id, and most importantly -- the list of
articles (or nodes) in two orders -- article order and thread order.

So each node takes 42 bytes, plus the strings from the string storage.
Running over the spool shows that each article has on average 58 bytes
that weaverd has to store, so that's exactly 100 bytes per article.
Scarily enough, that's what I guesstimated before I started.

I hadn't really counted up the overhead that all the hash tables
entail, though. That's about 20 bytes more per article.

This is the basic algorithm when receiving a new message:

Pick out the Date, From, Message-ID, Newsgroups and the last
References, and look then up in the string storage. (Well, the Date
is converted to a time_t.)

Enter them into the numeric node array for the group.

Find out where in the thread structure they fit, and alter
(possibly) the parent and the previous sibling. Then re-compute the
thread node array for the group, possibly being clever, or just
brute-force it, depending on how fast brute-forcing will be.

Now, expiry adds a twist. It would probably be convenient to be
able to go from Message-ID to node, so perhaps a separate Message-ID
hash table would be nice, in addition to the general string hash
table.

The daemon will listen on a socket and take the following commands:

enter [path to message]: Parse the message and enter it into the
structures as described above.

cancel [message-id]: Remove the message in question from the
structures.

group-thread [group] [page] [page-size] [last]: This is the main
command used by the web interface, and returns the threads starting
with (threadly speaking) on page PAGE. That is, if there are 1000
articles in the group, asking for 990-1000 will return the ten last
displayed articles in the group, if the group was already threaded.
So they won't be article numbers 990-1000.