What is ThreadFind?

ThreadFind is an open source, web-based interface for discovering
the message and thread URLs for messages on a mailing list, given the
"Message-ID" header or other metadata. Secondarily, ThreadFind serves
as a subsystem for SummaryDesk, which uses ThreadFind to gather thread information
for writing summaries.

ThreadFind development is sponsored by:

Why is ThreadFind necessary?

Often, mailing list management software is not well-enough
integrated with list archivers to know a message's archive URL at the
time that message goes out to recipients. This means there is no
header on the message saying what that message's archive URL
is — so if someone wants to refer to that message,
they need to browse the archives, find the message, and cut-and-paste
the URL. This is a time-consuming process.

ThreadFind solves this problem by keeping a database mapping
"Message-ID" headers (which all emails have) to message archive URLs.
Thus if you have the message itself in your hand, you can instantly
get its archive URL. Furthermore, when you query the database, by
default ThreadFind returns the results in a way that is friendly to
cut and paste.

This is just the default output; ThreadFind can be configured (via
the request URL) to give other kinds of output.

Other functionality: because ThreadFind has to constantly scan the
mailing list to keep itself up-to-date anyway, it grabs other
important mail headers as well. This allows users to map quickly
between, say, "Subject" and URL. Thus ThreadFind duplicates some of
the search capabilities found in most archivers; but because this
functionality came nearly for free, and allows people to add new kinds
of search interfaces, we decided to implement it anyway.

Overview of how it works

You configure ThreadFind to watch a set of mailing lists, via the
base URLs of their archives. For each mailing list, it starts at
message number 1 and polls upward. Each time it finds a message, it
pulls it in via HTTP and parses the page to get the header data. Then
it stores the header data and the URL in a database record.

ThreadFind is a self-updating system: no manual update process is
required when ThreadFind comes back online after having been offline
for a while. It just looks at its config file and the mailing list
archives, and brings itself up-to-date automatically.

Starting and stopping is done through the threadfind-ctl.py
program, using the commands
"threadfind-ctl.py -c live.cfg start" and
"threadfind-ctl.py -c live.cfg stop", respectively.
You can also use
"threadfind-ctl.py -c live.cfg restart" to get it to
reread its configuration file, which is where you add new mailing
lists, or tweak the configurations for existing lists. Run
"threadfind-ctl.py -c live.cfg" with no other arguments to
get a status printout of the instance.

Todo list

ThreadFind is alpha software. Some of the remaining work, in no
particular order, is:

Develop all the queries that SummaryDesk and others might need.

Document those queries.

Improve threadfind-ctl.py to detect hangs better.

Learn to probe automatically upwards a bit when a gap is found in
the message sequence, so that the "lower_bound" option in a list's
configuration does not need to be manually tweaked for this case.

Add instructions to hide config files from web browsers.

Need to start using the bug tracker, instead of todo lists, to keep
tabs on open issues :-).

How to get ThreadFind working, from scratch.

Create the database user.

Make sure the mysql users 'threadfindrw' and 'threadfindro' exist,
that the first has read/write access to an existing database named
threadfind, and that the second has read-only access:

Also add '.cgi' to the 'AddHandler cgi-script' directive, to enable viewing the default threadfind page.

Copy "example.cfg" to a new file (say, "live.cfg"), edit the new
file in the obvious ways.

Take a deep breath, then run
"threadfind-ctl.py -c live.cfg start". ThreadFind will
now start populating the database with messages. Use
threadfind-ctl.py to start, restart, and stop ThreadFind as necessary.
When started again, ThreadFind will always pick up where it left off,
no matter how long it's been stopped for.