Idea

The OSBF-Lua filter written by Fidelis
Assis is an amazing spam filter and general-purpose text classifier.
However, some of its "magic" is hidden in spam-filtering specific code,
hence it requires some experience and know-how to use it efficiently for
other tasks beyond spam filter.

Moonfilter is a wrapper around OSBF-Lua that aims at making this process
easier. It picks up Fidelis Assis' "best practice" experience from spam
filtering (regarding issues such as thick threshold/reinforcement training,
recommended number of buckets, etc.) and offers a comfortable interface for
training and classifying any other text classes in the same way.

The Moonfilter API includes:

a function for setting the classes to classify among (which will be used
for all subsequent operations);

functions for creating and destroying the database files;

a function for classifying, which will not only report the predicted
class, but also whether the user should reinforcement train, and

a intelligent function for training, which will only train if necessary
(either due to misclassification or for reinforcement).

The API also exports some public variables so the user can modify the
training threshold or the number of buckets or other important settings.
The default settings should already be reasonable for normal usage, leaving
such parameter fiddling for those who really want do to it.

In addition to the Lua API which can be invoked flexibly from Lua scripts,
there also is an easy-to-use script which reads lines from standard input,
executes each of them as a command, and writes a line containing the result
to standard output. This allows using the filter for standard purposes
without having to write any code, and it also makes it easy to
remote-control the filter from other languages such as Java.

API and Scripting Interface

The executable Lua script maps the
Lua API to a command-response syntax in a straightforward way. Since the
external scripting interface is merely a generic wrapper around the
internal API, it is nearly identical to the scripting interface. Each
command corresponds to one exported Lua function which expects its
parameters (if any) as Lua strings or numbers.

Each line of standard input is passed as a command to the wrapped
moonfilter module. Each line consists in a command name followed by any
number of parameters, separated by whitespace. Parameters containing
whitespace or starting with a double quote must be enclosed in "double
quotes"; double quotes and backslashes in such quoted strings must be
backslash-escaped).

If command is a function in the wrapped module, the function is executed
with the specified parameters. Otherwise it should be a public variable of
the wrapped module, which will be set to the value(s) specified as
parameters; if there are no parameters, the variable is simple returned
without changing it.

If execution of the command is successful, the program will print the name
of the command followed by "ok" and the return value of the function or the
(new) value of the variable (if any – the "nil" value is omitted).
Key/value pairs are separated by an equals sign: key=value; booleans are
serialized as true or false. In case of an error, the program will
print the command name + "failed" and an error message.

The special command "exit" terminates the program. Alternatively, the
program terminates when it reaches the end of standard input. In the latter
case, it will terminate immediately without printing a response.

Commands can also be passed in on the command line. Each command line
argument is considered a full command call, so if a command call contains
parameters, it must be quoted so the operation system will treat it as a
single argument. Commands from the command line are read and executed prior
to executing commands from standard input.

Moonrunner Usage Examples

For all simple purposes, two lines should be sufficient (one for selecting
the classes and the other for doing the job):

Create database files:

classes nonspam spam
create

Classify a file:

classes nonspam spam
classify FILENAME

Train a file (e.g. as spam):

classes nonspam spam
train spam FILENAME

If you want to classify and then train, it should be sufficient to give the
filename once (train will use the same file):

classes nonspam spam
classify FILENAME
train spam

If the text to classify isn't already stored in a file, having to create a
temp file would be inefficient. Hence you can use "-" as a special filename
that indicates to read from standard input until the end of input. This
will only work as parameter for the very last command, since it will
consume all the rest of stdin.

For purposes where this won't do, you can use the "readuntil" command which
is an equivalent to Perl's "HERE" documents and allows you to write things
like: