The basic format of a command to access the
CGI is to access the above URL followed by a "?" which is
followed by a command. The command is in the form:
Name=value&Name=value&....&Name=value. The Name,value
pairs are arguments to the CGI script. They are processed left to
right. I'll explain why this is important in the section marked
IMPORTANT TIPS below. The CGI program returns a web-page that
contains a header (everything before the BODY tag), the results, and
a footer (everything after the HR tag). The header and footer can be
ignored. They are the same for every command, and their only purpose
is to identify the software.

Word Statistics Only. To obtain corpus
statistics for the word "star" use the "c=<TERM>"
command preceded by a database command "d=<DATABASE NUMBER>"
http://fiji4.ccs.neu.edu/~zerg/lemurcgi/lemur.cgi?d=0&c=stars
This command retrieves the ctf (collection term frequency) and df (document frequency) of
the term. Collection term frequency = number of times the term appears in the collection.
Document frequency = number of documents in the collection that contain the term.
Those statistics are important for the retrieval models. For stemmed databases
the terms are automatically stemmed by Lemur. You do not need to stem.
You should remove stop words from the query before you use Lemur.

Inverted Lists and Word Statistics. To obtain the inverted list for a term use the
the "v=<TERM>" command. Prefix this command by the
the database command "d=<DATABASE NUMBER>":
http://fiji4.ccs.neu.edu/~zerg/lemurcgi/lemur.cgi?d=0&v=star
This command retrieves the ctf (collection term frequency) and df (document frequency) of
the term in addition to the inverted list. For stemmed database you do not need to stem
the provided terms. However, you need convert all the words to lowercase. Remove stop words from the query before using Lemur.

External Ids. Documents have both internal and
external ids. External ids like WSJ890803-0148 are typically a
combination of source and date information. To retrieve a document
with it's external ID use the "e?=" command. For example,
http://fiji4.ccs.neu.edu/~zerg/lemurcgi/lemur.cgi?e=AP890101-0001
The above command will return the document AP890101-0001 in SGML
format.

Internal Ids. Documents have both internal and
external ids. Internal ids are integers. They are used by Lemur
for efficiency. To retrieve a document in SGML format from the internal ID,
use the "i=" command. For example:
http://fiji4.ccs.neu.edu/~zerg/lemurcgi/lemur.cgi?i=1

Process the query file manually. Remove punctuation. Remove stop words. Convert
the remaining words to lower case. Check words with
hyphen as they may not be handled properly by lemur. Check words like U.S.A.

We recommend the use of a scripting language like Perl, Python or Ruby to implement the project.

Use regular expressions for parsing where possible. Parsing will be
the most code-instensive part. It is possible to implement the project
in less than 160 lines (600 words, 6000 characters).
Here is an example how Ruby can handle parsing of the inverted list page:

Make sure the retrieval formulas are implemented correctly.
A common error occurs in the language model. Under the language model
words that appear in the query but not in a retrieved document
obtain a score different than zero. In traditional IR a document
is considered for scoring even if it contains only one from a few
query words. Therefore missing words still receive a score.