A text indexer using B-trees

Customers have asked about text-indexing software. We now have a
text-indexer which stores associatons between words and numbers. As we
show in this note, one example of the use of this tool is to index all
words (3 characters or longer, and excluding a designated list of
words to ignore) in a collection of files. The indexing uses
AllegroCache and B-trees.

The example uses four files with are normally put in a directory
named text-index/. A zipped tar file named
text-index.tar.gz
can be downloaded (click on the filename). Extract that file into a
convenient directory, such as the Allegro 8.1 directory (where Allegro
CL 8.1 was installed). A subdirectory named text-index/ will be
created with the files listed below.

The file included are:

text-index.cl: the text-indexing software.

string-hash.cl: an auxiliary file needed by text-index.cl.

text-index.txt: documentation for the text-indexing
functionality.

tutorial.cl: forms to evaluate to illustrate the text-indexing
functionality.

The first step in creating a text index is to create an index
file. This is done with the function open-text-index and it returns an
text-index instance. Then you add to this index strings and
locations. The following transcript illustrates this:

;; First we set the value of user::*test-index-dir*. Where
;; we are running Lisp, it is the text-index/ subdirectory of the
;; current directory:
cl-user(3): (setf user::*text-index-dir* (pathname "text-index/"))
#P"text-index/"
;; Now we compile and load the file test-index.cl. The file
;; string-hash.cl should also be compiled and loaded:
cl-user(4): :cload text-index/text-index.cl
;;; Compiling file text-index/text-index.cl
;;; ...
;;; Compiling file text-index/string-hash.cl
;;; ...
; Fast loading text-index/text-index.fasl
;; We use the text-index package so we do not have to qualify
;; variable and operator names:
cl-user(10): (use-package :text-index)
t
;; We defined a filename for the index and we create (or open, if
;; it exists) the index:
cl-user(14): (defparameter testfname "text-index/tutorial.ti")
testfname
cl-user(15): (defparameter ind (open-text-index testfname :if-does-not-exist :create))
ind
cl-user(16):
;; Now things are set up, we add items to the index:
cl-user(16): (index-word ind "jans" 1)
nil
cl-user(17): (index-word ind "ahmon" 1)
nil
cl-user(18): (index-word ind "fritz" 2)
nil
cl-user(19): (index-word ind "kevin" 2)
nil
cl-user(20):

The index associates strings with a value. This value should be an
integer, one that can be represented by a (signed-byte 32). We have
indexed four names, jans, ahmon, fritz, and kevin, and placed two in
location 1 and two in location 2.

So far, this may seem simple and unimpressive. The power of the text indexer
is that it can place very, very many words in an index very quickly
and extract index information very quickly. So far we haven't done
anything that could not as easily be done with hashtables (the key is
the string and the value the location). But hashtables are hard to
make persistent (they do not have a easy representation storable in
files) and they tend to work less well when you have tens or hundreds
of millions of entries.

We are not in this example going to have millions of entries because
there is no convenient suitable collection of files which all users
reading this note will have access to. All we can do it show how it
can be used to index a collection of files and then you can try it out
on some collection of your own.

Anyway, we have a small and simple index. Let us demonstrate a few
operations on it:

;; You find index entries with LOOKUP-WORD:
cl-user(20): (lookup-word ind "jans")
(1)
;; Wildcards like * and ? are supported:
cl-user(24): (lookup-word ind "jan*" :wild t)
(1)
cl-user(25): (lookup-word ind "j?ns" :wild t)
(1)
;; LOOKUP-TEXT looks up several words and returns objects that
;; contains all:
cl-user(26): (lookup-text ind "jans ahmon")
(1)
cl-user(27): (lookup-text ind "jans kevin")
nil
;; Even though both jans and kevin are in the index, jans is indexed
;; at location 1 and kevin at location 2 so no location contains
;; them both.
;; You can insert multiple words with INDEX-STRING:
cl-user(28): (index-string ind "jans is away for a week" 1)
nil
;; Each word is extracted and added to the index at the
;; specified location (1 in the example):
cl-user(36): (lookup-word ind "away")
(1)
cl-user(37): (lookup-text ind "week for")
(1)
;; But "is" is not there:
cl-user(38): (lookup-word ind "is")
nil
;; The indexer (as implemented) does not index works less than
;; 3 characters nor more than *max-word* characters, nor any of
;; a defined list of stopwords (the value of *stop-words*) which
;; includes words like "was", "their", "and", "not", and so on.
;; We are done with the simple example. Let us close the index and
;; then delete it:
cl-user(39): (close-text-index ind)
nil
cl-user(40): (delete-text-index testfname)
t

Now let us do a more complicated example. We will index all
appropriate files in a directory. (By "appropriate", we mean text
files of interest.) We need a way to associate integers (which will
be used as locations) with filenames. We use a hashtable whose keys
are integers are whose values are filenames.

The work of do-one-dir is the (index-file ind name c)
form. Everything else is just setup to get at the files to be
indexed. index-file finds all words in a file and then adds
them to the index. Each file is thus a
location. See text-index.txt for a description of other
indexing options.

We load do-one-dir into Lisp and use it on a directory. We use
it on the directory text-index/ (which presumably users trying
out the code in this note will have):

We indexed the three .cl files and the one .txt file. Now let us look
for some words:

cl-user(56): (lookup-word ind "defun")
(2 1 0)
cl-user(57): (lookup-word ind "jans")
(2)
;; Well, defun appears in the Lisp source files but not in the txt file
;; and jans appears in the tutorial file (2) because it contains the
;; examples we have been running. But the numeric indexes are not
;; that useful. Here is a function that converts the numeric designations
;; to filenames:
cl-user(58): (defun get-files (lis)
(let (res)
(dolist (i lis)
(push (gethash i filetable) res))
(nreverse res)))
get-files
cl-user(59): (get-files (lookup-word ind "defun"))
("text-index/tutorial.cl"
"text-index/text-index.cl"
"text-index/text-index.txt")
cl-user(60):
;; Suppose we wanted to find the definition of LOOKUP-WORD.
;; The file must contain "defun lookup-word". LOOKUP-TEXT is not
;; quite right, because it finds the files that contains both words
;; but not necessarily together:
cl-user(60): (get-files (lookup-text ind "defun lookup-word"))
("text-index/text-index.txt"
"text-index/text-index.cl"
"text-index/tutorial.cl")
cl-user(61):
;; No real help there. But the function LOOKUP-PHRASE does what
;; we want. It takes an index and a phrase (just like LOOKUP-TEXT)
;; but it also take a function that takes a index location and returns
;; a string to be searched for the desired phrase (using the Lisp
;; SEARCH function). Now we will find the definition. The function we
;; we use is (lambda (a) (excl:file-contents (gethash a filetable))).
;; excl:file-contents returns a file's contents as a single string.
cl-user(61): (lookup-phrase ind "defun lookup-word"
(lambda (a)
(file-contents (gethash a filetable))))
(1)
cl-user(62): (get-files *)
("text-index/text-index.cl")
cl-user(63):
;; The definition of LOOKUP-WORD is in text-index.cl.

lookup-phrase works by applying lookup-text to the
phrase so that the files that contain all the words of the phrase are
found (but the words are just present, not necessarily forming the
phrase). Then the argument function is applied so the strings to search
are extracted and searched for the phrase.

In the file tutorial.cl, near the end, there is some
elaborate code for indexing Lisp files. It starts:

The function build-my-lisp-index indexes a directory of Lisp
and related source files. If you have a directory of Lisp source
files, you can test it out.

We have only given a short introduction to the text indexer. See
text-index.txt for documentation of all the defined functions.

The text indexing code is free to use for any purpose (we ask only
that you acknowledge Franz Inc. as the original source and require
others to do so as well). No text indexer is perfect and ours can
certainly be improved. If any user makes improvements or comes up with
different approaches, and wishes these publicized, please tell
us. Again click on the filename to download the text-indexing
files: text-index.tar.gz.