Searching for Text (Part I)

Do you deal a lot with reading or writing text? Do you often use search
tools? Do you have a pile of data sitting on your web and file
server(s)? Many of us do. How do you organise your collection of text
data? Do you use a directory, an index, or a database? In case you
haven't decided yet, let me suggest a few options.

Documents and Dealing with Text

I will focus on organising, indexing, and searching text data. This is
sufficient, since a lot of search queries can be transformed to text. In
addition, processing text is harder than it seems, so it's good to have
a focus. You may note that I make a distinction between documents
and text data; the reason is the sheer volume of different document
formats. Some of them are well-defined, some aren't. Some have open
specifications readily available to developers. Proprietary document
formats are always a barrier for data processing. Unfortunately, these
formats cannot be avoided.

The first thing you have to do is to organise your data in some way. It
doesn't matter if you populate a file server with a directory structure
and start copying data or if you keep a list of bookmarks in your
browser. The most important aspect is to have a kind of unique
identifier or reference to every single document. Uniform Resource
Locators (URLs) work well; a path to a file along with its name will
also be perfect. It's best if you manage to group your documents by a
list of categories. The next thing you have to consider is the document
formats. Most indexing and search tools can only handle text, so if your
document format allows for conversions, then it is useful for processing.
Here are some examples for conversions done in shell scripts.

PDF: pdftotext -q -eol unix -enc UTF-8 $IN - > $OUT

Postscript: pstotext $IN | iconv -f ISO-8859-1 -t UTF-8 -o $OUT -

MS Word: antiword $IN > $OUT

HTML: html2text -nobs -o $OUT $IN

RTF: unrtf --nopict --text $IN > $OUT

MS Excel: py_xls2txt $IN > $OUT

any OpenOffice document: ooo_as_text $IN > $OUT

The variable $IN denotes the source document and $OUT
is the name and location of the converted content in plain text. In
order to capture all possible character encodings, it is always useful
to convert to a suitable Unicode encoding. I usually use UTF-8 for this
purpose. Converting to UTF-8 from any other encoding works well;
converting from UTF-8 to an encoding having fewer representations of
characters is "lossy" and is usually not precise enough to be useful.

Keep in mind that although some converters can deal with MS Office
documents, it is not the best format for storing information. The format is
still proprietary and you may not use Microsoft's "free" document
specification for any purpose (commercial use is explicitly excluded,
therefore the specs are not free to use). Storing information in these
formats will cause a lot of trouble - especially if the vendor disables old
versions of the format by software updates (this has already happened).
That's a clear and obvious warning, and if you have any word in how to
organise document collections you can avoid a lot of trouble at the
beginning.

Having thought about organising the data, we can now consider how to best
index it. This doesn't mean that you are done with thinking about the
organisation of the data - it really is the most important step.

MySQL Natural Language Full-Text Searches

MySQL offers the creation of full text indices; this is described in the
manual in the "Natural Language Full-Text Searches" section. It is an
easy way of indexing text data. Let's say you have the following table:

We'll store the filename, the path information, the file type, and its
converted content in a database table. The VARCHAR data type might
be too small if you have big directory trees, but it's more than enough for a
simple example. Every document has a unique ID consisting of the field
id. The option FULLTEXT() advises MySQL to create a full
text search index over the columns filename and content.
You can add more columns if you like, but you need to be careful not to
index everything. Adding the type column might also be a
reasonable option.

Obviously, I added a few more rows than described originally. The right
column displays the ranking. Only record 6 has a number greater than 0
because all the other texts lack the word lorem. Now you can add
more texts and see what their rating is like. Note that MySQL uses a
specific strategy when performing full text indexing:

It does not index words with less than 4 characters.

It also does not index any word that appears in more than 50% of the
text lines.

Hyphenated words count as two words.

Any partial words are ignored.

The database engine also maintains a list of common English words (stop words); they are ignored,
too.

Be careful - if your search query consists solely of stop words, you'll
never get any results. If you need a full text search in languages other
than English you can provide your own set of stop words. The documentation
will tell you how to do this.

It is also possible to search for more than one word. You can add
multiple words separated by commas.

Full Text Search with PostgreSQL

Of course PostgreSQL can also deal with full text searches - a plugin
called Tsearch2 is available for PostgreSQL database servers prior
to version 8.3.0 (it's integrated into 8.3.0). Just like for the MySQL
functions, you can fine-tune these according to the language your texts are
written in. The content has to be transformed into tokens, and PostgreSQL
offers new database objects that deal with these operations. The Tsearch2
engine provides text parsers for tokenisation, dictionaries for
normalisation of tokens (and lists of stop words), templates for switching
between parsers or dictionaries, and configurations to use whatever
language you need to. Creating new database objects requires knowledge of C
programming.

Let's recreate the example table in PostgreSQL (I use version 8.3.0; if
you have an older version, please install Tsearch2):

First we create the table, then we create the text GIN (Generalized
Inverted Index); this type of index consists of distinct lexemes. The
function to_tsvector() converts the text stored in the content
column into these normalised words. It uses the English parser and
dictionary. Search queries look like this:

You'd use a normal SELECT and the @@ text match operator. This operator
compares arguments and the search string converted to lexemes by use of to_tsvector()
and to_tsquery() functions. The results are returned by the SELECT statement.
You can also use ranking in order to sort the results.

Now the parser displays the tokens as part of the URL and identifies
them. The tokens allow for better search query processing, and this is
the reason why you have to filter your query string. The text search
compares tokens and not the strings themselves.

Conclusion

I presented only two ways to index text data. This is really only the tip
of the iceberg - there's a lot more to learn about full text searches. Both
MySQL and PostgreSQL have convenient algorithms ready for use that
facilitate finding documents. You can use a simple Perl script with either
one of these database engines, feed them your browser bookmarks and build
an index with the content of the web pages, ready for search queries. There
are many other tools available, and I'll present another way of indexing in
the next part of this series. If you use something different or interesting
to accomplish these tasks, please write and let us know about it!

René was born in the year of Atari's founding and the release of the game Pong.
Since his early youth he started taking things apart to see how they work. He
couldn't even pass construction sites without looking for electrical wires that
might seem interesting. The interest in computing began when his grandfather
bought him a 4-bit microcontroller with 256 byte RAM and a 4096 byte operating
system, forcing him to learn assembler before any other language.

After finishing school he went to university in order to study physics. He then
collected experiences with a C64, a C128, two Amigas, DEC's Ultrix, OpenVMS and
finally GNU/Linux on a PC in 1997. He is using Linux since this day and still
likes to take things apart und put them together again. Freedom of tinkering
brought him close to the Free Software movement, where he puts some effort into
the right to understand how things work. He is also involved with civil liberty
groups focusing on digital rights.

Since 1999 he is offering his skills as a freelancer. His main activities
include system/network administration, scripting and consulting. In 2001 he
started to give lectures on computer security at the Technikum Wien. Apart from
staring into computer monitors, inspecting hardware and talking to network
equipment he is fond of scuba diving, writing, or photographing with his digital
camera. He would like to have a go at storytelling and roleplaying again as soon
as he finds some more spare time on his backup devices.