Automated Text Processing

There are many instances where automating a text processing chore
is essential, and can save time and money vs. hand processing of many
large text files.

Some of the most common examples of text that can be processed by
computer are mailing or address lists, web site content, source code,
web data entry forms, emails, stock tickers, books and literature,
scientific data such as genome data, or numerical data in the form of
tables of text (tab or comma separated text is a common method of
transferring tabular data such as spreadsheets).

There are many reasons to process text, and many different types
of processing. For example search engines process text to create an
index so that words and phrases may be found quickly, other text
processing tasks attempt to create or format data using a raw text
stream as input. For example, a real-time news feed from a news
service like Dow Jones requires significant processing to add line
breaks, strip off header information, and pretty up the layout, before
it becomes a human readable news story to be put on Yahoo News or
another web site.

In automated text processing I think of two different "paradigms."
One is transformational programming where the goal is to take
text A and turn it into text B in a systematic way. The other is
data extraction where the goal is to extract information
about the text for use in a database or search engine or other similar
system. This distinction is useful when considering what sorts of
tasks and tools are necessary for your project.

Tools suitable for transformational programming consist of macro
languages, text editors, UNIX shell utilities, and text processing
languages like awk, and Perl. To an extent other high level languages
are suitable for transformational programming, Python, the Lisp
family, and others. I don't classify C or other lower level languages
as suitable, because they take too long to develop with, and don't
provide suitable text specific tools within the languages
themselves. However there are a family of little languages, some of
which compile into C programs, that are extremely powerful. I am
thinking of compiler toolkits, such as Lex, YACC, FIXME: add more
compiler generators.

What are these tools? And how can you take advantage of them?

Transformational Programming

If your goal is to create a text output using text input, then you
are doing transformational programming. Many programmers would
immediately reach for
Perl and begin
hacking away until they had something which seemed to work. This is
generally a bad idea. Perl may be the language of choice, but before
reflexively grabbing it, it pays to consider other options. Shell
tools like
textutils,
macro languages such as
M4, and
text editors such as
sed are some of the
first transformational programming languages one should consider. Why?
Because they are simple, and much more easily debugged interactively.

A macro language is a system specifically for transforming one text
into another. The C preprocessor is a macro language, albeit a clumsy
one. In Lisp, you can create macros using Lisp itself. This is one of
the most powerful features of the language, and lets you create small
programs to write your larger program for you. In SAS, a programming
language I personally hate, macros are the only means to get a full
featured Turing Complete language, the base SAS system without the
macro language is actually terribly limited.

M4 is a general purpose macro language which is designed to be used
on any sort of text. Some examples of what you can do with M4 are to
create web pages (all of the pages on my site were created with the
aid of M4 and "make"), to extract specific portions of a text (such as
all the links from an HTML file, or a list of dependency files for
your C programs), to insert information into text files (such as
dates, copyright information, and so-forth), and to restructure a file
(for example putting one data record on each line).

The most effective way to harness m4 is to use it when writing more
complicated programs. By creating macros which expand into often
repeated or complicated expressions you can automate the task of
writing other code, such as web pages. Put together with the program
SED, the m4 macro language can be used for general manipulation of
text files. While sed itself is a very powerful language, it is
complicated to rely on it for more than simple substitutions and
deletions. However by substituting macros into a file and passing the
new file on to m4, you can harness the power of m4 to make more
sophisticated changes.

This illustrates a common principle of transformational
programming, stepwise refinement. If you can transform a text from A
to B, then B to C, and C to D, you can make complicated changes
through several simple steps.

To re-arrange the chapters of your book for example, you might
edit all the chapter headings with SED, to insert m4 macros, and then
use m4's "diversions" to reorder the chapters, or split them into
separate files. Or you might edit your address list to insert
macros that expand into SQL commands, then pass them to an SQL
interpreter and insert the names of of all your clients into your
relational database. You might even use SED to extract certain
statements from your source code, change them into calls to an M4
macro, and have the m4 macro expand into Prolog statements. The Prolog
statements could provide raw data for a program which deduces which
portions of your data processing system will fail if one of your data
vendors doesn't meet their deadlines. In other words, the technique is
very powerful.

Data Extraction

Although the difference between data extraction and
transformational programming has fuzzy boundaries, the address list
example given above for instance, there are some tasks for which
macros and editors do not fit well. An example would be making a table
of word frequencies, or trying to detect common grammatical
errors. Actually making the word frequency table could be accomplished
by transforming all non-letters into carriage returns, then deleting
blank lines, and sorting the resulting word list. This example shows
that transformational programming is maybe more powerful than you
realize. Still there are times when you want to run a computer program
in a command oriented language, and high level programming languages
provide an excellent tool. Lexical analyzers, parsers, and programming
languages like Python, Perl, Scheme, Emacs Lisp, and any other of your
favorites come in handy when you need to parse, extract, compare, and
quantify text contents.
There are some especially powerful techniques created by computer
scientists who study computer language theory. Parser generators, and
lexical analyzers can write programs for you, a technique called
generated code. Combining these with a high level language allows you
to process computer languages and summarize your own code, to let you
understand it better.