Chapter 11: Text Processing

Document processing is rapidly becoming one of the dominant functions
of computers. Computers are used to edit documents, to search
documents, to transport documents over the Internet, and to display
documents on printers and computer screens. Web `surfing' and Web
searching are becoming significant and important computer
applications, and many of the key computations in all of this document
processing involve character strings and string pattern matching.
For example, the Internet document
formats HTML and XML are primarily text formats,
with added tags for multimedia
content. Making sense of the many terabytes of information
on the Internet requires a considerable amount of text processing.

In this chapter, we
study several of the fundamental
text processing
algorithms for quickly performing important string operations. We pay
particular attention to algorithms for string searching and pattern
matching, since these can often be computational bottlenecks in many
document-processing applications.
We also study some fundamental data structure and algorithmic issues involved
in text processing, as well.

The progression of topics studied in this chapter continues
to follow our abstract data type approach.
The terminology and notation for the string ADT, which is used
in this chapter, is defined early.
It turns out that representing a string as an array of characters is quite
simple and efficient, so we don't spend a lot of attention on string
representations.
Nevertheless, the string ADT includes an interesting
method for string pattern matching, and
we study pattern matching algorithms in this chapter.
We also study the trie data structure,
which is a tree-based structure that allows
for fast searching in a collection of strings.
We also study an important text processing problem,
namely, the problem of compressing a document of
text so that it fits more efficiently in storage or can be transmitted more
efficiently over a network.
We study another text processing problem, as well,
which deals with how we can measure the similarity between two documents.
All of these problems are topics that arise often in Internet computations,
such as Web crawlers, search engines, document distribution,
and information retrieval.

In addition to having interesting applications, the topics of this
chapter also highlight some important algorithmic design patterns.
In particular, in the section on pattern matching, we discuss the
brute-force method,
which is often inefficient but has wide applicability.
For text compression we study the greedy method,
which often allows us to
approximate solutions to hard problems, and for some problems
(such as in text compression) actually gives rise to optimal algorithms.
Finally, in discussing text similarity, we introduce the
dynamic programming
design pattern, which can be applied in some special instances to solve a
problem in polynomial time that appears at first to require exponential time
to solve.