Text Processing in Python

David Mertz

Text Processing in Python is not for the casual scripter who wants
solutions to immediate problems. It has plenty of concrete examples,
but it's not a cookbook; it describes the contents of standard modules
and libraries, but it's not a reference. Rather, it uses examples and
explanations to explore the fundamental ideas behind, and features of,
both text processing and Python.

Misleadingly titled "Python Basics", the opening chapter is likely to
scare off many readers. Mertz begins with an introduction to functional
programming in Python, followed by a tutorial on polymorphism and class
construction. He then covers the mechanics of actually running Python
and some of the standard modules for filesystem access and interfaces
with operating systems.

The core of Text Processing in Python is in three chapters, on string
handling, regular expressions, and parsing. "Basic String Operations"
works through some examples of common tasks: sorting, reformatting,
counting, encoding binary data as ascii, and more. It then goes through
the contents of the string module and modules for memory-mapped files
(mmap) and StringIO, binary/ascii conversions, cryptography, compression,
and unicode handling.

A brief regular expression tutorial is followed by a look at some common
tasks, which are used to illustrate progressively more sophisticated
regular expressions. This is followed by detailed exposition of the
standard re module.

Mertz warns that parsing is often overkill and suggests that other options
be tried first. He then explains EBNF grammars and state machines,
before working through the mx and PLY libraries and assorted other
tools. Readers with no previous exposure to language theory may find
this difficult.

A final chapter looks at tools for email, web and other protocols for
passing text around the Internet. Appendix A provides "a selective and
impressionistic short review of Python" — enough for an experienced
programmer without previous acquaintance with Python — and other
appendices provide background on compression and unicode.

Text Processing in Python offers a nice combination of foundational
material and practical applications. Its approach means there is
little overlap with other Python books: even when going through standard
libraries, Mertz largely avoids repeating generic material, and there's
none of the padding that's used to flesh out many computing books.
The approach will appeal most to those with a computer science background,
or an inclination that way.

Perhaps most importantly, Text Processing in Python is, at least if
you have the right background, a good read. I found it entertaining
as well as informative, refreshing some basic computer science as well
as discovering new Python details and approaches; it has inspired me to
rework the scripts used to format these reviews for the web.

Note: Text Processing in Python is available in full on Mertz's web
site, along with its code examples.