Slicing and dicing data with regular expressions

Strings and Things

Martin Streicher

Regular expressions help you filter through the data to find the information you need.

Most computer systems have an assortment of tools for filtering and processing data. A virus scanner, a spam fighter, a web search engine, a spell checker – each is a filter that sifts though data to isolate the information you really need. Your shell provides a filter, too. For example, ls *.jpg lists only JPEG images.

Because so much of Linux depends on interpreting and processing plain text files, an entire shorthand exists for creating filters. The shorthand is called regular expressions, or regex. A regex applied to text can find, dissect, and extract virtually any pattern you seek. Table 1 shows some common regex operators, which you can string together and use in combination to build arbitrarily complex filters.

Table 1

Common Regular Expression Operators

Operator

Purpose

. (period)

Match any single character.

^

Match the empty string that occurs at the beginning of a line or string.

$

Match the empty string that occurs at the end of a line.

A

Match an uppercase letter A.

a

Match a lowercase a.

\d

Match any single digit.

\D

Match any single non-digit character.

\w

Match any single alphanumeric character; a synonym is [:alnum:].

[A-E]

Match any of uppercase A, B, C, D, or E.

[^A-E]

Match any character except uppercase A, B, C, D, or E. Here, the caret (^) inverts the range operator to exclude any of the characters that appear in the range.

X?

Match no or one capital letter X.

X*

Match zero or more capital Xs.

X+

Match one or more capital Xs.

X{n}

Match exactly n capital Xs.

X{n,m}

Match at least n and no more than m capital Xs. If you omit m, the expression tries to match at least n Xs.

(abc|def)+

Match a string that contains one or more occurrences of the substring abc or the substring def. abc and def would match, as would abcdef and abcabcdefabc.

The origin of regex dates back some 60 years to research in theoretical computer science, a branch of study that includes the design and analysis of algorithms and the semantics of programming languages. The earliest progenitor described models of computation in a shorthand notation called a "regular expression." The shorthand was first co-opted for use in the QED editor found in the original Unix operating system, but it has since expanded into a POSIX standard for pattern matching. Today, the most popular implementation of regex is the Perl-Compatible Regular Expressions library, or PCRE. You will find the PCRE in Perl, Apache, Ruby, PHP, and many other languages and tools.

Introducing Regular Expressions

To understand the purpose of a regular expression, consider a situation in which you need to find all the words in a file that contain a predefined string of characters. One common tool for this task is the Linux grep utility, which scans input line by line looking for a string.

In its simplest operation, grep readily finds a given word and prints the lines that contain the word. Suppose you have a file called heroes.txt that lists the names of familiar caped crusaders (Listing 1), and you want to find all the names that contain man. The command

Here, grep scans each line in the file, looking for an m, followed by an a, followed by an n. The letters must appear together and in that order with no intervening characters, but otherwise, they can appear anywhere on the line, even embedded in a larger word. Catwoman, Batman, Spider-Man, and Ant-Man, and the others each contain the string man. (The i option told the grep command to ignore letter case.)

Grep also has a nice feature to exclude rather than include all matches found. The -v option omits lines that match a specified pattern. For example,

grep -v -i spider heroes.txt

prints every line except those that contain the string "spider." Batgirl and Batman are valid matches (among others); Spider-Man and Spider-Woman are invalid.

What if you only want names of superheroes that begin with Bat or with any of bat, Bat, cat, or Cat? Or perhaps you want to find how many avenger names end with man. In these cases, a simple string search doesn't suffice; you need to seek matches on the basis of content and position.

A regex can specify position – such as the start or end of a line, or the beginning and end of a word. A regex can also describe alternates (i.e., occurrences of this or that pattern); fixed, variable, or indefinite repetition (zero, one, two, or more of any stretch); ranges (e.g., any of the letters between a and m, inclusive); and classes (kinds of) characters (e.g., printable characters or punctuation).

In the rest of this article, I explore some examples of regular expressions that work with grep. Many other Unix tools, including interactive editors Vi and Emacs, stream editors sed and awk, and all modern programming languages also support regex operations.

For more information on regex theory and practice, see the Perl man pages (or see perl.org [1]) and books by Jeffrey Friedl [2] and Nathan Good [3].

Match a Position

To find names that begin with Bat, use:

grep -E '^Bat'

The option -E specifies a regular expression. The ^ (caret) character matches the beginning of a line or a string – an imaginary character that appears before the first character of each line or string. The letters B, a, and t are literals and only match those characters. Filtering the contents of heroes.txt, the command

grep -E '^bat' heroes.txt

produces Batman and Batgirl.

Many regex operators are also used by the shell (some with different semantics), so it's a good habit to surround each regex on the command line with single quotes to protect the regex operators from interpretation by the shell. For example, both * and $ are regex operators, but they also have special meaning to the shell. The shell's asterisk is different from its facsimile regex operator: it matches any portion of a file name. The regex * is a qualifier, matching zero or more operands. The dollar sign indicates a variable in the shell but marks the end of a line or string in a regular expression.

To find names that end with man, you might use the regex man$ to match the sequence m, a, and n, followed immediately by the end of the line or string ($). Given the purpose of ^ and $, you can find a blank line with ^$ – essentially, this regex specifies a line that ends immediately after it begins.

To find words that begin with bat, Bat, cat, or Cat, you can use one of two techniques. The first is alternation, which yields a match if any of the patterns match. For example, the command

grep -E '^(bat|Bat|cat|Cat)' heroes.txt

does the trick. The vertical bar regex operator (|) specifies alternation, so this|that matches either the string this or the string that. Hence ^(bat|Bat|cat|Cat) specifies the beginning of a line, followed immediately by one of bat, Bat, cat, or Cat. Of course, you could simplify the regex with grep -i, which ignores case, reducing the command to:

grep -i -E '^(bat|cat)' heroes.txt

The second approach uses the set operator (). If you place a list of characters in a set, any of those characters can match. (Think of a set as shorthand for alternation of characters.) For example,

both produce the same results. To simplify again, you can ignore case with -i to reduce the regex to ^at.

To specify an inclusive range of characters in a set, use the hyphen (-) operator. For example, usernames typically begin with a letter. To validate one in a web form submitted to your server, you might use ^. This regex reads: "Find the start of a string, followed immediately by any uppercase letter (A-Z) or any lowercase letter (a-z)." By the way, is the same as .

You can mix ranges and individual characters in a set. The regex matches any of uppercase A through M, X, Y, and Z. If you want the inverse of a set – that is, any character except what's in the set – use the special set and include the range or characters to exclude. To find all superheroes with at in the name, excluding Batman, type:

grep -i -E '[^b]at' heroes.txt

The command produces Catwoman and Black Cat.

Certain sets are required so frequently that they are represented with a shorthand notation. For instance, the set is so common, it can be abbreviated \w. Likewise, the operator \W is a convenience for the set . Also, you can use the notation instead of as \w and ] for \W. See the "Locales" box.

Locales

\w (and its synonym ) are locale specific, whereas is literally the letters A to z, the digits 0 to 9, and the underscore. If you're developing international applications, use the locale-specific forms to make your code portable among many locales.

Although Scratch is great, a lot can be said for Snap!, Scratch's cloned, web-based little brother. Written in JavaScript, Snap! can be run within a web browser and is easily extended with some XML and Python-Fu.

The accessories, signals, and towns come to life on my Lionel train layout with the help of sensors, detectors, lots of relays, and Linux control software. The trains, though, remain strictly under human control. This year that's about to change!