Using grep

New Linux users unfamiliar with this standard Unix tool may not realize how useful it is. In this tutorial for the novice user, Eric demonstrates grep techniques.

Special Characters

Many Unix utilities use regular expressions to specify
patterns. Before we go into actual examples of regular expressions,
let's define a few terms and explain a few conventions that I will
use in the exercises.

Character any
printable symbol, such as a letter, number, or punctuation
mark.

String a sequence
of characters, such as cat or
segment (sometimes referred to as a
literal).

Expression also a
sequence of characters. The difference between a string and an
expression is that while strings are to be taken literally,
expressions must be evaluated before their actual value can be
determined. (The manual page for GNU grep compares regular
expressions to mathematical expressions.) An expression usually can
stand for more than one thing, for example the regular expression
th[ae]n can stand for then or
than. Also, the shell has its own type of
expression, called globbing, which is usually
used to specify file names. For example, *.c
matches any file ending in the characters
.c.

Metacharacters the
characters whose presence turns a string into an expression.
Metacharacters can be thought of as the operators that determine
how expressions are evaluated. This will become more clear as we
work through the examples below.

Interference

You have probably entered a shell command like

$ ls -l *.c

at some time. The shell “knows” that it is supposed to
replace *.c with a list of all the files in the
current directory whose names end in the characters
.c.

This gets in the way if we want to pass a literal
* (or ?,
|, $, etc.) character to
grep. Enclosing the regular expression in
`single quotes' will prevent
the shell from evaluating any of the shell's metacharacters. When
in doubt, enclose your regular expression in single quotes.

Basic Searches

The most basic regular expression is simply a string.
Therefore a string such as foo is a regular
expression that has only one match: foo.

We'll continue our examples with another file in the same
directory, so make sure you are still in the /usr/src/linux
directory:

This quite naturally gives the four lines that have Linus
Torvalds' name in them.

As I said earlier, the Unix shells have different
metacharacters, and use different kinds of expressions. The
metacharacters . and * cause
the most confusion for people learning regular expression syntax
after they have been using shells (and DOS, for that
matter).

In regular expressions, the character .
acts very much like the ? at the shell prompt:
it matches any single character. The *, by
contrast, has quite a different meaning: it matches
zero or more instances of the
previous character.

If we type

$ grep tha. CREDITS

we get this (partial listing only):

S: Northampton
E: Hein@Informatik.TU-Clausthal.de

As you can see, grep printed every instance of
tha followed by any character. Now try

We received a much larger response with “*”.
Since “*” matches zero or
more instances of the previous character (in this case the letter
“a”), we greatly increase our possibility of a match because we
made th a legal match!

Character Classes

One of the most powerful constructs available in regular
expression syntax is the character class. A
character class specifies a range or set of characters to be
matched. The characters in a class are delineated by the
[ and ] symbols. The class
[a-z] matches the lowercase letters
a through z, the class
[a-zA-Z] matches all letters, uppercase or
lowercase, and [Lh] would match upper case
L or lower case h.

gives us most of the file. If you look at the file closely, you'll
see that a few lines have no lowercase letters; these are the only
lines that grep does not print.

Now since we can match a set of characters, why not exclude
them instead? The circumflex, ^, when included
as the first member of a character class,
matches any character except the characters
specified in the class.

To search for a class of characters including a literal
^ character, don't place it first in the class.
To search for a class including a literal -,
place it the very last character of the class. To search for a
class including the literal character ], place
it the first character of the class.

Often it is convenient to base searches on the position of
the characters on a line. The ^ character
matches the beginning of a line (outside of a character class, of
course) and the $ matches the end. (Users of vi
may recognize these metacharacters as commands.) Earlier, searching
for Linus gave us four lines. Let's change that
to:

grep 'Linus$' CREDITS

which gives us

Linus
D: Personal information about Linus

two lines, since we specified that Linus must be
the last five characters of the line. Similarly,

grep - CREDITS

produces 99 lines, while

grep '^-' CREDITS

produces only one line:

----------

In some circumstances you may need to match a metacharacter.
Inside a character class set all characters are taken as literals
(except ^, -, and
], as shown above). However, outside of classes
we need a way to turn a metacharacter into a literal character to
match.