Counting with uniq

Shell experts make the best of simple combinations of standard utilities. Learn one of the most common examples of using two common commands together.

One of the truly great qualities of UNIX-like operating systems
is their ability to combine multiple commands. By combining
commands, you can perform a wide array of tasks, limited only by
your cleverness and imagination.

Although the number of potential command combinations is
huge, my experience has shown that certain combinations come in
handy more often than others.
One I turn to frequently is combining the sort and uniq commands to
count occurrences of arbitrary strings in a file. This is a great
trick for new Linux users and one you never will regret adding to
your skill set.

A Simple Example

Let's look at a simple example first to highlight the fundamental
concepts. Given a file called fruit with the following contents:

apples
oranges
apples

you can discover how many times each word appears, as follows:

% sort fruit | uniq -c
1 oranges
2 apples

What's happening here? First, sort fruit sorts the
file. The result ordinarily would go to the standard
output (in this case, your terminal), but note the |
(pipe) that follows. That pipe directs the output
of sort fruit to the input of the next command,
uniq -c, which prints each line preceded by the
number of times it occurred.

A More-Advanced Example

It's not obvious from the simple example why this is so powerful.
However, it becomes clearer when the file at hand is, for instance,
an Apache Web server access log with hundreds of thousands of lines.
The access log contains a wealth of valuable information. By using
sort and uniq, you can do a surprising amount of simple data analysis
on the fly from the command line.
Imagine a coworker desperately needs to know the ten IP addresses
that requested a PHP script called foo.php most often in January.
Moments later, you have the information she needs. How did you
derive this information so fast? Let's look at the solution step
by step.

For the sake of this exercise your server is logging in the following
format:

The log contains data from many months, not only January 2004, so
the first order of business is to use grep to limit our data set:

% grep Jan/2004 access.log

We then look for foo.php in the output:

% grep Jan/2004 access.log | grep foo.php

If we are to count occurrences of IP addresses, we better limit our
output to only that one field, like so:

% grep Jan/2004 access.log | grep foo.php | awk '{ print $1 }'

A discussion of awk is beyond the scope of this
article. For now, you need to understand only that awk
'{ print $1 }' prints the first string before any
whitespace on each line. In this case, it's the IP address.

Now, at last, we can apply sort and uniq. Here's the final command
pipeline:

The backslash (\) indicates the command is
continued on the next line. You can type the command
as one long line without the backslashes or use them
to break up a long pipeline into multiple lines on
the screen.

You may have noticed that, unlike in our simple example, the first
sort is a numeric sort (sort -n). This is
appropriate because we
are, after all, dealing with numbers.

The other difference is the inclusion of | sort -rn |
head. The
sort -rn command sorts the output of
uniq -c in reverse numeric
order. The head command prints only the first ten lines of
output. The first ten lines are perfect for the task at hand because
we want only the top ten:

You can change the functionality of this pipeline by
making changes to any of the component commands. For
instance, if you wanted to print the bottom ten
instead of the top ten, you need change only head
to tail.

Conclusion

Piping data through sort and uniq is exceedingly
handy, and I hope reading about it whets your
appetite for learning more about pipelines. For more
information about any of the commands used in these
examples, refer to the corresponding man pages.

Brian Tanaka has been a UNIX system administrator since 1994 and
has worked for companies such as The Well, SGI, Intuit and
RealNetworks. He can be reached at btanaka@well.com.