So far we've seen how to use cut, grep and wc to select and count records with certain qualities. But each set of records we'd like to count requires a separate command, as with counting the numbers of male and female names in the most recent example. Combining the uniq and sort commands allows us to count many groups at once.

The uniq command squashes out contiguous duplicate lines. That is, it copies from its standard input to its standard output, but if a line is identical to the immediately preceding line, the duplicate line is not written. For example:

$ cat foo
a
a
a
b
b
a
a
a
c
$ uniq foo
a
b
a
c

Note that 'a' is written twice because uniq compares only to the immediately preceding line. If the data is sorted first, we get each distinct record just once:

The combination of sort and uniq -c is extremely powerful. It allows one to create histograms from virtually any record oriented text data. Returning to the name to gender mapping of the previous chapter, we could have gotten the count of male and female names in one command like this:

This is a good opportunity to point out a big benefit of being able to play with data in this fashion. It allows you to quickly spot potential problems in a dataset. In the above example, why are there 1,796 households with 0 occupants? As another example of quickly verifying the integrity of data, let's make sure that household id is truly a unique identifier:

This grep invocation will print only lines that do not (because of the -v flag) begin with a series of spaces followed by a 1 (the count from uniq -c) followed by a tab (entered using the control-v trick). Since the number of lines written is zero, we know that each household id occurs once and only once in the file.

The technique of grepping uniq's output for lines with a certain count is generally useful. One other common application is finding the set of overlapping (duplicated) keys in a pair of files by grepping the output of uniq -c for lines that begin with a 2.

Throwing an extra sort on the end of the pipeline will sort the histogram so that the most common class is at the top (or bottom). This is useful when data is categorical and does not have a natural order. You'll want to give sort the -n option so that it sorts the counts numerically instead of lexically, and I like to give the -r option to reverse the sort so that the output is sorted in descending order, but this just a stylistic issue. For example, here is the distribution of household heating fuel from most common to least common:

The output of uniq -c is not in proper CSV form. This makes is necessary to convert the output if further operations on the output are wanted. Here we use a bit of inline perl to rewrite the lines and reverse the order of the fields.