wc&mdash;Word Count

The wc
(word count) command is a very simple utility found in all Unix
variants. Its purpose is counting the number of lines, words and
characters of text files. If multiple files are specified,
wc produces a count for each file,
plus totals for all files.

When used without options wc
prints the number of lines, words and characters, in that order. A
word is a sequence of one or more characters delimited by
whitespace. If we want fewer than the three counts, we use options
to select what is to be printed: -l to print
lines, -w to print words and
-c to print characters. The GNU version of
wc found in Linux systems also
supports the long options format: --chars (or
--bytes), --words,
--lines.

When I applied wc to an
earlier version of the LaTeX source file with this text, I received
the following information from
wc:

wc wc.tex
98 760 4269 wc.tex

This line means that the file had 98 lines, 760 words and
4269 characters (bytes). Actually, I seldom use
wc alone. Due to its simplicity
wc is mostly useful when used in
combination with other Linux commands.

If we use a file system other than Linux (or Unix), namely
DOS, there is an ambiguity due to a line break being a combination
of a carriage return and a line feed. Should -c
count a line break as two characters or only one? The POSIX.2
standard dictates that -c actually counts bytes,
not characters, and it provides the -m option to
count characters. This option cannot be used together with
-c, and for that matter, GNU
wc does not support
-m. If we desperately need it, we can always
subtract the line count from the byte count to obtain the char
count of a DOS file. Here are two different ways to achieve
this:

The first solution uses awk to subtract
the first field (the line count) from the third field (the byte
count). The second solution uses tr to delete
the carriage returns (char 15 in octal) from the input before
feeding it to wc.

Recently I used a CD-ROM writer that was connected to a
machine that was slightly sick. Now and then a block of 32
consecutive bytes got corrupted while copying amongst different
hard disk partitions. This caused quite a few CD-ROM backups to be
damaged. Sometimes the damage affected a large file, and in this
case, it was cheaper to keep the bad file and add a small patch
file to the next backup. To decide whether we should make a new
full backup of a corrupted file or just make a differential patch,
we used the cmp command to detect the
differences, followed by wc to
count them:.

cmp -l /original/foo /cdrom/foo | wc -l

The -l option to cmp
provides a full listing of the differences, one per line, instead
of stopping on the first difference. Thus, the above command
outputs the number of bytes that are wrong.

If we want to count how many words are in line 70 of file
foo.txt then we use:

head -70 foo.txt | tail -1 | wc -w

Here, the command head -70 outputs the
first 70 lines of the file, the command tail -1
(i.e., the number 1) outputs the last line of its input, which
happens to be line 70 of foo.txt, and
wc counts how many words are in
that line.

If our boss presses us to include in our monthly project
report a count of the number of lines of code produced, then we can
do it like this:

wc -l */*.[ch] | tail -1 | awk '{print $1}'

This assumes that all our code is in files with extension
.h or .c, and that these
files live in subdirectories one level deep from our current
directory. If file depth is arbitrary, we use the following:

wc -l `find . -name "*.[ch]" -print` | \
tail -1 | awk '{print $1}'

Notice the use of back quotes in the find
command line, and forward (normal) quotes in the
awk command. The command find . -name
"*.[ch]" -print outputs the *.c and
*.h files located below the current directory,
one per line. The back quotes cause that command to be executed,
and then replace each newline in the command's output with a blank,
and pass that output to the wc
command line.

If in good GNU style you mark all current bugs and dirty
hacks in your source code with the word FIXME,
then you can see how much urgent work is pending by typing:

grep FIXME *.c | wc -l

The grep outputs all lines that have a
FIXME, and then we just have to count them.

As you can see there is nothing special about the
wc command; however, half of my
shell scripts would stop working if that command was not
available.

Alexandre (avs@daimi.aau.dk) is from Porto,
Denmark, but has been in Aarhus for his PhD, just
delivered—something to do with literate programming and stuff. He
is ashamed to confess that his first Linux was 1.02, but he is
playing catch up. He claims to have brainwashed his significant
other, Renata, and now she is even more sanguine about Linux. Now
they are threatening to capture the mind and soul of their innocent
9 year old daughter Maria. She has a Mac but with the release of
MkLinux she is no longer safe. Root password at 9? Cool.

Trending Topics

Upcoming Webinar

Getting Started with DevOps - Including New Data on IT Performance from Puppet Labs 2015 State of DevOps Report

August 27, 2015
12:00 PM CDT

DevOps represents a profound change from the way most IT departments have traditionally worked: from siloed teams and high-anxiety releases to everyone collaborating on uneventful and more frequent releases of higher-quality code. It doesn't matter how large or small an organization is, or even whether it's historically slow moving or risk averse — there are ways to adopt DevOps sanely, and get measurable results in just weeks.