Another language that Linux provides and is standard on many
(most?) UNIX
systems is awk. The abbreviation awk is an acronym composed
of the first letter of the last names of its developers:
Alfred Aho, Peter Weinberger, and Brian Kernighan. Like sed, awk is an
interpreted pattern-matching language.
In addition, awk, like sed, can also read stdin.
It can also be passed the name of a file containing its arguments.

The most useful aspect of awk (at least useful for me and the many Linux
scripts that use it) is its idea of a field. Like sed,
awk will read whole lines, but unlike sed, awk can immediately break into
segments (fields) based on some criteria. Each field is
separated by a field separator. By default, this separator is a space. By
using the -F option on the
command line
or the FS variable
within an awk program, you can specify a new field
separator.
For example, if you specified a colon (:) as a field separator, you could read
in the lines from the /etc/password file and
immediately break it into fields.

A programming language in its own right, awk has become a staple of
UNIX
systems. The basic purposes of the language are manipulating and processing text
files. However, awk is also a useful tool when combined with output from other
commands, and allows you to format
that output in ways that might be easier to process further. One major advantage
of awk is that it can accomplish in a few lines
what would normally require dozens of lines in sh or csh shell script, or may
even require writing something in a lower-level
language, like C.

The basic layout of an awk command is

pattern { action }

where the action to be performed is included within the curly braces ({}).
Like sed, awk reads one input a line at a time,
aut awk sees each line as a record broken up into fields. Fields are separated
by an input Field Separator (FS), which by
default is a Tab or space. The FS can be changed to something else, for example,
a semi-colon (;), with FS=;. This is useful
when you want to process text
that contains blanks; for example, data of the following form:

Here we have name, address,
city, state, zip code, and age. Without using ; as a field
separator,
Blinn and David;42 would be two fields. Here, we would want to treat each name,
address city, etc., a single unit, rather than
as multiple fields.

The basic format of an awk program or awk script, as it is sometimes called,
is a pattern followed by a particular action. Like
sed, each line of the input is checked by awk to see if it matches that
particular pattern. Both sed and awk do well when
comparing string values, However, whereas checking numeric values is difficult
with sed, this functionality is an integral part of
awk.

If we wanted, we could use the data previously listed and output only the
names and cities of those people under 30.
First, we need an awk script, called awk.scr, that looks like this:

FS=; $6 < 30 { print $1, $3 }

Next, assume that we have a data file containing the seven lines of data
above, called awk.data. We could process the
data file in one of two ways. First

awk -f awk.scr awk.data

The -f option tells awk that it should read its instructions from the file
that follows. In this case, awk.scr.
At the end, we have the file from which awk needs to read its data.

Although it may make little sense, we could make string comparisons on what
would normally be numeric values, as in

$6 == "33" { print $1, $3 }

This prints out fields 1 and 3 from only those lines in which the sixth field
equals the string 33.

Not to be outdone by sed, awk will also allow you to use regular expressions
in your search criteria.
A very simple example is one where we want to print every line containing the
characters "on." (Note: The characters
must be adjacent and in the appropriate case.) This line would look like this:

/on/ {print $0}

However, the regular expressions that awk uses can be as complicated as those
used in sed. One example would be

/[^s]on[^;]/ {print $0}

This says to print every line containing the pattern on, but only if it is
not preceded by an ^s nor followed by
a semi-colon(^;). The trailing semi-colon eliminates the two town names ending
in "on" (Boston and Beaverton) and the
leading s eliminates all the names ending in "son." When we run awk with this
line, our output is

Giberson, Suzanne;102 Truck Stop Road;Ben Lomond;California;96221;26

But doesn't the name "Giberson" end in "son"? Shouldn't it be ignored along
with the others? Well, yes. However,
that's not the case. The reason this line was printed out was because of the
"on" in Ben Lomond, the city in which
Giberson resides.

We can also use addresses as part of the search criteria. For example, assume
that we need to print out only
those lines in which the first field name (i.e., the persons last name) is in
the first half of the alphabet. Because this
list is sorted, we could look for all the lines between those starting with "A"
and those starting with "M." Therefore, we
could use a line like this:

/^A/,/^M/ {print $0}

When we run it, we get

What happened? There are certainly several names in the first half of the
alphabet. Why didn't this print anything? Well,
it printed exactly what we told it to print. Like the addresses in both
vi and
sed, awk searches for a line that matches the
criteria we specified. So, what we really said was "Find the first line that
starts with an A and then print all the lines up to and
including the last one starting with an M." Because there was no line starting
with an "A," the start address
didn't exist. Instead, the code to get what we really want would look like this:

/^[A-M]/ {print $0}

This says to print all the lines whose first character is in the
range A-M. Because this checks
every line and isn't looking for starting and ending addresses, we could have
even used an unsorted file and would
have gotten all the lines we wanted. The output then looks like this:

If we wanted to use a starting and ending address,
we would have to specify the starting letter of the name that actually existed
in our file. For example:

/^B/,/^H/ {print $0}

Because printing is a very useful aspect of awk, its nice to know that there
are actually two ways of printing with awk.
The first we just mentioned. However, if you use printf instead of print, you
can specify the format of the output in
greater detail. If you are familiar with the C programming language, you already
have a head start, as the format of
printf is essentially the same as in C. However, there are a couple of
differences that you will see immediately if you are a
C programmer.

For example, if we wanted to print both the name and age with this line

$6 >30 {printf"%20s %5d\n",$1,$6}

the output would look like this:

Blinn, David

33

Dickson, Tillman

34

Holder, Wyliam

42

Nathanson, Robert

33

Richards, John

36

The space used to print each name is 20 characters long, followed by five
spaces for the age.

Because awk reads each line as a single record and blocks of
text
in each record as fields, it needs to keep track of how many records there are
and how many fields. These are denoted by
the NR variable.

Another way of using awk is at the end of a pipe.
For example, you may have multiple-line output from one command or another but
only want one or two fields from that line. To
be more specific, you may only want the permissions
and file names from an ls -l output. You would then pipe it through awk, like
this

This brings up the concept of variables. Like other languages, awk enables
you to define variables. A couple
are already predefined and come in handy. For example, what if we didn't know
off the tops of our heads that there
were nine fields in the ls -l output? Because we know that we wanted the first
and the last field, we can use the
variable
that specifies the number of fields. The line would then look like this:

ls -l | awk '{ print $1" "$NF }'

In this example, the space enclosed in quotes is necessary; otherwise, awk
would print $1 and $NR right next to
each other.

Another variable
that awk uses to keep track of the number of records read so far is NR. This can
be useful, for example, if you only want to
see a particular part of the text.
Remember our example at the beginning of this section where we wanted to see
lines 5-10 of a file (to look for an address
in the header)? In the last section, I showed you how to do it with sed, and now
I'll show you with awk.

We can use the fact that the NR variable
keeps track of the number of records, and because each line is a record, the NR
variable also keeps track of the
number of lines. So, we'll tell awk that we want to print out each line between
5 and 10, like this:

cat datafile | awk '{NR >=5 && NR <= 10 }'

This brings up four new issues. The first is the NR
variable
itself. The second is the use of the double ampersand (&&). As in C,
this means a logical AND. Both the right
and the left sides of the expression must be true for the entire expression to
be true. In this example, if we read a line and the
value of NR is greater than or equal to 5 (i.e., we have read in at least five
lines) and the number of lines read
is not more than 10, the expression meets the logical AND criteria. The third
issue is that there is no print statement. The
default action of awk, when it doesn't have any additional instructions, is to
print out each line that matches the pattern.
(You can find a list of other built in variables in the table below)

The last issue is the use of the variable
NR. Note that here, there is no dollar sign ($) in front of the variable because
we are looking for the value of NR, not
what it points to. We do not need to prefix it with $ unless it is a field
variable. Confused? Lets look at another example.

Lets say we wanted to print out only the lines where there were more than
nine fields. We could do it like this:

cat datafile | awk '{ NF > 9 }'

Compare this

cat datafile | awk { print $NF }

which prints out the last field in every line. (You can find a list of other
built in variable
in the table below)

Up to now, we've been talking about one line awk commands. These have all
performed a single action on each line.
However, awk has the ability to do multiple tasks on each line as well as a task
before it begins reading and after it has
finished reading.

We use the BEGIN and END pair as markers. These are treated like any other
pattern. Therefore, anything appearing
after the BEGIN pattern is done before the first line is read. Anything after
the END pattern is done after the last line is read.
Lets look at this script:

Following the BEGIN pattern is a definition of the field
separator.
This is therefore done before the first line is read. Each line is processed
four times, where
we print a different set of fields each time. When we finish, our output looks
like this:

Aside from having a pre-defined set of variables to use, awk allows us to
define variables ourselves.
If in the last awk script we had wanted to print out, lets say, the average age,
we could add a line in the middle
of the script that looked like this:

{total = total + $6 }

Because $6 denotes the age of each person, every time we run through the
loop, it is added to
the variable
total. Unlike other languages, such as C, we don't have to initialize the
variables; awk will do that for us.
Strings are initialized to the null string and numeric variables are initialized
to 0.

After the END, we can include another line to print out our sum, like
this:

Is that all there is to it? No. In fact, we haven't even touched the surface.
awk is a very complex programming
language and there are dozens more issues that we could have addressed. Built
into the language are mathematical
functions, if and while loops, the ability to create your own functions, strings
and array manipulation, and much more.

Unfortunately, this is not a book on UNIX
programming languages. Some readers may be disappointed that I do not have the
space to cover awk in more detail. I
am also disappointed. However, I have given you a basic introduction to the
constructs of the language to enable you to
better understand the more than 100 scripts on your system that use awk in some
way.

Hello Jim. I've been absent from here due to defective RAM that took my computer quite down.
Now, my comments. I believe that awk requires single quotes around the curly braces as
$ ls -l | '{ print $1 " " $NF }'
and, additionally, placing a comma between $1 and $NF will also produce a space between the printed fields.

Posted by jimmo on October 23, 2004 04:07pm:

Interesting that you are the first to notice. This has been online for three years. The single-quotes seem to have been lost when I translated the files from MS-Word to HTML. It's corrected now.

Copyright 2002-2009 by James Mohr. Licensed under modified GNU Free Documentation License (Portions of this material originally published by Prentice Hall, Pearson Education, Inc). See here for details. All rights reserved.

Is this information useful? At the very least you can help by spreading the word to your favorite newsgroups, mailing lists and forums.All logos and trademarks in this site are property of their respective owner. The comments are property of their posters. Articles are the property of their respective owners. Unless otherwise stated in the body of the article, article content (C) 1994-2013 by James Mohr. All rights reserved. The stylized page/paper, as well as the terms "The Linux Tutorial", "The Linux Server Tutorial", "The Linux Knowledge Base and Tutorial" and "The place where you learn Linux" are service marks of James Mohr. All rights reserved. The Linux Knowledge Base and Tutorial may contain links to sites on the Internet, which are owned and operated by third parties. The Linux Tutorial is not responsible for the content of any such third-party site. By viewing/utilizing this web site, you have agreed to our disclaimer, terms of use
and privacy policy. Use of automated download software ("harvesters") such as wget, httrack, etc. causes the site to quickly exceed its bandwidth limitation and are therefore expressly prohibited. For more details on this, take a look
here