3. Invoking AWK programs

AWK (named after its creators
Al Aho,
Peter Weinberger and
Brian Kernighan)
is a very powerful text processing language.
It features automatic splitting of each input line
in fields, associative arrays (arrays indexed by strings),
and built-in string oriented functions.

Brian Kernighan said about AWK:

It was originally for writing these one and two line
programs. It really was.
I think it's very seductive because it does so many things
automatically. It handles strings and numbers smoothly. It is an interpreter and there's
no baggage, no derived object files.
People start to write
a one and two line program that just grows and grows;
some of them grow unbelievably large: tens of
thousands of lines -- which is nonsense.

Brian Kernighan, cited by
Peter H. Salus (who in turn cites Peter Collinson from the
".EXE" magazine).
From
A Quarter Century of UNIX, pp. 103-104.

This article describes three ways to interface
AWK programs with shell scripts and how
to import shell variables into AWK programs.

This text assumes a good understanding of AWK
and shell scripting. If you want to learn
how to program using AWK, you should read an
AWK introduction, e.g. one of the documents
in the bibliography

.

If appropriate we will differentiate between
oawk, nawk, and
awk.

oawk (old AWK)
is the first AWK version, and is still around on many
UNIX systems. If you have an oawk on your
system, you probably have nawk, too.
There is no need to prefer oawk to
nawk, except for older AWK scripts that
require the older AWK version.

nawk (new AWK)
is an extension of OAWK that now is the standard AWK
version. Any references made here to this version apply to
the GNU AWK gawk as well.

If we refer to any AWK version we will just
write AWK. If you are interested in other AWK programs
for different operating systems, you should have
a look at the
AWK FAQ.

For now we'll ignore the fact that the script
is not working correctly and describe how it
should have worked.

The script assigns the first command line
parameter ("main" in the command above)
to the script variable SearchString
and then calls awk to search this given
string in all c files ("*.c") specified on the command line.
The special shell variable $@ will be expanded
to the file name list.

At this time, however, it only searches
the constantSearchString instead of
the value of the script variableSearchString. The script will find
all occurrences of the string "SearchString"
in all files specified - no matter what search
string we specify on the command line.

But how do we get the contents of the shell script
variable into the AWK program?

Using "pseudo files";
specifying variable=value pairs
on the command line.

The second method has the disadvantage of not being
portable to older versions of awk (and
even different versions of nawk).
The third method has some disadvantages we will
describe later.
Therefore we will explain the first, preferred method
in detail.

The first part consists only of the character "/" that
introduces a search pattern.

The second part consists of the contents of
the shell variable SearchString.

The third part consists of the character "/" that
ends the search pattern, and the AWK action "{print}"
that prints the line matching the pattern (we could
omit the "{print}", because it is the default
action).

It is essential that all three parts are written
together without any whitespace, because
AWK only takes one program on the command line and
will complain about any further program found.

What happens if we call this script textsearch
with "hello" as an argument?

$ textsearch hello *.doc

Inside of the script the first argument "hello" will
be assigned to the shell variable SearchString,
and AWK will be called the following way:

awk '/hello/ {print}' file1.doc file2.doc

We now have exactly the solution for our problem:
this is a way to import a shell environment
variable into AWK.

There's still one problem left. Consider the following
invocation of our script:

$ textsearch "our house" *.doc

Now the variable SearchString gets the
value "our house", which results in the following AWK
invocation:

awk '/our' 'house/ {print}' file1.doc file2.doc

Now our AWK program (marked red)
is split in two parts, resulting in AWK error messages.
The first part '/our' is taken to be
the (invalid) program code, and 'house/ {print}'
to be an (invalid) file name.

The solution to this problem is simple: the shell
environment variable should be enclosed in quotes:

awk '/'"$SearchString"'/ {print}' "$@"

Now you are able to write large AWK programs
that may use shell script variables.
The embedding of AWK programs in shell scripts is
easy to use, portable, and allows the usage of
arbitrary complex shell script commands for input
pre- or post processing.

The following example uses the technique
described above to transfer the name of a file
into the AWK script (marked red).

The script substitute substitutes
arbitrary words in the input with other words
specified in the file substitute.tab
in the current directory. The file contains lines
in the format

The script parts marked red
assign the contents of the shell script variable
SearchString to the AWK variable
Search. This variable is then used
inside of the AWK script
(marked blue) to match
a line.

Note that we changed the search command from
"/SearchString/" to
"$0 ~ Search", because AWK variables may
not be used between the pattern matching operator
/.../.

Portability:

The -v option is available with POSIX compliant awk
implementations.
The major disadvantage of this method is, that it's not widely
portable. gawk supports it, but oawk
does not. Some of the nawk programs support it, some
(e.g. SunOS 4.1.3) do not.

AWK knows another way to assign values
to AWK variables, like in the following example:

$ awk '{ print "var is", var }' var=TEST file1 file2

This statement assigns the value "TEST" to the AWK
variable "var", and then reads the files "file1"
and "file2". The assignment works, because AWK
interprets each file name containing an equal sign ("=")
as an assignment.

This example is very portable (even
oawk understands this syntax), and easy
to use. So why don't we use this syntax exclusively?

This syntax has two drawbacks: the variable assignment
are interpreted by AWK the moment the file would have
been read. At this time the assignment takes place. Since
the BEGIN action is performed before the
first file is read, the variable is not available
in the BEGIN action.

The second problem is, that the order of the
variable assignments and of the files are important.
In the following example

$ awk '{ print "var is", var }' file1 var=TEST file2

the variable var is not defined
during the read of file1, but during the reading of
file2. This may cause bugs that are hard to track
down.

An equally portable way to achieve the same result
is Shell script embedding, the preferred method.

Portability:

Assigning variables on the command line
is very portable, because even the first
versions of AWK support it. The internal
handling of AWK may cause subtle bugs, however,
and other methods should be preferred.

An interpreter line is the first line of
an executable text (non-binary) file. If the first two characters
of the file are "#!", the remainder of the line is taken
to be the name of an interpreter (an binary executable file).
This program is then started with the file text
[TODO: how? on stdin?].

This way any script may call its own interpreter, e.g.

#! /bin/awk -f
BEGIN {
print "this script is read by AWK"
}

This is a comfortable way to call AWK scripts, because
in contrary to the "awk -f" solution the user does not
have to remember the whole path for the script (if his
PATH environment variable is set correctly).

The programmer, however still does not have a way
to pre- or postprocess the input/output of the AWK script.

Portability:

Interpreter lines are a relatively new
UNIX feature that is now widely available.
It's available on
System V Release 4 based systems
(e.g. Solaris),
but not on older System V UNIXes.