* The Awk programming language was designed to be simple but powerful. It
allows a user to perform relatively sophisticated text-manipulation
operations through Awk programs written on the command line. For example,
suppose we want to turn a document with single-spacing into a document with
double-spacing. We could easily do that with the following Awk program:

awk '{print ; print ""}' infile > outfile

Notice how single-quotes (' ') are used to allow using double-quotes (" ")
within the Awk expression. This "hides" special characters from the shell.
We could also do this as follows:

awk "{print ; print \"\"}" infile > outfile

-- but the single-quote method is simpler.

This program does what it supposed to, but it also doubles every blank line
in the input file, which leaves a lot of empty space in the output. That's
easy to fix -- just tell Awk to print an extra blank line if the current line
is not blank:

awk '{print ; if (NF != 0) print ""}' infile > outfile

* One of the problems with Awk is that it is ingenious enough to make a user
want to tinker with it, and want to use it for tasks for which it isn't
really appropriate. For example, we could use Awk to count the number of
lines in a file:

awk 'END {print NR}' infile

-- but this is dumb, because the "wc (word count)" utility gives the same
answer with less bother: Use the right tool for the job.

Awk is the right tool for slightly more complicated tasks. Suppose we have a
file containing an email distribution list, with the email addresses of various
different groups placed on consecutive lines in the file, and the
different groups separated by blank lines. If we wanted to quickly and
reliably determine how many people were on the distribution list, we can't
use "wc", since, it counts blank lines, but Awk handles it easily:

awk 'NF != 0 {++count} END {print count}' list

* Awk is useful for performing simple iterative computations for which a more
sophisticated language like C might prove overkill. Consider the Fibonacci
sequence:

1 1 2 3 5 8 13 21 34 ...

Each element in the sequence is constructed by adding the two previous
elements together, with the first two elements defined as both "1". It's a
discrete formula for exponential growth. It is very easy to use Awk to
generate this sequence:

* Sometimes an Awk program needs to be used repeatedly. In that case, it's
simple to execute the Awk program from a shell script. For example, consider
an Awk script to print each word in a file on a separate line. This could be
done with a script named "words" containing:

awk '{c=split($0, s); for(n=1; n<=c; ++n) print s[n] }' $1

"Words" could them be made executable (using "chmod +x words") and the
resulting shell "program" invoked just like any other command. For example,
"words" could be invoked from the "vi" text editor as follows:

:%!words

This would turn all the text into a list of single words.

* For another example, consider the double-spacing program mentioned
previously. This could be slightly changed to accept standard input, using a
"-" as described earlier, then copied into a file named "double":

awk '{print; if (NF != 0) print ""}' -

-- and then could be invoked from "vi" to double-space all the text in the
editor.

* The next step would be to also allow "double" to perform the reverse
operation: to take a double-spaced file and return it to single-spaced,
using the option:

undouble

The first part of the task is, of course, to design a way of stripping out
the extra blank lines, without destroying the spacing of the original
single-spaced file by taking out all the blank lines. The simplest
approach would be to delete every other blank line in a continuous block of
such blank lines. This won't necessarily preserve the original spacing, but
it will preserve spacing in some form.

The method for achieving this is also simple, and involves using a variable
named "skip". This variable is set to "1" every time a blank line is
skipped, to tell the Awk program not to skip the next one. The scheme is
as follows:

BEGIN {set skip to 0}
scan the input:
if skip == 0 if line is blank
skip = 1
else
print the line
get next line of input
if skip == 1 print the line
skip = 0
get next line of input

Remember that when "\" is used to embed an Awk program in a script file, the
program appears as one line to Awk. A semicolon must be used to
separate commands.

For a more sophisticated example, suppose we're writing an extended document
and find out that we somehow end up with the same word typed in twice: "And
the result was also also that ... " Such duplicate words are hard to spot on
proofreading, but it is straightforward to write an Awk program to do the
job, scanning through a text file to find duplicate; printing the duplicate
word and the line it is found on if a duplicate is found; or otherwise
printing "no duplicates found".

The "w" variable stores each word in the file, comparing it to the next word
in the file; w is initialized to "xy-zzy" since that is unlikely to be a word
in the file. The "dup" variable is initialized to 0 and set to 1 if a
duplicate is found; if it's still 0 at the end of the end, the program prints
the "no duplicate found" message. As with the previous example, we could
put this into a separate file or embed it into a script file.

* These last examples use variables to allow an Awk program to keep track of
what it has been doing. Awk, as repeatedly mentioned, operates in a cycle:
get a line, process it, get the next line, process it, and so on; to have an
Awk program remember things between cycles, it needs to leave a little
message for itself in a variable.

For example, say we want to match on a line whose first field has the value
1,000 -- but then print the next line. We could do that as follows:

This program sets a variable named "flag" when it finds a line starting with
1,000, and then goes and gets the next line of input. The next line of input
is printed, and then "flag" is cleared so the line after that won't be
printed.

If we wanted to print the next five lines, we could do that in much the
same way using a variable named, say, "counter":

This program initializes a variable named "counter" to 5 when it finds a line
starting with 1,000; for each of the following 5 lines of input, it prints
them and decrements "counter" until it is zero.

This approach can be taken to as great a level of elaboration as needed.
Suppose we have a list of, say, five different actions on five lines of
input, to be taken after matching a line of input; we can then create a
variable named, say, "state", that stores which item in the list to perform
next. The scheme is generally as follows:

BEGIN {set state to 0}
scan the input:
if match set state to 1
get next line of input
if state == 1 do the first thing in the list
state = 2
get next line of input
if state == 2 do the second thing in the list
state = 3
get next line of input
if state == 3 do the third thing in the list
state = 4
get next line of input
if state == 4 do the fourth thing in the list
state = 5
get next line of input
if state == 5 do the fifth (and last) thing in the list
state = 0
get next line of input

This is called a "state machine". In this case, it's performing a simple
list of actions, but the same approach could also be used to perform a more
complicated branching sequence of actions, such as we might have in a
flowchart instead of a simple list.

We could assign state numbers to the blocks in the flowchart and then use
if-then tests for the decision-making blocks to set the state variable to
indicate which of the alternate actions should be performed next. However,
few Awk programs require such complexities, and going into more elaborate
examples here would probably be more confusing than it's worth. The
essential thing to remember is that an awk program can leave messages for
itself in a variable on one line-scan cycle to tell it what to do on later
line-scan cycles.

* Awk is an excellent tool for building Linux-style shell scripts, but there
are potential pitfalls. Say we have a scriptfile named "testscript", and it
takes two filenames as parameters:

testscript myfile1 myfile2

If we're executing Awk commands from a file, handling the two filenames isn't
very difficult. We can initialize variables on the command line as follows:

cat $1 $2 | awk -f testscript.awk f1=$1 f2=$2 > tmpfile

The Awk program will use two variables, "f1" and "f2", that are initialized
from the script command line variables "$1" and "$2".

Where this measure gets obnoxious is when we are specifying Awk commands
directly, which is preferable if possible since it reduces the number of
files needed to implement a script. The problem is that "$1" and "$2" have
different meanings to the scriptfile and to Awk. To the scriptfile, they are
command-line parameters, but to Awk they indicate text fields in the input.

The handling of these variables depends on how Awk print fields are defined
-- either enclosed in double-quotes (" ") or in single-quotes (' '). If we
invoke Awk as follows:

awk "{ print \"This is a test: \" $1 }" $1

-- we won't get anything printed for the "$1" variable. If we instead
use single-quotes to ensure that the scriptfile leaves the Awk positional
variables alone, we can insert scriptfile variables by initializing them to
variables on the command line:

awk '{ print "This is a test: " $1 " / parm2 = " f }' f=$2 < $1

This provides the first field in "myfile1" as the first parameter and the
name of "myfile2" as the second parameter.

Remember that Awk is relatively slow and clumsy and should not be regarded as
the default tool for all scriptfile jobs. We can use "cat" to append to
files, "head" and "tail" to cut off a given number of lines of text from the
front or back of a file, "grep" or "fgrep" to find lines in a particular
file, and "sed" to do search-replaces on the stream in the file.

* The original version of Awk was developed in 1977. It was optimized for
throwing together "one-liners" or short, quick-and-dirty programs. However,
some users liked Awk so much that they used it for much more complicated
tasks. To quote the language's authors: "Our first reaction to a program
that didn't fit on one page was shock and amazement." Some users regarded
Awk as their primary programming tool, and many had in fact learned
programming using Awk.

After the authors got over their initial consternation, they decided to
accept the fact, and enhance Awk to make it a better general-purpose
programming tool. The new version of Awk was released in 1985. The new
version is often, if not always, known as Nawk ("New Awk") to distinguish it
from the old one.

* Nawk incorporates several major improvements. The most important is that
users can define their own functions. For example, the following Nawk
program implements the "signum" function:

Function declarations can be placed in a program wherever a match-action
clause can. All parameters are local to the function. Local variables can
be defined inside the function.

* A second improvement is a new function, "getline", that allows input from
files other than those specified in the command line at invocation (as well
as input from pipes). "Getline" can be used in a number of ways:

* A related function, "close", allows a file to be closed so it can be read
from the beginning again:

close("myfile")

* A new function, "system", allows Awk programs to invoke system commands:

system("rm myfile")

* Command-line parameters can be interpreted using two new predefined
variables, ARGC and ARGV, a mechanism instantly familiar to C programmers.
ARGC ("argument count") gives the number of command-line elements, and ARGV
("argument vector") is an array whose entries store the elements
individually.

* There is a new conditional-assignment expression, known as "?:", which is
used as follows:

status = (condition == "green")? "go" : "stop"

This translates to:

if (condition=="green") {status = "go"} else {status = "stop"}

This construct should also be familiar to C programmers.

* There are new math functions, such as trig and random-number functions:

* There are new string functions, such as match and substitution functions:

match(<target string>,<search string>)

Search the target string for the search string; return 0 if no match,
return starting index of search string if match. Also sets built-in
variable RSTART to the starting index, and sets built-in variable RLENGTH
to the matched string's length.

sub(<regular expression>,<replacement string>)

Search for first match of regular expression in $0 and substitute
replacement string. This function returns the number of substitutions
made, as do the other substitution functions.

sub(<regular expression>,<replacement string>,<target string>)

Search for first match of regular expression in target string and
substitute replacement string.

gsub(<regular expression>,<replacement string>)

Search for all matches of regular expression in $0 and substitute
replacement string.

gsub(<regular expression>,<replacement string>,<target string>)

Search for all matches of regular expression in target string and
substitute replacement string.

* There is a mechanism for handling multidimensional arrays. For example,
the following program creates and prints a matrix, and then prints the
transposition of the matrix:

Nawk also includes a new "delete" function, which deletes array elements:

delete(array[count])

* Characters can be expressed as octal codes. "\033", for example, can be
used to define an "escape" character.

* A new built-in variable, FNR, keeps track of the record number of the
current file, as opposed to NR, which keeps track of the record number of the
current line of input -- regardless of how many files have contributed to
that input. Its behavior is otherwise exactly identical to that of NR.

* While Nawk does have useful refinements, they are generally intended to
support the development of complicated programs. It is arguable that Awk is
a good tool for building complicated programs, but those who would like to
know more about Nawk are encouraged to read THE AWK PROGRAMMING LANGUAGE by
Aho / Weinberger / Kernighan. This short, terse, detailed book outlines the
capabilities of Nawk and provides sophisticated examples of its use.

The search can be for an entire range of lines, bounded by two strings:

/<string1>/,/<string2>/

The search can be for any condition, such as line number, and can use the
following comparison operators:

== != < > <= >=

Different conditions can be ORed with "||" or ANDed with "&&".

[<charlist or range>] Match on any character in list or range.
[^<charlist or range>] Match on any character not in list or range.
. Match any single character.
* Match 0 or more occurrences of preceding string.
? Match 0 or 1 occurrences of preceding string.
+ Match 1 or more occurrences of preceding string.

If a metacharacter is part of the search string, it can be "escaped" by
preceding it with a "\".

An integer part that specifies the minimum output width. (A leading "0"
causes the output to be padded with zeroes.)

A fractional part that specifies either the maximum number of characters
to be printed (for a string), or the number of digits to be printed to the
right of the decimal point (for floating-point formats).

The format codes are:

d Prints a number in decimal format.
o Prints a number in octal format.
x Prints a number in hexadecimal format.
c Prints a character, given its numeric code.
s Prints a string.
e Prints a number in exponential format.
f Prints a number in floating-point format.
g Prints a number in exponential or floating-point format.

* Awk can perform output redirection (using ">" and ">>") and piping (using
"|") from both "print" and "printf".