Other Stuff

R. Loui loui@ai.wustl.edu is Associate Professor of Computer Science, at Washington University in St. Louis. He has published in AI Journal, Computational Intelligence, ACM SIGART, AI Magazine, AI and Law, the ACM Computing Surveys Symposium on AI, Cognitive Science, Minds and Machines, Journal of Philosophy.

Whenever Ronald Loui teaches GAWK, he gives the students the choice of learning PERL instead. Ninety percent will choose GAWK after looking at a few simple examples of each language (samples shown below). Those who choose PERL do so because someone told them to learn PERL.

After one laboratory, more than half of the GAWK students are confident with their GAWK skills and can begin designing. Almost no student can become confident in PERL that quickly.

After a week, 90% of those who have attempted GAWK have mastered it, compared to fewer than 50% of PERL students attaining similar facility with the language (it would be unfair to require one to `master' PERL).

By the end of the semester, over 90% who have attempted GAWK have succeeded, and about two-thirds of those who have attempted PERL have succeeded.

To be fair, within a year, half of the GAWK programmers have also studied PERL. Most are doing so in order to read PERL and will not switch to writing PERL. No one who learns PERL migrates to GAWK.

PERL and GAWK appear to have similar programming, development, and debugging cycle times.

Finally, there seems to be a small advantage for GAWK over PERL, after a year, for the programmers willingness to begin a new program. That is, both GAWK and PERL programmers tend to enjoy writing a lot of programs, but GAWK has the slight edge here.

Two magic patterns are BEGIN and END. These are true before and after all the input files are read. Use END of end actions (e.g. final reports) and BEGIN for start up actions such as initializing default variables, setting the field separator, resetting the seed of the random number generator:

Regular Expressions

Do you know what these mean?

/^[ \t\n]*/

/[ \t\n]*$/

/^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$/

Well, the first two are leading and trailing blank spaces on a line and the last one is the definition of an IEEE-standard number written as a regular expression. Once we know that, we can do a bunch of common tasks like trimming away white space around a string:

c
matches the character c (assuming c is a character with no special meaning in regexps).

\c
matches the literal character c; e.g. tabs and newlines are \t and \n respectively.

.
matches any character except newline.

^
matches the beginning of a line or a string.

$
matches the end of a line or a string.

[abc...]
matches any of the characters ac... (character class).

[^ac...]
matches any character except abc... and newline (negated character class).

r*
matches zero or more r's.

And that's enough to understand our trim function shown above. The regular expression /[ \t]*$/ means trailing whitespace; i.e. zero-or-more spaces or tabs followed by the end of line.

More Syntax:

But that's only the start of regular expressions. There's lots more. For example:

r+
matches one or more r's.

r?
matches zero or one r's.

r1|r2
matches either r1 or r2 (alternation).

r1r2
matches r1, and then r2 (concatenation).

(r)
matches r (grouping).

Now we can read ^[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)([eE][+-]?[0-9]+)?$ like this:

^[+-]? ...
Numbers begin with zero or one plus or minus signs.

...[0-9]+...
Simple numbers are just one or more numbers.

...[.]?[0-9]*...
which may be followed by a decimal point and zero or more digits.

...|[.][0-9]+...
Alternatively, a number can have zero leading numbers and just start with a decimal point.

.... ([eE]...)?$
Also, there may be an exponent added

...[+-]?[0-9]+)?$
and that exponent is a positive or negative bunch of digits.

Associative arrays

Gawk has arrays, but they are only indexed by strings. This can be very useful, but it can also be annoying. For example, we can count the frequency of words in a document (ignoring the icky part about printing them out):

Gawk '{for(i=1;i <=NF;i++) freq[$i]++ }' filename

The array will hold an integer value for each word that occurred in the file. Unfortunately, this treats foo'',Foo'', and foo,'' as different words. Oh well. How do we print out these frequencies? Gawk has a specialfor'' construct that loops over the values in an array. This script is longer than most command lines, so it will be expressed as an executable script:

You can find out if an element exists in an array at a certain index with the expression:

index in array

This expression tests whether or not the particular index exists,
without the side effect of creating that element if it is not present.

You can remove an individual element of an array using the delete statement:

delete array[index]

It is not an error to delete an element which does not exist.

Gawk has a special kind of for statement for scanning an array:

for (var in array)
body

This loop executes body once for each different value that your program has previously used as an index in array, with the variable var set to that index.

There order in which the array is scanned is not defined.

To scan an array in some numeric order, you need to use keys 1,2,3,... and store somewhere that the array is N long. Then you can do the Here are some useful array functions. We begin with the usual stack stuff. These stacks have items 1,2,3,.... and position 0 is reserved for the size of the stack

Note that the third argument of the split function can be any regular expression.

By the way, here's a nice trick with arrays. To print the lines in a files in a random order:

BEGIN {srand()}
{Array[rand()]=$0}
END {for(I in Array) print $0}

Short, heh? This is not a perfect solution. Gawk can only generate
1,000,000 different random numbers so the birthday theorem cautions
that there is a small chance that the lines will be lost when different
lines are written to the same randomly selected location. After some
experiments, I can report that you lose around one item after 1,000
inserts and 10 to 12 items after 10,000 random inserts. Nothing to write
home about really. But for larger item sets, the above three liner is not
what you want to use. For exampl,e 10,000 to 12,000 items (more than 10%)
are lost after 100,000 random inserts. Not good!

USAGE

Most of my experience comes from version of GNU awk (gawk) compiled for
Win32. Note in particular that DJGPP compilations permit the awk script
to follow Unix quoting syntax '/like/ {"this"}'. However, the user must
know that single quotes under DOS/Windows do not protect the redirection
arrows (<, >) nor do they protect pipes (|). Both are special symbols
for the DOS/CMD command shell and their special meaning is ignored only
if they are placed within "double quotes." Likewise, DOS/Win users must
remember that the percent sign (%) is used to mark DOS/Win environment
variables, so it must be doubled (%%) to yield a single percent sign
visible to awk.

If I am sure that a script will NOT need to be quoted in Unix, DOS, or
CMD, then I normally omit the quote marks. If an example is peculiar to
GNU awk, the command 'gawk' will be used. Please notify me if you find
errors or new commands to add to this list (total length under 65
characters). I usually try to put the shortest script first.

File Spacing

Double space a file

awk '1;{print ""}'
awk 'BEGIN{ORS="\n\n"};1'

Double space a file which already has blank lines in it. Output file
should contain no more than one blank line between lines of text.
NOTE: On Unix systems, DOS lines which have only CRLF (\r\n) are
often treated as non-blank, and thus 'NF' alone will return TRUE.

awk 'NF{print $0 "\n"}'

Triple space a file

awk '1;{print "\n"}'

Numbering and Calculations

Precede each line by its line number FOR THAT FILE (left alignment).
Using a tab (\t) instead of space will preserve margins.

awk '{print FNR "\t" $0}' files*

Precede each line by its line number FOR ALL FILES TOGETHER, with tab.

awk '{print NR "\t" $0}' files*

Number each line of a file (number on left, right-aligned)
Double the percent signs if typing from the DOS command prompt.

awk '{printf("%5d : %s\n", NR,$0)}'

Number each line of file, but only print numbers if line is not blank
Remember caveats about Unix treatment of \r (mentioned above)

The manual ("man") pages on Unix systems may be helpful (try "man awk",
"man nawk", "man regexp", or the section on regular expressions in "man
ed"), but man pages are notoriously difficult. They are not written to
teach awk use or regexps to first-time users, but as a reference text
for those already acquainted with these tools.

USE OF '\t' IN awk SCRIPTS: For clarity in documentation, we have used
the expression '\t' to indicate a tab character (0x09) in the scripts.
All versions of awk, even the UNIX System 7 version should recognize
the '\t' abbreviation.

Here are a few short programs that do the same thing in each language. When reading these examples, the question to ask is `how many language features do I need to understand in order to understand the syntax of these examples'.

Some of these are longer than they need to be since they don't exploit some (e.g.) command line trick to wrap the code in for each line do X. And that is the point- for teach-ability, the preferred language is the one you need to know LESS about before you can be useful in it.