The curly-bracket syntax allows for the shell's string operators
.
String operators allow you to manipulate values of
variables in various useful ways without having to write full-blown
programs or resort to external UNIX utilities.
You can do a lot with string-handling operators even if
you haven't yet mastered the programming features
we'll see in later chapters.

In particular, string operators let you
do the following:

Ensure that variables exist (i.e., are defined and have non-null values)

The basic idea behind the syntax of string operators
is that special characters that denote operations are inserted
between the variable's name and the right curly brackets.
Any argument that the operator may need is inserted to the operator's right.

The first group of string-handling operators tests
for the existence of variables and allows substitutions of
default values under certain conditions. These
are listed in
Table 4.1
.
[6]

[6] The colon (:
) in each of these operators is actually optional.
If the colon is omitted, then change "exists and isn't null"
to "exists" in each definition, i.e., the
operator tests for existence only.

If varname
exists and isn't null, return its value;
otherwise set it to word
and then return its value.[7]

Purpose
:

Setting a variable to a default value if it is undefined.

Example
:

${count:=0}
sets count
to 0 if it is undefined.

${varname:?message}

If varname
exists and isn't null, return its value;
otherwise print varname:
followed
by message
,
and abort the current command or script.
Omitting message
produces the default message
parameter null or not set
.

Purpose
:

Catching errors that result from variables being undefined.

Example
:

{count:?"undefined!"}
prints "count: undefined!"
and exits if count
is undefined.

[7]
Pascal, Modula, and Ada programmers may find it helpful to recognize the
similarity of this to the assignment operators in those languages.

The first two of these operators are ideal for setting defaults for
command-line arguments in case the user omits them. We'll use
the first one in our first programming task.

Task 4.1

You have a large album collection, and you want to write some
software to keep track of it. Assume that you have a file of data on
how many albums you have by each artist. Lines in the file look
like this:

14 Bach, J.S.
1 Balachander, S.
21 Beatles
6 Blakey, Art

Write a program that prints the N
highest lines,
i.e., the N
artists
by whom you have the most albums. The default for N
should be 10.
The program should take one argument for the name of the input file
and an optional second argument for how many lines to print.

By far the best approach to this type of script is to
use built-in UNIX utilities, combining them with I/O redirectors
and pipes. This is the classic "building-block" philosophy
of UNIX that is another reason for its great popularity with
programmers. The building-block technique lets us write a first
version of the script that is only one line long:

sort -nr $1 | head -${2:-10}

Here is how this works:
the sort
(1) program sorts the data in the file whose name
is given as the first argument ($1
).
The -n
option tells sort
to interpret the first word on each line as a number
(instead of as a character string);
the -r
tells it to reverse the comparisons, so as to sort in
descending order.

The output
of sort
is piped into the head
(1) utility, which, when
given the argument -N
, prints the first N
lines of its input on
the standard output. The expression -${2:-10}
evaluates to a dash
(-
)
followed by the second argument if it is given, or to -10 if it's not;
notice that the variable in this expression is 2
, which is
the second positional parameter.

Assume the script we want to write is called highest
.
Then if the user types
highest myfile
, the line that actually runs is:

sort -nr myfile | head -10

Or if the user types highest myfile 22
, the line that runs is:

sort -nr myfile | head -22

Make sure you understand how the :-
string operator provides
a default value.

This is a perfectly good, runnable script-but it has a few
problems. First, its one line is a bit cryptic. While this
isn't much of a problem for such a tiny script, it's not
wise to write long, elaborate scripts in this manner. A few minor
changes will make the code more readable.

First, we can add
comments to the code; anything between # and the end of
a line is a comment. At a minimum,
the script should start with a few comment lines that indicate
what the script does and what arguments it accepts. Second, we
can improve the variable names by assigning the values of the
positional parameters to regular variables with mnemonic names.
Finally, we can add blank lines to space things out; blank lines,
like comments, are ignored. Here is a more readable version:

The square brackets around howmany
in the comments
adhere to the convention in UNIX documentation
that square brackets denote optional arguments.

The changes we just made improve the code's readability but not how it runs.
What if the user were to invoke the script without any arguments?
Remember that positional parameters default
to null if they aren't defined.
If there are no arguments, then $1
and $2
are both null.
The variable howmany
($2
) is set up to default to 10, but there is
no default for filename
($1
).
The result would be that this command runs:

sort -nr | head -10

As it happens, if sort
is called without a filename argument,
it expects input to come from standard input, e.g.,
a pipe (|) or a user's terminal. Since it doesn't have the pipe,
it will expect the terminal. This means that the script will appear to hang!
Although you could always type [CTRL-D]
or
[CTRL-C]
to get out of the script, a naive
user might not know this.

Therefore we need to make sure that the user supplies at least
one argument. There are a few ways of doing this; one of them
involves another string operator. We'll replace the line:

filename=$1

with:

filename=${1:?"filename missing."}

This will cause two things to happen if a user invokes the
script without any arguments: first the shell will print
the somewhat unfortunate message:

highest: 1: filename missing.

to the standard error output.
Second, the script will exit without running the remaining code.

With a somewhat "kludgy" modification, we can
get a slightly better error message. Consider this code:

filename=$1
filename=${filename:?"missing."}

This results in the message:

highest: filename: missing.

(Make sure you understand why.) Of course, there are ways of printing
whatever message is desired; we'll find out how in Chapter 5
.

Before we move on, we'll look more closely at the two remaining
operators in
Table 4.1
and see how we can incorporate them into
our task solution.
The :=
operator does roughly the
same thing as :-
, except that it has the "side effect"
of setting the
value of the variable to the given word if the variable doesn't exist.

Therefore we would like to use :=
in our script in place of :-
,
but we can't; we'd be trying to set the
value of a positional parameter, which is not allowed. But
if we replaced:

howmany=${2:-10}

with just:

howmany=$2

and moved the substitution down to the actual command line (as we
did at the start), then we could use the :=
operator:

sort -nr $filename | head -${howmany:=10}

Using :=
has the added benefit of setting the value of howmany
to 10 in case we need it afterwards in later versions of the script.

The final substitution operator is :+
. Here is how we can use it
in our example: Let's say we want to give the user the option of
adding a header line to the script's output. If he or she types
the option -h
, then the output will be preceded by the line:

ALBUMS ARTIST

Assume further that this option ends up in the variable header
,
i.e., $header
is -h
if the option is set or null if not.
(Later we will see how to do this without disturbing the other
positional parameters.)

The expression:

${header:+"ALBUMS ARTIST\n"}

yields null if the variable header
is null,
or ALBUMS ARTIST
\n
if it is non-null.
This means that we can put the line:

print -n ${header:+"ALBUMS ARTIST\n"}

right before the command line that does the actual work.
The -n
option to print
causes it not
to print a LINEFEED after printing its
arguments. Therefore this print
statement will print
nothing-not even a blank line-if
header
is null; otherwise it will print the header line
and a LINEFEED (\n).

We'll continue refining our solution to Task 4-1 later in this chapter.
The next type of string operator is used to match portions of a
variable's string value against patterns
.
Patterns, as we saw in Chapter 1
are strings that can contain
wildcard characters (*
, ?
, and []
for character sets and ranges).

Wildcards have been standard features of all UNIX shells going
back (at least) to the Version 6 Bourne shell. But the Korn shell
is the first shell to add to their capabilities.
It adds a set
of operators, called regular expression
(or regexp
for short)
operators,
that give it much of the string-matching power of advanced UNIX utilities
like awk
(1),
egrep
(1) (extended grep
(1)) and the emacs
editor,
albeit with a different syntax. These capabilities go beyond
those
that you may be used to in other UNIX utilities like grep
,
sed
(1) and vi
(1).

Advanced UNIX users will find the Korn shell's regular expression
capabilities occasionally useful for script writing, although they
border on overkill. (Part of the problem is the inevitable
syntactic clash with the shell's myriad other special characters.)
Therefore we won't go into great detail about regular expressions here.
For more comprehensive information, the "last word"
on practical regular expressions in UNIX is sed & awk
,
an O'Reilly Nutshell Handbook by Dale Dougherty.
If you are already comfortable with awk
or egrep
, you
may want to skip the following introductory section and go to
"Korn Shell Versus awk/egrep Regular Expressions" below,
where we explain the shell's regular expression mechanism by
comparing it with the syntax used in those two utilities.
Otherwise, read on.

Think of regular expressions as strings that match patterns
more powerfully than the standard shell wildcard schema.
Regular expressions began as an idea in theoretical computer
science, but they have found their way into many nooks and crannies of
everyday, practical computing. The syntax used to represent them
may vary, but the concepts are very much the same.

A shell regular expression can contain regular characters, standard
wildcard characters, and additional
operators that are more powerful than wildcards. Each such operator
has the form x
(exp)
, where x
is the particular
operator and exp
is any regular expression (often simply
a regular string). The operator determines how many occurrences
of exp
a string that matches the pattern can contain.
See Table 4.2
and Table 4.3
.

Regular expressions are extremely useful when dealing with arbitrary
text, as you already know if you have used grep
or the
regular-expression capabilities of any UNIX editor. They aren't
nearly as useful for matching filenames and other simple
types of information with which shell users typically work.
Furthermore, most things you can do with the shell's regular
expression operators can also be done (though possibly with more
keystrokes and less efficiency) by piping the output of a shell
command through grep
or egrep
.

Nevertheless, here are a few examples of how shell regular
expressions can solve filename-listing problems. Some of
these will come in handy in later chapters as pieces of solutions
to larger tasks.

In a directory of C source code, list all files that are not
necessary. Assume that "necessary" files end in .c
or .h
, or
are named Makefile
or README
.

Filenames in the VAX/VMS operating system
end in a semicolon followed by a version
number, e.g., fred.bob;23
. List all VAX/VMS-style
filenames in the current directory.

Here are the solutions:

In the first of these, we are looking for files that end in .el
with an optional c
. The expression that matches this is
*
.el?
(c)
.

The second example depends on the four standard subexpressions
*.c
,
*.h
,
Makefile
, and README
.
The entire expression is
!(*.c|*.h|Makefile|README)
, which matches anything
that does not match any of the four possibilities.

The solution to the third example starts with
*\;
: the shell
wildcard *
followed by a backslash-escaped semicolon.
Then, we could use
the regular expression +([0-9])
,
which matches one or more
characters in the range [0-9]
, i.e., one or more digits.
This is almost correct (and probably close enough), but it doesn't
take into account that the first digit cannot be 0.
Therefore the correct expression is
*\;[1-9]*([0-9])
, which matches
anything that ends with a semicolon, a digit from 1 to 9, and
zero
or more digits from 0 to 9.

Regular expression operators are an interesting addition to the Korn
shell's features, but you can get along well without them-even
if you intend to do a substantial amount of shell programming.

In our opinion, the shell's authors missed an opportunity to build
into the wildcard mechanism the ability to match files by type
(regular, directory, executable, etc., as in some of the conditional
tests we will see in Chapter 5
) as well as by name component.
We feel that shell programmers would have found this more useful than
arcane regular expression operators.

The following section compares Korn shell regular expressions to
analogous features in awk
and egrep
. If you aren't familiar
with these, skip to the section entitled "Pattern-matching Operators."

These equivalents are close but not quite exact.
Actually, an exp
within any of the Korn shell operators can be a series of
exp1
|exp2
|... alternates. But because the shell would interpret
an expression like dave|fred|bob
as a pipeline of commands, you must use @(dave|fred|bob)
for alternates by themselves.

It is worth re-emphasizing that shell regular expressions can still
contain standard shell wildcards.
Thus, the shell wildcard ?
(match any single character) is the equivalent to .
in
egrep
or awk
, and the shell's character set operator
[
...]
is the same as in those utilities.
[9]
For example, the expression +([0-9])
matches a number, i.e.,
one or more digits. The shell wildcard character *
is equivalent
to the shell regular expression *
(?)
.

[9]
And, for that matter, the same as in
grep
, sed
, ed
, vi
, etc.

A few egrep
and awk
regexp operators do not have equivalents
in the Korn shell. These include:

The beginning- and end-of-line operators ^
and $
.

The beginning- and end-of-word operators \<
and \>
.

Repeat factors like
\{N\}
and
\{M,N\}
.

The first two pairs are hardly necessary, since the Korn shell doesn't
normally operate on text files and does parse strings into words itself.

If the pattern matches the beginning of the variable's value,
delete the shortest part that matches and return the rest.

$
{variable
##pattern
}

If the pattern matches the beginning of the variable's value,
delete the longest part that matches and return the rest.

$
{variable
%pattern
}

If the pattern matches the end of the variable's value,
delete the shortest part that matches and return the rest.

$
{variable
%%pattern
}

If the pattern matches the end of the variable's value,
delete the longest part that matches and return the rest.

These can be hard to remember, so here's a handy mnemonic
device: #
matches the front because number signs precede
numbers; %
matches the rear because percent signs follow
numbers.

The classic use for pattern-matching operators is in stripping
off components of pathnames, such as directory prefixes and filename suffixes.
With that in mind,
here is an example that shows how all of the operators work.
Assume that the variable path
has the value
/home /billr/mem/long.file.name
; then:

Task 4.2

Think of a C compiler as a pipeline of data processing
components. C source code is input to the beginning of the pipeline,
and object code comes out of the end; there are several steps in between.
The shell script's task, among many other things, is to control the
flow of data through the components and to designate output files.

You need to write the part of the script that takes the name of the
input C source file and creates from it the name of the output
object code file. That is,
you must take a filename ending in .c
and create a filename that is similar except that it ends in .o
.

The task at hand is to strip the .c
off the filename and
append .o
. A single shell statement will do it:

objname=${filename%.c}.o

This tells the shell to look at the end of filename
for .c
. If there is a match,
return $filename
with the match deleted. So if filename
had the value fred.c
, the expression ${filename%.c}
would
return fred
. The .o
is appended to make the desired fred.o
,
which is stored in the variable objname
.

If filename
had an inappropriate value (without .c
)
such as fred.a
,
the above expression would evaluate to fred.a.o
: since there was
no match, nothing is deleted from the value of filename
,
and .o
is appended anyway.
And, if filename
contained more
than one dot-e.g., if it were the y.tab.c
that is so infamous
among compiler writers-the expression would still produce the desired
y.tab.o
.
Notice that this would not be true if we used %%
in the expression
instead of %
.
The former operator uses the longest match
instead of the shortest, so it would match .tab.o
and
evaluate to y.o
rather than
y.tab.o
. So the single %
is correct in this case.

A longest-match deletion would be preferable, however, in the following task.

Task 4.3

You are implementing a filter that prepares a text file for
printer output. You want to put the file's name-without
any directory prefix-on the "banner" page.
Assume that, in your script, you have the pathname of the file
to be printed stored in the variable pathname
.

Clearly the objective is to remove the directory prefix from the pathname.
The following line will do it:

bannername=${pathname##*/}

This solution is similar to the first line in the examples shown before.
If pathname
were just a filename, the pattern */
(anything
followed by a slash) would not match and the value of the expression
would be pathname
untouched. If pathname
were something like
fred/bob
, the prefix fred/
would match the pattern and be deleted,
leaving just bob
as the expression's value. The same thing would
happen if pathname
were something like /dave/pete/fred/bob
:
since the ##
deletes the longest match, it deletes the
entire /dave/pete/fred/
.

If we used
#*/
instead of ##*/
, the expression
would have the incorrect value dave/pete/fred/bob
, because the
shortest instance of "anything followed by a slash" at the beginning
of the string is just a slash (/
).

The construct
$
{variable##*/}
is actually equivalent
to the UNIX utility basename
(1). basename
takes a pathname
as argument and returns the filename only; it is meant to be used
with the shell's command substitution mechanism (see below). basename
is
less efficient than
$
{variable##/*}
because it runs in its own separate process rather than
within the shell.
Another utility, dirname
(1), does essentially
the opposite of basename
: it returns the directory prefix only.
It is equivalent to the Korn shell expression
$
{variable%/*}
and is less efficient for the same reason.

There are two remaining operators on variables.
One is
$
{#varname
}, which
returns the length of the value of the variable as a character
string. (In Chapter 6
we will see how to treat this
and similar values as actual numbers so they can be used
in arithmetic expressions.) For example,
if filename
has the value fred.c
, then
${#filename}
would have the value 6
.
The other operator
($
{#array[*]}
) has to do with array variables, which are also discussed
in Chapter 6
.