In this chapter, we solve several relatively simple text processing jobs. What’s interesting about all the examples here is that they are scripts built from simple pipelines: chains of one command hooked into another. Yet each one accomplishes a significant task.

When you tackle a text processing problem in Unix, it is important to keep the Unix tool philosophy in mind: ask yourself how the problem can be broken down into simpler jobs, for each of which there is already an existing tool, or for which you can readily supply one with a few lines of a shell program or with a scripting language.

5.1 Extracting Data from Structured Text Files

Most administrative files in Unix are simple flat text files that you can edit, print, and read without any special file-specific tools. Many of them reside in the standard directory, /etc. Common examples are the password and group files (passwd and group), the filesystem mount table (fstab or vfstab), the hosts file (hosts), the default shell startup file (profile), and the system startup and shutdown shell scripts, stored in the subdirectory trees rc0.d, rc1.d, and so on, through rc6.d. (There may be other directories as well.)

File formats are traditionally documented in Section 5 of the Unix manual, so the command man 5 passwd provides information about the structure of /etc/passwd.*

Despite its name, the password file must always be publicly readable. Perhaps it should have been called the user file because it contains basic information about every user account on the system, packed together in one line per account, with fields separated by colons. We described the file’s format in “Text File Conventions” [3.3.1]. Here are some typical entries:

The encrypted password, or an indicator that the password is stored in a separate file

The numeric user ID

The numeric group ID

The user’s personal name, and possibly other relevant data (office number, telephone number, and so on)

The home directory

The login shell

All but one of these fields have significance to various Unix programs. The one that does not is the fifth, which conventionally holds user information that is relevant only to local humans. Historically, it was called the gecos field, because it was added in the 1970s at Bell Labs when Unix systems needed to communicate with other computers running the General Electric Comprehensive Operating System, and some extra information about the Unix user was required for that system. Today, most sites use it just to record the personal name, so we simply call it the name field.

For the purposes of this example, we assume that the local site records extra information in the name field: a building and office number identifier (OSD211 in the first sample entry), and a telephone number (555-0123), separated from the personal name by slashes.

One obvious useful thing that we can do with such a file is to write some software to create an office directory. That way, only a single file, /etc/passwd, needs to be kept up-to-date, and derived files can be created when the master file is changed, or more sensibly, by a cron job that runs at suitable intervals. (We will discuss cron in “crontab: Rerun at Specified Times” [13.6.4].)

For our first attempt, we make the office directory a simple text file, with entries like this:

Franklin, Ben •OSD212•555-0022 Gale, Dorothy •KNS321•555-0044 …

where • represents an ASCII tab character. We put the personal name in conventional directory order (family name first), padding the name field with spaces to a convenient fixed length. We prefix the office number and telephone with tab characters to preserve some useful structure that other tools can exploit.

Scripting languages, such as awk, were designed to make such tasks easy because they provide automated input processing and splitting of input records into fields, so we could write the conversion job entirely in such a language. However, we want to show how to achieve the same thing with other Unix tools.

For each password file line, we need to extract field five, split it into three subfields, rearrange the names in the first subfield, and then write an office directory line to a sorting process.

awk and cut are convenient tools for field extraction:

… | awk -F: ‘{ print $5 }’ | … … | cut -d: -f5 | …

There is a slight complication in that we have two field-processing tasks that we want to keep separate for simplicity, but we need to combine their output to make a directory entry. The join command is just what we need: it expects two input files, each ordered by a common unique key value, and joins lines sharing a common key into a single output line, with user control over which fields are output.

Since our directory entries contain three fields, to use join we need to create three intermediate files containing the colon-separated pairs key:person, key:office, and key:telephone, one pair per line. These can all be temporary files, since they are derived automatically from the password file.

What key do we use? It just needs to be unique, so it could be the record number in the original password file, but in this case it can also be the username, since we know that usernames are unique in the password file and they make more sense to humans than numbers do. Later, if we decide to augment our directory with additional information, such as job title, we can create another nontemporary file with the pair key:jobtitle and add it to the processing stages.

Instead of hardcoding input and output filenames into our program, it is more flexible to write the program as a filter so that it reads standard input and writes standard output. For commands that are used infrequently, it is advisable to give them descriptive, rather than short and cryptic, names, so we start our shell program like this:

Since the password file is publicly readable, any data derived from it is public as well, so there is no real need to restrict access to our program’s intermediate files. However, because all of us at times have to deal with sensitive data, it is good to develop the programming habit of allowing file access only to those users or processes that need it. We therefore reset the umask (see “Default permissions” in Appendix B) as the first action in our program:

umask 077Restrict temporary file access to just us

For accountability and debugging, it is helpful to have some commonality in temporary filenames, and to avoid cluttering the current directory with them: we name them with the prefix /tmp/pd.. To guard against name collisions if multiple instances of our program are running at the same time, we also need the names to be unique: the process number, available in the shell variable $$, provides a distinguishing suffix. (This use of $$ is described in more detail in Chapter 10.) We therefore define these shell variables to represent our temporary files:

During development, we can just comment out the second trap, preserving temporary files for subsequent examination. (The trap command is described in “Trapping Process Signals” [13.3.2]. For now, it’s enough to understand that when the script exits, the trap command arranges to automatically run rm with the given arguments.)

We need fields one and five repeatedly, and once we have them, we don’t require the input stream from standard input again, so we begin by extracting them into a temporary file:

awk -F: ‘{ print $1 ":" $5 }’ > $USERThis reads standard input

We make the key:person pair file first, with a two-step sed program followed by a simple line sort; the sort command is discussed in detail in “Sorting Text” [4.1].

The script uses = as the separator character for sed’s s command, since both slashes and colons appear in the data. The first edit strips everything from the first slash to the end of the line, reducing a line like this:

jones:Adrian W. Jones/OSD211/ 555-0123 Input line

to this:

jones:Adrian W. JonesResult of first edit

The second edit is more complex, matching three subpatterns in the record. The first part, ^([^:]*), matches the username field (e.g., jones). The second part, (.*)❒, matches text up to a space (e.g., Adrian❒W.❒; the ❒ stands for a space character). The last part, ([^❒]*), matches the remaining nonspace text in the record (e.g., Jones). The replacement text reorders the matches, producing something like Jones, ❒Adrian W. The result of this single sed command is the desired reordering:

jones:Jones, Adrian W. Printed result of second edit

Next, we make the key:office pair file:

sed -e ‘s=^([^:]*):[^/]*/([^/]*)/.*$=1:2=’ < $USER | sort > $OFFICE

The result is a list of users and offices:

jones:OSD211

The key:telephone pair file creation is similar: we just need to adjust the match pattern:

At this stage, we have three separate files, each of which is sorted. Each file consists of the key (the username), a colon, and the particular data (personal name, office, telephone number). The $PERSON file’s contents look like this:

ben:Franklin, Ben betsy:Ross, Betsy …

The $OFFICE file has username and office data:

ben:OSD212 betsy:BMD17 …

The $TELEPHONE file records usernames and telephone numbers:

ben:555-0022 betsy:555-0033 …

By default, join outputs the common key, then the remaining fields of the line from the first file, followed by the remaining fields of the line from the second line. The common key defaults to the first field, but that can be changed by a command-line option: we don’t need that feature here. Normally, spaces separate fields for join, but we can change the separator with its –t option: we use it as –t:.

The join operations are done with a five-stage pipeline, as follows:

Combine the personal information and the office location:

join -t: $PERSON $OFFICE | …

The results of this operation, which become the input to the next stage, look like this:

ben:Franklin, Ben:OSD212 betsy:Ross, Betsy:BMD17 …

Add the telephone number:

… | join -t: – $TELEPHONE | …

The results of this operation, which become the input to the next stage, look like this:

ben:Franklin, Ben:OSD212:555-0022 betsy:Ross, Betsy:BMD17:555-0033 …

Remove the key (which is the first field), since it’s no longer needed. This is most easily done with cut and a range that says “use fields two through the end,” like so:

… | cut -d: -f 2- | …

The results of this operation, which become the input to the next stage, look like this:

Franklin, Ben:OSD212:555-0022 Ross, Betsy:BMD17:555-0033 …

Re-sort the data. The data was previously sorted by login name, but now things need to be sorted by personal last name. This is done with sort:

… | sort -t: -k1,1 -k2,2 -k3,3 | …

This command uses a colon to separate fields, sorting on fields 1, 2, and 3, in order. The results of this operation, which become the input to the next stage, look like this:

Franklin, Ben:OSD212:555-0022 Gale, Dorothy:KNS321:555-0044 …

Finally, reformat the output, using awk’s printf statement to separate each field with tab characters. The command to do this is:

… | awk -F: ‘{ printf("%-39st%st%sn", $1, $2, $3) }’

For flexibility and ease of maintenance, formatting should always be left until the end. Up to that point, everything is just text strings of arbitrary length.

The awk printf statement used here is similar enough to the shell printf command that its meaning should be clear: print the first colon-separated field left-adjusted in a 39-character field, followed by a tab, the second field, another tab, and the third field. Here are the full results:

The real power of shell scripting shows itself when we want to modify the script to do a slightly different job, such as insertion of the job title from a separately maintained key:jobtitle file. All that we need to do is modify the final pipeline to look something like this:

A critical assumption of our program is that there is a unique key for each data record. With that unique key, separate views of the data can be maintained in files as key:value pairs. Here, the key was a Unix username, but in larger contexts, it could be a book number (ISBN), credit card number, employee number, national retirement system number, part number, student number, and so on. Now you know why we get so many numbers assigned to us! You can also see that those handles need not be numbers: they just need to be unique text strings.

{mospagebreak title=5.2 Structured Data for the Web}

The immense popularity of the World Wide Web makes it desirable to be able to present data like the office directory developed in the last section in a form that is a bit fancier than our simple text file.

Web files are mostly written in a markup language called HyperText Markup Language (HTML). This is a family of languages that are specific instances of the Standard Generalized Markup Language (SGML), which has been defined in several ISO standards since 1986. The manuscript for this book was written in DocBook/XML, which is also a specific instance of SGML. You can find a full description of HTML in HTML & XHTML: The Definitive Guide (O’Reilly).*

A Digression on Databases

Most commercial databases today are constructed as relational databases: data is accessible as key:value pairs, and join operations are used to construct multicolumn tables to provide views of selected subsets of the data. Relational databases were first proposed in 1970 by E. F. Codd,a who actively promoted them, despite initial database industry opposition that they could not be implemented efficiently. Fortunately, clever programmers soon figured out how to solve the efficiency problem. Codd’s work is so important that, in 1981, he was given the prestigious ACM Turing Award, the closest thing in computer science to the Nobel Prize.

Today, there are several ISO standards for the Structured Query Language (SQL), making vendor-independent database access possible, and one of the most important SQL operations is join. Hundreds of books have been published about SQL; to learn more, pick a general one like SQL in a Nutshell.b Our simple office-directory task thus has an important lesson in it about the central concept of modern relational databases, and Unix software tools can be extremely valuable in preparing input for databases, and in processing their output.

For the purposes of this section, we need only a tiny subset of HTML, which we present here in a small tutorial. If you are already familiar with HTML, just skim the next page or two.

Here is a minimal standards-conformant HTML file produced by a useful tool written by one of us:*

Special processor commands are enclosed in <! and >: here, the DOCTYPE command tells an SGML parser what the document type is and where to find its grammar file.

Markup is supplied by angle-bracketed words, called tags. In HTML, lettercase is not significant in tag names: html-pretty normally uppercases tag names for better visibility.

Markup environments consist of a begin tag, <NAME>, and an end tag, </NAME >, and for many tags, environments can be nested within each other according to rules defined in the HTML grammars.

An HTML document is structured as an HTML object containing one HEAD and one BODY object.

Inside the HEAD, a TITLE object defines the document title that web browsers display in the window titlebar and in bookmark lists. Also inside the HEAD, the LINK object generally carries information about the web-page maintainer.

The visible part of the document that browsers show is the contents of the BODY.

Whitespace is not significant outside of quoted strings, so we can use horizontal and vertical spacing liberally to emphasize the structure, as the HTML prettyprinter does.

Everything else is just printable ASCII text, with three exceptions. Literal angle brackets must be represented by special encodings, called entities, that consist of an ampersand, an identifier, and a semicolon: &lt; and &gt;. Since ampersand starts entities, it has its own literal entity name: &amp;. HTML supports a modest repertoire of entities for accented characters that cover most of the languages of Western Europe so that we can write, for example, caf&eacute; du bon go&ucirc;t to get café du bon goût.

To convert our office directory to proper HTML, we need only one more bit of information: how to format a table, since that is what our directory really is and we don’t want to force the use of typewriter fonts to get everything to line up in the browser display.

In HTML 3.0 and later, a table consists of a TABLE environment, inside of which are rows, each of them a table row (TR) environment. Inside each row are cells, called table data, each a TD environment. Notice that columns of data receive no special markup: a data column is simply the set of cells taken from the same row position in all of the rows of the table. Happily, we don’t need to declare the number of rows and columns in advance. The job of the browser or formatter is to collect all of the cells, determine the widest cell in each column, and then format the table with columns just wide enough to hold those widest cells.

For our office directory example, we need just three columns, so our sample entry could be marked up like this:

Because we chose to preserve special field separators in the text version of the office directory, we have sufficient information to identify the cells in each row. Also, because whitespace is mostly not significant in HTML files (except to humans), we need not be particularly careful about getting tags nicely lined up: if that is needed later, html-pretty can do it perfectly. Our conversion filter then has three steps:

Output the leading boilerplate down to the beginning of the document body.

Wrap each directory row in table markup.

Output the trailing boilerplate.

We have to make one small change from our minimal example: the DOCTYPE command has to be updated to a later grammar level so that it looks like this:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN//3.0">

You don’t have to memorize this: html-pretty has options to produce output in any of the standard HTML grammar levels, so you can just copy a suitable DOCTYPE command from its output.

Clearly, most of the work is just writing boilerplate, but that is simple since we can just copy text from the minimal HTML example. The only programmatic step required is the middle one, which we could do with only a couple of lines in awk. However, we can achieve it with even less work using a sed stream-editor substitution with two edit commands: one to substitute the embedded tab delimiters with </TD><TD>, and a following one to wrap the entire line in <TR><TD>…</TD></TR>. We temporarily assume that no accented characters are required in the directory, but we can easily allow for angle brackets and ampersands in the input stream by adding three initial sed steps. We collect the complete program in Example 5-2.

The << notation is called a here document. It is explained in more detail in “Additional Redirection Operators” [7.3.1]. Briefly, the shell reads all lines up to the delimiter following the << (EOFILE in this case), does variable and command substitution on the contained lines, and feeds the results as standard input to the command.

There is an important point about the script in Example 5-2: it is independent of the number of columns in the table! This means that it can be used to convert any tab-separated value file to HTML. Spreadsheet programs can usually save data in such a format, so our simple tool can produce correct HTML from spreadsheet data.

We were careful in tsv-to-html to maintain the spacing structure of the original office directory, because that makes it easy to apply further filters downstream. Indeed, html-pretty was written precisely for that reason: standardization of HTML markup layout radically simplifies other HTML tools.

How would we handle conversion of accented characters to HTML entities? We could augment the sed command with extra edit steps like -e ‘s=é=&eacute;=g’, but there are about 100 or so entities to cater for, and we are likely to need similar substitutions as we convert other kinds of text files to HTML.

It therefore makes sense to delegate that task to a separate program that we can reuse, either as a pipeline stage following the sed command in Example 5-2, or as a filter applied later. (This is the “detour to build specialized tools” principle in action.) Such a program is just a tedious tabulation of substitution commands, and we need one for each of the local text encodings, such as the various ISO 8859–ncode pages mentioned in “How Are Files Named?” in Appendix B. We don’t show such a filter completely here, but a fragment of one in Example 5-3 gives the general flavor. For readers who need it, we include the complete program for handling the common case of Western European characters in the ISO 8859-1 encoding with this book’s sample programs. HTML’s entity repertoire isn’t sufficient for other accented characters, but since the World Wide Web is moving in the direction of Unicode and XML in place of ASCII and HTML, this problem is being solved in a different way, by getting rid of character set limitations.