11.12.Â One-Pointed Mind

As a student of Zen, I like the idea of a one-pointed mind:
Do one thing at a time, and do it well.

This, indeed, is very much how UNIXÂ® works as well. While
a typical WindowsÂ® application is attempting to do everything
imaginable (and is, therefore, riddled with bugs), a
typical UNIXÂ® program does only one thing, and it does it
well.

The typical UNIXÂ® user then essentially assembles his own
applications by writing a shell script which combines the
various existing programs by piping the output of one
program to the input of another.

When writing your own UNIXÂ® software, it is generally a
good idea to see what parts of the problem you need to
solve can be handled by existing programs, and only
write your own programs for that part of the problem
that you do not have an existing solution for.

11.12.1.Â CSV

I will illustrate this principle with a specific real-life
example I was faced with recently:

I needed to extract the 11th field of each record from a
database I downloaded from a web site. The database was a
CSV file, i.e., a list of
comma-separated values. That is quite
a standard format for sharing data among people who may be
using different database software.

The first line of the file contains the list of various fields
separated by commas. The rest of the file contains the data
listed line by line, with values separated by commas.

I tried awk, using the comma as a separator.
But because several lines contained a quoted comma,
awk was extracting the wrong field
from those lines.

Therefore, I needed to write my own software to extract the 11th
field from the CSV file. However, going with the UNIXÂ®
spirit, I only needed to write a simple filter that would do the
following:

Remove the first line from the file;

Change all unquoted commas to a different character;

Remove all quotation marks.

Strictly speaking, I could use sed to remove
the first line from the file, but doing so in my own program
was very easy, so I decided to do it and reduce the size of
the pipeline.

At any rate, writing a program like this took me about
20 minutes. Writing a program that extracts the 11th field
from the CSV file would take a lot longer,
and I could not reuse it to extract some other field from some
other database.

This time I decided to let it do a little more work than
a typical tutorial program would:

It parses its command line for options;

It displays proper usage if it finds wrong arguments;

It produces meaningful error messages.

Here is its usage message:

Usage: csv [-t<delim>] [-c<comma>] [-p] [-o <outfile>] [-i <infile>]

All parameters are optional, and can appear in any order.

The -t parameter declares what to replace
the commas with. The tab is the default here.
For example, -t; will replace all unquoted
commas with semicolons.

I did not need the -c option, but it may
come in handy in the future. It lets me declare that I want a
character other than a comma replaced with something else.
For example, -c@ will replace all at signs
(useful if you want to split a list of email addresses
to their user names and domains).

The -p option preserves the first line, i.e.,
it does not delete it. By default, we delete the first
line because in a CSV file it contains the field
names rather than data.

The -i and -o
options let me specify the input and the output files. Defaults
are stdin and stdout,
so this is a regular UNIXÂ® filter.

I made sure that both -i filename and
-ifilename are accepted. I also made
sure that only one input and one output files may be
specified.

To get the 11th field of each record, I can now do:

%csv '-t;' data.csv | awk '-F;' '{print $11}'

The code stores the options (except for the file descriptors)
in EDX: The comma in DH, the new
separator in DL, and the flag for
the -p option in the highest bit of
EDX, so a check for its sign will give us a
quick decision what to do.

Much of it is taken from hex.asm above. But there
is one important difference: I no longer call write
whenever I am outputting a line feed. Yet, the code can be
used interactively.

I have found a better solution for the interactive problem
since I first started writing this chapter. I wanted to
make sure each line is printed out separately only when needed.
After all, there is no need to flush out every line when used
non-interactively.

The new solution I use now is to call write every
time I find the input buffer empty. That way, when running in
the interactive mode, the program reads one line from the user's
keyboard, processes it, and sees its input buffer is empty. It
flushes its output and reads the next line.

11.12.1.1.Â The Dark Side of Buffering

This change prevents a mysterious lockup
in a very specific case. I refer to it as the
dark side of buffering, mostly
because it presents a danger that is not
quite obvious.

It is unlikely to happen with a program like the
csv above, so let us consider yet
another filter: In this case we expect our input
to be raw data representing color values, such as
the red, green, and
blue intensities of a pixel. Our
output will be the negative of our input.

Such a filter would be very simple to write.
Most of it would look just like all the other
filters we have written so far, so I am only
going to show you its inner loop:

Because this filter works with raw data,
it is unlikely to be used interactively.

But it could be called by image manipulation software.
And, unless it calls write before each call
to read, chances are it will lock up.

Here is what might happen:

The image editor will load our filter using the
C function popen().

It will read the first row of pixels from
a bitmap or pixmap.

It will write the first row of pixels to
the pipe leading to
the fd.in of our filter.

Our filter will read each pixel
from its input, turn it to a negative,
and write it to its output buffer.

Our filter will call getchar
to fetch the next pixel.

getchar will find an empty
input buffer, so it will call
read.

read will call the
SYS_read system call.

The kernel will suspend
our filter until the image editor
sends more data to the pipe.

The image editor will read from the
other pipe, connected to the
fd.out of our filter so it can set the first row of the
output image before
it sends us the second row of the input.

The kernel suspends
the image editor until it receives
some output from our filter, so it
can pass it on to the image editor.

At this point our filter waits for the image
editor to send it more data to process, while
the image editor is waiting for our filter
to send it the result of the processing
of the first row. But the result sits in
our output buffer.

The filter and the image editor will continue
waiting for each other forever (or, at least,
until they are killed). Our software has just
entered a
race condition.

This problem does not exist if our filter flushes
its output buffer before asking the
kernel for more input data.