One would hope that a simple task like sorting would be relatively
unambiguous. Unfortunately, it isn't. The behavior of sort
can
be very puzzling. I'll try to straighten out some of the
confusion - at the same time, I'll be leaving myself open to abuse by
the real sort
experts. I hope you appreciate this! Seriously,
though: if we find any new wrinkles to the story, we'll add them in
the next edition.

The trouble with sort
is figuring out where one field ends and
another begins. It's simplest if you can
specify an explicit field delimiter (36.3
)
.
This makes it easy
to tell where fields end and begin. But by default, sort
uses white
space characters (tabs and spaces) to separate fields, and the rules
for interpreting white space field delimiters
are unfortunately complicated. As I see them, they are:

The first white space character you encounter is a "field delimiter";
it marks the end of the old field and the beginning of the next field.

Any white space character following a field delimiter is part of
the new field. That is - if you have two or more white space
characters in a row, the first one is used as a field delimiter, and
isn't sorted. The remainder are
sorted, as part of the next
field.

Every field has at least one non-whitespace character, unless you're
at the end of the line. (That is: null fields only occur when you've
reached the end of a line.)

All white space is not equal.
Sorting is done according to the
ASCII (51.3
)
collating sequence.
Therefore, TABs are sorted before spaces.

Here is a silly but instructive example that demonstrates most of the
hard cases. We'll sort the file sortme
, which is:

apple Fruit shipment
20 beta beta test sites
5 Something or other

All is not as it seems-
cat -t -v
(25.6
, 25.7
)
shows that the file really
looks like this:

^I
indicates a tab character. Before showing you what
sort
does with this file, let's break it into
fields, being very careful to apply the rules above. In the table, we
use quotes to show exactly where each field begins and ends:

Field

0

1

2

3

Line

1

"^Iapple"

"Fruit"

"shipment"

null (no more data)

2

"20"

"beta"

"beta"

"test"

3

"5"

"^Isomething"

"or"

"other"

OK, now let's try some sort
commands; I've added annotations on the
right, showing what character the "sort" was based on. First, we'll
sort on field zero - that is, the first field in each line:

% sort sortmesort on field zero
apple Fruit shipments field 0, first character: TAB
5 Something or other field 0, first character: SPACE
20 beta beta test sites field 0, first character: 2

As I noted earlier, a TAB precedes a space in the collating sequence.
Everything is as expected. Now let's try another, this time sorting
on field 1 (the second field):

The only surprise here is that the NULL field gets sorted first.
That's really no surprise, though: NULL has the ASCII value zero, so
we should expect it to come first.

OK, this was a silly example. But it was a difficult one; a casual
understanding of what sort "ought to do" won't explain any of these
cases. Which leads to another point. If someone tells you to sort
some terrible mess of a data file, you could be heading for a
nightmare. But often, you're not just sorting; you're also
designing
the data file you want to sort. If you get to design
the format for the input data, a little bit of care will save you lots
of headaches. If you have a choice, never
allow TABs in the
file. And be careful of leading spaces; a word with an extra space
before it will be sorted before
other words. Therefore, use an
explicit delimiter between fields (like a colon), or use the -b
option (and an explicit sort field), which tells sort
to ignore
initial white space.