More Technonecrophilia with Snobol One-Liners

I don't think I want to imply that
Snobol is dead: to me, it has more
vitality than Perl. But I thought I'd let
my title thank
Mike Swaine
for his blog posting
Adventures
in Technonecrophilia about my
Programs
That Transform Their Own Source Code; or: the Snobol Foot Joke.
Mike concludes that
figuring out how to solve a problem in Snobol is like
doing a puzzle, and discovering
the power hidden in this ancient language is like winning a prize.
So I'll pose you two more puzzles. Here
is the first, a one-liner that
sorts a string:

If you don't see coloured boxes in this one-liner,
you're probably using Internet Explorer. I don't know why
Microsoft's browser can't depict
a Unicode 2588 full block character whereas Firefox can. Perhaps
I have an old version. But I'm going to use the
coloured boxes later on to explain Snobol syntax, so
I'll ask you to switch to Firefox, or some other browser
that will show you coloured boxes above. As a test, the
thing between these dashes —
█ — should
be a chartreuse box.

With that out of the way, I'll show you the
one-liner again. It's actually
a seven-liner, but only one line does the
sorting.
The others: initialise two global system variables;
read the string to be sorted; display it and the result
of sorting; and
end the program. Here
they all are:

The first statement assigns to a
global system variable named "&anchor".
Snobol calls these variables
keywords. This one
affects pattern-matching,
as described in Emmer's
4.11
ANCHORED AND UNANCHORED MATCHING.
It makes the matcher start
every match at the beginning of
the string being matched, rather
than allowing the pattern to slide
along this string until the match
succeeds. The string being matched, by
the way, is often called the subject string.

The next statement also sets a
keyword. This one
puts the matcher into fullscan mode
instead of quickscan.
As Emmer notes in
7.3
QUICKSCAN AND FULLSCAN, quickscan mode
uses heuristics to eliminate some
possible matches. This speeds
up matching, and doesn't usually
eliminate matches that would actually
be useful, at least in sensible
programs acting on realistic data. But
it will hurt
"pathological" programs designed to test the
matcher, for example by deliberately
enumerating all possible
matches. Because of this, and because
I prefer not to have to remember
what the heuristics do — and anyway,
computers are much faster than when
Snobol was designed —
I always run in fullscan mode.

These are the next two statements again:

s = trim(input)
output = "Original: '" s "'"

These read the string to be sorted, and
display it. In
Programs
That Transform Their Own Source Code,
I said that "output" is an
output-associated variable
that displays anything you assign to it.
Similarly, "input" is an
input-associated variable. It
is connected to standard input,
unless you reassociate it with the
built-in "input" function. When
you use its value, Snobol
reads a line from it,
as explained in
INPUT/OUTPUT
AND KEYWORDS.

This is a pattern match statement. It's one I invented
after seeing a similar program on the Web, then never being
able to find it again. But perhaps I should clarify the syntax.
If you've ever tried learning
Chinese, you'll
enjoy — while
wincing at shared pain
—
David
Moser's amusing paper
Why
Chinese Is So Damn Hard.
He mentions the obvious reasons:
the characters; the tones; lack of shared
word roots; lack of shared culture.
But another is that Chinese does
not separate words by spaces. So
you see
"漢語語調稀奇古怪". Do you
read
it as
"漢 語語 調稀 奇古 怪"
or as
"漢語 語調 稀奇 古怪"
or as
"漢語語 調 稀奇 古 怪"
or as something else?
The Snobol tyro faces an opposite yet
analogous problem. Snobol uses spaces
too often with too many meanings: and because
most programming languages
accustom us to ignore spaces, we ignore
them in Snobol too. Hence Snobol becomes a programmer's
Chinese, its token-separators vanished from
conscious view. But behold!
By using colour, I restore them
to view:

Punctuation. For example, separates
a label from the rest of the statement, or
a jump from the part of the statement before it.

█

Separates the subject of a pattern match
from the pattern following it.

█

Separates the arguments of a binary infix operator
from the operator. Unary operators,
such as the star in *lgt,
don't take spaces.

█

Concatenation, usually of strings or
patterns. This is actually a binary
operator, albeit one with no visible
name.

█

These spaces surround the equals
sign that separates the pattern
of a pattern match from the replacement
string.

With this parsing in mind, we can
summarise the statement as:

loop s PATTERN = REPLACEMENT :s(loop)

This matches the string in variable "s"
against PATTERN. It then replaces the matched part by
REPLACEMENT. REPLACEMENT is "pre d c post",
which I'll now make explicit, rewriting the
statement as:

loop s PATTERN = pre d c post :s(loop)

The next step in understanding is to
realise that "pre", "d", "c", and "post"
are string variables, which get set
during the pattern match
by the "$" operator. In
Programs
That Transform Their Own Source Code,
I explained the "." or
conditional assignment operator. I said that
"A . B" tells the matcher that if the entire match of
which pattern A is a part succeeds,
then the substring matched by A must be
assigned to variable B. The "$" operator
is similar, but assigns to B immediately, without
waiting for the match to succeed.
The assigned substring can then be used later on
in the same pattern. We call "$" the
immediate assignment operator. There
is more about these two in
4.7
CAPTURING MATCH RESULTS and
5.3
IMMEDIATE ASSIGNMENT.

Now we'll look at the pattern. Here it is
again:

arb $ pre len(1) $ c len(1) $ d *lgt(c,d) rem $ post

The way to read these patterns
is not to worry about
how pattern matching is implemented.
Instead, read them declaratively, as string specifications
built from pattern primitives
such as "arb" and "len".

So I shall now note that
"arb" is a pattern primitive that
matches any string. See
4.8.1 Primitive Patterns in
4.8
UNKNOWNS .

The pattern matches the entire subject string, with
pre and post "taking up the slack" either side
of c and d. And the "=",
explained in 4.9
PATTERN MATCHING WITH REPLACEMENT,
tells the matcher to
replace the matched portion of
the subject string — namely all
of it — by an equivalent string but with c and d
transposed.

So the pattern match finds a pair of adjacent characters
and swaps them. If the match finds such a pair,
it will succeed; and if it succeeds, the
":s(loop)" will jump back to the beginning of the
statement for another try. If the match doesn't
find an out-of-order pair, it will fail, and control
will drop through to the next statement. So the statement
keeps on inverting out-of-order pairs until none
remain.

Here's another one-liner, inspired by a 1998
posting to org.perl.perl5-porters
by Mark-Jason Dominus. In
Pattern
matching in SNOBOL4,
Dominus starts by saying:

This note started out as an analysis of
SNOBOL4's `FAIL' pattern, and turned into a
huge ramble about SNOBOL in general. If it
has a point, the point is only that SNOBOL's pattern matching is *still* a
lot
better than Perl's, and that it is worth studying,
because we could learn a lot from it.

Dominus goes on to say that he
got sidetracked into thinking about how to transform

"abc"

into

('abc', 'ab', 'a', 'bc', 'b', 'c', '')

How about, he wonders, something related to
m//g in Perl? He continues:

m/.*/g wouldn't do it, of course.
But this reminded me of a feature in
SNOBOL that was useful for similar purposes,
and I dug out my SNOBOL book and got sucked in,
as I always do, and I came to the same conclusion
that I always come to, which is that SNOBOL4
was a remarkably usable language, especially for
1971, and that people should pay more attention to
it. It had associative arrays, recursive
functions with locally-scoped variables, pattern
matching that was better than then Perl's is now.
On the other hand, SNOBOL's control
flow and syntax are hopelessly 1971.

He then talks about how Snobol's
primitive "fail" pattern
causes the pattern matcher to
backtrack. There's
a little example under the entry
for FAIL
in
7.4
OTHER PRIMITIVE PATTERNS.
This demonstrates a pattern that,
in unanchored mode,
backtracks over the string
being matched, displaying
its characters one by one.

So how could one use backtracking to
enumerate all substrings of a string?
Here's my answer. Actually, it
doesn't generate the null substring, because a backtracking match that
doesn't was easier to code:

To understand this,
the only extra
thing you need to know about Snobol is the built-in
"differ" function, and that's only for the trivial
purpose of avoiding a trailing comma in
the concatenated substrings. It is
described in 2.3.1 Conditional Functions of
2.3
BUILT-IN FUNCTIONS.

Here is the program's output. From what I've written this
week and last, and from the other links I've referenced,
can you see how it and the previous program work?