Linear Logic and Permutation Stacks--The Forth Shall Be First

This material is based upon work supported by the National Science
Foundation under Grant No. III-9261682.

Girard's linear logic can be used to model programming languages in
which each bound variable name has exactly one "occurrence"--i.e., no
variable can have implicit "fan-out"; multiple uses require explicit
duplication. Among other nice properties, "linear" languages need no
garbage collector, yet have no dangling reference problems. We show a
natural equivalence between a "linear" programming language and a
stack machine in which the top items can undergo arbitrary
permutations. Such permutation stack machines can be considered
combinator abstractions of Moore's Forth programming
language.

THE "FORTRAN FALLACY"

For 40 years, programmers have tried to utilize mathematical
expressions to program computers. Indeed, "FORTRAN" is a contraction
of "FORmula TRANslator", because Fortran's selling point was its close
relationship to mathematical formulae. For example, one solution of a
quadratic equation x=(-b+sqrt(b^2-4ac)/2a in Fortran is
X=(-B+SQRT(B**2-4*A*C))/(2*A).

Although this one-dimensional form is not as pretty as the
mathematician's two-dimensional form, the relationship is clear, and
user can look forward to Fortran-0X, which we expect will utilize
2-D input notation pioneered (on punched cards!) by NASA's HAL/S
language [Ryer78].

Unfortunately, after 40 years of hard work, computer scientists
have not been that successful at efficiently exploiting this
mathematical expression metaphor. Furthermore, the single-minded
pursuit of this goal has blinded us to the reason for pursuing it in
the first place--to find a simple, easy-to-understand metaphor that
can be used to efficiently program a variety of applications on a
variety of computers.

Fortran has been quite naive about the nature of mathematical
expressions, and Fortran novitiates learn to "put that mathematical
metaphor up on the shelf next to the Easter Bunny", and use a more
implementation-oriented model which includes "storage maps",
"by-reference parameters", and the like.

Functional programming languages [Backus78] [Hudak89]
[Hughes89] have gone the furthest with the mathematical expression
metaphor, and have been extended to cover many programming tasks.
Unfortunately, functional programs have been considered "inefficient",
and their higher-order functions, "lazy evaluation" and
"multiple-return continuations" leave many cold.

The "standard", non-functional programming languages like Fortran,
Ada, and C are the bastard progeny of the coupling between a
pseudo-mathematical notation and a von Neumann-style random access
memory (RAM). There has never been a simple or particularly
effective mathematical model to aid in the deep understanding or
compiling of programs in these languages. A modern optimizing
Fortran compiler is still a baroque technical tour de force,
yet it has difficulty understanding simple Fortran programs. The real
Achilles' heel for these languages, however, is their innate inability
to deal with parallel or distributed computation. Neither the
mathematical expression metaphor nor the von Neumann RAM is up to this
task.

"The Emperor has no clothes", you say. "Cut the Gordion knot", you
say. "This many smart computer scientists can't be wrong," we say, so
the problem--as posed--must be essentially insoluble. In classical
logical fashion, computer science has achieved a proof by
contradiction. It has proposed the equivalence "programs =
mathematical expressions", and has derived a contradiction. Perhaps
it is time to move on to the next theorem.

Simula [Dahl66] started a revolution in computer languages
which is not yet finished. It proposed the equivalence "programs =
physical objects with behavior", but forgot to throw away the previous
equivalence. Smalltalk [Goldberg83] picked up the ball, but
the mathematical expressionists then converted the metaphor of
physical objects into polymorphic mathematical type systems--i.e., C++
[Stroustrup86]. The elegance of the mechanical metaphor was thus
buried underneath mathematical mysticism.

Linear logic [Girard87] can be viewed as the latest
attempt to bring back the physical object metaphor, but stripped of
its polymorphic pretensions. For the first time in 50 years of
computer science, a metaphor of programming has been proposed that
most people can relate to--objects have true identity, and objects
are conserved. As in the real world, an object cannot be
copied or destroyed without first filling out a lot of forms, but
on the other hand, the transmission of objects is relatively
painless. An object is localized in space, and can move from
place to place. (Only computer-literate people must be told that the
transmission of these objects does not create copies.)
Linear logic finally makes precise the high school notion that a
function is like a box into which one puts argument values and
receives result values, and that a truly "functional" box does not
remember its previous arguments or results.

The mathematical expression metaphor must be sacrificed to make way
for this linear/conservative object-oriented metaphor, which--as we
have seen--is not a great loss, as computer programs today deal with
non-mathematical objects most of the time.
[footnote 1]
The few remaining mathematical expressions--e.g., the quadratic
formula--can be computed, but require a more process-oriented
description. In such a description, we must first make copies of b
and a, since they are both used twice, and then we can compute the
result. The requirement for making explicit copies of b and a is
obvious to any 12-year-old, but computer scientists have spent 40
years eliminating this one trivial task while greatly increasing the
costs for everything else.

A linear logic language achieves its elegance through a starkly
simple rule--a bound variable name can be "used" only once. Thus,
variable reads are destructive and hence variables are
"read-once". Any attempt to use a name twice (or not at all) is
flagged by the compiler. A use of the variable name as an argument in
a function call means that the object referred to by the name has been
given to the function. Unless the function returns the object as one
of its results, the object is gone, and cannot be referenced by the
caller. Therefore, many operations on linear objects have the policy
of returning them when they are done--e.g., the length
function returns the length of a list as well as the list itself. A
function like "+", which accepts two values and returns their sum, can
be thought of as "consuming" its argument values and constructing a
result value. Not only is this metaphor appealing, but it can be very
efficient in practise--e.g., in multiple-precision arithmetic, the
storage utilized by the arguments to "+" can be reused to construct
the result.

A linear logic computer language avoids the need for garbage
collection by explicit construction and destruction of objects. The
beloved "constructors" and "destructors" of C++ can therefore be used,
as before. However, the "dangling reference problem" cannot happen in
a linear language because 1) the only name occurrence for an object is
used to invoke its destructor, and 2) the destructor doesn't return
the object. Furthermore, if an object does not wish to be copied, the
programmer cannot obtain a copy; thus, linearity achieves the goals
that "limited" types in the Ada programming language [AdaLRM83]
[Baker91SP][Baker93Steal]
sought in vain.

A linear logic language avoids the need for most synchronization
because there is only one "path" to an object at any given time, and
hence only one operation can be performed on the object at a time.
The calling program cannot even talk about the object while an
operation on it is being performed, but must wait for the object to be
returned. On the other hand, complex subexpressions can always be
computed in parallel, because there is no possibility of read or write
conflict between sub-expressions (linearity allows the environment to
be factored into disjoint factors that bind the names in each
subexpression.). A linear language thus makes explicit any
"space-time" tradeoff: space can be minimized by always working with
only one copy of an object, and performing each operation serially,
while time can be minimized by duplicating the object at some cost in
space.

LINEAR LISP AND PERMUTATION STACK MACHINES

Linear Lisp is a "linear" style of Lisp--i.e., every bound name is
referenced exactly once. Thus, each function parameter occurs just
once, as do the names introduced via other binding constructs--e.g.,
let, let*, etc.

Linear Lisp requires the programmer to make explicit any
duplication and deletion, but he is repaid through better error
checking during compilation and better utilization of resources (time,
space) at run-time. Furthermore, distributed and/or parallel
execution becomes quite palatable in Linear Lisp.

We demonstrate the natural mapping of Linear Lisp onto a
permutation stack machine with some examples. The
identity function is already linear:

(defun identity (x) x)

Linear Lisp is implemented with a stack on top of which all
argument values are placed and all result values are returned. This
stack is "unframed"--i.e., there is no Algol-like frame pointer. The
elimination of the frame pointer simplifies the model and eliminates
the cost of building and disassembling stack frames. To call the
identity function, an argument is placed on the top of
the stack, the identity function is called, it
moves its argument to the top of the stack as its result, and
the function then returns. Of course, as the argument is already on
the top of the stack, no movement is required, so there is no
code "[]" actually needed for the identity
function.

We now consider a function which always returns the constant 5:

(defun five (x) x 5) ; x is explicitly killed.

five is called with its argument on the top of the
stack, but it must be destroyed before 5 can be pushed onto the stack.
The argument is destroyed by a drop operation, which pops
the top item and destroys it. The primitive constant 5 is then copied
and pushed to finish the five function: [drop
'5].

square requires two uses of its argument. A
second copy can be obtained by use of the dup operation,
which accepts one argument and returns two values--i.e., two
copies of its argument. The square function follows:

Although the source code for square is a bit long, its
implementation on our stack machine is very efficient. One argument
is passed to square on the stack, dup
replaces this value by two copies on the top of the stack, and * then
multiplies them together.
[footnote 2]
The code for square is thus "[dup *]".

quadratic requires variable movement on the
stack. We must duplicate a, but a is not on top of the stack.
Therefore, we first permute the top three items of the stack to get a
on top, and then duplicate a. The stack then looks like ...bcaa'. We
must then duplicate b, but b is not on top of the stack. Therefore,
we permute the top four items to get b on top, and then
duplicate b. The stack then looks like ...caa'bb'. The next
operation negates b', which we arranged to be on the top of the stack.
The next operation after that permutes b to the top, so that
square can be called. We continue in this fashion,
alternately permuting the stack and calling functions. The code for
the quadratic function is therefore: [roll3 dup roll4 dup neg
roll2 square '4 roll5 roll6 * * - sqrt + '2 roll3 * /]. We can
see in operation the basic variable accessing mechanism
roll<n>, which permutes the top n items of the
stack by moving the n'th item to the top and pushing the other n-1
items down by one. A compiler can easily keep track of where
everything is.

(The reader may feel that permuting an item to the top of the stack
simply for the purpose of copying it is wasteful, as nearly all stack
machines have the ability to copy an item directly to the
top. We believe this intuition to be incorrect. First, while it may
be easy to copy objects like small integers from deep in the stack, it
is unlikely that the duplication operation for an arbitrary precision
integer or an abstract data type is a primitive operation, and when a
non-primitive duplication function must be called, the argument and
results should be on top of the stack. Second, a key assumption of
linear languages is that permuting items to the top of the stack is
relatively inexpensive (relative to copying or destroying them), and
hence the percolation of the second copy down after the
dup will happen automatically, as a side-effect
of rolling the other items up to the top of the stack.)

Every iterative/recursive function requires some sort of
conditional execution. Conditional expressions--e.g.,
if-expressions--in Linear Lisp require sophistication
beyond the simple linear rules we have considered so far. Since only
one "arm" of the conditional can be executed at any given time, we
relax the "one-occurrence" linearity condition to allow a reference in
both arms, so long as they are not executed concurrently or
speculatively. There is an occurrence in one arm if and only
if there is an occurrence in the other.

The proper treatment of the boolean expression part of an
if-expression requires more sophistication. Strict
linearity requires that any name used in the boolean part be counted
as an occurrence. However, many predicates are "shallow", in that
they examine only a small (i.e., shallow) portion of their arguments
(e.g., null, zerop), and therefore a relaxed
policy is required. We have not settled on the best syntax to solve
this problem, and currently use several if-like
expressions: if-atom, if-null,
if-zerop, etc. These if-like expressions
require that the boolean part be a simple name, which does not count
towards the "occur-once" linearity condition. This modified rule
allows a shallow condition to be tested, and the name to be reused
within the arms of the conditional.
[footnote 4]
The predicate could alternatively return two values: the
truth value (on top of the stack) and its unmodified argument(s).

To use Lisp-like cons cells with their car and cdr components, we
need a mechanism to linearly extract these components, since
any use of (car x) precludes the use of
(cdr x), and vice versa, due to the linearity of x. We
therefore utilize a function carcdr, which takes a cons
cell and returns both components. The carcdr function is
the stack machine inverse of cons. Using these
constructs, we can program the append function:

Decomposing list structure is common in Linear Lisp, that we
provide a "destructuring let" form dlet* in which
patterns with variable names are matched/unified against a value.
Since dlet* is linear, dlet* patterns
consume the portions of the values that they match--i.e., the
parts that are not bound to new names in the process of matching. Of
course, linearity requires that a pattern bind a name only once. We
can also use a backquote list composing syntax
[Steele90].
A prettier version of append follows:

In these append functions, we have swept something
important under the rug--the code for append is itself
non-linear, because it references itself in its own body! This is a
problem with all iterative and recursive constructs, and we have the
same solution for this problem that the lambda calculus has--the Y
combinator (and various optimizations of it), which does "lazy"
duplication. In other words, we abstract the append call
out of its own code, and pass it to the kernel of append
as another argument (the funcall special form takes its
first argument from the top of the stack):

The code for append-kernel is: [roll3 atom2
[drop drop] [carcdr roll4-2 dup funcall-3 check-1 cons]
ifelse]. The primitive roll4-2 rolls/rotates the
top 4 items of the stack by 2, and is equivalent to [roll4
roll4].

We note that the sequence [funcall-3 check-1] calls a
function with 3 arguments and 1 result. The information about the
number of arguments and results is used to check for the case of a
function which is called with the wrong number of arguments or
results. The implementation of this may be a separate implicitly
addressed register which keeps track of the number of arguments passed
on a call and the number of results on a return. Such numbers could
easily be passed on the stack, but this would be inefficient for the
vast majority of calls which have fixed numbers of arguments and
returned values. To check for 3 arguments to
append-kernel, use check-3.

We now examine an "iterative" factorial function. An
iterative factorial carries all of its "state" in its
arguments, and can therefore make all recursive calls
"tail-recursive". Tail-recursive calls are interesting, because they
do not require unbounded amounts of either argument stack or return
stack space. Furthermore, the complexity of a tail-recursive loop
function is independent of the number of its arguments. Below is a
linear iterative factorial function.

This iterative factorial is interesting because 1) its
tail-recursive property can be trivially recognized by syntactic
means, and 2) it can be implemented in a particularly efficient
manner. In fact, since iterative-fact1 references no
"free variables", no closures must be allocated or destroyed during
its execution. The entire state of iterative-fact1 lives
on the stack, and consists of the values of the variables r, n, and f.
The computations have the effect of permuting these values, but the
stack as a whole neither expands or contracts. Finally, the
permutation of these items necessary to execute the next iteration is
automatically achieved; no special tail-recursion optimization is
required [Steele78] [Hanson90]! In short, the execution of
tail-recursive loops on our permutation stack machine is not only just
as efficient as that achieved with a special iteration construct, it
is exactly the same as that achieved with the special
iteration construct, but without "syntactic sugar".

We eventually get tired of programming each recursive function with
its own driver function and its own function duplication code, in
which case we would like to program a true Y combinator [Gabriel88].
This Y combinator incorporates all of the lazy duplication machinery
necessary to implement iteration and recursion in a linear language.
In order to utilize Y, we will use a slightly different form of
recursive kernel, in which the function is abstracted separately.
Consider again factorial:

A closure must be compiled within
fact-kernel. The anonymous inner function of
fact-kernel is the function which will receive the actual
numerical argument. However, this inner function has a free variable
f which must be carried along with the anonymous function so that when
that function is called, the value of this free variable can be
obtained. The structure which carries this free value is a
closure, and consists of a 3-element vector whose last
element is the indicator 'funarg, whose second element is
the anonymous inner function code, and whose first element is the
value of the variable f. When the closure is invoked, the anonymous
code will be called with a stack which looks like ...nf. In other
words, the entire closure vector is "spread" onto the stack (about the
same effort as reading an entire cache line), the indicator 'funarg is
dropped, and the code sequence is loaded into the instruction buffer,
leaving only the free variables themselves on the stack. The effect
of this strategy is to allow the anonymous inner function to be
compiled as if its argument list were (n f).

Interestingly, while the source for y1 handles only
kernels of one argument, the compiled code handles kernels of
any number of arguments, by simply removing the setting and
checking of the register keeping track of the number of arguments!
This is because y1 doesn't really deal with the arguments
to the kernel itself, but only with the function closure which is
lazily copied during the recursion.

With additional effort, a general Y combinator can be programmed
which handles mutually recursive kernels
[Baker92MC].
If the duplication of short closure vectors and vectors of mutually
recursive functions can be done quickly--i.e., these vectors are
"copied" by-reference (and thus reference-counted [Collins60]
[Baker92LLL]
), rather than truly copied, then recursion and iteration can be as
efficient (O(1)) as in environments having cyclic code. Furthermore,
it is well-known how to optimize closure creation and destructuring
[Steele78], so our machine need not pay this cost for linearity.

Since a generic Y operator for a mutually-recursive system of
functions must constantly recirculate and lazily copy a vector of
routines in a strictly linear system, the accessing of the elements in
these vectors is an excellent place for implicit copying, if such
copying is going to be used anywhere. That efficient mutual recursion
requires copying is not surprising, but the explicit copying makes
more obvious the problem of implementing mutual recursion in a
distributed system. Mutual recursion can be performed as a
distributed computation, but the expense of explicit copying may make
it unpalatable. Nevertheless, linearity will keep precise the meaning
of this distributed computation, and guarantee that a system with
similar recursions taking place simultaneously on different machines
will have the same result as if all of the computations take place on
a single machine.

FORTH IS A SYSTEM OF LINEAR COMBINATORS

Combinatory logic [Curry58] is a logical structure which
is closely related to the lambda calculus [Church41]. The
lambda calculus talks about names and substitutions in expression
trees, while combinatory logic achieves the same "computations", but
without needing any names. Backus's speech on the advantages of
functional programming [Backus78] considers the ability of combinatory
logic to eliminate names to be one of its major advantages. Most APL
operators are combinators on array-type objects, and the absence of
names from APL "one-liners" is quite characteristic of
combinators.

When translating from lambda calculus expressions into
combinators--as the primitives of combinatory logic are
called, one must replace random access to a value by means of
a name with steering logic to propagate values to the
locations in an expression tree where they will be used. One of the
simplest translations involves distributing two copies of a
value down both branches of a binary tree (the S
combinator), followed by the killing of any copies at leaves
where they are not used (the K combinator). Obvious
optimizations involve sending values only down branches where they
will be used by means of the non-copying B and
C combinators, which simply steer the values down
either the left or right branch of the binary tree, respectively.

A linear version of the lambda calculus normally translates into
only B and C-type combinators which
neither copy nor kill values. Since any copying and killing is done
explicitly by the programmer, these operations require
"interpretation"--i.e., they are not built into any control
structures, but require type-specific code for their implementation.
An obvious optimization for a set of linear combinators is to provide
more permutation combinators.

Most Forth operators take their operands from the top of the stack
and return their values to the top of the stack. A perusal of this
Forth code reveals the absence of variable names which is
characteristic of combinators. The programming of Forth operators can
therefore be seen as the construction of larger combinators from
smaller ones. A Forth which incorporates only stack permutation
operations like swap, rotate, and
roll must be linear, because it has no copying or killing
operators.

STACK MACHINE IMPLEMENTATION

The usual implementation of Forth utilizes two stacks--an
operand stack and a return stack. Many Forth implementations allow
the programmer to make temporary use of the return stack to perform
more complex permutations of the operand stack. If a relatively
complete set of permutation operations for the operand stack is
provided, then the user will rarely need to "roll his own".

Since Forth is usually implemented on a traditional von Neumann
machine, one assumes that "rolling a stack" is an expensive operation.
The von Neumann bottleneck limits the speed with which the information
can be rearranged in the stack, because only one word can be moved at
a time. However, the RISC revolution makes the basic instruction
cycle as fast as possible--i.e., one instruction per clock, and then
uses pipelining and compiler technology to remove any impediments to
achieving this speed. Unfortunately, a RISC clock cycle is limited by
the need to access both a register bank and an instruction cache once
per cycle. While these can be done in parallel and overlapped, the
basic fact remains that the cycle time of these small RAM's is the
limiting factor in RISC performance. The limiting factor in RAM
access time is the logn gate delays that are used to address these
RAM's. Therefore, the only way to make the fastest RISC architecture
is to limit the size of these RAM's, which necessarily limits the
amount of data that can be stored in the registers or the number of
instructions that can be stored in the instruction cache.

If we examine Forth instruction streams, however, we notice that
they are punctuated into two kinds of operations--a number of
permutation operations followed by a number of computing operations.
These permutation operations are considered "non-productive", while
computing operations are "productive". However, stack permutation
operations are no less productive than register loads, stores and
transfers in a RISC architecture, and due to their ability to move
more than one word at a time, there is evidence that permutations are
more productive than register loads and stores. Furthermore,
although permutations are probably more time-consuming than
"computing" instructions, a true stack architecture could execute
"computing" instructions blindingly fast. For example, AND'ing the
top two items of the stack should take only 4-5 gate delays including
reading from the top two items and storing back to one of them.
Accessing a dual-ported register bank of 64-128 registers should take
significantly longer, not counting the time to store the result. If
one considers "round-trip time", the result may not be usable in the
next clock cycle. Thus, a stack architecture could utilize a clock
period for non-permutation instructions which would be a small
fraction of that for a register machine, and a permutation unit should
take about the same time for its operation as a RAM of the same
size.

Most people describe the top several positions of the Forth stack
as "locations", but it is more productive to think of them as
"busses", since no addressing is required to read from them at
all--the ALU is directly connected to these busses. The permutation
operations necessary to get operands onto these busses can now be seen
as the same sorts of operations that are used to extract values from
registers. If one generalizes this idea, then one can conceive of
multiple arithmetic operations being performed simultaneously on a
number of the top items of the "stack". For example, the top 4 (or 8)
items of the stack may be busses which are directly connected to an
FFT butterfly network (itself a generalized permutation generator), in
which case a radix-4 (or radix-8) FFT could be computed in parallel.
Similarly, other operations on independent busses (top items of the
stack) could be performed in parallel--e.g., computing an element-wise
addition or multiplication of the top 2n items of the stack,
considered in pairs. Of course, parallel operations of this sort must
conserve the stack size, but any machine with parallel units of this
sort would also have a general permutation network capable of
"squeezing out" in parallel empty positions from the stack.

A traditional stack cache utilizes its space on the chip and memory
bandwidth better than a register bank of the same capacity. The stack
cache for a linear stack machine should be even more efficient, since
1) there are no stack frames or frame pointers; 2) all temporaries are
handled the same--both named and unnamed temporaries; and 3) the space
for each temporary is reclaimed immediately upon the use of its value
(i.e., the cache is "self-cleaning"). This last characteristic
guarantees that all of the data held in the stack cache is
live data and is not just tying up space.

Since Forth is usually implemented on a traditional von Neumann
machine, one thinks of the return stack as holding "return addresses".
However, in these days of large instruction caches, in which entire
cache lines are read from the main memory in one transaction, this
view should be updated. It is well-known that non-scientific programs
have a very high rate of conditional branches, with the mean number of
instructions between branches being on the order of 10 or less. Forth
programs are also very short, with "straight-line" (non-branching)
sequences averaging 10 items or less. In these environments, it makes
more sense to view the return stack itself as the instruction buffer
cache! In other words, the return stack doesn't hold "return
addresses" at all, but the instructions themselves! When a routine is
entered, the entire routine is dumped onto the top of the return
stack, and execution proceeds with the top item of this stack. Since
routines are generally very short, the transfer of an entire routine
is about the same amount of work as transferring a complete cache line
in present architectures. Furthermore, an instruction
stack-cache-buffer is normally accessed sequentially, and therefore
can be implemented using shift register technology. Since a shift
register can be shifted faster than a RAM can be accessed, the "access
time" of this instruction stack-cache-buffer will not be a limiting
factor in a machine's speed. Executing a loop in an instruction
stack-cache-buffer is essentially the making of connections necessary
to create a cyclic shift register which literally cycles the
instructions of the loop around the cyclic shift register.

CONCLUSIONS

We have shown a Lisp called Linear Lisp which implements a
"linear style" of programming suggested by linear logic
[Girard87]. We have shown this linear style to be competitive with
traditional Lisp styles for sparse polynomials
[Baker94SP]
and Quicksort
[Baker94QS].
Furthermore, the mapping of Linear Lisp onto a stack architecture
[footnote 5]
utilizes a frameless Forth-style stack, in which the basic stack
operations are permutations, rather than loads and stores. A
specific kind of permutation--roll<n>--cycles the
top n items of the stack so that the n-th item becomes the top item.
These roll instructions are perfectly matched to the
needs of linear logic to conserve values on the stack until they are
utilized exactly once. Furthermore, any duplication or deletion of
values are explicit operations that take place at the top of the
stack. Interestingly, the consistent use of these roll
(move-to-front) and dup operators makes the stack into an
LRU buffer exactly like those used in "stack algorithms" for
simulating LRU caches [Mattson70]!

The idea of permuting the top several items of a stack by means of
a "rolling" operation seems to have occurred independently to a number
of people as a generalization of the swap instruction
from the Burroughs stack machines [Lonergan61] [Hauck68]. Although
the swap instruction was apparently introduced to
optimize the code for non-commutative operators (subtraction,
division) in arithmetic expressions, general rolling operators were
first used in HP calculators [Osborne93] and Forth implementations
[Moore93] to maximize the accessibility of operands while minimizing
real and conceptual space requirements. Although stack architectures
have been extensively written about, we have found no
references to the more general use of stack permutations to access
local variables
[footnote 6]
without the use of frame pointers. The literature on compiler
optimizations for stack machines is particularly sparse--probably
because the proponents of stack architectures touted trivial
translation. With the modern notion of an optimizing compiler as a
"partial evaluator", it is time to revisit the stack v. register
controversy.

[Footnote 1]
Even mathematical APL dispensed with traditional "operator
precedences" when Iverson discovered that 1) most programs dealt with
very few numerical expressions, and 2) people couldn't remember the
precedence table.

[Footnote 2]
We utilize the Postscript(tm) convention that the first argument is
pushed first, and is therefore deepest in the stack--i.e., the stack
grows to the right.

[Footnote 3]
This heavy use of explicit copying cries for ' as a
non-quoting character.

[Footnote 4]Typestates [Strom83] can check linearity in complex control
structures.