Translating Shell to Oil

2017-02-05

In Success with ASDL, I mentioned that a top project priority
is to automatically translate shell programs to the oil
language. The ability to express real programs is a test of the
language's design, especially when they're written by others.

I've done perhaps 25% of the work, but the translations are starting to look
accurate. Language features are apparently used in a Pareto or "long tail"
distribution.

In this post, I'll show a translated program and explain the oil
language features it uses.

Why a New Language?

Before looking at code, let's remind ourselves of the motivation. At first
glance, this project seems similar to CoffeeScript. We want a better
syntax for shell, in order to reveal its powerful semantics, e.g.
Bernstein chaining and pipelines.

But more important is that shell syntax leaves no room for extension. New
features will necessarily have a tortured syntax, such as the ^, ^^, ,
and ,, operations to change the case of a string in bash. I plan to
justify this further in a post called Declaring Syntax Bankruptcy on Shell.

So the bigger motivation for the oil language is to add
features to shell — in particular borrowing some from awk and make,
as well providing a dialect for config files. I'm excited about these
goals, but they require getting past some tedious work.

Make sure to widen the window so that the two code panes appear side-by-side.

Notice that whitespace and comments are intentionally preserved. That is, if
your style is to put then on its own line, the opening { in oil will also
be on its own line. I'll describe the algorithm for style-preserving
translation in a future post.

They look similar from a distance, which is good. But notice the following
changes:

(1) The proc keyword. Oil will have both "procs" and functions, denoted
with keywords proc and func.

Procs are what we call shell "functions": they accept an argv array of
strings, return an integer status, and have file descriptors. They resemble
both processes and a procedures.

Functions are like those in Python or JavaScript. They have typed arguments
and return values.

One important use case for functions is user-defined interactive
completion. Bash has a convention to mutate globals, e.g.
COMPREPLY, but proper return values are preferable.

Another use case is string manipulation, e.g. to escape HTML or SQL. You can
fake this by writing a "return value" to stdout and capturing it with a
subshell, but this requires forking for every function call.

So it makes sense to have proper functions, but procs are important too because
they're isomorphic to an external process. I'll explain how they work together
in a future post.

(2) if uses curly braces as block delimiters instead of then and
fi. Reasons for this:

Consistency: In shell, function bodies are delimited by braces, while other
blocks are delimited by keywords like do and done. In oil, all blocks
use braces.

Huffman coding: Block delimiters are common, so they should be short, and
braces are shorter than keywords. Python-style indented blocks are even
shorter, but aren't suitable for a shell because the language is meant to be
typed interactively.

Note that { is an operator in oil, but confusingly it isn't in shell.
See discussion below.

(3) The conversion uses test instead of [. Oil will have C-style infix
boolean expressions, but legacy code may use test.

Not only is the [ command an ugly syntactic pun, but the [ character is
an operator in oil, so it requires quoting when in a command name.

The fact that [ and { aren't operators prevents the shell language from
evolving. For example:

Roughly speaking, shell has a separate expression language for each type:
strings, integers, and booleans. Oil does away with this complexity with a
single expression language for all types, like C or Python.

As a result, it has just two sublanguages: commands and expressions.

The [] characters are used for arrays, and the () characters are used for
grouping expressions, as in most languages. So it makes sense for $[] to be
command substitution and $() to be expression substitution. Commands
are simply arrays of strings.

Keeping the two sublanguages in mind, notice:

(5)$(HDBMEGS) is a delimited variable substitution, in contrast to
${HDBMEGS}.

(6)$[which mke2fs] is command substitution, in contrast to $(which
mke2fs).

(7) Substitutions aren't quoted. Oil doesn't split words because it's
a misfeature designed to simulate arrays. (Most shell
implementations have arrays as an extension, but they're not in
POSIX.)

Splitting can be done explicitly with @split(HDB) or @[which mke2fs]. The
@ character is associated with arrays, e.g. for splitting and splicing.

(8) In contrast, strings on the right-hand side of assignments must be
quoted. In expression mode, strings must be quoted; and everything to the
right of = is parsed in expression mode as opposed to command mode. This
will be implemented with the lexer modes technique (formerly
lexical state).

Examples:

echo foo bar # command mode: command and two literal wordsfoo= bar # expression mode: bar is a variable, as in C or pythonfoo='bar'# bar is a string

x=1+2*3# an integer expressions=myStror'default'# a string expression

Also notice that = is a proper operator and may have spaces around it.