ASDL in Oil

2017-01-05

Yesterday I showed an example of an ASDL schema and data structure, and
recapped the project priorities. Today I will describe what OIl's ASDL
implemetation does, and describe how it will help with each of the top three
priorities.

Oil's ASDL Implementation

The implementation is very short:

[show line count]

How does it work? As mentioned the asdl_.py was copied from Python. This
uses a simple lexer and recursive descent parser to turn the ASDL schema
(example: osh.asdl) into a Python data structure.

Then we have two algorithms that walk this tree representation of the schema:

Relation to the Three Priorities Forward

For now, it's convenient. Compare with Python's ASDL. It generates 10K lines
of code, as big as all of oil!

easier to pass a huge string from Python to C++

save parsing work when we're bootstrapping. A lot

What I do NOT want: .pyc files or pycache files. This causes problems.
Most scripts won't need this.

Libraries pre-cmopiled but scripts won't be.

Should be able to parse say 100,000 lines of code neglible time. as fast as |wc
-l.

This work took awhile, around 6 weeks, but it unblocks the three top
priorities:

1) converting shell scripts to oil
2) testing the semantics by executing the shell in Python
The main reason for this is that I don't like iterating in C++. I am doing a
lot of work to avoid writing C++ code :)
3) Writing a production quality version in C++

Using ASDL will help with all three.

1) We still need source location info.
2) ASDL found a lot of bugs. It really is the core of the interpreter.
Dynamic type checking with exhaustive tests is as good as static type checking.

In oil, I want optional typing like Dart.

3) My bespoke ASDL implementation writes things to the oheap format.

One important realization that led me to use ASDL:

Originally I thought that I could have a very simple architecture. A "pure
front-end replacement".

osh compiles to an AST.

oil compiles to an AST

Execute the AST

Convert osh to oil by doing osh -> AST -> oil.

But for a number of reasons this won't work. oil is going to be a SUPERSET of
the semantics of oil.

osh parses into the AST osh/osh.asdl

oil parses into the AST oil/oil.asdl

They both are further compiled down to core/ovm.asdl

important point: All ERRORS are handled in the first stage.

ovm is a smaller language. It's a "lowering". It's a VM, but for now it will
have a tree structure. It might be a little similar to NQP, although I don't
know much about it.

ovm can also be used for other tools. I am not writing an awk interpreter or
parser, but if someone wanted to write one, ovm might be a good target.
Instead I am folding it

osh to oil will preserve style and comments. We will take the osh/asdl
representation and PRINT text.

later: we can write an oil autoformatter. The invariant of osh/osh.asdl

A Peek of ASDL

I've been working on using Zephyr ASDL. How is it used in Python?

design of CPython compiler

Python.asdl is a big data structure -- a set of Algebraic Data Types. For
people coming from a background of C, you can think of them like classes and
tagged unions (and indeed Rust has made that connection, using the "enum"
keyword for variants.)

This file transformed via asdl_c.py to a header file and C code.

The generated parser creates C structs.

Then it creates Python/C bindings classes so you can do the following:

Eventually, oil won't depend on Python, so I'm not going to use Python/C
bindings of course. Here is the plan for oil's architecture.

osh.asdl

oil.asdl

Both compile to ovm.asdl

I was working on making the AST more homogeneous ([example commit]), and then I
realized that I really need two tree data structures:

Heterogeneous means: represent everything faithfully, more types.

Homogeneous: fewer types. Most shells seems to do this. In C this is the
natural thing because there is no subtyping.

heterogeneous AST for converting from osh to oil

homogeneous tree for execution

I will just say what I have now, and describe how it's used once code is actually committed:

Parse the .asdl file, creating its own tree. (And yes, ASDL can be
described with ASDL, another form of bootstrapping.)

Use Python metaprogramming to generate classes so that the Python parser
can use.

Able to create and DYNAMICALLY type check instances of those generated
types. Example:

Able to serialize them to binary.

Able to GENERATE C++ code that turns. The API looks somewhawt like protocol
buffers,

In fact ASDL and protocol buffers are sort of the same. There is a long history of serializing types:

It is divided into coarse-grained "kinds", which help make parsing decisions.
I'm also using the "external visitor" pattern, i.e. this thing that Terrence
Parr gave me permission to do. (TODO: maybe write a blog post on that?)

What does the AST Look like?

Clang AST. IPR for C++.

I'm appreciating how hard, or rather how many choices there, it is to write a
programming language. LLVM is WAY cleaner than gcc; yet it seems that they
are stuck with some decisinos.

Recap

Another reason for not blogging I also want to be more forward looking. If I
blog after doing some more coding, that will be the case. I could write about
Shell WTFs every day for 6 months, but it doesn't really move the project
fowrard. Everyone is convinced. Still, some things are too weird not to
mention, like what I found out yesterday (explained in the Addendum below).

(I need a way to link to test cases.)

Fun fact: literally nobody is arguing that shell is a good language. I've
thought that programming is big, the internet is big, and someone would come
out of the woodwork and claim we don't need to fix shell. I haven't gotten
that feedback yet, which is sort of surprising.

Crucially, I think people understand the point of auto-converting osh to
oil. The effort is quite large, but it's worth it, and actually improved
the design of the oil language in several cases, which I will write about in a
future post.

I still have a draft of the blog post listing posts which I will release.

I want a BALANCE of the AST format.
AST formats are non-trivial: Clang talk which you can't print. Receive this
question a lot. They are doing source edits.
Clang AST is fuggin' enormous.

TODO: implement stack traces.

Code is smaller now

Two clients of AST:
Shared osh/oil executor, written in C++
oil printer, written in Python

They have different requirements. Fleshing out a third one.

AST differences

comments and whitespace

for execution, you can throw away token values
The three bytes '<<-' becomes a single byte: DLessDash

I'm probably going to have two trees: PTree and ETree. PTree is closer to
parsing, and Etree is closer to execution.

(This is not the same thing as the AST vs. parse-tree distinction. That
difference is essentially whehter you can automatically generate a tree from a grammar, without semantic actions, like ANTLR 4 does.

My parser is hand-written and has "semantic actions" to prune the tree as it's
being created; hence I'm producing an AST.

I don't like ANTLR framework.

Compiler class is kind of toy. I would like to see different representations.
nano-pass compiler paper kind of talks about this.

Skipping over some posts. Skipping over re2c to post, waiting until I publish the code. The current code doesn't use it, but the original C++ implementation did.
Python is missing abstractions for describing data.