It also taught me that comments are not lexical constructs in shell, as
they are in most languages. Correctly recognizing them depends on knowledge of
what words and operators are, which isn't dealt with in the lexer. Consider:

$ echo foo:#not-comment
> echo foo;#commentfoo:#not-comment
foo

The colon is not an operator, so it and the next # are part of a
single word. In contrast, the semi-colon is an operator. This means the
next token must begin a new word, and in this case it's a comment.

There two remaining issues are:

Unicode encodings other than UTF-8

The parser uses Python 3 strings, so it has no problem with code in the UTF-8
encoding. But git has two test scripts with non-UTF-8 Unicode (t4201 and
t7831).

I'm torn on the issue of supporting other encodings, and the best way to
resolve that is by examining real world usage.

On the one hand, compatibility with bash is good. On the other hand, if
there are only two files with non-UTF-8 encodings out of dozens of projects
totalling a million lines of code, then I'll be tempted to follow Go's
approach.

This approach is simpler, uses less memory, and should reduce portability
problems stemming from libc. (Bash uses various libc functions to
support multiple encodings and locales.)

A Subtlety with Static Parsing

The second remaining issue that git uncovered relates to static
parsing. If you look at line 10 of git-gui.sh,
you'll see something odd:

Wait, that's not shell anymore! It turned into Tcl code. Even when
non-interactive, the shell is a REPL that parses and executes each top-level
command in sequence. When it hits exec, the REPL must stop, so nothing
else is parsed.

Although oil can't do this without breaking certain shell scripts like
git-gui.sh, it's not a problem in practice because there's a difference
between "executing" a function and calling it. When the shell
"executes" a function, it just puts its parsed representation in a lookup
table:

$ echo before
> f(){echo'not called, but parsed and stored';}
> echo after
before
after

So, as long all code is in functions, and there is a single top-level main
"$@" call, oil will statically parse all of the code.

Recap

The parser is converging pretty quickly. Git is the eighth project I've
discussed, but there are now many more projects that it handles correctly.

The nice thing is that there have been no architectural changes for awhile;
it's all been polish around the edges. I will write about this architecture in
detail later, but a core observation is that it's four interleaved parsers for
four sublanguages.

I've also consciously avoided any "clever" features of Python, so this parser
can be ported almost line-for-line to many languages, including C++.

I want to set it up with a few more blog posts, but otherwise there's no reason
not to release the code so people can play with it. I expect that to be this
month.