Declarative Rust static analysis

an interactive exploration

Rust's Macros 2.0
are intuitive: demonstrate a pattern, and the compiler can insert the pattern
into your program wherever you want it. Inspired by this syntax, I wondered:
Could you “run a macro backwards”—use the same by-example
language to describe patterns to search for?

A mega-lint that searches for code that could be simplified by using any
utility function from the standard library.

These goals present some challenging requirements: support searching for new
patterns at runtime, but also be fast enough to search a large codebase for
tens of thousands of patterns at once.

To work out the key design decisions, I built a prototype,
comacro. The library is still in
an early research phase, and lacks a lot of necessary functionality, like
optional arguments, arguments types other than ident and
expr, and useful error messages. But the core functionality is
working and I thought it might be enough for a demo, so you can try it out
below!

Implementation details

The general idea is to compile patterns (and the code to be searched) into a
form suitable for regex-like matching.

Extend the AST, or slip metavariables under the parser's radar?

Pattern definitions need to be parsed as a Rust AST so that metavariables will
match the right AST types, optional nodes will work right, etc. Syntax that can
contain metavariables of various AST types is naturally a superset syntax of
Rust. Early on I considered a choice: do it “properly” by extending
a Rust parser with new node possibilities, or “the delta-value
hack”: replace metavariables in the token stream with placeholders,
twice—with different placeholders. Parse both versions as standard
syntax, linearize, and then compare: differences between the streams encode
metavars. (See why I call it a hack?)

So I thought, I'll do the work upfront to get it right and have less technical
debt later. I soon discovered though, that extending a Rust parser is the
technical debt equivalent of taking out a private student loan: it requires
maintaining a substantially-modified version of a complex library; there's a
lot of room for bugs like precedence issues with the new expression types, and
it would require far-reaching changes to modify its handling of
Ident to be an enum since it uses
proc_macro2's Ident unencapsulated. And this is all
just working with syn—the easy middle-end. I don't even want
to think about what redoing all this for rustc::hir would involve!
As bad as “do it twice with magic values” may sound, encoding all
the metadata into syntactically-normal Rust preserves the abstraction boundary
that all the complexity of parsing Rust is on the other side of.

Trace: language of the trees

Once the parser determines the AST structure, a visitor serializes the tree to
the regex format—a compact linear format I'm calling Trace, with just
enough information to compare two trees and identify any subtrees that differ
(In the demo above, the serialized forms are a readable representation of this
binary format.) Trace supports memcmp-like comparison without
building a stack; when a comparison fails, if the current position in the
pattern is a wildcard it resumes after consuming one subtree by counting
delimiters. To match against many patterns at once
simply and fast, it
would be possible to build an NFA from a collection of Trace patterns. The
initial comparison (or NFA run) identifies patterns that match the input
structurally; structural matches can be further compared by extracting subtrees
to check that each metavariable binds an equivalence set of subtrees. (There's
a subtree-hashing/partition-refinement approach to doing these checks fast, but
the naive approach is probably fine since structural matching will filter out
the majority of the candidates.)

Differential execution: easier than reconstructing the eviscerated

Trace is specialized for matching; deserializing it back to an AST would be
technically possible, but complex. Instead, operations that need to compare two
trees use a differential execution approach: visit the input data's tree with a
retracer that compares the Trace the second tree would emit with the
pattern's Trace. When the retracer finds a difference, the visitor has access
to the AST objects that produced different Trace, so it can capture them to
bind metavariables, or emit the debugging representations seen above, or do
other fun things with them.

Building the demo

This library isn't really intended to run in a webpage, but I've been looking
for an excuse to use WASM for something, and with Rust and
wasm-pack Rusting once and
running anywhere is easier done than said. Put your next project in a web page,
just because you can!

I also had to remember how to JavaScript and HTML (everything is reasonably
standardized and the tooling is awesome now!? I suddenly feel like my childhood
was full of unnecessary suffering!). Once I figure out how to compose two
libraries that both want to be used via npm init (?), I'll
probably end up rewriting most of the JavaScript in OCaml with BuckleScript,
but I'll shave that yak later.

I think Clippy could benefit from an implementation of comacro with a
hir middle-end and some simple extensions beyond standard macro
syntax (definitely optional metavars, probably a capture syntax to match a node
but also bind it for further inspection)—keeping the pattern syntax
minimal, and using it to describe the overall structure simply. A pattern's
associated match handler can then query additional properties about its match
bindings.

I looked at 10 recently added lints to try to estimate how often this might be
useful: