Writing Python inside your Rust code — Part 1

Contents

About a year ago, I published a Rust crate called inline-python,
which allows you to easily mix some Python into your Rust code using a python!{ .. } macro.
In this series, I’ll go through the process of developing this crate from scratch.

Sneak preview

If you’re not familiar with the inline-python crate, this is what it allows you to do:

It allows you to embed Python code right between your lines of Rust code.
It even allows you to use your Rust variables inside the Python code.

We’ll start with a much simpler case, and slowly work our way up to this result (and more!).

Running Python code

First, let’s take a look at how we can run Python code from Rust.
Let’s try to make this first simple example work:

fnmain(){println!("Hello ...");run_python("print(\"... World!\")");}

We could implement run_python by using std::process::Command
to run the python executable and pass it the Python code,
but if we ever expect to be able to define and read back Python variables,
we’re probably better off if we start by using the PyO3 library instead.

PyO3 gives us Rust bindings for Python.
It nicely wraps the Python C API,
letting us interact with all kind of Python objects directly from Rust.
(And even make Python libraries in Rust, but that’s a whole other topic.)

Its Python::run
function looks exactly like what we need. It takes the Python code as a &str,
and allows us to define any variables in scope using two optional PyDicts.
Let’s give it a try:

fnrun_python(code: &str){letpy=pyo3::Python::acquire_gil();// Acquire the 'global interpreter lock', as Python is not thread-safe.
py.python().run(code,None,None).unwrap();// No locals, no globals.
}

Macros defined by macro_rules! can not execute any code at compile time, they are only applying
replacement rules based on patterns.
Great for things like vec![], and even lazy_static!{ .. },
but not powerful enough for things such as parsing and compiling regular expressions (e.g. regex!("a.*b")).

In the matching rules of a macro, we can match on things like expressions, identifiers, types, and many other things.
Since ‘valid Python code’ is not an option, we’ll just make our macro accept anything: raw tokens, as many as needed:

macro_rules!python{($($code:tt)*)=>{...}}

(See the resources linked above for details on how macro_rules! works.)

An invocation of our macro should result in run_python(".."),
with all Python code wrapped in that string literal.
We’re luckily: there’s a builtin macro that puts things in a string for us,
called stringify!.

Not only does it remove all unnecessary white-space, it even removes comments.
The reason is that we’re working with tokens here, not the original source code:
a, 123, b, etc.

One of the first things rustc does, is to tokenize the source code.
This makes it easier to do the rest of the parsing,
not having to deal with individual characters like 1, 2, 3,
but only with tokens such as ‘integer literal 123’.
Also, white-space and comments are gone after tokenizing,
as they are meaningless for the compiler.

stringify!() is a way to convert a bunch of tokens back to a string,
but on a ‘best effort’ basis: It will convert the tokens back to text,
and only insert spaces around tokens when needed
(to avoid turning b, c into bc).

So this is a bit of a dead end.
Rustc has carelessly thrown our precious white-space away,
which is very significant in Python.

We could try to have some code guess which spaces have to be replaced back by newlines,
but indentation is definitely going to be a problem:

The two snippets of Python code have a different meaning, but stringify! gives us the same result for both.

Before giving up, let’s try the other type of macros.

Procedural macros

Rust’s procedural macros
are another way way to define macros.
Whereas macro_rules! can only define ‘function-style macros’ (those with an !),
procedural macros can also define custom derive macros (e.g. #[derive(Stuff)])
and attribute macros (e.g. #[stuff]).

Procedural macros are implemented as a compiler plugin.
You get to write a function that gets access to the token stream the compiler sees,
can do whatever it wants, and then needs to return a new token stream which
the compiler will use instead (or in addition, in the case of a custom derive):

#[proc_macro]pubfnpython(input: TokenStream)-> TokenStream{todo!()}

That TokenStream there doesn’t predict anything good.
We need the original source code, not just the tokens.
But let’s just continue anyway.
Maybe a procedural macro gives us more flexibility to hack our way around any problems.

Because procedural macros run Rust code as part of the compilation process,
they need to go in a separate proc-macro crate,
which is compiled before you can compile anything that uses it.

$ cargo new --lib python-macro
Created library `python-macro` package

In python-macro/Cargo.toml:

[lib]
proc-macro = true

In Cargo.toml:

[dependencies]
python-macro = { path = "./python-macro" }

Let’s start with an implementation that just panics (todo!()),
after printing the TokenStream:

Rust complains that ‘procedural macros cannot be expanded to statements', and something about enabling ‘hygienic macros’.
Macro hygiene is the wonderful feature of Rust macros to not accidentally ‘leak’ any names to the outside world (or the reverse).
If a macro expands to code that uses some temporary variable named x, it will be separate from any x that appears in any code outside of the macro.

However, this feature isn’t stable yet for procedural macros.
The result is that procedural macros are not (yet) allowed to appear in any place other than as a item by itself (e.g. at file scope, but not inside a function).

Our procedural macro panics as expected, after showing us the input it got as string:

print("... World!") print("Bye.")

Again, as expected, with the white-space thrown away. :(

Time to give up.

Or.. Maybe there’s a way to work around this.

Reconstructing white-space

Although rustc only works with tokens while parsing en compiling,
it somehow still knows exactly where to point when it has errors to report.
There’s no newlines left in the tokens, but it still knows our error happened on lines 6 through 9. How?

It turns out that tokens contain quite a bit of information. They contain a Span, which
is basically the start and end location of the token in the original source file.
The Span can tell which file, line, and column number a token starts and ends at.

If we can get to this information, we can reconstruct the white-space by
putting spaces and newlines between tokens to match their line and column information.

Functions that give us this information are not yet stable and
gated behind #![feature(proc_macro_span)].
Let’s enable it, and see what we get:

But there’s only four tokens?
It turns out ("... World!") appears one token here, and not three ((, "... World!", and )).
If we look at the documentation of TokenStream,
we can see it doesn’t give us a stream of tokens, but of token trees.
Apparently Rust’s tokenizer already matches parentheses (and braces and brackets)
and doesn’t just give a linear list of tokens, but a tree of tokens.
Tokens inside parentheses will be children of a single Group token.

Let’s modify our procedural macro to recursively go over all the tokens inside groups as well (and improve the output a bit):

Okay, that works, but what’s with all the extra newlines and spaces?
Oh right, the first token starts at line 7 column 8, so it correctly puts print on line 7 in column 8.
The location we’re looking at is the exact location in the .rs file.

The extra newlines at the start are not a problem (empty lines have no effect in Python).
It even has a nice side effect: When Python reports an error, the line number it reports
will match the line number in the .rs file.

However, the 8 spaces are a problem.
Although the Python code inside our python!{..} is properly indented with respect to our Rust code,
the Python code we extract should start at a ‘zero’ indentation level.
Otherwise Python will complain about invalid indentation.

Let’s subtract the column number of the first token from all column numbers: