rustgo: calling Rust from Go with near-zero overhead

Go has good support for calling into assembly, and a lot of the fast cryptographic code in the stdlib is carefully optimized assembly, bringing speedups of over 20 times.

However, writing assembly code is hard, reviewing it is possibly harder, and cryptography is unforgiving. Wouldn't it be nice if we could write these hot functions in a higher level language?

This post is the story of a slightly-less-than-sane experiment to call Rust code from Go fast enough to replace assembly. No need to know Rust, or compiler internals, but knowing what a linker is would help.

Why Rust

I'll be upfront: I don't know Rust, and don't feel compelled to do my day-to-day programming in it. However, I know Rust is a very tweakable and optimizable language, while still more readable than assembly. (After all, everything is more readable than assembly!)

Go strives to find defaults that are good for its core use cases, and only accepts features that are fast enough to be enabled by default, in a constant and successful fight against knobs. I love it for that. But for what we are doing today we need a language that won't flinch when asked to generate stack-only functions with manually hinted away safety checks.

So if there's a language that we might be able to constrain enough to behave like assembly, and to optimize enough to be as useful as assembly, it might be Rust.

Finally, Rust is safe, actively developed, and not least, there's already a good ecosystem of high-performance Rust cryptography code to tap into.

By using the C ABI as lingua franca of FFIs, we can call anything from anything: Rust can compile into a library exposing the C ABI, and cgo can use that. It's awkward, but it works.

We can even use reverse-cgo to build Go into a C library and call it from random languages, like I did with Python as a stunt. (It was a stunt folks, stop taking me seriously.)

But cgo does a lot of things to enable that bit of Go naturalness it provides: it will setup a whole stack for C to live in, it makes defer calls to prepare for a panic in a Go callback... this could be will be a whole post of its own.

As a result, the performance cost of each cgo call is way too high for the use case we are thinking about—small hot functions.

Linking it together

So here's the idea: if we have Rust code that is as constrained as assembly, we should be able to use it just like assembly, and call straight into it. Maybe with a thin layer of glue.

We don't have to work at the IR level: the Go compiler converts both code and high-level assembly into machine code before linking since Go 1.3.

This is confirmed by the existence of "external linking", where the system linker is used to put together a Go program. It's how cgo works, too: it compiles C with the C compiler, Go with the Go compiler, and links it all together with clang or gcc. We can even pass flags to the linker with CGO_LDFLAGS.

Underneath all the safety features of cgo, we surely find a cross-language function call, after all.

It would be nice if we could figure out how to do this without patching the compiler, though. First, let's figure out how to link a Go program with a Rust archive.

Thankfully go/build is nothing but a frontend! Go offers a set of low level tools to compile and link programs, go build just collects files and invokes those tools. We can follow what it does by using the -x flag.

I built this small Makefile by following a -x -ldflags "-v -linkmode=external '-extldflags=-v'" invocation of a cgo build.

That looks like an interesting pragma! //go:linkname just creates a symbol alias in the local scope (which can be used to call private functions!), and I'm pretty sure the byte trick is only cleverness to have something to take the address of, but //go:cgo_import_static... this imports an external symbol!

Armed with this new tool and the Makefile above, we have a chance to invoke this Rust function (hello.rs)

Well, it crashes when it tries to return. Also that $2048 value is the whole stack size Rust is allowed (if it's even putting the stack in the right place), and don't ask me what happens if Rust tries to touch a heap... but hell, I'm surprised it works at all!

Calling conventions

Now, to make it return cleanly, and take some arguments, we need to look more closely at the Go and Rust calling conventions. A calling convention defines where arguments and return values sit across function calls.

The Go calling convention is described here and here. For Rust we'll look at the default for FFI, which is the standard C calling convention.

The caller, seen above, does very little: it places the arguments on the stack in reverse order, at the bottom of its own frame (rsp to 16(rsp), remember that the stack grows down) and executes CALL. The CALL will push the return pointer to the stack and jump. There's no caller cleanup, just a plain RET.

Then there's the rsp management, which subtracts 0x108, making space for the entire 0x100 bytes of frame in one go, and the 8 bytes of frame pointer. So rsp points to the bottom (the end) of the function frame, and is callee managed. Before returning, rsp is returned to where it was (just past the return pointer).

Finally the frame pointer, which is effectively pushed to the stack just after the return pointer, and updated at rbp. So rbp is also callee saved, and should be updated to point at where the caller's rbp is stored to enable stack trace unrolling.

Finally, from the body itself we learn that return values go just above the arguments.

Virtual registers

The Go docs say that SP and FP are virtual registers, not just aliases of rsp and rbp.

Indeed, when accessing SP from Go assembly, the offsets are adjusted relative to the real rsp so that SP points to the top, not the bottom, of the frame. That's convenient because it means not having to change all offsets when changing the frame size, but it's just syntactic sugar. Naked access to the register (like MOVQ SP, DX) accesses rsp directly.

The FP virtual register is simply an adjusted offset over rsp, too. It points to the bottom of the caller frame, where arguments are, and there's no direct access.

Note: Go maintains rbp and frame pointers to help debugging, but then uses a fixed rsp and omit-stack-pointer-style rsp offsets for the virtual FP. You can learn more about frame pointers and not using them from this Adam Langley blog post.

We care little about this, since in Go all registers are caller-saved.

The stack must be aligned to 16-bytes.

(I think this is why JMP worked and CALL didn't, we failed to align the stack!)

Frame pointers work the same way (and are generated by rustc with -g).

Gluing them together

Building a simple trampoline between the two conventions won't be hard. We can also look at asmcgocall for inspiration, since it does approximately the same job, but for cgo.

We need to remember that we want the Rust function to use the stack space of our assembly function, since Go ensured for us that it's present. To do that, we have to rollback rsp from the end of the stack.

CALL on macOS

CALL didn't quite work on macOS. For some reason, there the function call was replaced with an intermediate call to _cgo_thread_start, which is not that incredible considering we are using something called cgo_import_static and that CALL is virtual in Go assembly.

callq 0x40a27cd ; x_cgo_thread_start + 29

We can bypass that "helper" by using the full //go:linkname incantation we found in the standard library to take a pointer to the function, and then calling the function pointer, like this.

Is it fast?

The point of this whole exercise is to be able to call Rust instead of assembly for cryptographic operations (and to have fun). So a rustgo call will have to be almost as fast as an assembly call to be useful.

Benchmark time!

We'll compare incrementing a uint64 inline, with a //go:noinline function, with the rustgo call above, and with a cgo call to the exact same Rust function.

Rust was compiled with -g -O, and the benchmarks were run on macOS on a 2.9GHz Intel Core i5.

To build the .a we use cargo build --release with a Cargo.toml that defines the dependencies, enables frame pointers, and configures curve25519-dalek to use its most efficient math and no standard library.

Packaging up

Now we know it actually works, that's exciting! But to be usable it will have to be an importable package, not forced into package main by a weird build process.

This is where //go:binary-only-package comes in! That annotation allows us to tell the compiler to ignore the source of the package, and to only use the pre-built .a library file in $GOPATH/pkg.

If we can manage to build a .a file that works with Go's native linker (cmd/link, referred to also as the internal linker), we can redistribute that and it will let our users import the package as if it was a native one, including cross-compiling (provided we included a .a for that platform)!

The Go side is easy, and pairs with the assembly and Rust we already have. We can even include docs for go doc's benefit.

//go:binary-only-package
// Package edwards25519 implements operations on an Edwards curve that is
// isomorphic to curve25519.
//
// Crypto operations are implemented by calling directly into the Rust
// library curve25519-dalek, without cgo.
//
// You should not actually be using this.
package edwards25519
import _ "unsafe"
//go:cgo_import_static scalar_base_mult
//go:linkname scalar_base_mult scalar_base_mult
var scalar_base_mult uintptr
var _scalar_base_mult = &scalar_base_mult
// ScalarBaseMult multiplies the scalar in by the curve basepoint, and writes
// the compressed Edwards representation of the resulting point to dst.
func ScalarBaseMult(dst, in *[32]byte)

The Makefile will have to change quite a bit—since we aren't building a binary anymore we don't get to keep using go tool link.

A .a archive is just a pack of .o object files in an ancient format with a symbol table. If we could get the symbols from the Rust libed25519_dalek_rustgo.a library into the edwards25519.a archive that go tool compile made, we should be golden.

.a archives are managed by the ar UNIX tool, or by its Go internal counterpart, cmd/pack (as in go tool pack). The two formats are ever-so-subtly different, of course. We'll need to use the platform ar for libed25519_dalek_rustgo.a and the Go cmd/pack for edwards25519.a.

(For example, the platform ar on my macOS uses the BSD convention of calling files #1/LEN and then embedding the filename of length LEN at the beginning of the file, to exceed the 16 bytes max file length. That was confusing.)

To bundle the two libraries I tried doing the simplest (read: hackish) thing: extract libed25519_dalek_rustgo.a into a temporary folder, and then pack the objects back into edwards25519.a.

Well, it almost worked. We cheated. The binary would not compile unless we linked it to libresolv. To be fair, the Rust compiler tried to tell us. (But who listens to everything the Rust compiler tells you anyway?)

note: link against the following native artifacts when linking against this static library
note: the order and any duplication can be significant on some platforms, and so may need to be preserved
note: library: System
note: library: resolv
note: library: c
note: library: m

Now, linking against system libraries would be a problem, because it will never happen with internal linking and cross-compilation...

But hold on a minute, libresolv?! Why does our no_std, "should be like assembly", stack only Rust library want to resolve DNS names?

I really meant no_std

The problem is that the library is not actually no_std. Look at all that stuff in there! We want nothing to do with allocators!

So how do we actually make it no_std? This turned out to be an entire side-quest, but I'll give you a recap.

If any dependency is not no_std, your no_std flag is nullified. One of the curve25519-dalek dependencies had this problem, cargo update fixed that.

Actually making a no_stdstaticlib (that is, an library for external use, as opposed to for inclusion in a Rust program) is more like making a no_stdexecutable, which is much harder as it must be self-contained.

A friend thankfully suggested making sure that I was using --gc-sections to strip dead code, which might reference things I don't actually need. And sure enough, this worked. (That's three layers of flag-passing right there.)

But umh, in the Makefile we aren't using a linker at all, so where do we put --gc-sections? The answer is to stop hacking .as together and actually reading the linker man page.

We can build a .o containing a given symbol and all the symbols it references with ld -r --gc-sections -u $SYMBOL. -r makes the object reusable for a later link, and -u marks a symbol as needed, or everything would end up garbage collected. $SYMBOL is scalar_base_mult in our case.

Why wasn't this a problem on macOS? It would have been if we linked manually, but the macOS compiler apparently does dead symbol stripping by default.

The last missing piece is internal linking on Linux. In short, it was not linking the Rust code, even if the compilation seemed to succeed. The relocations were not happening and the CALL instructions in our Rust function left pointing at meaningless addresses.

At that point I felt like it had to be a silent linker bug, the final boss in implementing rustgo, and reached out to people much smarter than me. One of them was guiding me in debugging cmd/link (which was fascinating!) when Ian Lance Taylor, the author of cgo, helpfully pointed out that //cgo:cgo_import_static is not enough for internal linking, and that I also wanted //cgo:cgo_import_dynamic.

I still have no idea why leaving it out would result in that issue, but adding it finally made our rustgo package compile both with external and internal linking, on Linux and macOS, out of the box.

Redistributable

Now that we can build a .a, we can take the suggestion in the //go:binary-only-package spec, and build a tarball with .as for linux_amd64/darwin_amd64 and the package source, to untar into a GOPATH to install.

Once installed like that, the package will be usable just like a native one, cross-compilation included (as long as we packaged a .a for the target)!

The only thing we have to worry about is that if we build Rust with -Ctarget-cpu=native it might not run on older CPUs. Thankfully benchmarks (and the curve25519-dalek authors) tell us that the only real difference is between post and pre-Haswell processors, so we only have to make a universal build and a Haswell one.

As the cherry on top, I made the Makefile obey GOOS/GOARCH, converting them as needed into Rust target triples, so if you have Rust set up for cross-compilation you can even cross-compile the .a itself.

Turning it into a real thing

Well, this was fun.

But to be clear, rustgo is not a real thing that you should use in production. For example, I suspect I should be saving g before the jump, the stack size is completely arbitrary, and shrinking the trampoline frame like that will probably confuse the hell out of debuggers. Also, a panic in Rust might get weird.

To make it a real thing I'd start by calling morestack manually from a NOSPLIT assembly function to ensure we have enough goroutine stack space (instead of rolling back rsp) with a size obtained maybe from static analysis of the Rust function (instead of, well, made up).

It could all be analyzed, generated and built by some "rustgo" tool, instead of hardcoded in Makefiles and assembly files. cgo itself is little more than a code-generation tool after all. It might make sense as a go:generate thing, but I know someone who wants to make it a cargo command. (Finally some Rust-vs-Go fighting!) Also, a Rust-side collection of FFI types like, say, GoSlice would be nice.

#[repr(C)]
struct GoSlice {
array: *mut u8,
len: i32,
cap: i32,
}

Or maybe a Go or Rust adult will come and tell us to stop before we get hurt.

EDIT: It was pointed out to me that if we simply named the Rust object file libed25519_dalek_rustgo.syso, we could skip all the go tool invocations and simply use go build which automatically links .syso files found in the package. But what's the fun in that.

Thanks (in no particular order) to David, Ian, Henry, Isis, Manish, Zaki, Anna, George, Kaylyn, Bill, David, Jess, Tony and Daniel for making this possible. Don't blame them for the mistakes and horrors, those are mine.

P.S. Before anyone tries to compare this to cgo (which has many more safety features) or pure Go, it's not meant to replace neither. It's meant to replace manually written assembly with something much safer and more readable, with comparable performance. Or better yet, it was meant to be a fun experiment.