Porting Our Software to ARM64

As we enable more ARM64[1] machines in our network, I want to give some technical insight into the process we went through to reach software parity in our multi-architecture environment.

To give some idea of the scale of this task, it’s necessary to describe the software stack we run on our servers. The foundation is the Linux kernel. Then, we use the Debian distribution as our base operating system. Finally, we install hundreds of packages that we build ourselves. Some packages are based on open-source software, often tailored to better meet our needs. Other packages were written from scratch within Cloudflare.

Industry support for ARM64 is very active, so a lot of open-source software has already been ported. This includes the Linux kernel. Additionally, Debian made ARM64 a first-class release architecture starting with Stretch in 2017. This meant that upon obtaining our ARM64 hardware, a few engineers were able to bring Debian up quickly and smoothly. Our attention then turned to getting all our in-house packages to build and run for ARM64.

Our stack uses a diverse range of programming languages, including C, C++, Go, Lua, Python, and Rust. Different languages have different porting requirements, with some being easier than others.

Porting Go Code

Cross-compiling Go code is relatively simple, since ARM64 is a first-class citizen. Go compiles and links static binaries using the system’s crossbuild toolchain, meaning the only additional Debian package we had to install on top of build-essential is crossbuild-essential-arm64.

After installing the crossbuild toolchain, we then replaced every go build invocation with a loop of GOARCH=<arch> CGO_ENABLED=1 go build, where <arch> iterates through amd64 and arm64. Forcing CGO_ENABLED=1 is required, as cgo is disabled by default for cross-compilation. The generated binaries are then run through our testing framework.

Porting Rust Code

Rust also has mature support for ARM64. The steps for porting start at installing crossbuild-essential-arm64, and defining the --target triple in either rustc or cargo. Different targets are bucketed into different tiers of completeness. Full instructions are well-documented at rust-cross.

One thing to note, however, is that any crates pulled in by a package must also be cross-compiled. The more crates used, the higher of a chance of running into one that does not cross-compile well.

Testing, Plus Porting Other Code

Other languages are less cooperative when it comes to cross-compilation. Fiddling with CC and LD values didn’t seem to be best use of engineering resources. What we really wanted was an emulation layer. An emulation layer would leverage all of our x86_64 machines, from our distributed compute behemoths to developers’ laptops, for the purposes of both building and testing code.

Enter QEMU.

QEMU is an emulator with multiple modes, including both full system emulation and user-space emulation. Our compute nodes are beefy enough to handle system-level emulation, but for developers’ laptops, user-space emulation provides most of the benefits, with less overhead.

For user-space emulation, our porting team did not want to intrude too much into our developers’ normal workflow. Our internal build system already uses Docker as a backend, so it would be ideal to be able to docker run into an ARM environment, like so:

Fortunately, we were not the first ones to come up with this idea: folks over at resin.io have solved this problem already! They’ve also submitted a patch to qemu-user that prepends the emulator into every execve call, similar to how binfmt_misc is implemented[2]. By prepending the emulator, you’re essentially forcing every new process to also be emulated, resulting in a nice self-contained environment.

With the execve patch in built into qemu-user, all we had to do was copy the emulator into an ARM64 container, and set the appropriate entrypoint:

LD_LIBRARY_PATH and Friends

It turns out that LD_LIBRARY_PATH was not the only environment variable that failed to work correctly. All environment variables, either set on the command line or via other means (e.g. export), would fail to propagate into the qemu-user process.

Through bisection of known good code, we found that it was the setcap in our Dockerfile which prevented the environment variable passthrough. Unfortunately, this setcap is the same one that allows us to call sudo, so we have a caveat for our developers that they can either run sudo inside their containers, or have environment variable passing, but not both.

Intermittent Go Failures

With a decent amount of Go code running through our CI system, it was easy to spot a trend of intermittent segfaults.

Going on a hunch, we confirmed a hypothesis that non-deterministic failures are generally due to threading issues. Unfortunately, opinion on the issue tracker showed that Go / QEMU incompatibilities aren’t a priority, so we were left without an upstream fix.

The workaround we came up with is simple: if the problem is threading-related, limit where the threads can run! When we package our internal go binaries, we add a .deb post-install script to detect if we’re running under ARM64 emulation, and if so, reduce the number of CPUs the go binary can run under to one. We lose performance by pinning to one CPU, but this slowdown is negligible when we’re already running under emulation, and slow code is better than non-working code.

With the workaround in place, reports of intermittent crashes dropped to zero. Onto the next problem!

Shared Library Mixups

We like to be at the forefront of technology. From suggesting improvements to what would become TLS 1.3, to partnering with Mozilla to make DNS queries more secure, and everything in between. To be able to support these new technologies, our software has to be at the cutting edge.

On the other hand, we also need a reliable platform to build on. One of the reasons we chose Debian is due to its long-term support lifecycle, versus other rolling release operating systems.

With these two ideas counterposed, we opted not to overwrite system libraries in /usr/lib with our cutting edge version, but instead supplement the defaults by installing into /usr/local/lib.

The same development team that reported the LD_LIBRARY_PATH issue also came to us saying the ARM64 version of their software would fail to load shared object symbols. A debugging session was launched and we eventually isolated it to the ordering of /etc/ld.so.conf.d/ in Debian.

By traversing /etc/ld.so.conf.d/ in alphabetical order, shared libraries in /usr/local/lib would be loaded before /usr/lib/$(uname -m)-linux-gnu on x86_64, while the opposite is true for arm64!

Internal discussion resulted in us not changing the system default search order, but instead use the linker flag --rpath to request the runtime loader to explicitly search our /usr/local/lib location first.

This issue applies to both the emulated and physical ARM64 environments, which is a boon for the emulation framework we’ve just put together.

Native Builds and CI

Cross- and emulated compilation brought over 99% of our edge codebase, but there were still a handful of packages that did not fit the models we defined. Some packages, e.g. llvm, parallelize their build so well that the cost of userspace emulation slowed the build time to over 6 hours. Other packages called more esoteric functions which QEMU was not prepared to handle.

Rather than devote resources to emulating the long tail, we allocated a few ARM64 machines for developer access, and one machine for a native CI agent. Maintainers of the long tail could iterate in peace, knowing their failing test cases were never due to the emulation layer. When ready, CI would pick up their changes and build an official package, post-review.

While native compilation is the least error-prone build method, limited supply of machines made this option unattractive; the more machines we allocate for development and CI mean the more machines we take away from our proving grounds.

Other Challenges

With the most glaring blockers out of the way, we have now given our developers an even footing to easily build for multiple architectures.

The rest of the time was spent coordinating over a hundred packages, split between dozens of tech teams. At the beginning, responsibility of building ARM64 packages laid on the porting team. Working on a changing codebase required close collaboration between maintainer and porter.

Once we deemed our ARM64 platform production-ready, a self-guided procedure was created to use the build methods listed above, and a request was sent out to all of engineering to support ARM64 as a first-class citizen.

The end result of our stack is currently being tested, profiled, and optimized, with results coming soon!