You wouldn't like me when I'm not coding.

Compiling Rust for GPUs

A couple of days back, I tweeted that I had just ran code written in
Rust on the GPU. It’s about time I provided some more details. This is
a project I worked on with Milinda Pathirage, a fellow student at
IU. I should emphasize that this is very much in the proof of concept
stage. I doubt it will work well enough to do anything useful, but it
does work well enough to do something and it would certainly be
possible to extend this. That said, I will include links to our code
so the valiant hackers out there can try it out if they wish. For
posterity’s sake, here is, to my knowledge, the first fragment of Rust
code to ever execute on a GPU:

There are two main parts to this project. The first is compiling Rust
code into something suitable for running on the GPU. We do this using
the PTX backend that is part of LLVM. The second part is loading and
executing the kernel. For this, we use OpenCL and its
clCreateProgramWithBinary API. In this
post, I’ll focus on the issues encountered with generating PTX code.

The bulk of the work to generate PTX code was already done by the
NVPTX backend that was recently contributed to
LLVM by NVIDIA. We started out with a very manual process. First we
used the --emit-llvm flag for rustc to save the generated LLVM
bitcode. From there, we attempt to compile as PTX using llc:

1

llc -march=nvptx -mcpu=sm_13 trivial-kernel.ll -o trivial-kernel.ptx

I wasn’t terribly surprised to see this fail with one of LLVM’s
typically opaque error messages. You can see it here if
you wish. Basically, Rust was generating code that the NVPTX backend
didn’t know how to handle. This makes sense; I expect NVIDIA primarily
tests the backend on code generated by CUDA, which looks different
from code that Rust generates. The next step was to pare down the generated LLVM to something a little more manageable:

Assertionfailed:(!isLiteral()&&"Literal structs never have names"),functiongetName,file/usr/local/src/llvm/lib/VMCore/Type.cpp,line605.

It seems that NVPTX was having trouble with the anonymous struct in
the function arguments ({ i64, %tydesc*, i8*, i8*, i8 }). To test
this theory, I replaced that type with an i8 *. The argument was
ignored anyway, so this shouldn’t cause problems. With this change, we
ended up with a PTX file.

At point, we could either hack the Rust compiler to avoid generating
code that the NVPTX backend couldn’t handle, or we could improve the
NVPTX backend. I opted for the latter, and ended up submitting my
first ever patch to LLVM.

After another minor fix or two, it became clear that we were going to
have to modify the way Rust generates code as well. For example, the
PTX code I linked to above does not include a .entry line, which is
required to indicate where a kernel function begins. One option would
be to add a new PTX target for Rust, and basically set it up as a
cross compiler. This isn’t quite what we want. We don’t want to run
all of Rust on the GPU, just a few portions of a program. Other than
the code generator, we want to PTX code to agree with the
architectural details of the host system. Instead, I added a -Zptx
flag to rustc and started making minor changes to the translation
pass. Functions that have the #[kernel] attribute get compiled to
use the ptx_kernel calling convention, which tells NVPTX to add the
.entry line. According to Patrick, we should probably use a new
ABI setting instead, as arbitrary attributes aren’t part of the
function’s type.

At any rate, we could now pretty reliably go from Rust to PTX without
any manual intervention. The next challenge was to execute the
kernel. When we first tried to load the PTX file, OpenCL complained
about an “invalid binary.” We had previously been able to load a PTX
file generated with OpenCL and extracted using
clGetProgramInfo, so we decided to compare the
Rust-generated code with the OpenCL-generated code. It turns out that
the parameters to the kernel were not being annotated with an address
space. We manually added .global to the parameters in the
Rust-generated code, and we were able to load and execute the
kernel. Furthermore, we could manually annotate the LLVM code with
addrspace(1) to get the same behavior.

For some types, Rust would have the addrspace(1) annotation, but for
others it wouldn’t. It turns out Rust was already using address spaces
for something related to garbage collection. Unfortunately, Rust and
NVPTX disagree on what these address spaces mean. To work around this,
I had Rust generate different address spaces when the -Zptx flag is
given. At the moment these changes only take effect for &
pointers. Others, such as @ pointers will be more difficult to get
working.

The final missing piece on the code generation side of things is to
have threads be able to do different things. This means providing
equivalents of the blockIdx, blockDim and threadIdx
variables. These show up in LLVM as intrinsic functions, so all we
need to do is expose those as new Rust intrinsics. We expect to have
this part working soon.

Our work here shows it’s possible to compile Rust to run on the
GPU. We support an extremely limited subset of Rust at the
moment. Most of the remaining challenges have to do with the way data
is arranged in memory and how Rust provides safety at runtime. Rust
uses a lot of pointer structures, and moving these between host and
device memory can be difficult. Perhaps the best thing to do for now
is simply to be careful about what data types we use in GPU code. Even
if we use relatively flat types, however, we will still need to handle
a few more things. For example, Rust does array bounds checks at
runtime. If we want to allow arbitrary array indexing safely in GPU
code, we’ll need a way to do bounds checks and report failures from
kernel code. There are clearly a lot of design issues left, but the
initial results for compiling Rust to run on the GPU seem very
promising.