Getting to Deterministic Builds on Windows

(Disclaimer: Some of this post discusses projects from my job. All opinions
and mistakes here are my own.)

This is a set of notes on getting to deterministic builds in C, C++ and Rust on
Windows.

The primary motivation for this is not the lofty goal of a Reproducible
Build, but simply improving our Bazel cache
hit rates.

A Quick primer on Bazel caching

At Dropbox, much of our build is powered by Bazel and I was involved in making
that a reality. One of the core benefits of Bazel is that once you buy into the
model, you get remote caching for free. This means a local developer can
benefit from the thousands of hours that CI machines spend cranking on the
build, and just pull down those artifacts instead of waiting several minutes
for full local rebuild.

The Bazel cache works at the action level, where an action is usually a unique
command run that produces some outputs from some inputs. Bazel calculates
checksums of all inputs and outputs and uses this to influence decisions about
when to use the cache. For a given action, if all your local input hashes match
the hashes in cache, Bazel can re-use the output from the cache instead of
rebuilding it.

Correctness

If the build is going to re-use outputs when inputs are the same, we want to
make sure that our compilers and other tools actually produce the same outputs
for the same inputs. What “same” means here lies on a range, because not all
tools are designed for this. At the lowest level you at least want these to
be functionally identical. That is, say you have inputs A and B, and a tool,
represented by a pure function F(inputs...) -> output, where output is some
executable.

Imagine we execute C and D and they produce different results! Or, say C is a
debug mode executable and D is one with optimizations, and so one runs faster
than the other. This is incorrect! Our function F did not produce the same
outputs for the same inputs, which means F itself has some implicit
configuration or state that is changing behavior. In the context of build
systems, this usually means the build system didn’t treat the command line
and compiler flags used as “inputs”. Bazel goes to great lengths in the build
description to force you to very pedantically describe all these things, so
that it can track all of it as inputs to monitor. A combination of
toolchains, well defined inputs and outputs and sandboxing is used to enforce
this. This usually means we don’t have to deal with such egregious
differences of correctness. That is, it is acceptable to have the same inputs
produce slightly different hashes. This will manifest in two ways. One, Bazel
will use the cache, see the inputs are the same, and instead of producing D
(hash: romeo), just gets C (hash: whiskey) from the cache. This is
usually OK, but we should try to minimize it.

The other way it can go is it decides to build locally, gets D and now
everything that depends on D is affected, slowing the local build.

Speed

In terms of build system classification, Bazel has a rebuilder
that uses constructive traces to track the build. Roughly this means for a
given action, only its immediate inputs affect the caching. So if we had a
common 2 step process for a C library:

That is a pickle! Say Bazel’s heuristics decide to compile locally instead of
hitting the cache. We end up with foo.o hashing to november, which is not in
the cache from the CI machine, so Bazel is forced to also run the link step
locally. You can see how this can spread. In large builds, you could easily
have tens of C files per library, some of which are built locally, some pulled
from the cache and every time hash mismatches happen, the build system is
forced to start from scratch down the entire build graph!

Well, that hypothetical compiler exists and it is called the Microsoft Visual
Studio C/C++ compiler 2.

What we really want is for every command we run, to produce truly identical
outputs so we get maximum cache hits. The fact that identical hashes do also
mean truly reproducible builds is a nice side-effect.

Now, everything I’ve talked about so far is well known in certain circles.
Let’s talk about how to actually address this on Windows!

Fix Date and time macros

The C standard provides two macros __DATE__ and __TIME__ that are set to
the time of compile. If a source file actually uses that to set variables or in
strings, that is terrible for our purposes because every compiler invocation
will lead to new values. We are forced to break these. I don’t really know of
any libraries that use this to affect behavior.

For MSVC, we can override the macro definitions:

cl.exe /D__DATE__=CONSTANT /D__TIME__=CONSTANT …

Fix dates and times in Portable Executables

In addition, the PE format
used by Windows for executables and DLLs has file headers that have a
prominent TimeDatestamp
field. This is inserted by the linker. There is an undocumented flag - /Brepro that causes the linker to put a fixed value in this field.

link.exe /Brepro …

There are a few more places in a PE that have timestamps. This includes the
IMAGE_EXPORT_DIRECTORY and IMAGE_RESOURCE_DIRECTORY structures.

Fortunately, there is a nice tool called
ducible that can be run on the file
after linking to rewrite these bits with constant values.

Integrating ducible with your build process can be a little involved. The
nicest way I know for Bazel is to tweak the toolchain definition. This has the
nice property of not requiring the rest of your build to know about ducible.
Even custom rules that leverage cc_common to create actions will
automatically benefit. If you are using a fixed toolchain configuration
inspired by this Bazel example, you should replace the
linker path with a custom target, instead of a direct path to
link.exe. This can be a batch script or similar that forwards all the linker
options to link.exe, then runs ducible transparently before exiting. This is
not the approach we use internally, since I only thought about it recently.
Instead we have the few places that require this be aware of ducible and add it
to the Bazel actions or stick it into tool wrappers. That is, our rust compiler
isn’t a direct call to rustc, but a wrapper script that does a bunch of
things, one of which is running ducible.

Disable incremental builds

MSVC has an incremental build mode, where the linker adds extra information to
files to allow it to build faster on subsequent runs. This causes changes in
the hash. Since our build is powered by Bazel, incremental builds also don’t
really help us. Disable
this.

Deal with the PDBs

Program Database files (PDBs) are the real bane of our quest. We
need them to debug software without shipping debug information to users, so we
cannot simply disable the production of these files. Everything I’ve discussed
above is relatively well known, but this I had to discover for myself.

PDBs have several problems:

A PDB and a PE are linked so debuggers know exactly which PDB to use from a
symbol server. PE file have a debug section that has a timestamp and a
signature to identify the PDB. This changes every time. This is fixed by
ducible.

PE files encode the absolute path to the PDB by default.

PDB files contain absolute paths to all the resources involved, such as
object files. This means we need to ensure all paths are the same on every
machine.

Since PEs and PDBs are linked, changes to the PDB always change the PE. This
means we need identical PDBs.

PDB files have a build identifier, so even minor version updates in MSVC
lead to hash differences.

Fix the paths

To fix #2, we can pass the /PDBALTPATH:%_PDB% flag to link.exe.
This makes the linker encode just the filename instead of an
absolute path.

To fix #3 and #4, we need to control where build trees are located on all
machines. The first problem is that developers will naturally have files in
different locations because the typical storage location is somewhere in
C:\Users\<user name> and <user name> is unique. If you are doing in-source
builds, this will be a problem. If you are doing out-of-source builds, this is
easier to fix by having your build system use a well-known location like
C:\build instead of C:\Users\nikhil\path\to\repo\build. This way all
absolute paths start as C:\build\... on any machine, fixing #3 to some
extent.

Fortunately, Bazel always does out of source builds so we can control where to
place the build tree. In addition, external libraries and resources are
also something to take care of. To take care of both of these we need to change
Bazel’s output user root and output base, both of which are usually calculated
based on hashes of things, and stored in the user home directory. The full
details are on their website. Something like
--output_user_root=C:/bazel and --output_base=C:/bazel/base is a good
start 3.

The other source of changing paths is temporary files. Depending on your
specific build steps, it is very common to run tools that build something in a
temporary directory. You will need to change this to use a deterministic
location. Using the hash of some set of inputs is a good way to go. We often
use a hash of the repository relative location of files. There are some sources
of temporary files you cannot easily remove, such as this one I
found in rustc.

With all these changes you should be getting deterministic PEs and PDBs as far
as input files go.

Pin the build version

The final bit I noticed was PDBs have a major and minor build number in the
DBI stream header. This can change within different versions of the
same major toolchain like Visual Studio 2017, so it is important to have CI
machines and developer machines on the same update cycle. Some automated
deployment stragegy helps here so every developer does not have to remember to
manually update their installation.

As far as I’m aware this leads to bit-identical PEs and PDBs in my tests so far!

Useful tools

pdbdump - Shipped as part of ducible, pdbdump is great for seeing the
metadata in a pdb. This was what helped me see absolute paths in the PDB.

dumpbin - Part of MSVC. dumpbin /all run on an EXE or DLL is a good way
to see how headers and other metadata differ.

A binary diff tool such as vbindiff. After you’ve exhausted all
the metadata approaches, sometimes you just have to jump into hex and
looking at individual bytes. This is a small and free tool. There is also a
limited, web version of diffoscope, but that tool it self does
not work on Windows.