An Introduction to Accelerating Your Build with Clang

Building software is a time-consuming task, especially if you are working on large codebases such as Clang and LLVM (2.5M C++ lines combined). As an LLVM developer, a significant portion of my time is spent building the software because a clean build take several minutes to complete. The work outlined in this series started out as an experiment to see by how much we could speed up the build process by using different tools or build settings rather than simply upgrading our hardware, and the goal is to squeeze as much performance out of the toolchain as possible. The focus is on building Clang/LLVM, however most of the results and corresponding suggestions will apply to other large C++ projects as well, and this post expands on a lightning talk I recently gave at the EuroLLVM 2015 conference.

Taking a Look at the Options

So what can we actually do to speed up the build? We came up with a handful of ideas that will each be covered by several followup posts in this blog series. This will allow us to explore which options produce the biggest improvement:

Build less: e.g. build only the backends we are actually interested in rather than all of them

Note: The items above in darker text (the first four) will be covered in this post, the remaining items will be covered in the next article in this series.

The Build Environment

The development machine is a desktop with an Intel Core i7-4770K CPU @ 3.50GHz, 16GB of RAM and a 1TB 7200RPM HDD. This is a fairly standard machine that should probably have more RAM and an SSD rather than a HDD, but it still does the job. It runs Fedora 21 which ships with GCC 4.9.2 and GNU gold 2.24.

LLVM has two build systems which are maintained in parallel: a CMake-based build as well as a traditional autoconf-style Makefile build. In the long-term, the autoconf-style build system will be phased out, so this article will only focus on the CMake build. Since CMake is a meta build system it can generate build files for various native build systems. While taking measurements, we always use the Ninja generator, e.g. CMake will create Ninja build files and Ninja will drive the actual build.

We are using a fairly recent trunk snapshot for Clang (r234392 from April 8, 2015). Whenever Clang is mentioned in this blog post, it’s referring to this particular revision. In order to avoid any confusion, the term host compiler is used to denote the compiler being used for the actual build.

Since the machine has a hyper-threaded quad-core CPU, we carry out builds with eight parallel build jobs. The results are always the best of five runs (or more). The build system defaults to static builds (e.g. all binaries are statically linked) and, unless otherwise mentioned, the binaries are linked with GNU gold 2.24.

Measurement Results

Let’s give these a try to see how each of them work out in practice.

Clang vs. GCC Compile Time

A core goal of Clang is to have fast compile times and low memory usage, with a particular focus on fast compilation of debug builds. Thus, one of the first things to try is our Clang build with a Clang host compiler rather than building with GCC.

In the following chart, the different compilers are all built with GCC 4.9.2 to make sure that the compiler binaries have similar code quality. The chart shows the compile time for a debug/release build of Clang with GCC 4.9.2, GCC 5.1.0 and Clang as a host compiler:

As you can see in the chart above, building with Clang is significantly faster than building with GCC. Sadly, the compile time for both debug and release builds has regressed quite a bit when comparing GCC 5.1.0 to GCC 4.9.2.

Faster Linker

On Linux we also have the option of using the GNU gold linker rather than GNU ld. The GNU gold linker has been around for a couple of years already and is optimized for fast ELF linking.

The following chart shows how the compile time improves when using GNU gold rather than GNU ld. Clang is built with GCC 4.9.2 as a host compiler and CMake produces Makefiles rather than Ninja build files.

Using GNU gold rather than GNU ld gives us a nice 17% speedup on debug builds. There isn’t much of a speedup for release builds, which is mostly due to the fact that release binaries are much smaller than debug binaries. There’s simply not that much time spent on linking for release builds.

Optimize the Host Compiler Aggressively

The next thing to try is optimizing the host compiler more aggressively; the goal here is to get a Clang binary that executes as efficiently as possible. The baseline is an -O3 build of Clang. By enabling link-time optimizations (-O3 -flto) we make it possible for the compiler to optimize across the whole program rather than just across a single compilation unit.

Profile-guided optimizations (-O3 -fprofile-use) help the compiler optimize for a particular input of the program (in our case the inputs are the Clang sources themselves). The profiling data gives the compiler the ability to focus on optimizing the frequently executed paths of the program rather than having to rely on heuristics to predict which paths are executed frequently. It is also possible to combine LTO and PGO (-O3 -flto -fprofile-use) in order to provide the optimizers with a maximum amount of optimization context. For the PGO builds we gather the profiling data on a release build of Clang, e.g. the debug builds on the following chart are not optimized with the profiling data of an actual debug build of Clang.

There are no Clang PGO builds as the PGO support in Clang is still in the very early development stages. There are ongoing efforts to improve the PGO support in Clang however.

Unfortunately there are no LTO results for GCC 5.1.0 because there’s a bug which causes GCC 5.1.0 to crash with an internal compiler error when trying to do an LTO build of Clang.

It’s not a big surprise that we get the best results when using PGO: with a GCC 4.9.2 there’s a speedup of 1.16x for release builds and 1.13x for debug builds compared to the baseline -O3 build. For release builds we see a small performance regression with the GCC 5.1.0 PGO build. An interesting observation is that at -O3 Clang and GCC 4.9.2 are on par in terms of code quality. GCC 5.1.0 produces slightly better code than Clang at -O3.

Somewhat surprising is the fact that the GCC 4.9.2 LTO build actually regresses performance such that it’s comparable to the baseline -O3 build. The same performance regression is visible when combining LTO and PGO. It will be interesting to see whether GCC 5.1.0 yields better performance here as plenty of LTO improvements made it into the 5.1.0 release.

With the Clang LTO build we see a speedup of 1.03x both for debug and release builds compared to the baseline -O3 build. Finally, we also tried to tune the Clang binary for the microarchitecture of the CPU (-mtune=haswell -O3 -fuse-profile) but there was no noticeable performance improvement.

Split DWARF

Compiling large C++ applications with debug information can lead to slow link times and possible out of memory conditions. For big applications it’s not uncommon for the debug information alone to take up 85% of the binary size. Split DWARF tries to address this by splitting the debug information and storing the majority of it in separate DWARF object files. This significantly reduces the size of the object files the linker needs to process and thus speeds up the linking process.

The following chart shows the speedup we get when building with -gsplit-dwarf (using a Clang host compiler compiled with Clang at -O3):

Intuitively we would expect to see a higher speedup since this is a static debug build with a significant amount of link time, and it’s somewhat unclear why the results are not better. When looking at the linker invocation for the Clang binary in isolation we can see that -gsplit-dwarf indeed reduces both the memory consumption and link time significantly:

During the linking stage of the debug build the I/O wait time peaks at 5%. That doesn’t seem to be an excessive amount of I/O wait time but might already be enough to cut into the speedup we gain by building with -gsplit-dwarf. To get to the bottom of this a deeper performance analysis of the whole build process is needed.

Building with -gsplit-dwarf is still a good idea as it also helps to reduce the memory consumption tremendously. Plus there are good chances that you will see better results on your own machine.

What’s Next?

It should be apparent from the first post in this series that we have been able to generate some serious improvements to our build time. Part II of this series will include the remaining results from our experiments and will take a closer look at improving the overall speedup in order to provide a comprehensive picture of how this can help you!