Posts tagged with 'speed'

There are two different schools of thought when it comes to embedded development such as working on Ubuntu phone. The first group feels that compiling on the device is better and the other group prefers doing cross-builds. Both have their advantages and disadvantages.

One of the biggest disadvantages for compiling on-device is the fact that a reflash wipes all your state and you need to compile your project from scratch again. This is unfortunate if a fresh build takes 20 minutes (which is not that uncommon).

The question then becomes whether you can avoid this. It turns out that you can by doing a ccache transplant. Here’s how to do it.

A common step in a software developer’s life is building packages. This happens both directly on you own machine and remotely when waiting for the CI server to test your merge requests.

As an example, let’s look at the libcolumbus package. It is a common small-to-medium sized C++ project with a couple of dependencies. Compiling the source takes around 10 seconds, whereas building the corresponding package takes around three minutes. All things considered this seems like a tolerable delay.

But can we make it faster?

The first step in any optimization task is measurement. To do this we simulated a package builder by building the source code in a chroot. It turns out that configuring the source takes one second, compiling it takes around 12 seconds and installing build dependencies takes 2m 29s. These tests were run on an Intel i7 with 16GB of RAM and an SSD disk. We used CMake’s Make backend with 4 parallel processes.

Clearly, reducing the last part brings the biggest benefits. One simple approach is to store a copy of the chroot after dependencies are installed but before package building has started. This is a one-liner:

sudo btrfs subvolume snapshot -r chroot depped-chroot

Now we can do anything with the chroot and we can always return back by deleting it and restoring the snapshot. Here we use -r so the backed up snapshot is read-only. This way we don’t accidentally change it.

With this setup, prepping the chroot is, effectively, a zero time operation. Thus we have cut down total build time from 162 seconds to 13, which is a 12-fold performance improvement.

But can we make it faster?

After this fix the longest single step is the compilation. One of the most efficient ways of cutting down compile times is CCache, so let’s use that. For greater separation of concerns, let’s put the CCache repository on its own subvolume.

The latter command gave an error about incorrect ioctls. The same effect can be achieved with bind mounts, though.

When doing this the compile time drops to 0.6 seconds. This means that we can compile projects over 100 times faster.

But can we make it faster?

At this point all individual steps take a second or so. Optimizing them further would yield negligible performance improvements. In actual package builds there are other steps that can’t be easily optimized, such as running the unit test suite, running Lintian, gathering and verifying the package and so on.

If we look a bit deeper we find that these are all, effectively, single process operations. (Some build systems, such as Meson, will run unit tests in parallel. They are in the minority, though.) This means that package builders are running processes which consume only one CPU most of the time. According to usually reliable sources package builders are almost always configured to work on only one package at a time.

Having a 24 core monster builder run single threaded executables consecutively does not make much sense. Fortunately this task parallelizes trivially: just build several packages at the same time. Since we could achieve 100 times better performance for a single build and we can run 24 of them at the same time, we find that with a bit of effort we can achieve the same results 2400 times faster. This is roughly equivalent to doing the job of an entire data center on one desktop machine.

The small print

The numbers on this page are slightly optimistic. However the main reduction in performance achieved with chroot snapshotting still stands.

In reality this approach would require some tuning, as an example you would not want to build LibreOffice with -j 1. Keeping the snapshotted chroots up to date requires some smartness, but these are all solvable engineering problems.

Some C++ code bases seem to compile much more slowly than others. It is hard to compare them directly because they very often have different sizes. Thus it is hard to encourage people to work on speed because there are no hard numbers to back up your claims.

To get around this I wrote a very simple compile time measurer. The code is available here. The basic idea is quite simple: provide a compiler wrapper that measures the duration of each compiler invocation and the amount of lines (including comments, empty lines etc) the source file had. Usage is quite simple. First you configure your code base.

We can see that C++ compiles quite a lot slower than plain C. The main interesting thing is that C++ compilation speed can change an order of magnitude between projects. The fastest is libcolumbus, which has been designed from the ground up to be fast to compile.

What we can deduce from this experiment is that C++ compilation speed is a feature of the code base, not so much of the language or compiler. It also means that if your code base is a slow one, it is possible to make it compile up to 10 times faster without any external help. The tools to do it are simple: minimizing interdependencies and external deps. This is one of those things that is easy to do when starting anew but hard to retrofit to code bases that resemble a bowl of ramen. The payoff, however, is undeniable.

Using libraries in C++ is simple. You just do #include<libname> and all necessary definitions appear in your source file.

But what is the cost of this single line of code?

Quite a lot, it turns out. Measuring the effect is straightforward. Gcc has a compiler flag -E, which only runs the preprocessor on the given source file. The cost of an include can be measured by writing a source file that has only one command: #include<filename>. The amount of lines in the resulting file tells how much code the compiler needs to parse in order to use the library.

Here is a table with measurements. They were run on a regular desktop PC with 4 GB of RAM and an SSD disk. The tests were run several times to insure that everything was in cache. The machine was running Ubuntu 12/10 64 bit and the compiler was gcc.

It should be noted that the elapsed time is only the amount it takes to run the code through the preprocessor. This is relatively simple compared to parsing the code and generating the corresponding machine instructions. I ran the test with Clang as well and the times were roughly similar.

Even the most common headers such as vector add almost 10k lines of code whenever they are included. This is quite a lot more than most source files that use them. On the other end of the spectrum is stuff like Boost.Python, which takes over three seconds to include. An interesting question is why it is so much slower than Nux, even though it has less code.

This is the main reason why spurious includes need to be eliminated. Simply having include directives causes massive loss of time, even if the features in question are never used. Placing a slow include in a much used header file can cause massive slowdowns. So if you could go ahead and not do that, that would be great.

There have been several posts in this blog about compile speed. However most have been about theory. This time it’s all about measurements.

I took the source code of Scribus, which is a largeish C++ application and looked at how much faster I could make it compile. There are three different configurations to test. The first one is building with default settings out of the box. The second one is about changes that can be done without changing any source code, meaning building with the Ninja backend instead of Make and using Gold instead of ld. The third configuration adds precompiled headers to the second configuration.

The measurements turned out to have lots of variance, which I could not really nail down. However it seemed to affect all configurations in the same way at the same time so the results should be comparable. All tests were run on a 4 core laptop with 4 GB of ram. Make was run with ‘-j 6′ as that is the default value of Ninja.

Default: 11-12 minutes
Ninja+Gold: ~9 minutes
PCH: 7 minutes

We can see that a bit of work the compile time can be cut almost in half. Enabling PCH does not require changing any existing source files (though you’ll get slightly better performance if you do). All in all it takes less than 100 lines of CMake code to enable precompiled headers, and half of that is duplicating some functionality that CMake should be exposing already. For further info, see this bug.

Is it upstream? Can I try it? Will it work on my project?

The patch is not upstreamed, because it is not yet clean enough. However you can check out most of it in this merge request to Unity. In Unity’s case the speedup was roughly 40%, though only one library build time was measured. The total build time impact is probably less.

Note that if you can’t just grab the code and expect magic speedups. You have to select which headers to precompile and so on.

Finally, for a well tuned code base, precompiled headers should only give around 10-20% speed boost. If you get more, it probably means that you have an #include maze in your header files. You should probably get that fixed sooner rather than later.

A relatively large portion of software development time is not spent on writing, running, debugging or even designing code, but waiting for it to finish compiling. This is usually seen as necessary evil and accepted as an unfortunate fact of life. This is a shame, because spending some time optimizing the build system can yield quite dramatic productivity gains.

Suppose a build system takes some thirty seconds to run for even trivial changes. This means that even in theory you can do at most two changes a minute. In practice the rate is a lot lower. If the build step takes only a few seconds, trying out new code becomes a lot faster. It is easier to stay in the zone when you don’t have to pause every so often to wait for your tools to finish doing their thing.

Making fundamental changes in the code often triggers a complete rebuild. If this takes an hour or more (there are code bases that take 10+ hours to build), people try to avoid fundamental changes as much as possible. This causes loss of flexibility. It becomes very tempting to just do a band-aid tweak rather than thoroughly fix the issue at hand. If the entire rebuild could be done in five to ten minutes, this issue would become moot.

In order to make things fast, we first have to understand what is happening when C/C++ software is compiled. The steps are roughly as follows:

Configuration

Build tool startup

Dependency checking

Compilation

Linking

We will now look at each step in more detail focusing on how they can be made faster.

Configuration

This is the first step when starting to build. Usually means running a configure script or CMake, Gyp, SCons or some other tool. This can take anything from one second to several minutes for very large Autotools-based configure scripts.

This step happens relatively rarely. It only needs to be run when changing configurations or changing the build configuration. Short of changing build systems, there is not much to be done to make this step faster.

Build tool startup

This is what happens when you run make or click on the build icon on an IDE (which is usually an alias for make). The build tool binary starts and reads its configuration files as well as the build configuration, which are usually the same thing.

Depending on build complexity and size, this can take anywhere from a fraction of a second to several seconds. By itself this would not be so bad. Unfortunately most make-based build systems cause make to be invocated tens to hundreds of times for every single build. Usually this is caused by recursive use of make (which is bad).

It should be noted that the reason Make is so slow is not an implementation bug. The syntax of Makefiles has some quirks that make a really fast implementation all but impossible. This problem is even more noticeable when combined with the next step.

Dependency checking

Once the build tool has read its configuration, it has to determine what files have changed and which ones need to be recompiled. The configuration files contain a directed acyclic graph describing the build dependencies. This graph is usually built during the configure step. Suppose we have a file called SomeClass.cc which contains this line of code:

#include "OtherClass.hh"

This means that whenever OtherClass.hh changes, the build system needs to rebuild SomeClass.cc. Usually this is done by comparing the timestamp of SomeClass.o against OtherClass.hh. If the object file is older than the source file or any header it includes, the source file is rebuilt.

Build tool startup time and the dependency scanner are run on every single build. Their combined runtime determines the lower bound on the edit-compile-debug cycle. For small projects this time is usually a few seconds or so. This is tolerable.

The problem is that Make scales terribly to large projects. As an example, running Make on the codebase of the Clang compiler with no changes takes over half a minute, even if everything is in cache. The sad truth is that in practice large projects can not be built fast with Make. They will be slow and there’s nothing that can be done about it.

There are alternatives to Make. The fastest of them is Ninja, which was built by Google engineers for Chromium. When run on the same Clang code as above it finishes in one second. The difference is even bigger when building Chromium. This is a massive boost in productivity, it’s one of those things that make the difference between tolerable and pleasant.

If you are using CMake or Gyp to build, just switch to their Ninja backends. You don’t have to change anything in the build files themselves, just enjoy the speed boost. Ninja is not packaged on most distributions, though, so you might have to install it yourself.

If you are using Autotools, you are forever married to Make. This is because the syntax of autotools is defined in terms of Make. There is no way to separate the two without a backwards compatibility breaking complete rewrite. What this means in practice is that Autotool build systems are slow by design, and can never be made fast.

Compilation

At this point we finally invoke the compiler. Cutting some corners, here are the approximate steps taken.

Merging includes

Parsing the code

Code generation/optimization

Let’s look at these one at a time. The explanations given below are not 100% accurate descriptions of what happens inside the compiler. They have been simplified to emphasize the facets important to this discussion. For a more thorough description, have a look at any compiler textbook.

The first step joins all source code in use into one clump. What happens is that whenever the compiler finds an include statement like #include “somefile.h”, it finds that particular source file and replaces the #include with the full contents of that file. If that file contained other #includes, they are inserted recursively. The end result is one big self-contained source file.

The next step is parsing. This means analyzing the source file, splitting it into tokens and building an abstract syntax tree. This step translates the human understandable source code into a computer understandable unambiguous format. It is what allows the compiler to understand what the user wants the code to do.

Code generation takes the syntax tree and transforms it into machine code sequences called object code. This code is almost ready to run on a CPU.

Each one of these steps can be slow. Let’s look at ways to make them faster.

Faster #includes

Including by itself is not slow, slowness comes from the cascade effect. Including even one other file causes everything included in it to be included as well. In the worst case every single source file depends on every header file. This means that touching any header file causes the recompilation of every source file whether they use that particular header’s contents or not.

Cutting down on interdependencies is straightforward. Only #include those headers that you actually use. In addition, header files must not include any other header files if at all possible. The main tool for this is called forward declaration. Basically what it means is that instead of having a header file that looks like this:

#include "SomeClass.hh"
class MyClass {
SomeClass s;
};

You have this:

class SomeClass;
class MyClass {
SomeClass *s;
}

Because the definition of SomeClass is not know, you have to use pointers or references to it in the header.

Remember that #including MyClass.hh would have caused SomeClass.hh and all its #includes to be added to the original source file. Now they aren’t, so the compiler’s work has been reduced. We also don’t have to recompile the users of MyClass if SomeClass changes. Cutting the dependency chain like this everywhere in the code base can have a major effect in build time, especially when combined with the next step. For a more detailed analysis including measurements and code, see here.

Faster parsing

The most popular C++ libraries, STL and Boost, are implemented as header only libraries. That is, they don’t have a dynamically linkable library but rather the code is generated anew into every binary file that uses them. Compared to most C++ code, STL and Boost are complex. Really, really complex. In fact they are most likely the hardest pieces of code a C++ compiler has to compile. Boost is often used as a stress test on C++ compilers, because it is so difficult to compile.

It is not an exaggeration to say that for most C++ code using STL, parsing the STL headers is up to 10 times slower than parsing all the rest. This leads to massively slow build times because of class headers like this:

As we learned in the previous chapter, this means that every single file that includes this header must parse STL’s vector definition, which is an internal implementation detail of SomeClass and even if they would not use vector themselves. Add some other class include that uses a map, one for unordered_map, a few Boost includes and what do you end up with? A code base where compiling any file requires parsing all of STL and possibly Boost. This is a factor of 3-10 slowdown on compile times.

Getting around this is relatively simple, though takes a bit of work. It is known as the pImpl idiom. One way of achieving it is this:

Now the dependency chain is cut and users of SomeClass don’t have to parse vector. As an added bonus the vector can be changed to a map or anything else without needing to recompile files that use SomeClass.

Faster code generation

Code generation is mostly an implementation detail of the compiler, and there’s not much that can be done about it. There are a few ways to make it faster, though.

Optimizing code is slow. In every day development all optimizations should be disabled. Most build systems do this by default, but Autotools builds optimized binaries by default. In addition to being slow, this makes debugging a massive pain, because most of the time trying to print the value of some variable just prints out “value optimised out”.

Making Autotools build non-optimised binaries is relatively straightforward. You just have to run configure like this: ./configure CFLAGS=’O0 -g’ CXXFLAGS=’-O0 -g’. Unfortunately many people mangle their autotools cflags in config files so the above command might not work. In this case the only fix is to inspect all autotools config files and fix them yourself.

The other trick is about reducing the amount of generated code. If two different source files use vector<int>, the compiler has to generate the complete vector code in both of them. During linking (discussed in the next chapter) one of them is just discarded. There is a way to tell the compiler not to generate the code in the other file using a technique that was introduced in C++0x called extern templates. They are used like this.

This instructs the compiler not to generate vector code when compiling file B. The linker makes it use the code generated in file A.

Build speedup tools

CCache is an application that stores compiled object code into a global cache. If the same code is compiled again with the same compiler flags, it grabs the object file from the cache rather than running the compiler. If you have to recompile the same code multiple times, CCache may offer noticeable speedups.

A tool often mentioned alongside CCache is DistCC, which increases parallelism by spreading the build to many different machines. If you have a monster machine it may be worth it. On regular laptop/desktop machines the speed gains are minor (it might even be slower).

Precompiled headers

Precompiled headers is a feature of some C++ compilers that basically serializes the in-memory representation of parsed code into a binary file. This can then be read back directly to memory instead of reparsing the header file when used again. This is a feature that can provide massive speedups.

Out of all the speedup tricks listed in this post, this has by far the biggest payoff. It turns the massively slow STL includes into, effectively, no-ops.

So why is it not used anywhere?

Mostly it comes down to poor toolchain support. Precompiled headers are fickle beasts. For example with GCC they only work between two different compilation units if the compiler switches are exactly the same. Most people don’t know that precompiled headers exist, and those that do don’t want to deal with getting all the details right.

CMake does not have direct support for them. There are a few modules floating around the Internet, but I have not tested them myself. Autotools is extremely precompiled header hostile, because its syntax allows for wacky and dynamic alterations of compiler flags.

Faster Linking

When the compiler compiles a file and comes to a function call that is somewhere outside the current file, such as in the standard library or some other source file, it effectively writes a placeholder saying “at this point jump to function X”. The linker takes all these different compiled files and connects the jump points to their actual locations. When linking is done, the binary is ready to use.

Linking is surprisingly slow. It can easily take minutes on relatively large applications. As an extreme case, linking the Chromium browser on ARM takes 3 gigs of RAM and takes 18 hours.

Yes, hours.

The main reason for this is that the standard GNU linker is quite slow. Fortunately there is a new, faster linker called Gold. It is not the default linker yet, but hopefully it will be soon. In the mean time you can install and use it manually.

A different way of making linking faster is to simply cut down on these symbols using a technique called symbol visibility. The gist of it is that you hide all non-public symbols from the list of exported symbols. This means less work and memory use for the linker, which makes it faster.

Conclusions

Contrary to popular belief, compiling C++ is not actually all that slow. The STL is slow and most build tools used to compile C++ are slow. However there are faster tools and ways to mitigate the slow parts of the language.

Using them takes a bit of elbow grease, but the benefits are undeniable. Faster build times lead to happier developers, more agility and, eventually, better code.

Fewer people know why, or how to make it faster. Other people do, for example the developers at Remedy made the engine of Alan Wake compile from scratch in five minutes. The payoff for this is increased productivity, because the edit-compile-run cycle gets dramatically faster.

There are several ways to speed up your compiles. This post looks at reworking your #includes.

Quite a bit of C++ compilation time is spent parsing headers for STL, Qt and whatever else you may be using. But how long does it actually take?

To find out, I wrote a script to generate C++ source. You can download it here. What it does is generate source files that have some includes and one dummy function. The point is to simulate two different use cases. In the first each source file includes a random subset of the includes. One file might use std::map and QtCore, another one might use Boost’s strings and so on. In the second case all possible includes are put in a common header which all source files include. This simulates “maximum developer convenience” where all functions are available in all files without any extra effort.

By default the script produces 100 source files. When the includes are kept in individual files, compiling takes roughly a minute. When they are in a common header, it takes three minutes.

Remember: the included STL/Boost/Qt4 functionality is not used in the code. This is just the time spent including and parsing their headers. What this example shows is that you can remove 2 minutes of your build time, just by including C++ headers smartly.

The delay scales linearly. For 300 files the build times are 2 minutes 40 seconds and 7 minutes 58 seconds. That’s over five minutes lost on, effectively, no-ops. The good news is that getting rid of this bloat is relatively easy, though it might take some sweat.

Never include any (internal) header in another header if you can use a forward declaration. Include the header in the implementation file.

Never include system headers (STL, etc) in your headers unless absolutely necessary, such as due to inheritance. If your class uses e.g. std::map internally, hide it with pImpl. If your class API requires these headers, change it so that it doesn’t or use something more lightweight (e.g. std::iterator instead of std::vector).

Never, never, ever include system stuff in your public headers. That slows down not just your own compilation time, but also every single user of your library. The only exception is when your library is a plugin or extension to an existing library and even then your includes need to be minimal.

Last week was the Bazaar sprint, which was fantastic and tiring. Somehow even the people who’d been at UDS just before made it through five packed days of fixing bugs, preparing releases, and debugging package imports. We were most hospitably hosted at the Canonical offices a long way up Millbank tower. But even those who couldn’t be there in person to enjoy the view were part of the experience. At home in the Ukraine Alexander wore his Bazaar shirt in support during the first day. On IRC larstiq and santagada ran the test suite on pypy and investigated incompatibilities. And all week we had a small robot John sitting in the middle of the table on the line from the Netherlands, working on performance bugs and offering helpful advice.

There were two new faces introduced. Max has been a stalwart maintaining the ~bzr PPAs and getting daily builds working. Jonathan is joining the Bazaar team on rotation from Kubuntu, which is very exciting for fans of qbzr. He started getting to know bzrlib by taking on some bugs tagged ‘easy’ and pair programming on harder ones. It was a bit tough to keep track of everything going on, but good progress was made on the Ubuntu Distributed Development front, the translation framework branches Naoki put together were landed, and lots of pet bugs were fixed. Download bzr 2.4b3 now to see the rest of the results for yourself.

After these long days in front of screens a nice meal out was a welcome treat. Over dinner we even managed to get on to topics other than code on occasion. On Thursday evening everyone went to As You Like It at the Globe as groundlings. Even with the language barrier to overcome for some of the sprinters, the comedy lived up to the categorisation. Trying to use the cycle hire scheme to travel there and back proved more of an obstacle. The bikes themselves were fine, provided you could get past the terrible computer interface and persuade the system to let you rent them. Now, if only they took patches for that…

The last of my patches is queued up to land, so I figured I’d post an update about the performance improvements I’ve been working on. I’m also just excited about how well it has all come together.

There were essentially 3 changes that mattered for performance on large trees.

Fixing iter_entries_by_dir() to preload the data in Repository- optimal ordering rather than by-request ordering. In large trees this was causing us to thrash and become pathologically slow. In the 70,000-file test tree, thrashing took about 3 minutes, the preloading version takes about 15s. This affected a lot of our commands, though I guess the next two fixes would actually reduce the number of commands affected by this.

Fixing several code paths to use optimized iter_changes() rather than the generic iter_changes(). The generic path walks both inventories iter_entries_by_dir() and compares them. Our 2a format Repository can do iter_changes without loading the whole tree. (It internally uses a hash_trie to store the inventory, and so nodes with matching sub-trees can be skipped for comparison.) This generally shows up as something that was taking 15s (to load the whole inventory) dropping to <2s for the improved comparison. (bzr revert and bzr pull were both directly impacted here)

Changing WT.set_parent_trees([one_tree]) to update itself using current_basis.iter_changes(one_tree), rather than setting the state from scratch. This basically adds another case where we can avoid reading the whole inventory state again, which is another 15s to <2s sort of change. This only showed up after fixing (2), because once the tree is loaded, the other actions are generally pretty quick. (bzr up, bzr pull)

This is the chart I put together for “whats-new-in-2.4.txt”. bzr-2.3.2 will have fix (1), but not (2) or (3), to give a feel for how much of an impact different fixes have had.

Ubuntu 9.10 (“Karmic Koala”) has some neat new features, and I want to bring your attention to a few in turn. I am not taking credit for any of them. I didn’t have much to do with any of these wonderful features, unless I mention it.

The booting software that ships in Ubuntu 9.10 is still safe, stable, and slow, relative to what else the Ubuntu folks have made. If you have a non-solid-state, regular old rotational disk-drive, then you can shave off maybe another 20 seconds by using a more specialized linux kernel and read-ahead implementation. How? It’s dead simple! From a terminal (Applications, Accessories) run these three commands:

Are you a Bazaar fan and need some help explaining to others why Bazaar is cool? I published a document last week called Why switch to Bazaar? that may help. I’ve tried hard to present the big picture together with some concrete examples, explaining what we stand for and what that means to users, teams and communities in reality. Furthermore, if you tried Bazaar 1.x but found it too slow or inefficient, I’m sure you’ll find the Bazaar 2.0 benchmarks included in the document great news.

I hope you find the document interesting and food for thought. If there are any mistakes or you’ll like to translate the document to another language, please let me know.