November 16th, 2018

As of last nightly (20181115100051), Firefox now supports Wayland on Linux, thanks to the work from Martin Stransky and Jan Horak, mostly.

Before that, it was possible to build your own Firefox with Wayland support (and Fedora does it), but now the downloads from mozilla.org come with Wayland support out of the box for the first time.

However, being experimental and all, the Wayland support is not enabled by default, meaning by default, you’ll still be using XWayland. To enable wayland support, first set the GDK_BACKEND environment variable to wayland.

To verify whether Wayland support is enabled, go to about:support, and check “WebGL 1 Driver WSI Info” and/or “WebGL 2 Driver WSI Info”. If they say something about GLX, Wayland support is not enabled. If they say something about EGL, it is. I filed a bug to make it more obvious what is being used.

It’s probably still a long way before Firefox enables Wayland support on Wayland by default, but we reached a major milestone here. Please test and report any bug you encounter.

Update: I should mention that should you build your own Firefox, as long as your Gtk+ headers come with Wayland support, you’ll end up with the same Wayland support as the one shipped by Mozilla.

As of next nightly (as of writing, obviously), all tier-1 platforms are now built with clang with LTO enabled. Yes, this means Linux, Mac and Android arm, aarch64 and x86. Linux builds also have PGO enabled.

Mac and Android builds were already using clang, so the only difference is LTO being enabled, which brought some performance improvements.

The most impressive difference, though, was on Linux, where we’re getting more than 5% performance improvements on most Talos tests (up to 18% (!) on some tests) compared to GCC 6.4 with PGO. I must say I wasn’t expecting switching from GCC to clang would make such a difference. And that is with clang 6. A quick test with upcoming clang 7 suggests we’d additionally get between 2 and 5% performance improvement from an upgrade, but our static analysis plugin doesn’t like it.

This doesn’t mean GCC is being unsupported. As a matter of fact, we still have automated jobs using GCC for some static analysis, and we also have jobs ensuring everything still builds with a baseline of GCC 6.x.

You might wonder if we tried LTO with GCC, or tried upgrading to GCC 8.x. As a matter of fact, I did. Enabling LTO turned up linker errors, and upgrading to GCC 7.x turned up breaking binary compatibility with older systems, and if I remember correctly had some problems with our test suite. GCC 8.1 was barely out when I was looking into this, and we all know to stay away from any new major GCC version until one or two minor updates. Considering the expected future advantages from using clang (cross-language inlining with Rust, consistency between platforms), it seemed a better deal to switch to clang than to try to address those issues.

Update: As there’s been some interest on reddit and HN, and I failed to mention it originally, it’s worth noting that comparing GCC+PGO vs. clang+LTO or GCC+PGO vs. clang+PGO was a win for clang overall in both cases, although GCC was winning on a few benchmarks. If I remember correctly, clang without PGO/LTO was also winning against GCC without PGO.

Anyways, what led me on this quest was a casual conversation at our last All Hands, where we were discussing possibly turning on LTO on Mac, and how that should roughly just be about turning a switch.

Famous last words.

At least, that’s a somehow reasonable assumption. But when you have a codebase the size of Firefox, you’re up for “interesting” discoveries.

This involved compiler bugs, linker bugs (with a special mention for a bug in ld64 that Apple has apparently fixed in Xcode 9 but hasn’t released the source of), build system problems, elfhack issues, crash report problems, clang plugin problems (would you have guessed that __attribute__((annotate("foo"))) can affect the generated machine code?), sccache issues, inline assembly bugs (getting inputs, outputs and clobbers correctly is hard), binutils bugs, and more.

I won’t bother you with all the details, but here we are, 3 months later with it all, finally, mostly done. Counting only the bugs assigned to me, there are 77 bugs on bugzilla (so, leaving out anything in other bug trackers, like LLVM’s). Some of them relied on work from other people (most notably, Nathan Froyd’s work to switch to clang and then non-NDK clang on Android). This spread over about 150 commits on mozilla-central, 20 of which were backouts. Not everything went according to plan, obviously, although some of those backouts were on purpose as a taskcluster trick.

Hopefully, this sticks, and Firefox 64 will ship built with clang with LTO on all tier-1 platforms as well as PGO on some. Downstreams are encouraged to do the same if they can. The build system will soon choose clang by default on all builds, but won’t enable PGO/LTO.

As a bonus, as of a few days ago, Linux builds are also finally using Position Independent Executables, which improves Address Space Layout Randomization for the few things that are in the executables instead of some library (most notably, mozglue and the allocator). This was actually necessary for LTO, because clang doesn’t build position independent code in executables that are not PIE (but GCC does), and that causes other problems.

Work is not entirely over, though, as more inline assembly bugs might be remaining only not causing visible problems by sheer luck, so I’m now working on a systematic analysis of inline assembly blocks with our clang plugin.

What’s new since 0.4.0?

git-cinnabar-helper is now mandatory. You can either download one with git cinnabar download on supported platforms or build one with make.

Performance and memory consumption improvements.

Metadata changes require to run git cinnabar upgrade.

Mercurial tags are consolidated in a separate (fake) repository. See the README file.

Updated git to 2.18.0 for the helper.

Improved memory consumption and performance.

Improved experimental support for pushing merges.

Support for clonebundles for faster clones when the server provides them.

Removed support for the .git/hgrc file for mercurial specific configuration.

Support any version of Git (was previously limited to 1.8.5 minimum)

Git packs created by git-cinnabar are now smaller.

Fixed incompatibilities with Mercurial 3.4 and >= 4.4.

Fixed tag cache, which could lead to missing tags.

The prebuilt helper for Linux now works across more distributions (as long as libcurl.so.4 is present, it should work)

Properly support the pack.packsizelimit setting.

Experimental support for initial clone from a git repository containing git-cinnabar metadata.

Now can successfully clone the pypy and GNU octave mercurial repositories.

More user-friendly errors.

Development process changes

It took about 6 months between version 0.3 and 0.4. It took more than 18 months to reach version 0.5 after that. That’s a long time to wait for a new version, considering all the improvements that have happened under the hood.

From now on, the release branch will point to the last tagged release, which is roughly the same as before, but won’t be the default branch when cloning anymore.

The default branch when cloning will now be master, which will receive changes that are acceptable for dot releases (0.5.x). These include:

Changes in behavior that are backwards compatible (e.g. adding new options which default to the current behavior).

Changes that improve error handling.

Changes to existing experimental features, and additions of new experimental features (that require knobs to be enabled).

Changes to Continuous Integration/Tests.

Git version upgrades for the helper.

The next branch will receive changes for the next “major” release, which as of writing is planned to be 0.6.0. These include:

April 2nd, 2017

Since version 0.4.0, git-cinnabar has a few hidden experimental features. Two of them are available in 0.4.0, and a third was recently added on the master branch.

The basic mechanism to enable experimental features is to set a preference in the git configuration with a comma-separated list of features to enable, or all, for all of them. That preference is cinnabar.experiments.

Any means to set a git configuration can be used. You can:

Add the following to .git/config:

[cinnabar]
experiments=feature

Or run the following command:

$ git config cinnabar.experiments feature

Or only enable the feature temporarily for a given command:

$ git -c cinnabar.experiments=featurecommand arguments

But what features are there?

wire

In order to talk to Mercurial repositories, git-cinnabar normally uses mercurial python modules. This experimental feature allows to access Mercurial repositories without using the mercurial python modules. It then relies on git-cinnabar-helper to connect to the repository through the mercurial wire protocol.

As of version 0.4.0, the feature is automatically enabled when Mercurial is not installed.

merge

Git-cinnabar currently doesn’t allow to push merge commits. The main reason for this is that generating the correct mercurial data for those merges is tricky, and needs to be gotten right.

In version 0.4.0, enabling this feature allows to push merge commits as long as the parent commits are available on the mercurial repository. If they aren’t, you need to first push them independently, and then push the merge.

On current master, that limitation doesn’t exist anymore ; you can just push everything in one go.

The main caveat with this experimental support for pushing merges is that it currently doesn’t handle the case where a file was moved on one of the branches the same way mercurial would (i.e. the information would be lost to mercurial users).

clonebundles

As of mercurial 3.6, Mercurial servers can opt-in to providing pre-generated bundles, which, when clients support it, takes CPU load off the server when a clone is performed. Good for servers, and usually good for clients too when they have a fast network connection, because downloading a pre-generated bundle is usually faster than waiting for the server to generate one.

As of a few days ago, the master branch of git-cinnabar supports cloning using those pre-generated bundles, provided the server advertizes them (mozilla-central does).

An interesting thing to note here is that the glibc allocator runaway memory use was, this time, more pronounced on 0.4.0 than on master. It was the opposite originally, but as I mentioned in the past ASLR is making it not happen exactly the same way each time.

While I’m here, one thing I failed to mention in the previous posts is that all these measurements were done by cloning a local mercurial clone of mozilla-central, served from localhost via HTTP to eliminate the download time from hg.mozilla.org. And while mozilla-central itself has received new changesets since the first post, the local clone has not been updated, such that all subsequent clone tests I did were cloning the exact same repository under the exact same circumstances.

After last blog post, I focused on the low hanging fruits identified so far:

The overall clone is now about 11 minutes faster than 0.4.0 (and about 50 minutes faster than master as of be75326!)

Non-shared memory use of the git-remote-hg process stays well under 2GB during the whole clone, with no spike at the end.

git-cinnabar-helper now uses more memory, but the sum of both processes is less than what it used to be, even when compensating for the glibc memory allocator issue. One thing to note is that while the git-cinnabar-helper memory use goes above 2GB at the end of the clone, a very large part is due to the pack window size being 1GB on 64-bits (vs. 32MB on 32-bits). Memory usage should stay well under the 2GB address space limit on a 32-bits system.

CPU usage is well above 100% for most of the clone.

On a more granular level:

The “Import manifests” phase is now 13 minutes faster than it was in 0.4.0.

The “Read and import files” phase is still almost 4 minutes slower than in 0.4.0.

The “Import changesets” phase is still almost 2 minutes slower than in 0.4.0.

But the “Finalization” phase is now 3 minutes faster than in 0.4.0.

What this means is that there’s still room for improvement. But at this point, I’d rather focus on other things.

Logging all the memory allocations with the python allocator disabled still resulted in a 6.5GB compressed log file, containing 2.6 billion calls to malloc, calloc, free and realloc (down from 2.7 billions in be75326). The number of allocator calls done by the git-remote-hg process is down to 2.25 billions (from 2.34 billion in be75326).

Surprisingly, while more things were moved to the helper, it still made less allocations than in be75326: 345 millions, down from 363 millions. Presumably, this is because the number of commands processed by the fast-import code was reduced.

Let’s now take a look at the various metrics we analyzed previously (the horizontal axis represents the number of allocator calls that happened before the measurement):

A few observations to make here:

The allocated memory (requested bytes) is well below what it was, and the spike at the end is entirely gone. It also more closely follows the amount of raw data we’re holding on to (which makes sense since most of the bookkeeping was moved to the helper)

The number of live allocations (allocated memory pointers that haven’t been free()d yet) has gone significantly down as well.

The cumulated[*] bytes are now in a much more reasonable range, with the lower bound close to the total amount of data we’re dealing with during the clone, and the upper bound slightly over twice that amount (the upper bound for the be75326 is not shown here, but it was around 45TB; less than 7TB is a big improvement).

There are less allocator calls during the first phases and the “Importing changesets” phase, but more during the “Reading and importing files” and “Importing manifests” phases.

[*] The upper bound is the sum of all sizes ever given to malloc, calloc, realloc etc. and the lower bound is the same, but removing the size of allocations passed as input to realloc (in practical words, this pretends reallocs never happened and that the final size for a given reallocated pointer is the one that counts)

So presumably, some of the changes led to more short-lived allocations. Considering python uses its own allocator for sizes smaller than 512 bytes, it’s probably not so much of a problem. But let’s look at the distribution of buffer sizes (including all sizes given to realloc).

(Bucket size is 16 bytes)

What is not completely obvious from the logarithmic scale is that, in fact, 98.4% of the allocations are less than 512 bytes with the current master (35c18e7), and they were 95.5% with be75326. Interestingly, though, in absolute numbers, there are less allocations smaller than 512 bytes in current master than in be75326 (1,194,268,071 vs 1,214,784,494). This suggests the extra allocations that happen during some phases are larger than that.

There are clearly less allocations across the board (apart from very few exceptions), and close to an order of magnitude less allocations larger than 1MiB. In fact, widening the bucket size to 32KiB shows an order of magnitude difference (or close) for most buckets:

An interesting thing to note is how some sizes are largely overrepresented in the data with buckets of 16 bytes, like 768, 1104, 2048, 4128, with other smaller bumps for e.g. 2144, 2464, 2832, 3232, 3696, 4208, 4786, 5424, 6144, 6992, 7920… While some of those are powers of 2, most aren’t, and some of them may actually represent objects sized with a power of 2, but that have an extra PyObject overhead.

While looking at allocation stats, I got to wonder what the lifetimes of those allocations looked like. So I scanned the allocator logs and measured the distance between when an allocation is made and when it is freed, ignoring reallocs.

To give a few examples of what I mean, the following allocation for p gets a lifetime of 0:

void *p = malloc(42);
free(p);

The following a lifetime of 1:

void *p = malloc(42);
void *other = malloc(42);
free(p);

And the following a lifetime of 1 as well:

void *p = malloc(42);
p = realloc(p, 84);
free(p);

(that is, it is not counted as two malloc/free pairs)

The further away the free is from the corresponding malloc, the larger the lifetime. And the largest the lifetime can ever be is the total number of allocator function calls minus two, in the hypothetical case the very first allocation is freed as the very last (minus two because we defined the lifetime as the distance).

What comes out of this data:

As expected, there are more short-lived allocations in 35c18e7.

Around 90% of allocations have a lifetime spanning 10% of the process life or less. This is a rather surprisingly large amount of allocations with a very large lifetime.

Around 80% of allocations have a lifetime spanning 0.01% of the process life or less.

The median lifetime is around 0.0000002% (2*10-7%) of the process life, which, in absolute terms is around 500 allocator function calls between a malloc and a free.

If we consider every imported changeset, manifest and file to require a similar number of allocations, and considering there are about 2.7M of them in total, each spans about 3.7*10-7%. About 53% of all allocations on be75326 and 57% on 35c18e7 have a lifetime below that. Whenever I get to look more closely to memory usage again, I’ll probably look at the data separately for each individual phase.

One surprising fact, that doesn’t appear on the graph because of the logarithmic scale not showing “0” on the horizontal axis, is that 9.3% on be75326 and 7.3% on 35c18e7 of all allocations have a lifetime of 0. That is, whatever the code using them is doing, it’s not allocating or freeing anything else, and not reallocating them either.

All in all, what the data shows is that we’re definitely in a better place now than we used to be a few days ago, and that there is still work to do on the memory front, but:

As mentioned in a previous post, there are bigger wins to be had from not keeping manifests data around in memory at all, and by importing it directly instead.

In time, a lot of the import code is meant to move to the helper, where the constraints are completely different, and it might not be worth spending time now on reducing the memory usage of python code that might go away soon(ish). The situation was bad and necessitated action rather quickly, but we’re now in a place where it’s not as bad anymore.

So at this point, I won’t look any deeper into the memory usage of the git-remote-hg python process, and will instead focus on the planned metadata storage changes. They will allow to share the metadata more easily (allowing faster and more straightforward gecko-dev graft), and will allow to import manifests earlier, which, as mentioned already, will help reduce memory use, but, more importantly, will allow to do more actual work while downloading the data. On slow networks, this is crucial to make clones and pulls faster.

One thing that was mentioned in the first followup is that reducing the amount of realloc and substring copies made the cloning more than 15 minutes faster on master. But the same code exists in 0.4.0, so this isn’t part of the difference.

So what’s going on? Looking at the CPU usage during the clone is enlightening.

On 0.4.0:

On master:

(Note: the data gathering is flawed in some ways, which explains why the git-remote-hg process goes above 100%, which is not possible for this python process. The data is however good enough for the high level analysis that follows, so I didn’t bother to get something more acurate)

On 0.4.0, the git-cinnabar-helper process was saturating one CPU core during the File import phase, and the git-remote-hg process was saturating one CPU core during the Manifest import phase. Overall, the sum of both processes usually used more than one and a half core.

On master, however, the total of both processes barely uses more than one CPU core.

Essentially, before those changes, git-remote-hg would send instructions to git-fast-import (technically, git-cinnabar-helper, but in this case it’s only used as a wrapper for git-fast-import), and use marks to track the git objects that git-fast-import created.

After those changes, git-remote-hg asks git-fast-import the git object SHA1 of objects it just asked to be created. In other words, those changes replaced something asynchronous with something synchronous: while it used to be possible for git-remote-hg to work on the next file/manifest/changeset while git-fast-import was working on the previous one, it now waits.

The changes helped simplify the python code, but made the overall clone process much slower.

If I’m not mistaken, the only real use for that information is for the mapping of mercurial to git SHA1s, which is actually rarely used during the clone, except at the end, when storing it. So what I’m planning to do is to move that mapping to the git-cinnabar-helper process, which, incidentally, will kill not 2, but 3 birds with 1 stone:

It will restore the asynchronicity, obviously (at least, that’s the expected main outcome).

Storing the mapping in the git-cinnabar-helper process is very likely to take less memory than what it currently takes in the git-remote-hg process. Even if it doesn’t (which I doubt), that should still help stay under the 2GB limit of 32-bit processes.

The whole thing that spikes memory usage during the finalization phase, as seen in previous post, will just go away, because the git-cinnabar-helper process will just have prepared the git notes-like tree on its own.

So expect git-cinnabar 0.5 to get moar faster, and to use moar less memory.