Startup: Backward Constructors

This post is a result of debugging bug 561842. Turns out one needs to go far beyond lumping libraries together to reap startup benefits.

I made a pdf to illustrate the cost centers of loading libxul.so (the essence of Firefox).

With Icegrind I demonstrated that better binary layout can significantly improve application startup. However I still didn’t have a breakdown of reasons of why loading binaries is so damn inefficient. That’s what the above pdf is about.

Michael Meeks pointed me at the funny backwards IO pattern in his IO logs. I even made fun of how by default libxul.so is read mostly via backwards IO. Once I assigned userspace symbols to my pagefault log, it became clear that the backwards IO pattern was entirely due to library initializers. C++ compiler generates code that runs on library initialization to initialize globals and run relevant C++ constructors. In C one can assign a “constructor” GNU attribute to a function to participate in this mayhem.

Running Backwards?

Ian Lance Taylor clued me in on why these things run backwards.When one links the program, the object files are laid out sequentially. Static libraries are specified after the code that depends on them. Once an object is linked, the easiest way to make sure that libraries are initialized before their users is to invoke initializers backwards. The list of initializers is stored in the .ctors section and they loaded by libgcc.

In Mozilla (and likely other C++ codebases) these global initializers are more or less evenly scattered throughout the codebase. By the time main() is run, most of the program has been paged in an unfortunately inefficient manner.

Run Faster Please?

The most interesting part about all this that the compiling toolchain can make a rather precise guess at how a large part of the initial program execution is going to go. To test this theory I wrote my best Mozilla patch ever.

One can place a function near the beginning of the library file and another one at the end (with a “constructor” attribute). The function at the end runs first and it can figure out the approximate range of memory that will need to be paged in and madvise() it. This results in a 5x reduction in libxul pagefaults. Unfortunately since constructors execute backwards and readahead forwards, the constructor execution stalls to wait for readahead, so the speedup is rather hard to detect.

Run Forward Faster!

Depressed about my hack failing to make a dent in startup time I patched gcc to run initializers in a forward order (and reversed the function-placement logic in above patch). Now readahead happened in the same direction as library initialization and my Firefox started 30% faster! I wrapped this up into a standalone gcc patch (speed up any bloated C++ startup with a simple change to the compiler!). Note this hack reverses the library initialization order discussed above, this happens to not be a problem for Mozilla.

Conclusion: Order Matters!

The linker can reverse the per-library initializers such that initializers run forward, but cross-library dependencies are honoured. That in itself isn’t enough to boost startup without cleverer readahead on the kernel side (or application-side hacks).

It’s weird to have initializers page in most of the binary. An interesting optimization would to have the compiler transitively mark functions reached by library initialization and place those in a .text.initializers section. Then one could have the linker group the initializers together.

Plans

I haven’t made up my mind on how to proceed. This madvise() hack + a simple linker patch could be deployed more easily than icegrind. This hack also appears to be as performant as a static firefox build + icegrind (due to inadequate kernel readahead without madvise()). Icegrid + libxul.so isn’t quite as efficient. I have a feeling that we’ll end up with a combination of icegrind + some form the initializer madvise() hack.

This entry was posted on Thursday, May 27th, 2010 at 4:35 pm and is filed under startup.
You can follow any comments to this entry through the RSS 2.0 feed.
Both comments and pings are currently closed.

I did some little investigation which got me interesting data: while I didn’t check all these constructor functions, all that I checked are due to constructs like this in the source:
static PRLogModuleInfo *gWordCacheLog = PR_NewLogModule(“wordCache”);

So, it turns out only a few are PR_NewLogModules.
Most are actually due to cycle collection (from which a great number happens in SVG code, which is pretty much pointless at startup), some others are due to statically initialized instances of some classes that look like singletons, some others only contain a “ret” instructions (in which case one can wonder why gcc emits them), some others are due to the use of iostreams (in chromium code), a few are html5 parser initialization of static data,

Most of these constructors actually don’t do much, only copying data around, not even calling functions. Some do call functions, though.

Anyways, unfortunately, the __attribute__((section)) thing can’t work with these static initializations…

I tried modding the compiler to do the equivalent of __attribute__ ((section (“.text.initializers”)));, but basically every constructor, generated func, etc must be tagged correctly. That’s a decent chunk of work to do right, so I decided to leave that as an exercise for the reader.

Most of the ones I checked were cyclecollector stuff. It would be an interesting project to make logging/cyclecollection initialize lazily. Thanks for identifying logging as a problem too.

ld doesn’t do any linktime optimization. gcc 4.5 lto is too buggy to even bother with atm. It would be nice to engage llvm folks on this, perhaps someone there will want to make application startup competitive.

Not to be trolling, but after reading several posts in this blog i get the impression that this work – while certainly interesting and insightful – is attacking the wrong problems with firefox/mozilla.

For example firefox’s current caching implementation is so horrible that its performance impact on startup times is orders of magnitude greater than any library load ordering, assuming you’re reloading tabs. [disk cache is restricted to 8192 items and has a hashing algorithm with lots of collisions, evicting valid items => open 100 tabs with 10 files each, restart FF, see a slow reload-fest]