Squeezing Every Last Bit of Performance Out of The Linux Toolchain

Magic of GCC PGO

On Friday I finally got gold to produce a prelinkable static binary(bug). I also got around to trying out GCC profile-guided-optimization with the debloatifying -freorder-blocks-and-partition option. This option breaks up every profiled function into cold and hot “functions”. It then lumps all of the hot functions together.

PGO performance is amazingly close to that of binaries produced by icegrind (within 10% based on page-counts).

Startup Time

RSS (KB)

firefox.stock

2515ms

49452

firefox.ordered

1919ms

45344

firefox.static

2321ms

49616

firefox.static.ordered

1577ms

37072

firefox.static.pgo

1619ms

38436

In above table, ordered means application of icegrind, static means a fat firefox-bin. To generate the PGO profile I just started and closed Firefox. So it’s not too surprising that the results are similar to those of an explicitly ordered binary. RSS refers to how much memory is mmap()ed into the process(lower RSS usage means we are paging in less junk). I was not able to control the layout of PGO builds; will need some linker hackery to deal with split functions.

I think the numbers speak for themselves. Isn’t it scary how wasteful binaries are by default? It amazes me that Firefox can shrug off a significant amount of resource bloat without changing a single line of code. I think this is a good demonstration on why application developers should a) expect more out of their toolchain (Linux, GCC, Binutils) b) contribute to their toolchain.

I think my next step is to tackle PGO Firefox builds(bug). From there I would like to teach icegrind to play together with PGO.

Pieces of The Startup Puzzle

It took a long time since I first noticed the suckyness of library io. After much digging, help by smart people on #gcc + LKML discussion, I think I finally have a pretty clear list of the remaining/inprogress things needed for Linux applications to start faster.

Wu Fengguang is making headway on smarter readahead that makes better use of available RAM + disk bandwidth. Firefox could be read in 512kb chunks instead of 128 (4x page-fault reduction). Distributions should be aggressively testing this patch.

Better agreement on binary organization between the compile-time linker, run-time linker and the kernel (see LKML discussion). Can shave off a handful of unneeded page-faults per file this way.

A linker flag to specify how much of a particular library should be read-in via madvise(). For example any xulrunner apps will know ahead of time that they need large parts of libxul.so – might as well let the OS know.

Transparent read-only per-file ext4 compression (like OSX). Ted Tso indicated that this would be easy to add to ext4, but as far as I know nobody has jumped on this.

I think with all of the above combined, we could load apps like Firefox at near-warm (0.5 second) speeds. Most of these are easy. #1 is hard, but it’s already being worked on. I’ll be happy to point someone at how to solve items 2-4 while I work on the Firefox side of things. The end result of all this should be a more instant gratification for all Linux users.

This entry was posted on Monday, April 12th, 2010 at 3:16 pm and is filed under startup, Uncategorized.
You can follow any comments to this entry through the RSS 2.0 feed.
Both comments and pings are currently closed.

@steve: Aren’t all libraries and executables mmaped in? But I assume that is hidden from the higher levels so it will still work, it will just be a little slower (assuming that you can still efficiently find the right page to decompress and map in)

On a sidenote: PGO is the reason why I’m still not using FF 3.6 – thats because there’s no PGO-based, CPU-centered build of Firefox 3.6 available yet – aka Swiftweasel. FF 3.5 + 3.6 as in the default vanilla binary taste are huge resource hogs, which nearly brought me to converting back to FF 3.0 – till I found out about PGO-based builds like Swiftweasel. After that, I was happy again .. works like a charm

And seeing this data in here lets me hope this is going to be indeed a godawful improvement for FF, esp. for Linux / *x users like me

@Ben Bennet,skierpage: the disadvantage of jar/zip files is exactly that mmap’ing doesn’t work, and this wastes memory (the file is both in cache and in the process memory). And the app has no control about how to store libraries – it’s ld.so from glibc who has. And since this idea is broken, glibc won’t implement it.
When you do it at the filesystem-level, I guess that the simplest way to make mmap() work is to have the uncompressed file content in the page cache. At that point, you don’t even need a perfect index of the file, just a good approximate one, so that what you uncompress will contain the right area. Actually, I guess that compression is done in a more local, per-chunk way (since LZ77-based algorithms work on a sliding window anyway, so you probably don’t loose a lot). Not sure about LZMA, though.