This time I didn’t bother to collect 50 startup times to get more accurate figures, mostly because as of writing, I don’t have scripts to gather these data automatically (especially on mobile devices).

Platform

time spent before XRE_main without relocations packing (ms)

with relocations packing (ms)

GNU/Linux x86

1,362

1,273

GNU/Linux x86-64

1,643

1,318

Maemo 5, n900

1,717

1,427

Android 2.2, HTC Desire

4,250

3,568

All the numbers above were taken after a fresh boot with a more or less recent nightly (n900 was from a week ago, others are from today). The Android number with relocations packing was gotten from a build where it miraculously started without crashing (relocations packing apparently unveils a dynamic linker problem) ; it might be wrong.

The timings that are currently reported through this API are the following:

process is when the Firefox process starts

main is when the XRE_main function is called (one of the first functions actively called)

firstPaint is when a web page has been displayed for the first time to the user

sessionRestored is pretty much self describing

There are apparently still a few rough edges, but it is still quite valuable information. As such, I wrote a little (restart-less) extension that displays these information when you go to the about:startup url. It doesn’t really display the raw values, but instead the number of milliseconds elapsed since the process startup until each further event above.

I would like to replace my current blog with a system that mostly generates static pages, with support for comments. I’d like it to take files as input for blog posts (I’d like to store them in git), instead of database tables, and to have a flexible markup language (flexible in that it’d allow to customize the HTML output), and flexible templates.

Ikiwiki might come close to that, though I haven’t looked into details. Dear lazyweb, would you know other software that’d fulfill my needs, or come close?

Recent Iceweasel betas allows to replace the menu bar with a Iceweasel button. This is not enabled by default, but right-clicking on the menu bar allows to disable the menu bar, which enables the Iceweasel button.

The button is not exactly very appealing, and takes quite a lot of horizontal space on the tab bar. But with a few lines of CSS, this can fortunately be changed. Edit the chrome/userChrome.css file under your user profile, and add the following lines:

Recent Firefox betas replaced the menu bar with a Firefox button. Under Linux, this is not enabled by default, but right-clicking on the menu bar allows to disable the menu bar, which enables the Firefox button.

The button is not exactly very appealing, and takes quite a lot of horizontal space on the tab bar. But with a few lines of CSS, this can fortunately be changed. Edit the chrome/userChrome.css file under your user profile, and add the following lines:

I made some changes as to how packages from the Debian Mozilla team that can’t yet be distributed in the Debian archives are distributed to users. Please update your APT sources and now use the following for 4.0 beta packages:

deb http://mozilla.debian.net/ experimental iceweasel-4.0

You’ll also need the experimental repository in your sources, but the overall installation is much easier now:

# apt-get install -t experimental iceweasel

This should work for squeeze and unstable users.

I also added Iceweasel 3.6 backports for Debian Lenny users. For these, add the following APT source:

deb http://mozilla.debian.net/ lenny-backports iceweasel-3.6

You’ll also need the lenny-backports repository in your sources. As for the experimental packages above, installation should be as easy as:

# apt-get install -t lenny-backports iceweasel

If your APT complains about the archive key, please check the instructions to add the key to your APT keyring.

There are several ways a program can hit the disk, and it can be hard to know exactly what’s going on, especially when you want to take into account the kernel caches. This includes, but is not limited to:

any access to a file, which may lead to I/O to read its parent directories if they are not already in the inode or dentry caches

enumerating a directory with readdir(), which may lead to I/O on the directory for the same reason

read()/write() on a file, which may lead to I/O on the file if it is not in the page cache

accesses in a mmap()ed area of memory, which may lead to I/O on the underlying file if it is not in the page cache

etc.

There are various ways to track system calls (e.g. strace) allowing to track what a program is doing with files and directories, but that doesn’t tell you if you’re actually hitting the disk or not. There are also various ways to track block I/O (e.g. blktrace) allowing to track actual I/O on the block devices, but it is then hard to back-track what part of which files or directories these I/O relate to. To the best of my knowledge, there are unfortunately no tools to do such tracking easily.

Systemtap, however, allows to access the kernel’s internals and to gather almost any kind of information from any place in a running kernel. The downside is that it means you need to know how the kernel works internally to gather the data you need ; that will limit the focus of this post.

I had been playing, in the past, with Taras’ script, which he used a while ago to track I/O during startup. Unfortunately, it became clear something was missing in the picture, so I had to investigate what’s going on in the kernel.

Setting up systemtap

On Debian systems, you need to install the following packages:

systemtap

linux-image-2.6.x-y-$arch-dbg (where x, y, and $arch correspond to the kernel package you are using)

linux-headers-2.6.x-y-$arch (likewise)

make

That should be enough to pull all the required dependencies. You may want to add yourself to the stapdev and stapusr groups, if you don’t want to run systemtap as root. If, like in my case, you don’t have enough space left for all the files in /usr/lib/debug, you can trick dpkg into not unpacking files you don’t need:

The file in /etc/dpkg/dpkg.cfg.d can obviously be named as you like, and you can adjust the path-exclude pattern to what you (don’t) want. In the above case, kernel drivers debugging symbols will be ignored. Please note that this feature requires dpkg version 1.15.8 or greater.

Small digression

One of the first problems I had with Taras’ script is that systemtap would complain that it doesn’t know the kernel.function("ext4_get_block") probe. It is due to a very unfortunate misfeature of systemtap, where the kernel.* probes refer to whatever is in the vmlinux image. Modules probes have a separate namespace, namely module("name").*.

So for the ext4_get_block() function, this means you need to set a probe for either kernel.function("ext4_get_block") or module("ext4").function("ext4_get_block"), depending how your kernel was compiled. And you can’t even use both in your script, because systemtap will complain about either being an unknown probe…

Tracking the right thing

This script attempts to track I/O by following read() and write() system calls. Which is not tracking I/O, it is merely tracking some system calls (Taras’ script had the same kind of problem with read()/write() induced I/O). You could just do the very same with existing tools like strace, and that wouldn’t even require some system privileges.

To demonstrate the script doesn’t actually track the use of storage devices as the document claims, consider the following source code:

All it does is reading some parts of the executable a lot of times (notice the trick to make the executable size at least 1MB). Not a lot of programs will actually do something as bold as reading the same data again and again (though we could probably be surprised), but this easily points the problem. Here is what the output of the systemtap script looks like for this program (stripping other unrelevant processes):

process read KB tot write KB tot
test 400002 25600001 0 0

Now, do you really believe 25MB have actually been read from the disk? (Note that the read() count seems odd as well, as there should only be around 200000 calls)

Read-ahead, page cache and mmap()

What the kernel actually does is that a read() on a file is going to check the page cache first. If there is nothing corresponding to the read() request in the page cache, then it goes down to the disk, and fills page cache. But as loading only a few bytes or kilobytes from the disk could be wasteful, the kernel also reads a few more blocks ahead, apparently with some heuristic.

But read() and write() aren’t the sole way a program may hit the disk. On UNIX systems, a file can be mapped in memory with the mmap() system call, and all accesses in the memory range corresponding to this mapping will be reflected on the file. There are exceptions, depending on how the mapping is established, but let’s keep it simple. There is a lot of litterature on the subject if you want to document yourself on mmap().

The way the kernel will read from the file, however, is quite similar to that of read(), and uses page cache and read ahead. The systemtap script debunked above doesn’t track these.

I’ll skip write accesses, because for now, they haven’t been in my scope.

Tracking some I/O with systemtap

What I’ve been trying to track so far has been limited to disk reads, which happen to be the only accesses occurring on shared library files. Programs and shared libraries are first being read() from, so that the dynamic linker gets the ELF headers and knows what to load, and then are mmap()ed following PT_LOAD entries in these headers.

As far as my investigation in the Linux kernel code goes, fortunately, both accesses, before they end up actually hitting the disk, go through the __do_page_cache_readahead kernel function (this function was tracked in Taras’ script). Unfortunately, while it is called with an offset and a number of pages to read for a given file, it turns out the last pages in that number are not necessarily being read from the disk. I don’t know for sure, because the latter had an effect on my observations, but some might even already be in the page cache.

Going further down, we reach the VFS layer, which ends up being file-system specific. But fortunately, a bunch of (common) file-systems actually share page mapping code, and commonly use the mpage_readpage and mpage_readpages functions, both calling do_mpage_readpage to do the actual work. And this function seems to be properly called only once for each page not in the page cache already.

If my reading of the kernel source is right, this however is not really where it ends, and do_mpage_readpage doesn’t actually hit the disk. It seems to only gather some information (basically, a mapping between storage blocks and memory) that is going to be submitted to the block I/O layer, which itself may do some fancy stuff with it, such as reordering the I/Os depending on other requests it got from other processes, the position of the disk heads, etc.

And when I say do_mpage_readpage doesn’t actually hit the disk, I’m again simplifying the issue, because it actually might, as there might be a need to read some metadata from the disk to know where some blocks are located. But tracking metadata reads is much harder, and I haven’t investigated it.

Anyways, skipping metadata, going further down after do_mpage_readpage is hard because it’s difficult to back-track which block I/O is related to which read-ahead, corresponding to what read at what position in which file. do_mpage_readpage already has part of this problem because it is not called with any reference to the corresponding file. But __do_page_cache_readahead is.

So knowing all the above, here is my script, the one I used to get the most recent Firefox startup data you can find in my last posts:

This script needs to be used with a command given to systemtap, with the -c option, such as in the following command line:

# stap readpage.stp -c firefox

Each line of output represents a page (i.e. 4,096 bytes) being read, and contains a timestamp, the name of the file being read, and the offset in the file. As discussed above, do_mpage_readpage is not really the place the I/O actually occurs, so the timestamps are not entirely accurate, and the actual read order from disk might be slightly different, but it still is a quite reliable view in that the result should be reproducible with the same files even when they’re not located on the same blocks on disk, and provided their page cache status is the same when starting the program.

This systemtap script ignores writes, as well as metadata accesses (including but not limited to inodes, dentries, bitmap blocks and indirect blocks). It also doesn’t account accesses to files opened with the O_DIRECT flag or similar constructs (raw devices, etc.)

Read-ahead in action

Back to the small example program, my systemtap script records 102 page accesses, that is, 417,792 bytes, much less than the actual binary size on my system (1,055,747 bytes). We are still far from the 25MB figure from the other systemtap script. But we are also far from the 128KiB the program actively reads (64KiB twice, leaving a hole between the two blocks).

At this point, it is important to note that the ELF headers and program code all fit within a single page (4 KiB), and following the 1MiB section corresponding to the dummy variable, there are only 5,219 bytes of other ELF sections, including the .dynamic section. So even counting everything the dynamic linker needs to read, and the program code itself, we’re still far from what my systemtap script records.

Grouping consecutive blocks with consecutive timestamps, here is what can be observed:

Offset

Length

0

16,384

983,040

73,728

16,384

262,144

524,288

65,536

(By now, you should have guessed why I wanted that big hole between the read()s ; if you want to reproduce at home, I suggest you also use my page cache flush helper)

As earlier investigations showed, the first accesses by the dynamic loader are to read the ELF headers and .dynamic section. As mentioned above, these are really small. Yet, the kernel actually reads much more: 16KiB at the beginning of the file when reading the ELF headers, and 72KiB at the end when reading the .dynamic section. Subsequent accesses from the dynamic loader are obviously already covered by these accesses.

Next accesses are those due to the program itself. The program actively reads 64KiB at the beginning of the file, then 64KiB starting at offset 524,288. For the first read, the kernel already had 16KiB in the page cache, so it didn’t read them again, but instead of reading the remainder, it reads 256KiB. For the second read, however, it only reads the requested 64KiB.

As you can see, this is far from being something like “you wanted n KiB, I’ll read that fixed amount now”.

Further testing with different read patterns (e.g. changing the hole size, read size, or reading from the dummy variable directly instead of read()ing the binary) is left as an exercise to the reader.

Firefox is a quite featured web browser, and fortunately, all the code in the Firefox tree doesn’t come from Mozilla (though a large part does). Some parts of the code actually come from third party libraries, such as libpng for PNG support, libjpeg for (obviously) JPEG support, libsqlite3 for SQLite support (used, for example, for the awesomebar), or libcairo for part of the layout rendering.

Some of these libraries are also used in other software. For instance, Gtk+, the widget toolkit Firefox uses on UNIX systems, and used by the GNOME desktop environment, needs, directly or indirectly, libcairo for rendering, and libpng for icons. But these other softwares don’t embed their own copy of the libraries: they use shared, system libraries.

There are several advantages doing so: they reduce memory footprint, because the library is loaded only once in memory (modulo relocations on data sections), they reduce maintenance overhead (you only need to update one copy for stability or security issues), and, you should have guessed where I’m going by now, they help reduce startup I/O: a system library that has been loaded by some other software doesn’t need to be read from disk again when you start another application using the same system library.

Unfortunately, Mozilla distributes binaries that can run on a large variety of systems, which are not necessarily using the right versions of the libraries, and sometimes even, Mozilla patches some libraries it embeds to fix bugs or add features. One such example, the worst case, actually, is libpng, which is patched to handle APNG files. Libcairo also comes with some bug fixes, most of which, after a while, end up in the original libcairo. Anyways, as Mozilla can’t really control which version of cairo, sqlite, etc. are installed on systems where Firefox is going to be used, the only way that works for us is to ship binaries that link against these internal (possibly modified) libraries.

Though the same logic could be applied to some other libraries which we do link against system versions of (such as Gtk+), they’re, in practice, not so much a problem as sqlite and cairo, which are the most problematic libraries at the moment.

On non GNU/Linux platforms (excluding some other uncommon unix variants for Firefox users), system libraries are unfortunately not a common practice, and even if they were, most of the libraries Firefox uses internally are not shipped with the system. On (more or less) controlled environments, however, the story is slightly different. When you know what version of which libraries is shipped, it’s easier to start using system libraries. GNU/Linux distributions are such environments, mobile OS (Android, Maemo/Meego) are, too.

I’m not interested in discussing here the support implications of GNU/Linux distributions shipping Firefox binaries linked against system libraries, though in practice, it creates less problems than some people may think. This post only aims at showing how using system libraries impacts startup times, how it could or could not be an interesting thing to investigate on mobile, and how GNU/Linux distributions practices are actually not hurting.

As a reminder, here are the startup times with a plain Firefox 4.0b8, as collected in previousposts:

Average startup time (ms)

x86

3,228.76 ± 0.57%

x86-64

3,382.0 ± 0.51%

The first set of system libraries I enabled are those that the default GNOME desktop environment as it comes in Debian Squeeze uses, except libpng, because the system library doesn’t support APNG files (note that I installed libcairo from Debian experimental, as Squeeze doesn’t have 1.10):

It turns out there is not much gain, but at least, it’s not a regression. Please note that libbz2 is actually not used in Firefox itself. It’s only used in the upgrade program.

Let’s go further: the clock applet coming with the gnome-panel uses a system copy of libnspr4 and libnss3, through libedataserver. Provided that these libraries are recent enough on the system, Firefox can actually use them, so let’s add these:

ac_add_options --with-system-nspr
ac_add_options --with-system-nss

Average startup time (ms)

x86

3,086.74 ± 0.48%

x86-64

3,226.3 ± 0.59%

This had a much more noticeable positive impact. This is most probably due to the size and number of libraries involved, as currently, nspr and nss are respectively 3 and 8 separate library files.

Let’s now go even further and enable all currently supported flags remaining, except libpng, as already explained:

Unsurprisingly, it has a negative impact, most probably due to the additional disk seeks that each of these library induces. The good news is that it doesn’t get slower than the original Firefox build, and is actually still slightly faster.

There are more libraries that Firefox embeds and could be using system libraries instead, but there aren’t flags for them yet. At the very least, and if I recall correctly, libogg is loaded by the GNOME desktop environment through libcanberra for event sounds, so there could be some more possible gain there.

However, it’s not entirely clear from the above figures whether that would be worth on mobile OS, but I’m not putting too much hope there. I’ll first need to collect some more data first, such as which libraries are used after a system boot, and what versions are provided.

On desktops, though, we can go even much further. In the 3.x days, and earlier, Firefox’s javascript engine was shipped as a separate library in the Firefox directory. In recent 4.0 betas, it is now statically linked into libxul.so. But, the future GNOME desktop, based on GNOME shell, will be using Firefox’s javascript engine. Some distributions plan to use a separately packaged javascript library for that purpose. In Debian, I’m still planning to keep Firefox shipping the javascript engine as a separate, system library, so let’s see what that means for startup time:

ac_add_options --enable-shared-js

Average startup time (ms)

x86

2,887.48 ± 0.55%

x86-64

2,990.64 ± 0.65%

(disclaimer: this was not measured with an actuall GNOME shell, but with the current GNOME desktop, and a script reading the javascript library file to force it in page cache. Arguably, an actual use of the library may be loading less of it in page cache)