Better Crash Trending – A Test Pilot Proposal

Summary: (Total Crashes)/(Active Daily Users) is a low resolution metric for crash trending. We can improve Firefox’s stability for more users if we understand the distribution of crashes.

Over the last few months, improving Firefox’s stability has become a top priority. And our early results are encouraging! Both our survey data and crash reporter data show a downward trend in crashes. We should be careful, however, not to read too much into this data.

Due to privacy concerns, we do not store user ids when collecting crash data. Accordingly, our crash data suffers from a number of limitations. For example, we can only determine the number of crashes per daily session, not per session length. As a result, the number of crashes per user will appear to rise if users browse more often per day. Changes to the crash reporter UI will similarly bias our data. A 5% increase in the reporter response rate will lead to a 5% increase in reported crashes.

Perhaps most importantly, the lack of a user id limits our ability to draw inferences about the distribution of our data. Why does this matter? If Firefox crashes are skewed, we may reduce overall crashes by 10%, but have 80% of users experience more crashes.

A quick look at the “Week in the Life of a Browser” study suggests that crashes are indeed highly skewed. By examining start-up events without corresponding shut-down events, Jono calculated the number of unexplained session interruptions per user. While there are other causes of session interruptions, such as a computer losing its power, we can reasonably assume that session interruptions serve as a (perhaps highly overstated) proxy for crashes.

Session interruptions per user does not take on the bell-curve shape of normally distributed data. Rather, it follows a power law distribution. 49% of users did not experience a single session interruption, while 70% experienced one or fewer. The mean number of session interruptions, however, was 1.4. If our crash data follows a similar distribution, the average crash per user metric tells us little about the experience of a typical Firefox user.

Anecdotal evidence supports this hypothesis. While we all know people who swear by Firefox’s stability, we also know people who complain of frequent failures. I, for one, haven’t experienced any crashes since upgrading to the 3.6 beta a few weeks ago.

With this in mind, I suggest we use Test Pilot to run a longitudinal study of true Firefox crashes. Because Test Pilot is opt-in and allows users to review their data before submitting it, we’re able to consider data at a more granular level. As with previous TP experiments, we will go to great lengths to respect the privacy of participants.

In addition to crash events and session length, I would like to collect data we can correlate with crashes. Firefox version, operating system, and Add-ons installed immediately come to mind.

Have suggestions for additional data that we should, or shouldn’t, collect? Please leave them in the comments.

6 responses

Plugins, with names and if possible without extra stress version-data for them.
Nearly more important than this: last action before crash (load of page, start of plugin) to determine external or internal cause of crash (e.g. powerloss, systemcrash, systemhangup are external, while loading of plugin, loading of page, building/parsing dom, rendering are internal) If we leave out urls, I don’t see privacy violated, maybe give the TP user the posibility to add comments to a crash before sending?

Good idea, but be prepared for skewed data. The people more likely to install Test Pilot would probably be more likely to make a point of troubleshooting their crashes. You’ll probably also get quite a few spikes from people who like to test crashes themselves, also a likely user of Test Pilot. With this sort of test you’ll be hitting the base bias of the test user pool way more than with many other tests. Still sounds like it’s worth trying, however.

I’ve experienced numerous crashes of FF 3.6.3. I updated Shockwave for Director, updated Flashplayer with install problems (getplusplusadobe 16263). Certain websites, like wrongdiagnosis.com would crash FF, generate the MS error report. If you reopened FF, it would act like the Link Support Protocol LSP was broken. Close it, then open task manager and you would see a process FF still running, about 55800k in size. Kill process, reopen FF, still LSP broken. Reboot, opens, certain sites still guaranteed to crash.
Reinstalled FF 3.5.5. Open, immediately wanted to install Shockwave Flash 10.0.45.2. Installed, former conflicted websites no longer crash.
The FF 3.6 series has two problems. First is an issue with Flash, second is the LSP. The link support layer LSL is undamaged, because you can ping the Internet root servers, and the separate MS error reporter transmits.