Vladan Djeric got all of the security reviews and should be able to land Nicholas Chaim’s fix for networked certificate validation: bug 726125

We spent most of the meeting discussing bug bug 784512. According several data sources Firefox 15 Beta loads pages slower than 14. Occasionally problems squeeze past our performance testing + telemetry infrastructure, this looks like one of these times. Unfortunately, it’s quite hard to reduce a few noisy signals to a concrete performance problem. If you can reproduce a performance regression to do with loading webpages/games/etc in FF15 vs FF14, please leave a comment.

Thanks!

Thanks for the great comments on my previous snappy updates. Bug 783755 should take care of the new cache size pref not sticking. Bug 718910 on hiding Cache directory from Spotlight is making progress too.

Commenter, kumalos, reported a tab switching regression and posted a profile recorded with our profiler as evidence. This proved to be an example of bug 783748, and lead us to identify a previously unknown issue in bug 784756. Constructive feedback like this is one of the main reasons I blog.

I highly encourage users interested in improving Firefox performance to use Nightly builds and report bugs with profiler traces attached.

Shutdown Times

I’ll end with our latest Telemetry data point. This one took a while to get right, but we finally track our shutdown speed.

I have been using dates to mark passage of time in the Snappy project. I think I’ll switch to a simple counter instead. We are ~36 updates into this project.

Blogging

Ludovic Hirlimann blogged about how spotlight spends a lot of time indexing the Firefox network ‘Cache’ directory (known problem, bug 718910). If you experience this problem and would like to see it fixed, please comment in the bug if the suggested remedy helps.

Tim Taubert wrote about reducing new-tab jank. I mention Tim a lot in these updates. He takes on a lot of interesting bugs in Firefox frontend. Hopefully he’ll make a habit out of blogging about his work.

Networking

Nick Hurley landed a change to reduce our maximum cache size to 350 megabytes. In order to avoid excessive disk IO traffic old cache size of 1 gigabyte remains in effect until the cache is reset. See bug 709297 and Nick’s blog post for more details. Progress is also being made on bug 777328 so we can move towards not blowing away our cache 10-20% of the time.

Michal Novotny is proceeding with incrementally reducing cache-caused jank that’s due to holding a lock on the main thread while doing IO on a background thread. He also removing a multitude of synchronous necko APIs, see bug 695399.

Patrick McManus is removing synchronous proxy-related code, see bug 766973 for the DNS-related bit. Turns out our proxy code also does all kinds of synchronous operations when detecting proxy configuration, etc. This is being worked on, but hasn’t been filed yet.

I usually try to highlight work that has already landed, but in this case it is important to point out that the Necko team is working hard on addressing significant problems in the networking code. These problems are tricky and will take a while to fix. The team is relatively new and is still discovering hidden surprises in their codebase.

Profiler

Benoit Girard posted a preview of view-source in the profiler. This will be handy for figuring out where performance problems lay, especially in JS files that have been preprocessed (our JS preprocessor does not try to keep the line numbers sane).

bug 777919 – Free LifoAlloc chunks on background thread, instead of as part of the final IGC slice. This isn’t a problem for most people, but for some people on OSX it can take anywhere from 50ms to 250ms or more.

bug 778993 – Separate runtime’s gcMallocBytes from compartment’s gcMallocBytes, so we trigger less non-incremental GCs with many tabs open

See Bill’s comments in bug 767209 for some insight into the complex heuristics that go into minimizing GC interruptions: comment 1, comment 2.

Coming Soon

In the coming week I expect to see some good optimizations land for page rendering, tab-switching behavior, more robust cache, etc.

Some Snappy people will be attending MozCamp.eu 2012 in Warsaw, Poland on September 8, 9. Expect to see lots of talk on profiling and other performance tools.

I hope to have above 15-20 Performance/Snappy people in Warsaw for the following week. This is not yet finalized. At the moment we are looking to see if there is a coworking space or a company in Warsaw who could host us.

SPS Gecko Profiler has gotten a lot of praise this week on #perf. If you ever wonder the hell Firefox is doing with your CPU, give the profiler a try. For the past couple of weeks it has been able to label stacks with JS, URLs and even favicons. It’s likely that Mozilla may have shipped the world’s first profiler to feature favicons.

Having JS support is nice, it lead to the first 2 snappy addon bugs: 777266, 777397. I documented how to act on addon responsiveness issues in the snappy wiki.

Whether you develop web pages, addons or are a core gecko hacker, the profiler may make the performance-analysis part of your life much more pleasant. Update: Benoit Girard wrote about the new profiler features.

Things To Not Do On Startup

Blair McBride did some digging, there may be 15million users with signed extensions which can cause Firefox to do network IO (ie stall for a long time) on startup.

Brian Bondy landed a fix to lower IO priority of nuking our cache: 773518. According to telemetry, 10-20% of startups feature cache nuking. It take a while to blow away 1GB of files on startup. Brian used telemetry to investigate causes for cache purges in bug 774146. Based on this data, Brian will begin tackling what may be the oldest snappy bug so far: bug 105843. For more details on our cache see Nick Hurley’s blog post (also see his link to a similar blog post from a Chrome person).

More Responsive Tabs

Tim Taubert made our new tab animation more pleasant in bug 716108. Tim also landed a fix to halve jank caused by thumbnail capture in bug 774811, this should result in better tab-switching experience. Stay tuned for more developer attention in this area.

For the in-progress work and minor changes that landed see non-meeting notes for this week.

Jeff Muizelaar wrote an interesting blog post work involved in a tab switch on Mac.

Windows Prefetch: Experimental Data vs Reality

I once discovered that Windows Prefetch can adversely affect application startup times, bug 627591. Certain machines were showing performance to be much better with Windows prefetch disabled and using my “manual” dll preload code to warm up the cache. Manual dll preload is a win for loading large applications because it causes xul.dll to be read in sequentially rather than randomly via page-in (see my blog posts from 2010 for details of startup IO uglyness). Unfortunately Windows Prefetch + my preload code measured as a net regression. I found a weird API that seemed to return 0 when prefetch was broken and guarded preload on that.

We have recently backed out above heuristic based on a telemetry study in bug 757215. Perhaps this is why our startup numbers have started getting better in Firefox 16?

Brian Bondy setup a telemetry startup trial to randomly delete prefetch, turn on dll preload. Last week Saptarshi Guha crunched some telemetry numbers, see this bugzilla comment. Turned out Windows Prefetch is a huge win and dll preload is a tiny incremental improvement on top of that (rather than being a regression).

Moral of the story is: do not rely on manual performance testing for workloads involving large amounts of IO. Simulating a “typical” Windows machine is extremely hard without getting noisy numbers. Effort is better spent on analyzing noisy real-world numbers and running experiments in the wild.

We discussed setting up Eideticker, for desktop Firefox responsiveness testing.

Andrew Halberstadt is making progress on a revised version of peptest. We are looking at loading talos pageset into individual tabs and tracking tab-switching

We also discussed how QA can help in helping us confirm + narrow down regressions found by telemetry.

Necko

Necko guys are continuing to remove main thread DNS resolution, are integrating a custom DNS resolver. Last week they landed a bunch of telemetry to help them play cache-lock-whack-a-mole: bug 763342, 767275.

Jon Coppeard put up a patch to do incremental sweeping. The cleanup phase of the GC is a major remaining continuous GC operation. This should help reduce remaining significant GC pauses.

Perf Team

Nicholas Chaim is almost done with setting a way to track main thread IO with XPerf in bug 770317. We would like to track main thread network IO via xperf, but it’s not clear if xperf can report what thread IO operations happen on.

Slow Startup

Turns out Firefox validates some signed extensions on startup: bug 726125. I think we finally have a good explanation for some of the ridiculously slow startups we’ve been looking at. Yuck.

Jared Wein discovered that our about:home was surprisingly expensive to load. He sped up the page by an estimated 30% in bug 765411. Similarly, Tim Taubert is fixing our new tab page performance in bug 753448.

Alex Crichton added ability to profile JS in bug 761261. Benoit Girard is adding labels to the profiler to expose JS profiling info in bug 707308. Same functionality will also allow us to add URLs to the stacks. This means that in addition to seeing what Firefox is busy with, the profiler will now provide context on what caused the processing (screenshot). This is huge. Benoit also improved profiler timing data in 769989.

Slow Startup Research

As I mentioned before, Nicholas Chaim wrote an addon to track system IO usage while starting Firefox. He has since updated his addon to be hosted on AMO and to submit that data for analysis. If you suffer from slow Firefox startups, please help us identify common IO hogs by installing his addon. Please encourage friends with slow startups to do the same.

This addon lists names of processes and amount of IO they did. This is somewhat private information, we can’t gather this data via telemetry.

Ehsan Akhgari turned on frame pointers on nightly channel: bug 764216. This means that one can now use the built-in profiler on nightly builds. The main purpose behind the change was to collect more chromehang data(long Firefox UI stalls). Vlad Djeric lowered the chromehang reporting threshold to 5 seconds: bug 763124. We are waiting on metrics to separate out chromehang reporting from telemetry pings: bug 763116.

Nathan Froyd is making heroic progress on teaching our events to queue so they can be prioritized: bug 715376.

Tim Taubert is working to reproduce a tab animation regression in bug 752837. He also taking over making Firefox themes less of a performance pig inbug 650968 .

We had great success with eliciting data on slow startups in Nicholas Chaim’s blog post. We confirmed that external processes can affect Firefox startup (we had evidence for this) and that we can detect those situations (great work Nicholas!). It will be a hard slog before we can bolt a pretty UI to the extension + integrate system diagnostics into Firefox. In the meantime I recommend that SUMO people start suggesting this extension to diagnose slow Firefox installs. Nicholas is working on a revision of the extension that records slow-IO-caused-startup situations on a server so we prioritize + turn these into detectable fingerprints.

“The Performance of Open Source Applications” book is looking for contributors. Would be cool if someone snuck some Mozilla wisdom in there.

Sorry for skipping the snappy update last week. These posts take a lot more effort than is reasonable and I needed to direct it at my talk last week. You can see my Velocity slides here.

At least one person objected to the strong language used in the presentation (ie “dom local storage sucks”). I chose this language to emphasize the fact this isn’t a feature where one gets to weigh upsides vs downsides because the downsides are so severe. Most of the positive data on this is coming from what I believe to be unrepresentative benchmarks. I have not seen any other data points similar in quality to those reported by our telemetry.

Btw, Jan Varga is close to removing our IndexedDB prompt(bug 758357), opening up IndexedDB as an alternative to DOM Local Storage(which sucks).