Status

The project status is presented earliest in this document for convenience of tracking progress, but these items probably won't make much sense before reading the rest of the document. The specific work items will almost certainly change often during the project's lifetime.

Deploy (parallelizable)

(?) AWFY-style runs: regularly triggered off of m-c pushes, with results compared to other browsers' results from latest publicly available version/release

(?) Turn-key setup: developers make an IT request, an "Eideticker Box" appears at their door soon thereafter

(?) Tinderbox integration: run tests and report on every push, in order to spot regressions. Unclear whether this would subsume AWFY-style runs or vice versa, or whether they're orthogonal.

All done!

Background

Our performance tests currently rely on metrics accessible by web content (Date.now, events, setTimeout, etc.). Many of these sources of information are loosely defined or just lies, implemented that way to improve performance of real web pages. Performance tests built on these metrics are useful in their own right, but it's very hard to use them to measure what users actually perceive: pixels appearing on screen. The problem is made even harder by process separation, GPU rendering, async scrolling and animation, and async rerendering (fennec).

There are some additional issues with our current test harnesses that become especially problematic on mobile devices. The harnesses are mostly written to run on the same system as the one being tested, meaning that the tests and the harness can compete for system resources.

So, I think that for testing user-perceived performance, a new test harness is in order.

Motivating examples

Microsoft's Psychedelic Browsing demo works as intended on windows desktops: if the browser uses the GPU to render content, the reported score is high. If the browser doesn't, the score is low. On X11 systems, the test does not work as intended. If Firefox runs on an X11 installation that uses the GPU to implement XRender, the test reports a high score and the user observes a high framerate. On X11 installations with a CPU-implemeneted XRender, the test still reports the same high score, but the user can see <1 frame per second drawn to screen.

Robert O'Callahan modified GUIMark3 to test our plugin rendering. The test counts paints performed by the plugin and equates them with frames drawn to screen. While Firefox on X11 scored lower as measured by the test than other browsers, the actual plugin framerate was qualitatively higher, easily distinguishable to the naked eye.

Fennec re-renders content asynchronously when the user pans, in order to mask the latency of the "actual" reflow/paint/etc (which can't be done in real-time, in general). To hide this "cheat", when fennec paints web content, it paints more than will be displayed on screen at any time. This creates a "prefetch" region which can be panned into view before needing new pixels from web content. If the user pans outside of this prefetch region, fennec has nothing to display, and instead draws a checkerboard pattern. How large the prefetch region should be, what shape it should be, and how long fennec should wait before asking web content to repaint, are heuristics that need to be tuned. Currently they're tuned based on developers panning around on popular web pages and trying to guess whether there's more or less "checkerboarding" with patches that change the heuristics.

Measuring page-load performance is Hard. A common approach is to time how long it takes for a document's onload event to fire. The problem with this is that Firefox might not have painted, or even reflowed(!), the document when Firefox fires onload. That is, onload doesn't necessarily correspond at all with what users see. The MozAfterPaint event is better, in that it's (usually) fired only when Firefox draws part of the document to screen, but that in itself gives no indication of "how much" of the page has loaded (e.g., if a particular image has been downloaded, decoded, and painted). I haven't yet seen an attempt to optimize how pages are painted during initial load according to quantitative objective functions, perhaps partly because rocket science is needed to gather the raw data.

Non-goals

Replace any existing testing infrastructure. This new harness is intended to complement talos et al.

Goals

Test infrastructure doesn't compete with tests for system resources. E.g., don't serve pages through http.js running on the test system.

Take the software paths used in the wild. E.g., load pages through ns*Http*Channel, not through data/file channel.

Go through the hardware used in the wild. E.g., load pages through NIC/WNIC, not from disk.

Ability to run tests on competitor's browsers

Test results include independent data that validates the calculated numerical data. E.g., results of page-load tests include a (highly compressed) video of the page load along with the data/graphs calculated

Tests can be run locally by all developers.

Test results are repeatable.

Test results are reported in a statistically sound way.

Test data can be reported at arbitrary granularity. E.g., data available down to the level of times of individual runs of a test within a trial.

Tests run as part of existing automation framework. E.g., run on each checkin to m-c, changes in results reported to tree-management

Many of these goals conflict.

Types of measurements to be made

Responsiveness: ping the browser in various ways, measure pong in the form of pixels appearing on screen

Perceived load time: not just how fast pixels appear on screen, but which pixels and according to what pattern.

Panning/zooming (for fennec): how fast can content be moved on screen, how long does it take for "checkerboard" regions to be filled in

Scrolling (non-fennec): similar to above

Framerate of animations (actual framerate!)

Ideal infrastructure (WIP)

Tests are driven by a robot that has

Control of all input devices on test system

A camera that records the value of each individual pixel drawn on every monitor vsync. (NB: this implies the quantum of measurable time is 1/vsync-rate.)

Test pages are served over a network that has

Exactly precise latency and bandwidth

Arbitrarily configurable latency and bandwidth

Approximating the ideal

No one has time to build such a robot, and perfect networks don't exist, so we will need to approximate them, probably in platform-specific ways.

Network

A cross-platform approximation to the ideal network is to run DNS and HTTP servers on a (quiet) host machine over a dedicated (quiet) ethernet or wireless network (preferably wireless, for the sake of fennec). The minimum configurable latency is the intrinsic network latency, but bandwidth and latency could be arbitrarily throttled (approximately) by the HTTP server itself. Pages would need to be served from a ramfs.

The kindly Chromium folks have done some work that could help out here.

Capturing the screen

Use a DVI capture card to record the output of any machine capable of outputting a DVI signal. This will cover

Tegra 2 boards already in infra

pretty much any modern desktop or laptop

This also allows us to run our perf tests on competitors' browsers.

The capture card allows us to record every pixel drawn, uncompressed, at the signal's sync rate (should be <= 60Hz). The capture itself is entirely platform-neutral, since it's performed on the raw DVI signal. These cards can get a bit pricey, so we need to decide on a lowest-common-denominator model. The Datapath VisionRGB E1 seems to suit our needs. BEWARE: there are cheaper capture cards, that are however NOT able to record an uncompressed signal and/or are NOT able to capture at 60Hz.

We want to be able to capture HD-resolution screens at 60Hz, for desktop. This will generate 1920*1080*4*60 ~= 474.6 MiB/sec. It will take a beefy recording machine, with lots of RAM for buffering, and software written to run close to the metal to keep up. We will probably need to know how the capture card's driver is implemented to in order to transfer data into userspace efficiently enough. Luckily, Datapath claims to have linux drivers available to OEMs. If we played our hand well, we might be able to get ahold of the source.

This test infrastructure will be more useful initially for fennec than desktop FF (as of the FF4 architecture), and the tegra boards probably won't be able to drive 1920x1080, so we should probably initially focus on the easier problem of capturing output from the tegras.

Some analyses need access to the pixels of all recorded frames; that is, they can't be performed in a streaming style on a small number of recent frames as they arrive from the capture card. (We also might want to do multiple analyses of one test run, or analyze the recorded frames off-line, or ...) This means we need to save the screen recording somewhere, semi-permanently. The options seem to be

cap test runs to the number of seconds we can buffer frames in RAM

store data to disk, probably with a (lossless) compression scheme

Simple compression schemes should work well for the data we want to gather; row deduplication between successive frames and/or run-length-encoding are possibilities. Another possibility is farming out PNG compression to multiple threads. Failing simple schemes, we can talk to Tim Terriberry, who's a world expert in video compression.

Controlling the device

We will probably need platform-specific "drivers" to accomplish this. Ideally we would want to deliver OS-level input events directly to FF's widget. This is doable on at least X11, android, and windows. We can fall back on an add-on that runs in the FF instance being tested, to deliver these events directly to Gecko. (It would need to deliver events from a separate thread.) The two main downsides of add-on are (i) less realistic than OS events and add-on disturbs system being tested; and (ii) we can't run our tests on competing browsers, or if the competing browser does support add-ons powerful enough, we have to maintain an add-on per browser.

Early prototype

record create a new screen recording, dumping the output files into the directory out/. NB: this only works on Gdk/Gtk platforms.

fps: compute the framerate of the recording in out/

loadhist: compute the "load histogram" (described below) of the recording in out/

These commands can be chained, so one can run python eideticker.py record fps loadhist, e.g..

encodeframes: Script that uses ffmpeg to encode a set of frames into an OGG/Theora movie. E.g., ./encodeframes out/ capture.ogg encodes all the frames in the directory out/ into the movie capture.ogg

A goal of Eideticker is to show developers a visual representation of the test run to correlate with the reported data. This provides useful visual data to correlate with numerical data. It also heads off at the pass the old "My patch did not drop that tests's animation to 0fps ... something must be wrong on the infra!" If the developer can access a small video of the test run, the question can be settled immediately. At the end of each example below, a video of the test run is included.

NB: the above prototype is just a proof-of-concept and is a DEAD END. Its current approach is too slow to be made practical, and has a HUGE perf impact on the system being tested. Please don't try to optimize it ;).

The raw frame captures from the examples below is available if anyone wants it.

Example: Framerate analysis

An interesting perf-analysis problem we've hit on X11 systems shows up on Microsoft's Psychedelic Browsing demo. On systems with an XRender implementation that uses the GPU, the page reports a very high framerate ("RPM" in its weird terminology) and a very high framerate is actually drawn to screen. However, on systems with a CPU-only XRender implementation (i.e. a slow impl), the page still reports a high framerate, even though the rate of frames drawn to screen can be <1fps.

Example: Load histogram

Our existing pageload tests can tell when various events of note occur, say when the onload event is fired, but they can only give us an approximation of the user's experience of the page load. We can get richer data from a screen recording of the page load: with that, we can see which pixels loaded when, and according to what pattern.

Here's the (simple) histogram computed by this prototype while FF loads this ARM help page. [NB: The prototype recorder is so slow that I had to look around for a page that loaded slowly enough to get an interesting histogram. Thank heavens for JSP!]

This histogram shows how many pixels were the same between frame i and the final frame. The optimal load histogram would show the frame similarity immediately go to 100%. We would want to improve the load histogram according to these hand-wavy rules

definitely better for "everything" to appear ASAP

probably better for "something" to appear ASAP

in between, let UX people (or whoever) tell us what heuristics to tune for

I also refer to "green screen" and "red screen". These might be implemented by loading data URIs for HTML documents with green/red backgrounds.

Note too that individual frames are referred to by f0-fn in a run that records n+1 frames. Each frame is a "screenshot" pulled from the capture card. Perhaps confusingly, fi is also used to identify points in time. We can equate particular frames with points in time because the capture card (should!) reliably deliver frames at a known rate X Hz. So, 1/X seconds always elapse between fi-1 and fi, and fi refers to some point in the interval of time between fi-1 and fi+1

frames per second, computed as follows. fs is a unique frame. Between fs and fn, if the pixels captured for frame fi are different from those in frame fi-1, count another unique frame. The final framerate is #uniqueFrames / (fn - fs). This is a simplistic definition that doesn't attempt to account for tearing, but ... see below.

where sizeof is the size of a frame in number of pixels, and pixelDiff counts the number of different pixels in two frames.

The idea here is that we want to reach the "final" rendering as quickly as possible. We probably also want to draw "something" to screen quickly, to appear more responsive. In between, UX folks can tell us what to tune using video from loads or whatever.

where countChanges counts the number of frames between fi-fj in which the value of pixel x,y changes.

The higher a pixel's "heat", the more frequently it changes during page load. Heat could indicate wasted work, so between two page loads with the same load histogram, we would probably prefer the colder heat map.

(Lots of other stuff is possible.)

[NB: The calculations above are presented as declaratively as possible for clarity. In reality, they could be calculated much more efficiently than would a naive, sequential implementation of the declarative formulas.]

Tearing

Tearing happens when we draw to screen but don't synchronize to monitor refreshes, and/or can't draw one entire frame during one cycle, and/or don't buffer properly, and/or try to draw frames at too high a rate. This causes parts of different frames to appear on the screen at the same time. It's an unpleasant phenomenon to view.

In general, it's impossible to decide whether a sequence of frames exhibits tearing, without knowing information about what's supposed to be drawn. But, given a set of known images { Ii | Ii ∩ Ij = ∅ } (that is, a set of images in which no two images have the same pixel value at <x,y>), drawn to screen over a sequence of frames f0-fn, we can decide whether there was tearing

∃ fi-1, fi s.t. fi-1 ∩ fi ≠ ∅ ∧ fi-1 ≠ fi

(that is, two successive frames share pixels in common, but they aren't the same frame.) This system is free from false positives by definition, but false negatives are possible. We would probably want to implement tearing tests for DOM manipulation, canvas, videos, etc. and combinations thereof.

(Maybe) how many successive frames tear. Not sure if this would be useful.

Window open histogram

Same as the load histogram, but replace the launching a green screen + test page with opening a bare FF/XUL window, and having the window load synchronize with a red screen.

(WIP) Fennec panning/zooming

Details not worked out, but some ingredients are

load a page. Synchronize.

dispatch a bunch of events causing fennec to pan and zoom at max rate

extract framerate from recorded frames. Higher framerate is better.

extract "checkerboarding" info from recorded frames: what percentage of frames is not checkerboard, i.e. actual web content. Less checkerboard is better.

Here we could also use sequences of touch events that we record from human users, playing around with fennec in the real world. The idea is that different people exhibit different "input profiles", and we should tune panning heuristics over a weighted average of the known profiles.

(WIP) Scrolling

Details not worked out. Would be similar to fennec panning tests, except there's no checkerboarding to measure.

(WIP) Responsive tests

Details even less worked out. Dispatch events to the UI as a "ping", wait for expected pixels to appear on screen as "pong".

Addendum: Only recording the content area

In most of the analyses above, we only care about the pixels in the area in which web content is drawn ("content window"). For example, if we're doing a framerate analysis on a recording of the entire screen, and during the test the system clock on the test machine changes display from "1201" to "1202", we might record an extra "new" frame when we shouldn't have. The same goes for the text in Firefox's window title, etc. I suspect this won't be a big problem, but we can avoid it with an extra synchronization step. At the beginning of a test, we can force the browser to draw an agreed-on pattern (green screen, pink screen, green screen with pink corners, QR code, whatever), and have Eideticker analyze frames to determine when the full pattern is present. From then on, Eideticker can remember the bounds of the "interesting" content, as demarcated by the synchronization pattern, and discard uninteresting content. (This would also reduce the analysis and storage costs by a small amount, but nothing worth speaking of.)

This is a bit harder on fennec, where web content scrolling can move parts of the UI into/out of view. I don't foresee any hard problems though.

Out-of-band information

Andreas requested that the test harness be able to provide out-of-band information during a test, such as when GCs happen during test runs. This is somewhat tricky; the more out-of-band data we record, the less black-box the testing becomes. We also won't be able to gather this data from other browsers. I think this is a secondary problem: if the results are reliable and repeatable, then developers can make whatever measurements they need off-line, with profilers etc., then run patches back through Eideticker to see if the problem went away.

That said, we have some ways to gather this kind of information. One option is to use a lightweight, xperf-like event-aggregation system (or xperf itself) and save the results after individual test runs, using real-time timestamps (in the CLOCK_REALTIME sense) to approximately correlate the events with rendered frames. A more exotic option is to reserve K pixels on the screen in which 16K or 32K bits of information would be encoded and drawn along with each rendered frame (think movie subtitles). This is attractive because the information would be encoded with a built-in timestamp, the frame in which it was drawn.

This shouldn't be an issue with a sufficiently fast disk.

Also, perhaps a custom app could record only the pixels of interest, since most of each frame is black. There would probably be a CPU-disk speed tradeoff here.