Last summer I wrote a bit about using Eideticker to measure the relative performance of Firefox for Android versus other browsers (Chrome, stock, etc.). At the time I was pretty optimistic about Eideticker’s usefulness as a truly “objective” measure of user experience that would give us a more accurate view of how we compared against the competition than traditional benchmarking suites (which more often than not, measure things that a user will never see normally when browsing the web). Since then, there’s been some things that I’ve discovered, as well as some developments in terms of the “state of the art” in mobile browsing that have caused me to reconsider that view — while I haven’t given up entirely on this concept (and I’m still very much convinced of eideticker’s utility as an internal benchmarking tool), there’s definitely some limitations in terms of what we can do that I’m not sure how to overcome.

Essentially, there are currently three different types of Eideticker performance tests:

Animation tests: Measure the smoothness of an animation by comparing frames and seeing how many are different. Currently the only example of this is the canvas “clock” test, but many others are possible.

Startup tests: Measure the amount of time it takes from when the application is launched to when the browser is fully running/available. There are currently two variants of this test in the dashboard, both measure the amount of time taken to fully render Firefox’s home screen (the only difference between the two is whether the browser profile is fully initialized). The dirty profile benchmark probably most closely resembles what a user would usually experience.

Scrolling tests: Measure the amount of undrawn areas when the user is panning a website. Most of the current eideticker tests are of this kind. A good example of this is the taskjs benchmark.

In this blog post, I’m going to focus on startup and scrolling tests. Animation tests are interesting, but they are also generally the sorts of tests that are easiest to measure in synthetic ways (e.g. by putting a frame counter in your javascript code) and have thus far not been a huge focus for Eideticker development.

As it turns out, it’s unfortunately been rather difficult to create truly objective tests which measure the difference between browsers in these two categories. I’ll go over them in order.

Startup tests

There are essentially two types of startup tests: one where you measure the amount of time to get to the browser’s home screen when you explicitly launch the app (e.g. by pressing the Firefox icon in the app chooser), another where you load a web page in a browser from another app (e.g. by clicking on a link in the Twitter application).

The first is actually fairly easy to test across browsers, although we are not currently. There’s not really a good reason for that, it was just an oversight, so I filed bug 852744 to add something like this.

The second case (startup to the browser’s homescreen) is a bit more difficult. The problem here is that, in a nutshell, an apples to apples comparison is very difficult if not impossible simply because different browsers do different things when the user presses the application icon. Here’s what we see with Firefox:

And here’s what we see with Chrome:

And here’s what we see with the stock browser:

As you can see Chrome and the stock browser are totally different: they try to “restore” the browser back to its state from the last time (in Chrome’s case, I was last visiting taskjs.org, in Stock’s case, I was just on the homepage).

Personally I prefer Firefox’s behaviour (generally I want to browse somewhere new when I press the icon on my phone), but that’s really beside the point. It’s possible to hack around what chrome is doing by restoring the profile between sessions to some sort of clean “new tab” state, but at that point you’re not really reproducing a realistic user scenario. Sure, we can draw a comparison, but how valid is it really? It seems to me that the comparison is mostly only useful in a very broad “how quickly does the user see something useful” sense.

Panning tests

I had quite a bit of hope for these initially. They seemed like a place where Eideticker could do something that conventional benchmarking suites couldn’t, as things like panning a web page are not presently possible to do in JavaScript. The main measure I tried to compare against was something called “checkerboarding”, which essentially represents the amount of time that the user waits for the page to redraw when panning around.

At the time that I wrote these tests, most browsers displayed regions that were not yet drawn while panning using the page background. We figured that it would thus be possible to detect regions of the page which were not yet drawn by looking for the background color while initiating a panning action. I thus hacked up existing web pages to have a magenta background, then wrote some image analysis code to detect regions that were that color (under the assumption that magenta is only rarely seen in webpages). It worked pretty well.

The world’s moved on a bit since I wrote that: modern browsers like Chrome and Firefox use something like progressive drawing to display a lower resolution “tile” where possible in this case, so the user at least sees something resembling the actual page while panning on a slower device. To see what I mean, try visiting a slow-to-render site like taskjs.org and try panning down quickly. You should see something like this (click to expand):

Unfortunately, while this is certainly a better user experience, it is not so easy to detect and measure. For Firefox, we’ve disabled this behaviour so that we see the old checkerboard pattern. This is useful for our internal measurements (we can see both if our drawing code as well as our heuristics about when to draw are getting better or worse over time) but it only works for us.

If anyone has any suggestions on what to do here, let me know as I’m a bit stuck. There are other metrics we could still compare against (i.e. how smooth is the panning animation aka frames per second?) but these aren’t nearly as interesting.

Just wanted to give a quick heads up that as part of the ateam’s ongoing effort to improve the documentation of our automated testing infrastructure, we now have online documentation for mozdevice, the python library we use for interacting with Android- and FirefoxOS-based devices in automated testing.

Mozdevice is used in pretty much every one of our testing frameworks that has mobile support, including mochitest, reftest, talos, autophone, and eideticker. Additionally, mozdevice is used by release engineering to clean up, monitor, and otherwise manage our hundred-odd tegra and panda development boards that we use in tbpl. See sut_tools (old, buildbot-based, what we currently use) and mozpool (the new and shiny future).

Quick update on my last post about finding some kind of camera suitable for use in automated performance testing of fennec and b2g with eideticker. Shortly after I wrote that, I found out about a company called Point Grey Research which manufactures custom web cameras intended for exactly the sorts of things we’re doing. Initial results with the Flea3 camera I ordered from them are quite promising:

(the actual capture is even higher quality than that would suggest… some detail is lost in the conversion to webm)

There seems to be some sort of issue with the white balance in that capture which is causing a flashing effect (I suspect this just involves flipping some kind of software setting… still trying to grok their SDK), and we’ll need to create some sort of enclosure for the setup so ambient light doesn’t interfere with the capture… but overall I’m pretty optimistic about this baby. 60 frames per second, very high resolution (1280×960), no issues with HDMI (since it’s just a USB camera), relatively inexpensive.

[ For more information on the Eideticker software I'm referring to, see this entry ]

Ok, so as I mentioned last time I’ve been looking into making Eideticker work for devices without native HDMI output by capturing their output with some kind of camera. So far I’ve tried four different DSLRs for this task, which have all been inadequate for different reasons. I was originally just going to write an email about this to a few concerned parties, but then figured I may as well structure it into a blog post. Maybe someone will find it useful (or better yet, have some ideas.)

Elmo MO-1

This is the device I mentioned last time. Easy to set up, plays nicely with the Decklink capture card we’re using for Eideticker. It seemed almost perfect, except for that:

The image quality is really bad (beaten even by $200 standard digital camera). Tons of noise, picture quality really bad. Not *necessarily* a deal breaker, but it still sucks.

More importantly, there seems to be no way of turning off the auto white balance adjustment. This makes automated image analysis impossible if the picture changes, as is highlighted in this video:

Canon Rebel T4i

This is the first camera that was recommended to me at the camera shop I’ve been going to. It does have an HDMI output signal, but it’s not “clean”. This blog post explains the details better than I could. Next.

Nikon D600

Supposedly does native clean 720p output, but unfortunately the output is in a “box” so isn’t recognized by the Decklink cards that we’re using (which insist on a full 1280×720 HDMI signal to work). Next.

Nikon D800

It is possible to configure this one to not put a box around the output, so the Decklink card does recognize it. Except that the camera shuts off the HDMI signal whenever the input parameters change on the card or the signal input is turned on, which essentially makes it useless for Eideticker (this happens every time we start the Eideticker harness). Quite a shame, as the HDMI signal is quite nice otherwise.

–

To be clear, with the exception of the Elmo all the devices above seem like fine cameras, and should more than do for manual captures of B2G or Android phones (which is something we probably want to do anyway). But for Eideticker, we need something that works in automation, and none of the above fit the bill. I guess I could explore using a “real” video camera as opposed to a DSLR acting like one, though I suspect I might run into some of the same sorts of issues depending on how the HDMI output of those devices behaves.

Part of me wonders whether a custom solution wouldn’t work better. How complicated could it be to construct your own digital camera anyway? Hook up a fancy camera sensor to a pandaboard, get it to output through the HDMI port, and then we’re set? Or better yet, maybe just get a fancy webcam like the Playstation Eye and hook it up directly to a computer? That would eliminate the need for our expensive video capture card setup altogether.

[ For more information on the Eideticker software I'm referring to, see this entry ]

Here’s a long overdue update on where we’re at with Eideticker for FirefoxOS. While we’ve had a good amount of success getting useful, actionable data out of Eideticker for Android, so far we haven’t been able to replicate that success for FirefoxOS. This is not for lack of trying: first, Malini Das and then me have been working at it since summer 2012.

When it comes right down to it, instrumenting Eideticker for B2G is just a whole lot more complex. On Android, we could take the operating system (including support for all the things we needed, like HDMI capture) as a given. The only tricky part was instrumenting the capture so the right things happened at the right moment. With FirefoxOS, we need to run these tests on entire builds of a whole operating system which was constantly changing. Not nearly as simple. That said, I’m starting to see light at the end of the tunnel.

Platforms

We initially selected the pandaboard as the main device to use for eideticker testing, for two reasons. First, it’s the same hardware platform we’re targeting for other b2g testing in tbpl (mochitest, reftest, etc.), and is the platform we’re using for running Gaia UI tests. Second, unlike every other device that we’re prototyping FirefoxOS on (to my knowledge), it has HDMI-out capability, so we can directly interface it with the Eideticker video capture setup.

However, the panda also has some serious shortcomings. First, it’s obviously not a platform we’re shipping, so the performance we’re seeing from it is subject to different factors that we might not see with a phone actually shipped to users. For the same reason, we’ve had many problems getting B2G running reliably on it, as it’s not something most developers have been hacking on a day to day basis. Thanks to the heroic efforts of Thomas Zimmerman, we’ve mostly got things working ok now, but it was a fairly long road to get here (several months last fall).

More recently, we became aware of something called an Elmo which might let us combine
the best of both worlds. An elmo is really just a tiny mounted video camera with a bunch of outputs, and is I understand most commonly used to project documents in a classroom/presentation setting. However, it seems to do a great job of capturing mobile phones in action as well.

The nice thing about using an external camera for the video capture part of eideticker is that we are no longer limited to devices with HDMI out — we can run the standard set of automated tests on ANYTHING. We’ve already used this to some success in getting some videos of FirefoxOS startup times versus Android on the Unagi (a development phone that we’re using internally) for manual analysis. Automating this process may be trickier because of the fact that the video capture is no longer “perfect”, but we may be able to work around that (more discussion about this later).

FirefoxOS web page tests

These are the same tests we run on Android. They should give us an idea of roughly where our performance when browsing / panning web sites like CNN. So far, I’ve only run these tests on the Pandaboard and they are INCREDIBLY slow (like 1-3 frames per second when scrolling). So much so that I have to think there is something broken about our hardware acceleration on this platform.

FirefoxOS application tests

These are some new tests written in a framework that allows you to script arbitrary interactions in FirefoxOS, like launching applications or opening the task switcher.

I’m pretty happy with this. It seems to work well. The only problems I’m seeing with this is with the platform we’re running these tests on. With the pandaboard, applications look weird (since the screen resolution doesn’t remotely resemble the 320×480 resolution on our current devices) and performance is abysmal. Take, for example, this capture of application switching performance, which operates only at roughly 3-4 fps:

So what now?

I’m not 100% sure yet (partly it will depend on what others say as well as my own investigation), but I have a feeling that capturing video of real devices running FirefoxOS using the Elmo is the way forward. First, the hardware and driver situation will be much more representative of what we’ll actually be shipping to users. Second, we can flash new builds of FirefoxOS onto them automatically, unlike the pandaboards where you currently either need to manually flash and reset (a time consuming and error prone process) or set up an instance of mozpool (which I understand is quite complicated).

The main use case I see with eideticker-on-panda would be where we wanted to run a suite of tests on checkin (in tbpl-like fashion) and we’d need to scale to many devices. While cool, this sounds like an expensive project (both in terms of time and hardware) and I think we’d do better with getting something slightly smaller-scale running first.

So, the real question is whether or not the capture produced by the Elmo is amenable to the same analysis that we do on the raw HDMI output. At the very least, some of eideticker’s image analysis code will have to be adapted to handle a much “noisier” capture. As opposed to capturing the raw HDMI signal, we now have to deal with the real world and its irritating fluctuations in ambient light levels and all that the rest. I have no doubt it is *possible* to compensate for this (after all this is what the human eye/brain does all the time), but the question is how much work it will be. Can’t speak for anyone else at Mozilla, but I’m not sure if I really have the time to start a Ph.D-level research project in computational vision. I’m optimistic that won’t be necessary, but we’ll just have to wait and see.

To help with the business of making Android or FirefoxOS devices do our bidding, Mozilla Automation & Tools developed a Python library called mozdevice which allows you to control these devices either using the Android Debug Bridge protocol (which is actually not Android specific: FirefoxOS devices use it too) or the System Under Test protocol (a Mozilla-specific thing).

Anyone familiar with debugging these devices is doubtless familiar with adb, which provides a command line interface that allows you to push/pull files, run a shell, etc. To help test our python code (as well as expand the scope of what’s possible on the command line), I created a similar utility a few months ago called “dm” which provides a command-line interface to the aforementioned mozdevice code. It’s shipped as part of mozdevice, and testing it out is pretty simple if you have virtualenv installed:

Testing out mozdevice code. For example, today we discovered an (unfortunate) bug in devicemanagerADB’s killProcess routine. It was easy to verify both the problem and my fix did what I expected by starting my custom build of fennec and running this command:

Before you ask, yes, it’s technically possible to do much of this with the original adb utility. But (1) some of our internal stuff can’t use adb a variety of reasons and (2) some of the tasks above (e.g. launching an app, getting a screenshot) involve considerably more typing with adb than with dm. So it’s still a win.

Just a few quick notes on how to avoid a class of errors I’ve been seeing in Mozilla’s automation over the last year. Since python interprets code dynamically, it’s pretty easy for problems like undefined variables to slip through, especially if they’re in a codepath that isn’t frequently tested. The most recent example I found was in some cleanup-after-error code for remote mochitest/reftest, which tried to call a “self.cleanup” from a standalone method.

The end result of this is that the cleanup code will *never* be called, it will just silently fail to run. Probably not the end of the world in this case (there are other parts of our mobile automation which will perform the same cleanup later), but it’s easy to imagine another one where this would be a more serious problem.

There’s two very easy ways to help stop errors like this before they hit our code:

Try to avoid using a blanket try…except. In addition to catching legitimate problems which we want to ignore (in the remote case for example, devicemanager exceptions), it also catches (and thus obscures) things like syntax, name, or type errors. Instead, try just catching the exception you’re looking for. For example, we might rewrite the case above as:

pyflakes, pyflakes, pyflakes. Pyflakes is a fantastic tool for analyzing your python code for common problems. It’s kind of analagous to jslint, for those of you familiar with that. Here’s what happens when we run pyflakes against the offending code:

Frof pretty considerably reduces the amount of busywork you need to do to gather a profile. Instead of a rather complicated multi-step process to initialize fennec with the right parameters for profiling, downloading profiles, etc., you can just run the frof script like so:
frof org.mozilla.fennec http://wrla.ch mywonderfulprofile.zip

Assuming that frof was bootstrapped correctly (and your phone is connected to your computer in debugging mode), this should start up fennec automatically with that URL loaded. Now, just perform whatever action you want to profile on your phone, then press enter in the terminal when you’re done. Voila, instant profile trace which you can examine, post to bugs, etc. All the other details are automated.

Backstory: the inspiration for frof came from a personal itch of mine, the fact that leaflet.js maps seem to be causing out-of-memory errors on Fennec when zooming is enabled (bug 784580). I wanted to be able to capture some profiles to see what was going on, but the current instructions on MDN seem a bit unwieldly. I figured I’d get lots of mileage out of a tool to make this easier (especially if I was going to get into a profile, edit, debug cycle), so I spent a few hours dilligently copying the logic we put into eideticker to gather profiles into a standalone script.

Unfortunately in my case, the gecko profile didn’t tell me much, aside from the fact that Gecko didn’t seem to be the culprit (remember that on Android we also have lots of Java-based frontend code to contend with, which the profiler doesn’t measure). I’m going to stare more at the Java code and dig into the various high-level tools that Android provides for profiling performance and memory usage. My current hypothesis is that the problem is the screenshot code and the CSS transitions that leaflet generates when zooming. In the mean time, the only thing I have to show for my foray away from writing tools for Mozilla is … yet another tool.

Oh well, it could be worse. My fervent hope is that frof will be helpful for both Fennec developers and QA. Let me know if you wind up using it!

[ For more information on the Eideticker software I'm referring to, see this entry ]

Just wanted to give some updates on a few new Eideticker features which have landed in the past week.

Profiling support

While Eideticker is a great tool for observing the external behaviour of the mobile browser, it hasn’t been able to tell us much about what’s going on inside. If something’s slow, why is it slow? If it’s slower than it was the day before, what’s the cause? If it’s faster? What explains the deviations in test results from one run to the other?

Thanks to Benoit Girard (+ a little bit of integration work from me), we can now start providing answers to these questions. Eideticker now has a mode that allows us to capture a sampling profile of the application while the video capture is ongoing. From the dashboard, you can now get access said profile, just by clicking on a link.

Note that the profile is not yet synchronized precisely to the videocapture (the profile works over the entire run of the browser), but Benoit is busily making that happen. That should hopefully land soon, in the mean time we still have some pretty useful data.

To say I’m excited about this would be an understatement. I think it has the potential to open up a whole new world of understanding of why our mobile (and desktop, someday) browser performs the way it does.

Startup / pageload testing

Eideticker has had support for measuring startup and page load time for a few months now, but I hadn’t yet hooked it up to the dashboard. As of today, it now is. There’s a bunch of different angles that are interesting to measure here (new vs. old profiles, whether the browser has been launched since boot, launching web applications, loading about:home or loading a web page, …) which I’ll get to in due course. For now, we at least have a baseline of how long it takes to see the Firefox homescreen on a Galaxy Nexus:

Of course, this is hooked up to the profiling support previously mentioned. Here’s an example profile.

[ For more information on the Eideticker software I'm referring to, see this entry ]

More on this to come, but just a quick note that the client-side URL schema for the Eideticker dashboard has been changed, as we now gather benchmarks for more than one device (Samsung Galaxy Nexus benchmarks FTW). To get the new and improved dashboard, please just go to the root: