Monthly Archives: November 2017

Often I hear about our talos results, why are they so noisy? What is noise in this context- by noise we are referring to a larger stddev in the results we track, here would be an example:

With the large spread of values posted regularly for this series, it is hard to track improvements or regressions unless they are larger or very obvious.

Knowing the definition of noise, there are a few questions that we often need to answer:

Developers working on new tests- what is the level of noise, how to reduce it, what is acceptable

Over time noise changes- this causes false alerts, often not related to to code changes or easily discovered via infra changes

New hardware we are considering- is this hardware going to post reliable data for us.

What I care about is the last point, we are working on replacing the hardware we run performance tests on from old 7 year old machines to new machines! Typically when running tests on a new configuration, we want to make sure it is reliably producing results. For our system, we look for all green:

This is really promising- if we could have all our tests this “green”, developers would be happy. The catch here is these are performance tests, are the results we collect and post to graphs useful? Another way to ask this is are the results noisy?

To answer this is hard, first we have to know how noisy things are prior to the test. As mentioned 2 weeks ago, Talos collects 624 metrics that we track for every push. That would be a lot of graph and calculating. One method to do this is push to try with a single build and collect many data points for each test. You can see that in the image showing the all green results.

One method to see the noise, is to look at compare view. This is the view that we use when comparing one push to another push when we have multiple data points. This typically highlights the changes that are easy to detect with our t-test for alert generation. If we look at the above referenced push and compare it to itself, we have:

Here you can see for a11y, linux64 has +- 5.27 stddev. You can see some metrics are higher and others are lower. What if we add up all the stddev numbers that exist, what would we have? In fact if we treat this as a sum of the squares to calculate the variance, we can generate a number, in this case 64.48! That is the noise for that specific run.

Now if we are bringing up a new hardware platform, we can collect a series of data points on the old hardware and repeat this on the new hardware, now we can compare data between the two:

What is interesting here is we can see side by side the differences in noise as well as the improvements and regressions. What about the variance? I wanted to track that and did, but realized I needed to track the variance by platform, as each platform could be different- In bug 1416347, I set out to add a Noise Metric to the compare view. This is on treeherder staging, probably next week in production. Here is what you will see:

Here we see that the old hardware has a noise of 30.83 and the new hardware a noise of 64.48. While there are a lot of small details to iron out, while we work on getting new hardware for linux64, windows7, and windows10, we now have a simpler method for measuring the stability of our data.

Over the last 6 months there has been a deep focus on performance in order to release Firefox 57. Hundreds of developers sought out performance improvements and after thousands of small adjustments we see massive improvements.

Last week I introduced Ionut who has come in as a Performance Sheriff. What do we do on a regular basis when it comes to monitoring performance. In the past I focused on Talos and how many bugs per release we found, fixed, and closed. While that is fun and interesting, we have expanded the scope of sheriffing.

We continue to refine benchmarks and tests on each of these frameworks to ensure we are running on relevant configurations, measuring the right things, and not duplicating data unnecessarily.

Looking at the list of frameworks, we collect 1127 unique data points and alert on them with included bugs for anything sustained and valid. While the number of unique metrics can change, here are the current number of metrics we track:

Framework

Total Metrics

Talos

624

Autophone

19

Build Metrics

172

AWSY

83

Platform Microbenchmarks

229

1127

While we generate these metrics for every commit (or every few commits for load reasons), what happens is we detect a regression and generate an alert. In fact we have a sizable number of alerts in the last 6 weeks:

Framework

Total Alerts

Talos

429

Autophone

77

Build Metrics

264

AWSY

85

Platform Microbenchmarks

227

1082

Alerts are not really what we file bugs on, instead we have an alert summary when can (and typically) does contain a set of alerts. Here is the total number of alert summaries (i.e. what a sheriff will look at):

Framework

Total Summaries

Talos

172

Autophone

54

Build Metrics

79

AWSY

29

Platform Microbenchmarks

136

470

These alert summaries are then mapped into bugs (or downstream alerts to where the alerts started). Here is a breakdown of the bugs we have:

Framework

Total Bugs

Talos

41

Autophone

3

Build Metrics

17

AWSY

6

Platform Microbenchmarks

6

73

This indicates there are 73 bugs associated with Performance Summaries . What is deceptive here is many of those bugs are ‘improvements’ and not ‘regressions’. If you figured it out, we do associate improvements with bugs and try to comment in the bugs to let you know of the impact your code has on a [set of] metric[s].

Framework

Total Bugs

Talos

23

Autophone

3

Build Metrics

14

AWSY

4

Platform Microbenchmarks

3

47

This is a much smaller number of bugs- now there are a few quirks here-

some regressions show up across multiple frameworks (reduces to 43 total)

some bugs that are ‘downstream’ are marked against the root cause instead of just being downstream. Often this happens when we are sheriffing bugs and a downstream alert shows up a couple days later.

Note that Firefox 58 has 28 bugs associated with it, but we have 43 bugs from the above query. Some of those bugs from the above query are related to Firefox 57, and some are starred against a duplicate bug or a root cause bug instead of the regression bug.

I hope you find this data useful and informative towards understanding what goes on with all the performance data.