Over the last year there has been a lot of research into reducing the noise in our talos performance numbers. For example looking at tscroll, we have a fluctuation in the reported numbers of almost 400 (out of 14000). Jan Larres took a look at this problem in his masters thesis, and found a variety of factors that did and didn’t contribute to the noise. We actually have Bug 706912 filed to implement some of his suggestions on how to calculated the posted number. Last fall, Stephen Lewchuk look at the raw data that was collected and found some inconsistencies in the way we were aggregating the data. In short, we have a lot of ground to cover if we want to reduce our numbers.

Over the last couple months, we have been working on a project call Signal From Noise. This is an attempt to fix the way we collect some numbers and redo the way we aggregate numbers for reporting. We have done a lot of experimenting with the primary focus on tp5. The way we run tp5 is to load each of the 100 pages once, then repeat 10 times. For each page, we would drop the highest value and take the median value of the remaining 9 numbers. This results in an array of 100 data points which get reported to the graph server. We take those 100 data points and average them out to generate the single number for tp5. It is easy to imagine that the small samples and median/average combination will produce a lot of noise.

Going forward, we are looking to change from column major to row major and collect 30 samples instead of 10. This means we focus on one page and load it 30 times, then move to the next page and repeat until all 100 pages have been loaded. The downfall is the runtime as we move from an average of 17 minutes to an average of 39 minutes for the entire tp5 run. Collecting 30 samples will give us a much more meaningful number, but we also found that the first 5-10 iterations contain the most noise. So initially we are looking to throw away the first 10 numbers instead of what we originally did by throwing away the highest number. When looking at the raw numbers (not the aggregated number), here are some graphs to highlight the difference:

This is only the first step in many changes needed. After rolling this out, we need to evaluate the other test suites as well and ensure we are running adequate cycles to get a valid sample size. We are also working on allowing the database to accept the raw values instead of the single median value per page. Likewise are are looking to stop doing a average([median(page)]). All of this will allow us to find regressions easier per page instead of having it washed over with the other numbers.

4 responses to “Reducing the Noise in Talos”

The downside of dropping the first data points is that the most common action that a user makes, i.e. loading a page (or page element) the first time, is being thrown out of the results and we optimize our performance for the completely synthetic case of reloading a page for the for the 10th to 30th time. Unfortunately we don’t have a good idea how to measure page load in a way that is nearer to what really affects the users.

Awesome point of view, and a hard thing to solve. With our historical method of dropping the highest number it would for the most part drop the first number. Our proposal is to drop more of the first numbers, so maybe that is something worth flushing out before deploying it.

My thoughts here are if we get all the raw values loaded into the database, we can report a stable number that shouldn’t move much at all by ignoring the first few iterations. We can also generate a second number which takes into account the first load if we want.

I guess the big difference is column order vs row order. That is something that once changed is hard to change back.

FWIW, the first time almost always means also building and updating the profile, not just loading a page after opening the browser. Some tests use partial or outdated profiles that must be updated on the first run, even with costly operations. Optimizing that case would not be that useful since would measure the time talos takes to fix/create its own profile. At that point would be better introducing a Tprofile measure for that specific need.