Today is my first day back to Mozilla, after my internship this summer!! This time I’ll be working part-time as a contractor for the same team (Release Engineering), 10h/week for now, and hopefully lots more next semester when I won’t have any courses. Yay!! :)

I’ll be working remotely from Bucharest, Romania (GMT+2 or PST+10 ;)).

In my previous post End to End Times Report I started talking about E2E times, by defining what they are and then looking at some monthly E2E times averages of the past 3 months for mozilla-central and try.

I also kept mentioning that the normal E2E times for mozilla-central is a little under 4 hours, but greatly varies upwards with the system load. Now, how much exactly do the E2E times vary away from the normal times and how?

In order have a better grasp on what the E2E times values distribution might be, I plotted the histogram of all E2E times for mozilla-central registered in October (more precisely October 1-20, 2010). And here’s how it looks after removing the outliers:

The histogram above represents the distribution of the E2E times among bins of 15 minutes.

As it turns out the histogram looks pretty nice. Most values (66.38%) are located in the 3h – 4h 25m normal time interval, with a high peak in the 3h 45m – 4h time subinterval.

However there is a long tail of values between 5h to 10 hours. Even though the number of values in each bin is small, summed up together they represent around 15% of the Build Runs.

The values smaller than 3h (10.92%) are build failures and exceptions. The very large outliers (>10h) were excluded from the histogram. They represent 7.18% of all Build Runs, with 4.02% between 10-25h and 3.16% between 25-255h (see plot bellow with outliers included).

Time Interval

Percentage (%)

Comments

0 – 3h

10.92 %

Failures

3h – 4h 25m

66.38 %

Normal times

4h 25m – 10h

15.52 %

Long tail of large values

>10h

7.18 %

Outliers

* 10h-25h: 4.02%

* >25h: 3.16%

Branch

mozilla-central

Timeframe

~October 1-20, 2010

No. values

348

Max value

255h 51m

Mean value

7h 12m

Median value

3h 42m

Here’s the histogram re-plotted, but this time with all the outliers included:

The End to End Time measures how long it takes for a Build Run to complete. That is, the time difference between the timestamp of the change that triggered this Build Run and the timestamp of when the last of the generated Build Requests ends (in other words, when all builds and tests are completed). (see Build Run Life Cycle diagram below, also published in Mozilla’s Build System blog post)

The normal End to End Time for mozilla-central is a little under 4 hours, but greatly varies upwards with the system load.

Report Contents

Summary

As you can see from the snapshot above (snapshot of the End to End Times Report for try branch as seen on October 22, 2010, a little after 12:00 PM), the report starts with some general information, like the branch selected, number of Build Runs found in the specified timeframe (given by startime and endtime URL GET parameters) and another very important value, the Average build run duration (also called the Average End to End Time).

Build Runs Info

1. Push’s Timestamp

Initially, the table is sorted by the ‘Least When Timestamp‘ column, which is actually the push’s change timestamp. This means that at the top should be listed the most recent pushes to the repo (colored gray if still running/pending). Note: the table is sortable by all other columns too.

2. Result: success vs. warnings vs. failure

The rows have different colors depending on the Build Run’s result (‘Results‘ column): green for success, orange for warnings, red for exception and failure and gray for no result (“-“) (if all Build Requests are currently running or pending).

3. Complete? Still Running?

The ‘Complete‘ column tells whether all Build Requests are completed or not (values: yes/no).

4. End to End Time (Duration)

A very important column is ‘Duration‘, also known as the End to End Time. The duration is computed as following:

Duration := Greatest Finish Time – Least When Timestamp

, or how long it took for all Build Requests in this Build Run to complete (or up until now, if not complete). The ‘Least When Timestamp‘ is the earliest timestamp of the Build Requests’ start times and ‘Greatest Finish Time‘, the latest timestamp of the Build Requests’ finish times.

5. Build Requests Numbers Broke Down by Status And Job Type

The number of Build Requests within a Build Run (differs per branch, for example in mozilla-central there should be 168 if everything was successful) are broke down once by status: Complete, Running, Pending, Cancelled, Interrupted and Misc, and again by job type: Builds, Unittests and Talos.

6. Rebuilds And Forcebuilds

There are also counts on how many rebuilds and forcebuilds were done.

7. Further Information, Link to Build Run Report Page

To see more about the different parameters, check out the Build Run Report. The revision links on the ‘Revision‘ column points to such reports, where you can see the exact status of individual Build Requests.

The End to End Times Report contains all the Build Runs displayed by Tinderboxpushlog, but with accurate data (which does not lie! :) ). However the report was not intended as a real time monitorization tool, but rather as an analysis tool which provides a peak into how well the Build System is performing. Not so far anyways…

Average End to End Times (E2E)

Here are some E2E Averages computed per month, though E2E times tend to vary greatly from week to week or even from one Build Run to another.

Month

Branch

Mean

Median

Aug

m-c

9h 22m

4h 29m

try

10h 25m

7h 8m

Sep

m-c

6h 12m

4h 8m

try

7h 6m

4h 59m

Oct

m-c

6h 41m

3h 43m

try

4h 20m

3h 55m

The average is currently computed only as a simple arithmetic mean, which due to large outlier values might not the the best measurement. The median values were added to the table presented above as a comparison only, and aren’t currently calculated by the report.

As you can see from the chart, the E2E times have decreased over the past 3 months for mozilla-central and try. For try the improvement is even more visible, mostly thanks to the new Try Chooser.

Problem / E2E Report Incomplete

There is one problem that prevents the E2E Times Report from being complete, and that is the nightly builds. The Build Requests generated for the nightly builds have no revision number attached, which means there is currently no exact way of regrouping the individual Build Requests back to the Build Run. To make things more complicated, the nightly’s tests do get revision numbers, that is the revision number of the most recent commit, thus making the nightly’s tests mix up with the previous Build Run’s Build Requests! (contaminating the E2E time of that Build Run too)

To solve this issue, the following bug has been issued to Bugzilla: Bug 594496 – Generate unique id for a push in schedulerdb/statusdb.

Fun Outliers

By sorting the table by the ‘Duration’ column you can run into many surprising findings, like:

One push to hg.mozilla.org triggers off the Build System to generate a certain number of Build Requests (depending on the branch). All these build requests make up a Build Run. In a previous post I have covered in more detail its flow through the Build System and the Build Run Life Cycle.

2. Details: Individual Build Requests

Next, the report presents information on the individual Build Requests making up the Build Run. If you are interested in how the Build Requests are fetched from the database and what the individual fields describing a Build Request mean, you might want to also read Build Requests Query.

The table displays a lot of information, and many of the parameters are internal and relevant only to how the Build System works.

The Average Time per Builder Report measures the average run time of each builder (e.g. ‘Linux mozilla-central build’, ‘Rev3 Fedora 12 mozilla-central opt test crashtest’) within a branch, computed over a timeframe. It also calculates the percentage of time spent by the system running jobs for each builder and the percentage of successful vs. warnings vs. failed jobs. In addition, all information mentioned above is aggregated and filterable by platform (fedora vs. fedora64 vs. leopard vs. linux…), build type (debug vs. opt) and job type (build vs. talos vs. unittest vs. repack).

First & last builders sorted by avg. run time (mozilla-central, Oct 1-20)

Features

1. Average Run Time

First and foremost, the report measures the average run time for each builder (detail_level=builder). This way you can see how long individual builds, unittests and talos take on average and compare them.

By setting different filters, it is possible to compare only the builders of a platform, build type or job type of interest. Just to take a couple of examples, it’s very easy to see:

The averages are simple arithmetic means so far, calculated over the number of Build Requests found for each builder within the specified timeframe. The number of Build Requests are displayed on the ‘No. breqs’ column and are different for each builder.

As a future improvement, the median could be used instead of the simple mean, or removing the outliers when computing the mean.

2. Percentage of System Run Time

In addition to the average run times, the report measures the percentage of time spent by the system doing jobs of a certain type (‘PTG Run Time %’ column). This number is computed by summing the run times of all Build Requests of a certain builder (job type, build type or platform, depending on the chosen detail level) and divided by the sum of the run times of all Build Requests displayed, after all filters have been applied (platforms, build types or job types).

Example – How much time is spent per each Linux builder?

Filters: platform=linux; build_type=opt,debug; job_type=build

Detail level: builder

As you can see from the table above, when looking only at Linux build builders, the system spends 34.78% doing ‘Android R7 mozilla-central build’ builds, based on 345 Build Requests having an average of 33m 21s. The percentage goes up both with the number of Build Requests and average run time.

The example looks at jobs registered between October 1-20, 2010 on mozilla-central. The same example can be accessed on the demo page at Linux Builders Demo.

3. Aggregation

It is possible to aggregate the results for the builders on upper levels, by setting the detail_level to job_type, build_type, platform or branch.

To make things more clear, let’s take an example: How much time is it spent per each Snowleopard optimized job type?

4. Filters

There are 3 types of filters that can be set: platforms, build types and job types. All of them have been used in one or more of the previous examples. For instance, in the ‘How much time is it spent per each Snowleopard optimized job type’ example (see 3. Aggregation), the filters are set as follows: platform=snowleopard; build_type=opt; job_type=build,repack,talos,unittest.

5. Percentage of Success vs. Warnings vs. Failure

Another interesting information presented by the report is the percentage of success vs. warnings vs. failure of registered build requests. By sorting the results by these values, you can easily see which tests fail the most, always fail, or always pass.

Mozilla’s Build System is a very cool distributed system run by Buildbot. The system automatically rebuilds and tests the tree every time something has changed.

The Build Infrastructure currently has around 1,000 machines grouped into 3 pools, each made up of several Build Masters and many Slaves:

Build Pool (handles builds triggered by all changes, except those going to Try):

4 Build Masters

~300 Slaves

Try Build Pool (handles Try builds):

1 Build Master

~200 Slaves

Test Pool (handles all tests, including Try)

7 Test Masters

~400 Slaves

How it works

The hg poller looks for new changes in the hg.mozilla.org repository every few minutes. The changes are picked up by the Build Scheduler Master, which creates Build Requests, one for each of the supported platforms. The Build Requests go into the Scheduler Database as pending. The Build Masters look for pending Build Requests and take them on only if there are free Slaves to assign them to.

As the builds complete, the Build Master updates their statuses in the Scheduler Database. Also, the Test Scheduler Master creates Test Build Requests for the corresponding tests.

Next, the Test Build Requests are picked up by the Test Masters and assigns them to free Slaves. When the tests are complete, the Test Master updates back their statuses in the Scheduler Database.

Each Build Master and Test Master controls its own set of Slaves.

Build Run Life Cycle

One push to mozilla-central, if successful, generates a total of 168 Build Requests (as of October 2010, but subject to change in the future), from which 10 are builds (one for each of the supported 10 platforms), 108 unittests and 50 talos tests. All these build requests make up a Build Run.

Each of the 10 platform builds comes with its own set of test requests. The tests are created only when the corresponding build completes, and only if successful. Which means that if there are failed builds, some of the tests won’t be created, and the Build Run won’t have 168 Build Requests, but less.

Two very important measures in a Build Runs’s life cycle are the Wait Time and End to End Time.
The Wait Time measures how long Build Requests wait in the queue before starting, more specific, it measures the time difference between the timestamp of the change that generated that Build Request and the timestamp of when that Build Request is assigned to a free slave. (see Build Run Life Cycle diagram above)The End to End Time measures how long it takes for a Build Run to complete. That is, the time difference between the timestamp of the change that triggered this Build Run and the timestamp of when the last of the generated Build Requests ends (in other words, when all builds and tests are completed). (see Build Run Life Cycle diagram above)

The normal End to End Time for mozilla-central is a little under 4 hours, but greatly varies upwards with the system load.

The Great Wall of Mac minis

The builds are done on a mix of VMs, 1U servers, xserves and Mac minis, and all the testing is done on Mac minis.

The Great Wall of Mac minis is made up of a little over 400 of the Mac minis’ boxes, and is located by the Release Engineers’ desks in the Mountain View office. :D