I took some detailed notes to spread the word about how this work is enabling a great deal of important Q3 work like the Release Promotion project. Basically, the bridge allows us to separate out work that Buildbot currently runs in a somewhat monolithic way into TaskGraphs and Tasks that can be scheduled separately and independently. This decoupling is a powerful enabler for future work.

Of course, you might argue that we could perform this decoupling in Buildbot.

However, moving to TaskCluster means adopting a modern, distributed queue-based approach to managing incoming jobs. We will be freed of the performance tradeoffs and careful attention required when using relational databases for queue management (Buildbot uses MySQL for it’s queues, TaskCluster uses RabbitMQ). We also will be moving “decision tasks” in-tree, meaning that they will be closer to developer environments and likely easier to manage keeping developer and build system environments in sync.

Here are my notes:

Why have the bridge?

Allows a graceful transition

We’re in an annoying state where we can’t have dependencies between buildbot builds and taskcluster tasks. For example: we can’t move firefox linux builds into taskcluster without moving everything downstream of those also into taskcluster

It’s not practical and sometimes just not possible to move everything at the same time. This let’s us reimplement buildbot schedulers as task graphs. Buildbot builds are tasks on the task graphs enabling us to change each task to be implemented by a Docker worker, a generic worker or anything we want or need at that point.

One of the driving forces is the build promotion project – the funsize and anti-virus scanning and binary moving – this is going to be implemented in taskcluster tasks but the rest will be in Buildbot. We need to be able to bounce between the two.

What is the Buildbot Bridge (BBB)

BBB acts as a TC worker and provisioner and delegates all those things to BuildBot. As far as TC is concerned, BBB is doing all this work, not Buildbot itself. TC knows nothing about Buildbot.

Buildbot Listener

Claims a Task when builds start. Attaches BuildBot Properties to Tasks as artifacts. Has a buildslave name, information/metadata. It resolves those Tasks.

Buildbot and TC don’t have a 1:1 mapping of BB statuses and TC resolution. Also needs to coordinate with Treeherder color. A short discussion happened about implementing these colors in an artifact rather than inferring them from return codes or statuses inherent to BB or TC.

Reflector

Runs on a timer – every 60 seconds

Reclaims tasks: need to do this every 30-60 minutes

Cancels Tasks when a BuildRequest is cancelled on the BB side (have to troll through BB DB to detect this state if it is cancelled on the buildbot side)

Scenarios

A successful build!

Task is created. Task in TC is pending, nothnig in BB. TCListener picks up the event and creates a BuildRequest (pending).

TC deadline, what is it? Queue: a task past a deadline is marked as timeout/deadline exceeded

On TH, if someone requests a rebuild twice what happens? * There is no retry/rerun, we duplicate the subgraph — where ever we retrigger, you get everything below it. You’d end up with duplicates Retries and rebuilds are separate. Rebuilds are triggered by humans, retries are internal to BB. TC doesn’t have a concept of retries.

How do we avoid duplicate reporting? TC will be considered source of truth in the future. Unsure about interim. Maybe TH can ignore duplicates since the builder names will be the same.

The key tutorial information centered on how to set up jobs, test/run them locally and selecting appropriate worker types for jobs.

This past quarter Morgan has been working on Linux Docker images and TaskCluster workers for Firefox builds. Using that work as an example, Morgan showed how to set up new jobs with Docker images. She also touched on a couple issues that remain, like sharing sensitive or encrypted information on publicly available infrastructure.

A couple really nice things:

You can run the whole configuration locally by copy and pasting a shell script that’s output by the TaskCluster tools

There are a number of predefined workers you can use, so that you’re not creating everything from scratch

Dustin gave an overview of task graphs using a specific example. Looking through the docs, I think the best source of documentation other than this video is probably the API documentation. The docs could use a little more narrative for context, as Dustin’s short talk about it demonstrated.

Mozilla’s build and test infrastructure has relied on Buildbot as the backbone of our systems for many years. Asking around, I heard that we started using Buildbot around 2008. The time has come for a change!

The goal of this work is to shut down Buildbot and identify a timeline. Our first goal post is to eliminate the Buildbot Scheduler by moving build production entirely into TaskCluster, and scheduling tests in TaskCluster.

Today, most FirefoxOS builds and tests are in Taskcluster. Nearly everything else for Firefox is driven by Buildbot.

OS X cross-compliation – Ted Mielczarek will be building on the work of Michael Shal, Chris Cooper and others. This will enable us to build Firefox in Linux containers. It won’t let us get away from using Mac Minis for tests, but it will free up hardware that previously was dedicated to builds.

Buildbot Bridge – Ben Hearsum is working on this. It will enable us to schedule a job in TaskCluster, but run the job in Buildbot infrastructure. This is helpful for allowing us to decommission the scheduler portion of Buildbot and helps everyone work on migrating task scheduling and configuration while we work on cross-platform worker support in parallel.

Porting of Linux unit tests – Andrew Halberstadt will be focusing on this in the very near future.

We have quite a few things to figure out in the Windows and Mac OS X realm where we’re interacting with hardware, and some work is left to be done to support Windows in AWS. We’re planning to get more clarity on the work that needs to be done there next week.

The bugs identified seem tantalizingly close to describing most of the issues that remain in porting our builds. The plan is to have a timeline documented for builds to be fully migrated over by Whistler! We are also working on migrating tests, but for now believe the Buildbot Bridge will help us get tests out of the Buildbot scheduler, even if we continue to need Buildbot masters for a while. An interesting idea about using runner to manage hardware instead of the masters was raised during the meeting that we’ll be exploring further.

If you’re interested in learning more about TaskCluster and how to use it, Chris Cooper is running a training on Monday June 1 at 1:30pm PT.

One of the mysterious and amazing parts of Mozilla’s Release Engineering infrastructure is the Try server, or just “Try”. This is how Firefox and FirefoxOS developers can push changes to Mozilla’s large build and test system, made up of about 3000 servers at any time. There are a couple amazing things about this — one is that anyone can request access to push to try, not just core developers or Mozilla employees. It is a publicly-available system and tool. Second is that a remarkable amount of control is offered over which builds to produce and which tests are run. Each Try run could consume 300+ hours of machine time if every posible build and test option is selected.

This blog post is a brain dump from a few days of noodling and a quick review of of the pushlog for Try, which shows exactly the options developers are choosing for their Try runs.

To use Try, you need to include a string of configuration that looks something like this in your topmost hg commit:

You can include a Try configuration string in an empty commit, or as the last part of an existing commit message. What most developers tell me they do is have a an empty commit with the Try string in it, and they remove the extra commit before merging a patch. From all the feedback I’ve read and heard, I think that’s probably what we should document on the wiki page for Try, and maybe have a secondary page with “variants” for those that want to use more advanced tooling. KISS rule seems to apply here.

If you’re a regular user of Try, you might have heard of the high scores tracker. What you might not know is that there is a JSON file behind this page and it contains quite a bit of history that’s used to generate that page. You can find it if you just replace ‘.html’ with ‘.json’.

Something about the 8-bit ambiance of this page that made me think of “texts from last night”. But in reality, Try is most busy during typical Pacific Time working hours.

The high scores page also made me curious about the actual Try strings that people were using. I pulled them all out and had a look at what config options were most common.

Of the 1262 pushes documented today in that file:

760 used ‘-b do’ meaning both Debug and Opt builds are made. I wonder whether this should just be the default, or we should have some clear recomendations about what developers should do here.

366 used ‘-p all’ meaning build on 28 platforms, and produce 28 binaries. Some people might intend this, but I wonder if some other default might be more helpful.

456 used ‘-u all’ meaning that all the unit tests were run.

1024 used ‘-t none’ reflecting the waning use of Talos tests.

I’m still thinking about how to use this information. I have a few ideas:

Change the defaults for a minimal try run

Make some commonly-used aliases for things like “build on B2G platforms” or “tests that catch a lot of problems”

Create a dashboard that shows TopN try syntax strings

update the parser to include more of the options documented on various wiki pages as environment variables

If you’re a regular user of Try, what would you like to see changed? Ping me in #releng, email or tweet your thoughts!

And some background on me: I’ve been working with the Release Engineering team since April 1, 2015, and most of that time so far was spent on #buildduty, a topic I’m planning to write a few blog posts about. I’m also having a look at planning for the Task Cluster migration (away from BuildBot), monitoring and developer environments for releng tooling. I’m also working on a zine to share at Whistler of what is going on when you push to Try. Finally, I stood up a bot for reporting alerts through AWS SNS and to be able to file #buildduty related bugzilla bugs.

My goal right now is to find ways of communicating about and sharing the important work that Release Engineering is doing. Some of that is creating tracker bugs and having meetings to spread knowlege. Some of that is documenting our infrastructure and drawing pictures of how our systems interact. A lot of it is listening and learning from the engineers who build and ship Firefox to the world.

One of the things that I like to do is create architecture diagrams of complicated systems.

We had Release Engineering and Release Operations in the Portland Mozilla office this week, providing a perfect opportunity to pick everyone’s brains about what the current state of our release infrastructure is like.

Behold:

And here’s a version that includes some “tree closure reasons” in magenta:

A tree closure is defined as an hg hook that prevents people from committing to a tree (like mozilla-central). It looks up status at treestatus.mozilla.org to figure out whether or not the tree is closed, and this value is updated manually by “sheriffs” who track tree status.

And the an initial key to the tree closure reasons (the numbers on the magenta blobs), is documented on the Mozilla wiki.

The goal of this document was to take brain dump information from everyone in the meeting, and create a relationship diagram of all the systems that everyone here supports. As you can see, it is pretty complex.

What I took away from creating this was:

The cognitive load is very high for trying to diagnose the root cause for several kinds of tree closures.

People loved being able to look at how each systems related to the others.

No single person really had a model in their head of how everything represented in this diagram was related.

There’s a lot more work to do to link in documentation and create some related diagrams, which I’ll tackle next week. The kinds of questions I’d like to try to answer based on the information that I’ve gathered include:

How does my patch get a build created for it?

What single points of failure can we mitigate?

What kinds of resilience do we need for our typical transient failures?

I really enjoyed identifying sources of tree closure and the kinds of failures that cause it. These are the kinds of problems I love working on solving — complicated, often unpredictable and largely driven by the normal work that people need to do to get their jobs done. There’s rarely a simple solution to things like experimental patches taking down large portions of a build infrastructure, and how we solve, or at least mitigate, these problems is fascinating.