Planet Mozilla Automation

February 22, 2018

Lando is so close now that I can practically smell the tibanna.
Israel put together a quick demo of Phabricator/BMO/Lando/hg running on
his local system, which is only a few patches away from being a
deployed reality.

One caveat: this demo uses Phabricator’s web UI to post the patch. We
highly recommend using Arcanist, Phabricator’s command-line tool,
to submit patches instead, mainly because it preserves all the
relevant changeset metadata.

January 04, 2018

I would fire the "senior" engineers for being over levelled. Senior engineers jobs are to mentor and build up the engineers below them. If they find that a burden then they need to re-evaluate their seniority. https://t.co/f0G5TtumRE

The, now deleted, quoted tweet went along the lines of "If you have 3 senior engineers earning $150k and a junior developer breaks the repository is it worth the $60k for having a junior". The original tweet, and then similar tweets that came out after that shows that there seems to be a disconnect on what some engineers, and even managers, believe a senior engineer should act like and what it really takes to be a senior or higher engineer.

The following is my belief, and luckily for me its also the guide I am given by my employer, is that seniority of an engineer is more to do with their interpersonal skills and less to do with their programming ability. So what do I mean by this? The answer is really simple.

A senior developer or senior engineer should be able to build up and mentor engineers who are "below" them. A senior engineer should be working to make the rest of their team senior engineers. This might actually mean that a senior engineer might do less programming than the more junior members in a team. This is normally accounted for by engineering management and even by project management. The time when they are not programming is now filled with architectural discussions, code reviews and general mentoring of newer members. These tasks might not be as tangible as producing code but it is just as important.

Whether you are on an management track or individual contributor track, the further you go up the more it is dependent on you doing less coding and more making sure that you raise everyone up with you. This is just how it all goes. After all "A raising tide raises all ships".

December 21, 2017

Over the years we have had great dreams of running our tests in many different ways. There was a dream of ‘hyperchunking’ where we would run everything in hundreds of chunks finishing in just a couple of minutes for all the tests. This idea is difficult for many reasons, so we shifted to ‘run-by-manifest’, while we sort of do this now for mochitest, we don’t for web-platform-tests, reftest, or xpcshell. Both of these models require work on how we schedule and report data which isn’t too hard to solve, but does require a lot of additional work and supporting 2 models in parallel for some time.

In recent times, there has been an ongoing conversation about ‘run-by-component’. Let me explain. We have all files in tree mapped to bugzilla components. In fact almost all manifests have a clean list of tests that map to the same component. Why not schedule, run, and report our tests on the same bugzilla component?

I got excited near the end of the Austin work week as I started working on this to see what would happen.

This is hand crafted to show top level productions, and when we expand those products you can see all the components:

I just used the first 3 letters of each component until there was a conflict, then I hand edited exceptions.

What is great here is we can easy schedule networking only tests:

and what you would see is:

^ keep in mind in this example I am using the same push, but just filtering- but I did test on a smaller scale for a bit with just Core-networking until I got it working.

What would we use this for:

collecting code coverage on components instead of random chunks which will give us the ability to recommend tests to run with more accuracy than we have now

developers can filter in treeherder on their specific components and see how green they are, etc.

easier backfilling of intermittents for sheriffs as tests are not moving around between chunks every time we add/remove a test

While I am excited about the 4 reasons above, this is far from being production ready. There are a few things we would need to solve:

My current patch takes a list of manifests associated with bugzilla components are runs all manifests related to that component- we would need to sanitize all manifests to only have tests related to one component (or solve this differently)

My current patch iterates through all possible test types- this is grossly inefficient, but the best I could do with mozharness- I suspect a slight bit of work and I could have reftest/xpcshell working, likewise web-platform tests. Ideally we would run all tests from a source checkout and use |./mach test <component>| and it would find what needs to run

What do we do when we need to chunk certain components? Right now I hack on taskcluster to duplicate a ‘component’ test for each component in a .json file; we also cannot specify specific platform specific features and lose a lot of the functionality that we gain with taskcluster; I assume some simple thought and a feature or two would allow for us to retain all the features of taskcluster with the simplicity of component based scheduling

We would need a concrete method for defining the list of components (#2 solves this for the harnesses). Currently I add raw .json into the taskcluster decision task since it wouldn’t find the file I had checked into the tree when I pushed to try. In addition, finding the right code names and mappings would ideally be automatic, but might need to be a manual process.

when we run tests in parallel, they will have to be different ‘platforms’ such as linux64-qr, linux64-noe10s. This is much easier in the land of taskcluster, but a shift from how we currently do things.

This is something I wanted to bring visibility to- many see this as the next stage of how we test at Mozilla, I am glad for tools like taskcluster, mozharness, and common mozbase libraries (especially manifestparser) which have made this a simple hack. There is still a lot to learn here, we do see a lot of value going here, but are looking for value and not for dangers- what problems do you see with this approach?

December 18, 2017

The WebDriver working group
meeting at TPAC this year
marked the culmination of six years of hard work defining
WebDriver as a standard.
Up to this point, the work has largely been about
specifying the behaviour of existing implementations:
we have tightened up semantics where drivers have behaved differently,
corrected inconsistencies inherited from
the Selenium wire protocol,
and written a test suite.

We are quickly reaching a point where we can deliver
a consistent cross-browser mechanism
for instrumenting and automating web browsers.
Vendors are already reaping the benefits of this
by employing WebDriver in the testing of standards,
and web authors will soon see a range of new features be added.

New windows

One long-awaited feature
is for WebDriver to be able to open new windows and tabs.
Today people use many different techniques,
such as injecting a window.open(…) script
to work around this deficiency.
Doing that is problematic because new windows
will be children of the current window,
potentially opening them up to leeching.

Some users are exploiting the fact that certain drivers
let you perform key combinations that affect the OS or the browser.
For example, a ^T or ⌘T combination
will normally open a new tab in desktop browsers,
but WebDriver is constricted to web content
and is not meant to let you interact with the surrounding system.

There is currently no platform API
that lets you open a new untainted top-level browsing context
and it will complement the other window manipulation commands well.

Logging

In the Selenium project
many drivers implemented a rudimentary
logging API
that made it possible get different logs
such as console- and performance,
the driver logs, and Selenium Grid logs.
We talked about logging several years ago
but decided to put it on hold
in order to narrow the scope of the first draft
and focus on getting the fundamentals right.

Simon came up with a new strawman proposal
that lets you request log services from arbitrary remote ends
between you and the final endpoint node.
What sets the new API apart from the existing Selenium logging API,
is that it distinguishes logs from individual classes of nodes.
Your favourite WebDriver-in-the-cloud provider
might provide a service for taking screenshots for every failing test,
and with this new API it will be possible to request those in a uniform way.

Permissions

As WebDriver is a specification text
other standards now have the ability to leverage its definitions
to meet their own demands.
We are seeing an example of this with
the Permissions API
that is extending WebDriver to instrument getUserMedia().
The ability to write automated tests for permissions
allows shared test suites like Web Platform Tests
to promote consistency amongst browsers,
but for example also means web authors will get the opportunity
to test geolocation for maps and other types of media.

WebVR

Can you imagine driving
a WebVR headset using WebDriver?
Well, the WebVR working group can.
It will be an unconventional use of WebDriver,
but it turns out that WebDriver’s API lends itself well
to the kind of spatial navigation that is needed to control headsets.

To go into virtual reality mode in your browser
you first need permission from the user,
and it’s therefore exciting to see that we are starting to build
an ecosystem of tools for browser instrumentation
with the addition of the Permissions API.

December 15, 2017

When you are using Selenium and geckodriver to automate your tests in Firefox you might see a behavior change with Firefox 58 when using the commands Element Click or Element Send Keys. For both commands we have enabled the interactability checks by default now. That means that if such an operation has to be performed for any kind of element it will be checked first, if a click on it or sending keys to it would work from a normal user perspective at all. If not a not-interactable error will be thrown.

If you are asking now why this change was necessary, the answer is that we are more WebDriver specification conformant now.

While pushing this change out by default, we are aware of corner cases where we accidentally might throw such a not-interactability error, or falsely assume the element is interactable. If you are hitting such a condition it would be fantastic to let us know about it as best by filing an geckodriver issue with all the required information so that it is reproducible for us.

In case the problem causes issues for your test suites, but you totally want to use Firefox 58, you can use the capability moz:webdriverClick and turn off those checks. Simply set it to False, and the former behavior will happen. But please note that this workaround will only work for Firefox 58, and maybe Firefox 59, because then the old and legacy behavior will be removed.

That’s why please let us know about misbehavior when using Firefox 58, so that we have enough time to get it fixed for Firefox 59, or even 58.

December 04, 2017

There is a strong argument in modern software engineering that a
sequence of smaller changes is preferable to a single large
change. This approach facilitates development (easier to debug,
quicker to land), testing (less functionality to verify at once),
reviews (less code to keep in the reviewer’s head), and archaeology
(annotations are easier to follow). Recommended limits are in the
realm of 300-400 changed lines of code per patch (see, for example,
the great article “How to Do Code Reviews Like a Human (Part Two)”).

400 lines can still be a fairly complex change. Microcommits is the
small-patch approach taken to its logical conclusion. The idea is to
make changes as small as possible but no smaller, resulting in a
series of atomic commits that incrementally implement a fix or new
feature. It’s not uncommon for such a series to contain ten or more
commits, many changing only 20 or 30 lines. It requires some
discipline to keep commits small and cohesive, but it is a skill that
improves over time and, in fact, changes how you think about building
software.

Former Mozillian Lucas Rocha has a great summary of some of the
benefits. Various other Mozillians have espoused their personal
beliefs that Firefox engineering would do well to more widely adopt
the microcommits approach. I don’t recall ever seeing an organized
push towards this philosophy, however; indeed, for better or for worse
Mozilla tends to shy away from this type of pronouncement. This left
me with a question: have many individual engineers started working with
microcommits? If we do not have a de juror decision to work this way,
do have a de facto decision?

We designed MozReview to be repository-based to encourage the
microcommit philosophy. Pushing up a series of commits automatically
creates one review request per commit, and they are all tied together
(albeit through the “parent review request” hack which has
understandably caused some amount of confusion). Updating a series,
including adding and removing commits, just works. Although we never
got around to implementing support for confidential patches (a
difficult problem given that VCSs aren’t designed to have a mix of
permissions in the same repo), we were pretty proud of the fact that
MozReview was unique in its first-class support for publishing and
reviewing microcommit series.

While MozReview was never designated the Firefox review tool,
through organic growth it is now used to review (and generally land)
around 63% of total commits to mozilla-central, looking at stats for
bugs in the Core, Firefox, and Toolkit products:

To be honest, I was a little surprised at the numbers. Not only had
MozReview grown in popularity over the last year, but much of its
growth occurred right around the time its pending retirement was
announced. In fact, it continued to grow slightly over the rest of the
year.

However, we figured that, owing to MozReview’s support for
microcommits, this wasn’t quite a fair comparison. Bugzilla’s
attachment system discourages multiple patches per bug, even with
command-line tools like bzexport. So we figured that, generally, a fix
submitted to MozReview would have more parts than a corresponding fix
submitted as a traditional BMO patch. Thus the
bug-to-MozReview-request ratio would be lower than the bug-to-patch
ratio. We ran a query on the number of MozReview requests per bug in
about the last seven months. These results yielded further surprises:

About 75% of MozReview commit “series” contain only a single
commit. 12% contain only two commits, 5% contain three, and 2.7%
contain four. Series with five or more commits make up only 5.3%.

Still, it seems MozReview has perhaps encouraged the splitting up of
work per bug to some degree, given that 25% of series had more than
one commit. We decided to compare this to traditional patches attached
to bugs, which are both more annoying to create and to apply:

Well then. Over approximately the same time period, of bugs with
old-style attachments, 76% had a single patch. For bugs with two,
three, and four patches, the proportions were 13%, 7%, and 1.5%,
respectively. This is extremely close to the MozReview case. The mean
is almost equal in both cases, in fact, slightly higher in the
old-style-attachment case: 1.65 versus 1.61. The median in both cases
is 1.

Okay, maybe the growing popularity of MozReview in 2017 influenced the
way people now use BMO. Perhaps a good number of authors use both
systems, or the reviewers preferring MozReview are being vocal about
wanting at least two or three patches over a single one when reviewing
in BMO. So we looked back to the situation with BMO patches in early
2016:

For one more piece of evidence, this scatter plot shows that, on
average, we’ve been using both BMO and MozReview in about the same
way, in terms of discrete code changes per bug, over the last two
years:

There are a few other angles we could conceivably consider, but the
evidence strongly suggests that developers are (1) creating, in most
cases, “series” of only one or two commits in MozReview and (2)
working in approximately the same way in both BMO and MozReview, in
terms of splitting up work.

I strongly believe we would benefit a great deal from making more of
engineering’s assumptions and expectations clearer; this is a
foundation of driving effective decisions. We don’t have to be
right all the time, but we do have to be conscious, and we have to own
up to mistakes. The above data leads me to conclude that the
microcommit philosophy has not been widely adopted at Mozilla. We
don’t, as a whole, care about using series of carefully structured and
small commits to solve a problem. This is not an opinion on how weshouldwork, but a conclusion on how wedowork,
informed by data. This is, in effect, a decision that has already
been made, whether or not we realized it.

Although I am interested in this kind of thing from an academic
perspective, it also has a serious impact to my direct
responsibilities as an engineering manager. Recognizing such decisions
will allow us to better prioritize our improvements to tooling and
automation at Mozilla, even if it has to first precipitate a serious
discussion and perhaps a difficult, conscious decision.

I will have more thoughts on why we have neither organically nor
structurally adopted the microcommits approach in my next blog
post. Spoiler: it may have to do with prevailing trends in open-source
development, likely influenced by GitHub.

November 17, 2017

With work on Phabricator–BMO integration wrapping up, the development
team’s focus has switched to the new automatic-landing service that
will work with Phabricator. The new system is called “Lando” and
functions somewhat similarly to MozReview–Autoland, with the biggest
difference being that it is a standalone web application, not tightly
integrated with Phabricator. This gives us much more flexibility and
allows us to develop more quickly, since working within extension
systems is often painful for anything nontrivial.

Lando is split between two services: the landing engine, “lando-api”,
which transforms Phabricator revisions into a format suitable for the
existing autoland service (called the “transplant server”), and the
web interface, “lando-ui”, which displays information about the
revisions to land and kicks off jobs. We split these services partly
for security reasons and partly so that we could later have other
interfaces to lando, such as command-line tools.

When I last posted an update I included an early screenshot of
lando-ui. Since then, we have done some user testing of our prototypes
to get early feedback. Using a great article,
“Test Your Prototypes: How to Gather Feedback and Maximise
Learning”, as a guide, we took our prototype to some interested future
users. Refraining from explaining anything about the interface and
providing only some context on how a user would get to the
application, we encouraged them to think out loud, explaining what the
data means to them and what actions they imagine the buttons and
widgets would perform. After each session, we used the feedback to
update our prototypes.

These sessions proved immensely useful. The feedback on our third
prototype was much more positive than on our first prototype. We
started out with an interface that made sense to us but was confusing
to someone from outside the project, and we ended with one that was
clear and intuitive to our users.

For comparison, this is what we started with:

And here is where we ended:

A partial implementation of the third prototype, with a few more small
tweaks raised during the last feedback session, is currently on
http://lando.devsvcdev.mozaws.net/revisions/D6. There are currently
some duplicated elements there just to show the various states; this
redundant data will of course be removed as we start filling in the
template with real data from Phabricator.

Phabricator remains in a pre-release phase, though we have some people
now using it for mozilla-central reviews. Our team continues to use it
daily, as does the NSS team. Our implementation has been very stable,
but we are making a few changes to our original design to ensure it
stays rock solid. Lando was scheduled for delivery in October, but due
to a few different delays, including being one person down for a
while and not wanting to launch a new tool during the flurry of the
Firefox 57 launch, we’re now looking at a January launch date. We
should have a working minimal version ready for Austin, where we have
scheduled a training session for Phabricator and a Lando demo.

November 16, 2017

Often I hear about our talos results, why are they so noisy? What is noise in this context- by noise we are referring to a larger stddev in the results we track, here would be an example:

With the large spread of values posted regularly for this series, it is hard to track improvements or regressions unless they are larger or very obvious.

Knowing the definition of noise, there are a few questions that we often need to answer:

Developers working on new tests- what is the level of noise, how to reduce it, what is acceptable

Over time noise changes- this causes false alerts, often not related to to code changes or easily discovered via infra changes

New hardware we are considering- is this hardware going to post reliable data for us.

What I care about is the last point, we are working on replacing the hardware we run performance tests on from old 7 year old machines to new machines! Typically when running tests on a new configuration, we want to make sure it is reliably producing results. For our system, we look for all green:

This is really promising- if we could have all our tests this “green”, developers would be happy. The catch here is these are performance tests, are the results we collect and post to graphs useful? Another way to ask this is are the results noisy?

To answer this is hard, first we have to know how noisy things are prior to the test. As mentioned 2 weeks ago, Talos collects 624 metrics that we track for every push. That would be a lot of graph and calculating. One method to do this is push to try with a single build and collect many data points for each test. You can see that in the image showing the all green results.

One method to see the noise, is to look at compare view. This is the view that we use when comparing one push to another push when we have multiple data points. This typically highlights the changes that are easy to detect with our t-test for alert generation. If we look at the above referenced push and compare it to itself, we have:

Here you can see for a11y, linux64 has +- 5.27 stddev. You can see some metrics are higher and others are lower. What if we add up all the stddev numbers that exist, what would we have? In fact if we treat this as a sum of the squares to calculate the variance, we can generate a number, in this case 64.48! That is the noise for that specific run.

Now if we are bringing up a new hardware platform, we can collect a series of data points on the old hardware and repeat this on the new hardware, now we can compare data between the two:

What is interesting here is we can see side by side the differences in noise as well as the improvements and regressions. What about the variance? I wanted to track that and did, but realized I needed to track the variance by platform, as each platform could be different- In bug 1416347, I set out to add a Noise Metric to the compare view. This is on treeherder staging, probably next week in production. Here is what you will see:

Here we see that the old hardware has a noise of 30.83 and the new hardware a noise of 64.48. While there are a lot of small details to iron out, while we work on getting new hardware for linux64, windows7, and windows10, we now have a simpler method for measuring the stability of our data.

From there you can navigate to individual changesets. The diff viewer will only highlight added lines as having coverage or no coverage. There are also some added lines that will not have highlighting since it is non-code added lines.

November 03, 2017

Background story: I’ve been working with Mozilla full-time since 2009 (contributor in 2007 — intern in 2008). I’ve been working with the release engineering team, the automation team (A-team) and now within the Product Integrity organization. In all these years I’ve been blessed with great managers, smart and helpful co-workers, and enthusiastic support to explore career opportunities. It is an environment that has helped me flourish as a software engineer.

I will go straight to some of the benefits that I’ve enjoyed this year.

Parental leave

Three months at 100% of my salary. I did not earn bonus payouts during that time, however, it was worth the time I spent with my firstborn. We bonded very much during that time, I learned how to take care of my family while my wife worked, and I can proudly say that he’s a “daddy’s boy” :) (Not that I spoil him!).

Working from home 100% of the time

My favourite benefit. Period.

It really helps me as an employee, as I don’t enjoy commuting and I tend to talk a lot when I’m in the office. My family is very respectful of my work hours and I’m able to have deep-thought sessions in the comfort of my own home.

This is not a benefit that a lot of companies give, especially the bigger ones which expect you to relocate and come often to the office. I chuckle when I hear a company offer that their employees can work from home only a couple of days per week.

Wellness benefits

I appreciate that Mozilla allocaters some of their budget to pay for anything related to employee wellness (mental, spiritual & physical). Knowing that if I don’t use it I will lose it causes me to think about ways to apply the money to help me stay in shape.

Learning support/budget

This year, after a re-org and many years of doing the same work, I found myself in need of a new adventure — I get bored if I don’t feel as though I’m learning. With my manager’s support (thanks jmaher!), I embarked on a journey to become a front-end developer. Mozilla also supported me by paying for me to complete a React Nanodegree as part of the company’s learning budget.

To my great surprise, React has become rather popular inside Mozilla, and there is great need for front-end work within my org. It was also a nice surprise to see that switching to Javascript from Python was not as difficult as I thought it would be.

November 01, 2017

Over the last 6 months there has been a deep focus on performance in order to release Firefox 57. Hundreds of developers sought out performance improvements and after thousands of small adjustments we see massive improvements.

Last week I introduced Ionut who has come in as a Performance Sheriff. What do we do on a regular basis when it comes to monitoring performance. In the past I focused on Talos and how many bugs per release we found, fixed, and closed. While that is fun and interesting, we have expanded the scope of sheriffing.

We continue to refine benchmarks and tests on each of these frameworks to ensure we are running on relevant configurations, measuring the right things, and not duplicating data unnecessarily.

Looking at the list of frameworks, we collect 1127 unique data points and alert on them with included bugs for anything sustained and valid. While the number of unique metrics can change, here are the current number of metrics we track:

Framework

Total Metrics

Talos

624

Autophone

19

Build Metrics

172

AWSY

83

Platform Microbenchmarks

229

1127

While we generate these metrics for every commit (or every few commits for load reasons), what happens is we detect a regression and generate an alert. In fact we have a sizable number of alerts in the last 6 weeks:

Framework

Total Alerts

Talos

429

Autophone

77

Build Metrics

264

AWSY

85

Platform Microbenchmarks

227

1082

Alerts are not really what we file bugs on, instead we have an alert summary when can (and typically) does contain a set of alerts. Here is the total number of alert summaries (i.e. what a sheriff will look at):

Framework

Total Summaries

Talos

172

Autophone

54

Build Metrics

79

AWSY

29

Platform Microbenchmarks

136

470

These alert summaries are then mapped into bugs (or downstream alerts to where the alerts started). Here is a breakdown of the bugs we have:

Framework

Total Bugs

Talos

41

Autophone

3

Build Metrics

17

AWSY

6

Platform Microbenchmarks

6

73

This indicates there are 73 bugs associated with Performance Summaries . What is deceptive here is many of those bugs are ‘improvements’ and not ‘regressions’. If you figured it out, we do associate improvements with bugs and try to comment in the bugs to let you know of the impact your code has on a [set of] metric[s].

Framework

Total Bugs

Talos

23

Autophone

3

Build Metrics

14

AWSY

4

Platform Microbenchmarks

3

47

This is a much smaller number of bugs- now there are a few quirks here-

some regressions show up across multiple frameworks (reduces to 43 total)

some bugs that are ‘downstream’ are marked against the root cause instead of just being downstream. Often this happens when we are sheriffing bugs and a downstream alert shows up a couple days later.

Note that Firefox 58 has 28 bugs associated with it, but we have 43 bugs from the above query. Some of those bugs from the above query are related to Firefox 57, and some are starred against a duplicate bug or a root cause bug instead of the regression bug.

I hope you find this data useful and informative towards understanding what goes on with all the performance data.

October 23, 2017

I gave an update 2 weeks ago on the current state of Stockwell (intermittent failures). I mentioned additional posts were coming and this is a second post in the series.

First off the tree sheriffs who maintain merges between branches, tree closures, backouts, hot fixes, and a many other actions that keep us releasing code do one important task, and that is star failures to a corresponding bug.

Once we get bugs annotated, now we work on triaging them. Our primarily tool is Neglected Oranges which gives us a view of all failures that meet our threshold and don’t have a human comment in the last 7 days. Here is the next stage of the process:

As you can see this is very simple, and it should be simple. The ideal state is adding more information to the bug which helps make it easier for the person we NI? to prioritize the bug and make a decision:

While there is a lot more we can do, and much more that we have done, this seems to be the most effective use when looking across 1000+ bugs that we have triaged so far this year.

In some cases a bug fails very frequently and there are no development resources to spend fixing the bug- these will sometimes cross our 200 failures in 30 days policy and will get a [stockwell disabled-recommended] whiteboard tag, we monitor this and work to disable bugs on a regular basis:

This isn’t as cut and dry as disable every bug, but we do disable as quickly as possible and push hard on the bugs that are not as trivial to disable.

There are many new people working on Intermittent Triage and having a clear understanding of what they are doing will help you know how a random bug ended up with a ni? to you!

October 18, 2017

About 8 months ago we started looking for a full time performance sheriff to help out with our growing number of alerts and needs for keeping the Talos toolchain relevant.

We got really lucky and ended up finding Ionut (:igoldan on irc, #perf). Over the last 6 months, Ionut has done a fabulous job of learning how to understand Talos alerts, graphs, scheduling, and narrowing down root causes. In fact, he has not only been able to easily handle all of the Talos alerts, Ionut has picked up alerts from Autophone (Android devices), Build Metrics (build times, installer sizes, etc.), AWSY (memory metrics), and Platform Microbenchmarks (tests run inside of gtest written by a few developers on the graphics and stylo teams).

While I could probably write a list of Ionut’s accomplishments and some tricky bugs he has sorted out, I figured your enjoyment of reading this blog is better spend on getting to know Ionut better, so I did a Q&A with him so we can all learn much more about Ionut.

Tell us about where you live?

I live in Iasi. It is a gorgeous and colorful town, somewhere in the North-East of Romania. It is full of great places and enchanting sunsets. I love how a casual walk
leads me to new, beautiful and peaceful neighborhoods.

I have many things I very much appreciate about this town:
the people here, its continuous growth, its historical resonance, the fact that its streets once echoed the steps of the most important cultural figures of our country. It also resembles ancient Rome, as it is also built on 7 hills.

It’s pretty hard not to act like a poet around here.

What inspired you to be a computer programmer?

I wouldn’t say I was inspired to be a programmer.

During my last years in high school, I occasionally consulted with my close ones. Each time we concluded that IT is just the best domain to specialize in: it will improve continuously, there will be jobs available; things that are evident nowadays.

I found much inspiration in this domain after the first year in college, when I noticed the huge advances and how they’re conducted. I understood we’re living in a whole new era. Digital transformation is now the coined term for what’s going on.

Any interesting projects you have done in the past (school/work/fun)?

I had the great opportunity to work with brilliant teams on a full advertising platform, from almost scratch.

It got almost everything: it was distributed, highly scalable, completely written in
Python 3.X, the frontend adopted material design, NoSQL database in conjunction with SQL ones… It used some really cutting-edge libraries and it was a fantastic feeling.

Now it’s Firefox. The sound name speaks for itself and there are just so many cool things I can do here.

What hobbies do you have?

I like reading a lot. History and software technology are my favourite subjects.
I enjoy cooking, when I have the time. My favourite dish definitely is the Hungarian goulash.

Also, I enjoy listening to classical music.

If you could solve any massive problem, what would you solve?

Greed. Laziness. Selfishness. Pride.

We can resolve all problems we can possibly encounter by leveraging technology.

Keeping non-values like those mentioned above would ruin every possible achievement.

Where do you see yourself in 10 years?

In a peaceful home, being a happy and caring father, spending time and energy with
my loved ones. Always trying to be the best example for them. I envision becoming a top notch professional programmer, leading highly performant teams on
sound projects. Always familiar with cutting-edge tech and looking to fit it in our tool set.

Constantly inspiring values among my colleagues.

Do you have any advice or lessons learned for new students studying computer science?

Be passionate about IT technologies. Always be curious and willing to learn about new things. There are tons and tons of very good videos, articles, blogs, newsletters, books, docs…Look them out. Make use of them. Follow their guidance and advice.

Continuous learning is something very specific for IT. By persevering, this will become your second nature.

Treat every project as a fantastic opportunity to apply related knowledge you’ve acquired. You need tons of coding to properly solidify all that theory, to really understand why you need to stick to the Open/Closed principle and all other nitty-gritty little things like that.

I have really enjoyed getting to know Ionut and working with him. If you see him on IRC please ping him and say hi

October 17, 2017

I have done a poor job of communicating status on our performance tooling, this is something I am trying to rectify this quarter. Over the last 6 months many new talos tests have come online, along with some differences in scheduling or measurement.

In this post I will highlight many of the test related changes and leave other changes for a future post.

October 09, 2017

It has been 6 months since the last Stockwell update. With new priorities for many months and reducing our efforts on Stockwell, it was overlooked by me to send updates. While we have been spending a reasonable amount of time hacking on Stockwell, it has been a less transparent.

I want to cover where we were a year ago, and where we are today.

1 year ago today I posted on my blog about defining intermittent. We were just starting to focus on learning about failures. We collected data, read bugs, interviewed many influential people across Mozilla and came up with a plan which we presented Stockwell at the Hawaii all hands. Our plan was to do a few things:

Triage all failures >=30 instances/week

Build tools to make triage easier and collect more data

Adjust policy for triaging, disabling, and managing intermittents

Make our tests better with linting and test-verification

Invest time into auto-classification

Define test ownership and triage models that are scalable

While we haven’t focused 100% on intermittent failures in the last 52 weeks, we did about half the time, and have achieved a few things:

While that is a lot of changes, it is incremental yet effective. We started with an Orange Factor of 24+, and often we see <12 (although last week it is closer to 14). While doing that we have added many tests, almost doubling our test load and the Orange Factor has remained low. We still don’t think that is success, we often have 50+ bugs in a state of “needswork”, and it would be more ideal to have <20 in progress at any one time. We are still ignoring half the problem, all the other failures that do not cross our threshold of 30 failures/week.

Some statistics about bugs over the last 9 months (Since January 1st):

Category

# Bugs

Fixed

511

Disabled

262

Infra

62

Needswork

49

Unknown

209

Total

1093

As you can see that is a lot of disabled tests. Note, we usually only disable a test on a subset of the configurations, not 100% across the board. Another NOTE: unknown bugs are ones that were failing frequently and for some undocumented reason have reduced in frequency.

One other interesting piece of data is many of the fixed bugs we have tried to associate with a root cause, we have done this for 265 bugs and 90 of them are actual product fixes The rest are harness, tooling, infra, or more commonly test case fixes.

I will be doing some followup posts on details of the changes we have made over the year including:

Triage process for component owners and others who want to participate

Test verification and the future

Workflow of an intermittent, from first failure to resolution

Future of Orange Factor and Autoclassification

Vision for the future in 6 months

Please note that the 511 bugs that were fixed were done by the many great developers we have at Mozilla. These were often randomized requests in a very busy schedule, so if you are reading this and you fixed an intermittent, thank you!

September 07, 2017

As the manager responsible for driving the decision and process behind
the move to Phabricator at Mozilla, I’ve been asked to write about my
recent experiences, including how this decision was implemented, what
worked well, and what I might have done differently. I also have a
few thoughts about decision making both generally and at Mozilla
specifically.

Please note that these thoughts reflect only my personal opinions.
They are not a pronouncement of how decision making is or will be done
at Mozilla, although I hope that my account and ideas will be useful
as we continue to define and shape processes, many of which are still
raw years after we became an organization with more than a thousand
employees, not to mention the vast number of volunteers.

Mozilla has used Bugzilla as both an issue tracker and a code-review
tool since its inception almost twenty years ago. Bugzilla was
arguably the first freely available web-powered issue tracker, but
since then, many new apps in that space have appeared, both
free/open-source and proprietary. A few years ago, Mozilla
experimented with a new code-review solution, named (boringly)
“MozReview”, which was built around Review Board, a third-party
application. However, Mozilla never fully adopted MozReview, leading
to code review being split between two tools, which is a confusing
situation for both seasoned and new contributors alike.

There were many reasons that MozReview didn’t completely catch on,
some of which I’ve mentioned in previous blog and newsgroup posts.
One major factor was the absence of a concrete, well-communicated,
and, dare I say, enforced decision. The project was started by a
small number of people, without a clearly defined scope, no
consultations, no real dedicated resources, and no backing from upper
management and leadership. In short, it was a recipe for failure,
particularly considering how difficult change is even in a perfect
world.

Having recognized this failure last year, and with the urging of some
people at the director level and above, my team and I embarked on a
process to replace both MozReview and the code-review functionality in
Bugzilla with a single tool and process. Our scope was made clear: we
wanted the tool that offered the best mechanics for code-review at
Mozilla specifically. Other bits of automation, such as
“push-to-review” support and automatic landings, while providing many
benefits, were to be considered separately. This division of concerns
helped us focus our efforts and make decisions clearer.

Our first step in the process was to hold a consultation. We
deliberately involved only a small number of senior engineers and
engineering directors. Past proposals for change have faltered on
wide public consultation: by their very nature, you will get virtually
every opinion imaginable on how a tool or process should be
implemented, which often leads to arguments that are rarely settled,
and even when “won” are still dominated by the loudest voices—indeed,
the quieter voices rarely even participate for fear of being shouted
down. Whereas some more proactive moderation may help, using a
representative sample of engineers and managers results in a more
civil, focussed, and productive discussion.

I would, however, change one aspect of this process: the people
involved in the consultation should be more clearly defined, and not
an ad-hoc group. Ideally we would have various advisory groups that
would provide input on engineering processes. Without such people
clearly identified, there will always be lingering questions as to the
authority of the decision makers. There is, however, still much value
in also having a public consultation, which I’ll get to shortly.

There is another aspect of this consultation process which was not
clearly considered early on: what is the honest range of solutions we
are considering? There has been a movement across Mozilla, which I
fully support, to maximize the impact of our work. For my team, and
many others, this means a careful tradeoff of custom, in-house
development and third-party applications. We can use entirely custom
solutions, we can integrate a few external apps with custom
infrastructure, or we can use a full third-party suite. Due to the
size and complexity of Firefox engineering, the latter is effectively
impossible (also the topic for a series of posts). Due to the size of
engineering-tools groups at Mozilla, the first is often ineffective.

Thus, we really already knew that code-review was a very likely
candidate for a third-party solution, integrated into our existing
processes and tools. Some thorough research into existing solutions
would have further tightened the project’s scope, especially given
Mozilla’s particular requirements, such as Mercurial support, which
are in part due to a combination of scale and history. In the end,
there are few realistic solutions. One is Review Board, which we used
in MozReview. Admittedly we introduced confusion into the app by
tying it too closely to some process-automation concepts, but it also
had some design choices that were too much of a departure from
traditional Firefox engineering processes.

The other obvious choice was Phabricator. We had considered it some
years ago, in fact as part of the MozReview project. MozReview was
developed as a monolithic solution with a review tool at its core, so
the fact that Phabricator is written in PHP, a language without much
presence at Mozilla today, was seen as a pretty big problem. Our new
approach, though, in which the code-review tool is seen as just one
component of a pipeline, means that we limit customizations largely to
integration with the rest of the system. Thus the choice of
technology is much less important.

The fact that Phabricator was virtually a default choice should have
been more clearly communicated both during the consultation process
and in the early announcements. Regardless, we believe it is in fact
a very solid choice, and that our efforts are truly best spent solving
the problems unique to Mozilla, of which code review is not.

To sum up, small-scale consultations are more effective than open
brainstorming, but it’s important to really pay attention to scope and
constraints to make the process as effective and empowering as
possible.

Lest the above seem otherwise, open consultation does provide an
important part of the process, not in conceiving the initial solution
but in vetting it. The decision makers cannot be “the community”, at
least, not without a very clear process. It certainly can’t be the
result of a discussion on a newsgroup. More on this later.

Identifying the decision maker is a problem that Mozilla has been
wrestling with for years. Mitchell has previously pointed out that we
have a dual system of authority: the module system and a management
hierarchy. Decisions around tooling are even less clear, given that
the relevant modules are either nonexistent or sweepingly defined.
Thus in the absence of other options, it seemed that this should be a
decision made by upper management, ultimately the Senior Director of
Engineering Operations, Laura Thomson. My role was to define the
scope of the change and drive it forward.

Of course since this decision affects every developer working on
Firefox, we needed the support of Firefox engineering management.
This has been another issue at Mozilla; the directorship was often
concerned with the technical aspects of the Firefox product, but there
was little input from them on the direction of the many supporting
areas, including build, version control, and tooling. Happily I found
out that this problem has been rectified. The current directors were
more than happy to engage with Laura and me, backing our decision as
well as providing some insights into how we could most effectively
communicate it.

One suggestion they had was to set up a small hosted test instance and
give accounts to a handful of senior engineers. The purpose of this
was to both give them a heads up before the general announcement and
to determine if there were any major problems with the tool that we
might have missed. We got a bit of feedback, but nothing we weren’t
already generally aware of.

At this point we were ready for our announcement. It’s worth pointing
out again that this decision had effectively already been made,
barring any major issues. That might seem disingenuous to some, but
it’s worth reiterating two major points: (a) a decision like this,
really, any nontrivial decision, can’t be effectively made by a large
group of people, and (b) we did have to be honestly open to the idea
that we might have missed some big ramification of this decision and
be prepared to rethink parts, or maybe even all, of the plan.

This last piece is worth a bit more discussion. Our preparation for
the general announcement included several things: a clear
understanding of why we believe this change to be necessary and
desirable, a list of concerns we anticipated but did not believe were
blockers, and a list of areas that we were less clear on that could
use some more input. By sorting out our thoughts in this way, we
could stay on message. We were able to address the anticipated
concerns but not get drawn into a long discussion. Again this can
seem dismissive, but if nothing new is brought into the discussion,
then there is no benefit to debating it. It is of course important to
show that we understand such concerns, but it is equally important to
demonstrate that we have considered them and do not see them as
critical problems. However, we must also admit when we do not yet
have a concrete answer to a problem, along with why we don’t think it
needs an answer at this point—for example, how we will archive past
reviews performed in MozReview. We were open to input on this issues,
but also did not want to get sidetracked at this time.

All of this was greatly aided by having some members of Firefox and
Mozilla leadership provide input into the exact wording of the
announcement. I was also lucky to have lots of great input from Mardi
Douglass, this area (internal communications) being her specialty.
Although no amount of wordsmithing will ensure a smooth process, the
end result was a much clearer explanation of the problem and the
reasons behind our specific solution.

Indeed, there were some negative reactions to this announcement,
although I have to admit that they were fewer than I had feared there
would be. We endeavoured to keep the discussion focussed, employing
the above approach. There were a few objections we hadn’t fully
considered, and we publicly admitted so and tweaked our plans
accordingly. None of the issues raised were deemed to be
show-stoppers.

There were also a very small number of messages that crossed a line of
civility. This line is difficult to determine, although we have often
been too lenient in the past, alienating employees and volunteers
alike. We drew the line in this discussion at posts that were
disrespectful, in particular those that brought little of value while
questioning our motives, abilities, and/or intentions. Mozilla has
been getting better at policing discussions for toxic behaviour, and I
was glad to see a couple people, notably Mike Hoye, step in when
things took a turn for the worse.

There is also a point in which a conversation can start to go in
circles, and in the discussion around Phabricator (in fact in response
to a progress update a few months after the initial announcement) this
ended up being around the authority of the decision makers, that is,
Laura and myself. At this point I requested that a Firefox
engineering director, in this case Joe Hildebrand, get involved and
explain his perspective and voice his support for the project. I wish
I didn’t have to, but I did feel it was necessary to establish a
certain amount of credibility by showing that Firefox leadership was
both involved with and behind this decision.

Although disheartening, it is also not surprising that the issue of
authority came up, since as I mentioned above, decision making has
been a very nebulous topic at Mozilla. There is a tendency to invoke
terms like “open” and “transparent” without in any way defining them,
evoking an expectation that everyone shares an understanding of how we
ought to make decisions, or even how we used to make decisions in some
long-ago time in Mozilla’s history. I strongly believe we need to lay
out a decision-making framework that values openness and transparency
but also sets clear expectations of how these concepts fit into the
overall process. The most egregious argument along these lines that
I’ve heard is that we are a “consensus-based organization”. Even if
consensus were possible in a decision that affects literally hundreds,
if not thousands, of people, we are demonstrably not consensus-driven
by having both module and management systems. We do ourselves a
disservice by invoking consensus when trying to drive change at
Mozilla.

On a final note, I thought it was quite interesting that the topic of
decision making, in the sense of product design, came up in the recent
CNET article on Firefox 57. To quote Chris Beard, “If you try to make
everyone happy, you’re not making anyone happy. Large organizations
with hundreds of millions of users get defensive and try to keep
everybody happy. Ultimately you end up with a mediocre product and
experience.” I would in fact extend that to trying to make all
Mozillians happy with our internal tools and processes. It’s a scary
responsibility to drive innovative change at Mozilla, to see where we
could have a greater impact and to know that there will be resistance,
but if Mozilla can do it in its consumer products, we have no excuse
for not also doing so internally.

August 02, 2017

It's no secret that I'm not a fan of try syntax, it's a topic I've blogged about on severaloccasions before. Today, I'm pleased to announce that there's a real alternative now landed on
mozilla-central. It works on all platforms with mercurial and git. For those who just like to dive in:

This will prompt you to install fzf. After bootstrapping is finished, you'll enter an interface
populated with a list of all possible taskcluster tasks. Start typing and the list will be filtered
down using a fuzzy matching algorithm. I won't go into details on how to use this tool in this blog
post, for that see:

Like the existing mach try command, this should work with mercurial via the push-to-try
extension or git via git-cinnabar. If you encounter any problems or bad UX, please file a bug
under Testing :: General.

Try Task Config

The following section is all about the implementation details, so if you're curious or want to write
your own tools for selecting tasks on try, read on!

This new try selector is not based on try syntax. Instead it's using a brand new scheduling
mechanism called try task config. Instead of encoding scheduling information in the commit
message, mach try fuzzy encodes it in a JSON file at the root of the tree called
try_task_config.json. Very simply (for now), the decision task knows to look for that file on try.
If found, it will read the JSON object and schedule every task label it finds. There are also hooks
to prevent this file from accidentally being landed on non-try branches.

What this means is that anything that can generate a list (or dict) of task labels can be a try
selector. This new JSON format is much easier for tools to write, and for taskgraph to read.

Creating a Try Selector

There are currently two ways to schedule tasks on try (syntax and fuzzy). But I envision 4-5 different
methods in the future. For example, we might implement a TestResolver based try selector which
given a path can determine all affected jobs. Or there could be one that uses globbing/regex to
filter down the task list which would be useful for saving "presets". Or there could be one that
uses a curses UI like the hg trychooser extension.

To manage all this, each try selector is implemented as an @SubCommand of mach try. The regular
syntax selector, is implemented under mach try syntax now (though mach try without any
subcommand will dispatch to syntax to maintain backwards compatibility). All this lives in a newly
created tryselect module.

If you have want to create a new try selector, you'll need two things:

A list of task labels as input.

The ability to write those labels to try_task_config.json and push it to try.

Luckily tryselect provides both those things. The first, can be obtained using the tasks.py
module. It basically does the equivalent of running mach taskgraph target, but will also
automatically cache the resulting task list so future invocations run much quicker.

The second can be achieved using the vcs.py module. This uses the same approach that the old
syntax selector has been using all along. It will commit try_task_config.json temporarily and
then remove all traces of the commit (and of try_task_config.json).

You can inspect the fuzzy implementation to see how all this ties together.

Future Considerations

Right now, the try_task_config.json method only allows specifying a list of task labels. This is
good enough to say what is running, but not how it should run. In the future, we could expand
this to be a dict where task labels make up the keys. The values would be extra task metadata that
the taskgraph module would know how to apply to the relevant tasks.

With this scheme, we could do all sorts of crazy things like set prefs/env/arguments directly from
a try selector specialized to deal with those things. There are no current plans to implement any of
this, but it would definitely be a cool ability to have!

July 11, 2017

Development, read-only UI for Lando (the new automatic-landing
service) has been deployed.

Work is proceeding on matching viewing restrictions on Phabricator
revisions (review requests) to associated confidential bugs.

Work is proceeding on the internals of Lando to land Phabricator
revisions to the autoland Mercurial branch.

Pre-release of Phabricator, without Lando, targeted for mid-August.

General release of Phabricator and Lando targeted for late September
or early October.

MozReview and Splinter turned off in early December.

Work on Phabricator@Mozilla has been progressing well for the last
couple months. Work has been split into two areas:
Phabricator–Bugzilla integration and automatic landings.

Let me start with what’s live today:

Our Phabricator development instance is up at
https://mozphab.dev.mozaws.net/. We’ve completed and deployed a
Phabricator extension to use Bugzilla for authentication and identity;
on our mozphab.dev instance, this is tied to
bugzilla-dev.allizom.org. If you would like to poke around our
development instance, please be our guest! Note that it is a
development server, so we make no guarantees as to functionality, data
preservation, and such, as with bugzilla-dev. Also, if you
haven’t used bugzilla-dev in the last year or two (or ever), you’ll
either need to log in with GitHub or get an admin to reset your
password, since email is disabled on this server. Ping mcote or holler
in #bmo on IRC. I’ll have a follow-up post on exactly what’s involved in
using Bugzilla as an authentication and identity provider and how it
affects you.

The skeleton of our new automatic-landing service, called Lando, is
also deployed to development servers. While it doesn’t actually do any
landings yet, the UI has been fleshed out. It pulls the current status
of a “revision” (which is Phabricator’s term for a review request) and
displays relevant details. It is currently pulling live data from
mozphab.dev. This is what it looks like at the moment, although we
will continue to iterate on it:

What we’re working on now:

The other part of Bugzilla integration is ensuring that we can support
confidential revisions (review requests) in Phabricator tied to
confidential bugs in a seamless way. The goal is to have the set of
people who can view a confidential bug in Bugzilla be equal to the set
of people who can view any Phabricator revisions associated with that
bug. We knew that matching any third-party tool to Bugzilla’s
fine-grained authorization system would not be easy, but Phabricator
has proven even trickier to integrate than we anticipated. We have
implemented the code that sets the visibility appropriately for a new
revision, and we have the skeleton code for keeping it in sync, but
there are some holes in our implementation that we need to plug. We’re
continuing to dig into this and have set a goal to have a solid plan
within two weeks, with implementation to follow immediately.

In parallel, within Lando we are working on the logic to take a diff
from a Phabricator revision, verify the lander’s credentials and
permissions, and turn it into a commit on the autoland branch of
hg.mozilla.org. We have much of the first point done now, are
consulting with IT on the best solution for the second, and will be
starting work on the third shortly (which is actually the easiest,
since we’re leveraging pieces of MozReview’s Autoland service).

Launch plans:

At the point that we have completed the Bugzilla-integration work
described above, we’ll have what we need for a production Phabricator
environment integrated with Bugzilla. This is planned for
mid-August. We are calling this our pre-release launch, as Lando will
not be complete, but we will be inviting some teams to try out
Phabricator, to catch issues and frustrations before going to general
release. Lando and the general rollout of Phabricator to all Firefox
enginering will follow in late September or early October. We’ll have
some brownbags to introduce Phabricator and our integrations, and we
will ensure documentation is available and discoverable both for
general Phabricator usage and our customizations, including automatic
landings.

Due to the importance of the Firefox 57 release, Splinter and
MozReview will remain functional but will be considered
deprecated. New contributors should be directed to Phabricator to
avoid the frustration of having to switch processes. Splinter will be
turned off and MozReview will be moved to a read-only mode in early
December.

June 08, 2017

When a bug for an intermittent test failure needs attention, who should be contacted? Who is responsible for fixing that bug? For as long as I have been at Mozilla, I have heard people ask variations of this question, and I have never heard a clear answer.

There are at least two problematic approaches that are sometimes suggested:

The test author: Many test authors are no longer active contributors. Even if they are still active at Mozilla, they may not have modified the test or worked on the associated project for years. Also, making test authors responsible for their tests in perpetuity may dissuade many contributors from writing tests at all!

The last person to modify the test: Many failing tests have been modified recently, so the last person to modify the test may be well-informed about the test and may be in the best position to fix it. But recent changes may be trivial and tangential to the test. And if the test hasn’t been modified recently, this option may revert to the test author, or someone else who isn’t actively working in the area or is no longer familiar with the code.

There are at least two seemingly viable approaches:

“You broke it, you fix it”: The person who authored the changeset that initiated the intermittent test failure must fix the intermittent test failure, or back out their change.

The module owner for the module associated with the test is responsible for the test and must find someone to fix the intermittent test failure, or disable the test.

Let’s have a closer look at these options.

The “you broke it, you fix it” model is appealing because it is a continuation of a principle we accept whenever we check in code: If your change immediately breaks tests or is otherwise obviously faulty, you expect to have your change backed out unless you can provide an immediate fix. If your change causes an intermittent failure, why should it be treated differently? The sheriffs might not immediately associate the intermittent failure with your change, but with time, most frequent intermittent failures can be traced back to the associated changeset, by repeating the test on a range of changesets. Once this relationship between changeset and failure is determined, the changeset needs to be fixed or backed out.

A problem with “you broke it, you fix it” is that it is sometimes difficult and/or time-consuming to find the changeset that started the intermittent. The less frequent the intermittent, the more tests need to be backfilled and repeated before a statistically significant number of test passes can be accepted as evidence that the test is passing reliably. That takes time, test resources, etc.

Sometimes, even when that changeset is identified, it’s hard to see a connection between the change and the failing test. Was the test always faulty, but just happened to pass until a patch modified the timing or memory layout or something like that? That’s a possibility that always comes to mind when the connection between changeset and failing test is less than obvious.

Finally, if the changeset author is not invested in the test, or not familiar with the importance of the test, they may be more inclined to simply skip the test or mark it as failing.

The “module owner” approach is appealing because it reinforces the Mozilla module owner system: Tests are just code, and the code belongs to a module with a responsible owner. Practically, ‘mach file-info bugzilla-component <test-path>’ can quickly determine the bugzilla component, and nearly all bugzilla components now have triage owners (who are hopefully approved by the module owner and knowledgeable about the module).

Module and triage owners ought to be more familiar with the failing test and the features under test than others, especially people who normally work on other modules. They may have a greater interest in properly fixing a test than someone who has only come to the test because their changeset triggered an intermittent failure.

Also, intermittent failures are often indicative of faulty tests: A “good” test passes when the feature under test is working, and it fails when the feature is broken. An intermittently failing test suggests the test is not reliable, so the test’s module owner should be ultimately responsible for improving the test. (But sometimes the feature under test is unreliable, or is made unreliable by a fault in another feature or module.)

A risk I see with the module owner approach is that it potentially shifts responsibility away from those who are introducing problems: If my patch is good enough to avoid immediate backout, any intermittent test failures I cause in other people’s modules is no longer my concern.

As part of the Stockwell project, :jmaher and I have been using a hybrid approach to find developers to work on frequent intermittent test failure bugs. We regularly triage, using tools like OrangeFactor to identify the most troublesome intermittent failures and then try to find someone to work on those bugs. I often use a procedure like this:

Does hg history show the test was modified just before it started failing? Ping the author of the patch that updated the test.

Can I retrigger the test a reasonable number of times to track down the changeset associated with the start of the failures? Ping the changeset author.

Does hg history indicate significant recent changes to the test by one person? Ask that person if they will look at the test, since they are familiar with it.

If all else fails, ping the triage owner.

This triage procedure has been a great learning experience for me, and I think it has helped move lots of bugs toward resolution sooner, reducing the number of intermittent failures we all need to deal with, but this doesn’t seem like a sustainable mode of operation. Retriggering to find the regression can be especially time consuming and is sometimes not successful. We sometimes have 50 or more frequent intermittent failure bugs to deal with, we have limited time for triage, and while we are bisecting, the test is failing.

I’d much prefer a simple way of determining an owner for problematic intermittents…but I wonder if that’s realistic. While I am frustrated by the times I’ve tracked down a regressing changeset only to find that the author feels they are not responsible, I have also been delighted to find changeset authors who seem to immediately see the problem with their patch. Test authors sometimes step up with genuine concern for “their” test. And triage owners sometimes know, for instance, that a feature is obsolete and the test should be disabled. So there seems to be some value in all these approaches to finding an owner for intermittent failures…and none of the options are perfect.

When a bug for an intermittent test failure needs attention, who should be contacted? Who is responsible for fixing that bug? Sorry, no clear answer here either! Do you have a better answer? Let me know!

May 23, 2017

Today, as I was working
on importing and building geckodriver
in mozilla-central,
I found myself head first down a rabbit hole
trying to figure out why the mach try command
we use for testing changesets in continuous integration
complained that I didn’t have git-cinnabar installed:

% ./mach try -b do -p all -u all -t none
mach try is under development, please file bugs blocking 1149670.
ERROR git-cinnabar is required to push from git to try withthe autotry command.
More information can by found at https://github.com/glandium/git-cinnabar

As I’ve explained previously,
git-cinnabar
is a git extension
for interacting with remote Mercurial repositories.
It’s a godsend written by fellow Mozillian Mike Hommey
that lets me do my work
without getting my hands dirty with hg.

As one might suspect,
mach try uses whereis(1)
to look for the git-cinnabar binary.
However, as it is a helper program
that is not meant to be invoked directly,
but rather be despatched through git cinnabar (without hyphen),
it gets installed into /usr/local/libexec/git-core.
Since I had never heard about libexec before,
I decided to do some research.

libexec is meant
for system daemons and system utilities
executed by other programs.
That is, the binaries put in this namespaced directory
are meant for the consumption of other programs,
and are not intended to be executed directly by users.

/usr/libexec includes internal binaries
that are not intended to be executed directly by users or shell scripts.
Applications may use a single subdirectory
under /usr/libexec.

On my preferred Linux system, Debian,
there is apparently also /usr/local/libexec,
which as far as I understand
is meant to complement /usr/libexec
in the same way that /usr/local
compliments /usr.
It provides a tertiary hierarchy for local data and programs
that may be shared amongst hosts
and that are safe from being overwritten
when the system is upgraded.
This is exactly what I want,
since I installed git-cinnabar from source.

It was somewhat surprising to me
to find that whereis(1)—or
at least the util-linux version of it—does
not provide a flag for searching auxillary support programs
located in libexec,
when it is capable
of searching for manuals and sources,
in addition to executable binaries:

To further complicate things,
git itself has no option for emitting where
it finds its internally-called programs from.
It does have an --exec-path flag
that tells you where internal git commands are kept,
but not where optionally installed git extensions are.

I think the fix to my problem
is to tell mach try to include
both /usr/libexec/git-core
and /usr/local/libexec/git-core
in the search path when it looks for git extensions,
but maybe there is a more elegant way
to check if git has a particular subcommand available?
Certainly it’s conceivable to just call git cinnabar --help or similar
and check its exit code.

April 24, 2017

On Friday I had the unfortunate pleasure of taking the brunt on an unhappy Selenium user. Their issue? My team said that a release of GeckoDriver would happen when we are confident in the code. They said that was not professional. They started by telling me that they contribute to Mozilla and this is not acceptable for them as a customer.

Below is a break down of why I took exception to this:

My team was being extremely professional. Software, by its very nature, has bugs but we try minimize the amount of bugs we ship. To do this we don't set release dates, we set certain objectives. My team is relatively small compared to the user group it needs to service so we need to triage bugs, fix code. We have both groups inside and outside of Mozilla. By saying we can only release when it is ready is going to be the best we can do.

Please don't ever tell open source maintainers you are their customer unless you are paying for support and you have a contract with SLAs. So that there is no issue with definition of customer I suggest you look at Merriam Webster's definition. It says "one that purchases a commodity or service". Mozilla, just like Google, Microsoft, and Apple, are working on WebDriver to help web developers. There is no monetary benefit from doing this. The same goes for the Selenium project. The work and products are given freely.

And finally, and this goes for any F/OSS project even if it comes from large corporations like Google or Facebook, never make demands. Ask how you can help instead. If you disagree with the direction of the project, fork it. Make your own project. They have given everything away for free. Take it, make it better for whatever better means for you.

Now, even after explaining this, the harassment continued. It has lead to that user being blocked on social media for me and my team as well as them being blocked on Github. I really dislike blocking people because I know when they approach us they are frustrated but taking that frustration out on my team doesn't help anyone. If you continue, after being warned, you will be blocked. This is not a threat, this a promise.

Next time you feel frustrated with open source ask the maintainers if you can donate time/money/resources to make their lives easier. Don't be the moron that people will instantly block.

April 11, 2017

I am 1 week late in posting the update for Project Stockwell. This wraps up a full quarter of work. After a lot of concerns raised by developers about a proposed new backout policy, we moved on and didn’t change too much although we did push a little harder and I believe we have disabled more than we fixed as a result.

Lets look at some numbers:

Week Starting

01/02/17

02/27/17

03/24/17

Orange Factor

13.76

9.06

10.08

# P1 bugs

42

32

55

OF(P2)

7.25

4.78

5.13

As you can see we increased in March on all numbers- but overall a great decrease so far in 2017.

There have been a lot of failures which have lingered for a while which are not specific to a test. For example:

windows 8 talos has a lot of crashes (work is being done in bug 1345735)

and a few other leaks/timeouts/crashes/harness issues unrelated to a specific test

infrastructure issues and tier-3 jobs

While these are problematic, we see the overall failure rate is going down. In all the other bugs where the test is clearly a problem we have seen many fixes which and responses to bugs from so many test owners and developers. It is rare that we would suggest disabling a test and it was not agreed upon, and if there was concern we had a reasonable solution to reduce or fix the failure.

Speaking of which, we have been tracking total bugs, fixed, disabled, etc with whiteboard tags. While there was a request to not use “stockwell” in the whiteboard tags and to make them more descriptive, after discussing this with many people we couldn’t come to agreement on names or what to track and what we would do with the data- so for now, we have remained the same. Here is some data:

03/07/17

04/11/17

total

246

379

fixed

106

170

disabled

61

91

infrastructure

11

17

unknown

44

60

needswork

24

38

% disabled

36.53%

34.87%

What is interesting is that prior to march we had disabled 36.53% of the fixes, but in March when we were more “aggressive” about disabling tests, the overall percentage went down. In fact this is a cumulative number for the year, so for the month of March alone we only disable 31.91% of the fixed tests. Possibly if we had disabled a few more tests the overall numbers would have continued to go down vs slightly up.

A lot of changes took place on the tree in the last month, some interesting data on newer jobs:

taskcluster windows 7 tests are tier-2 for almost all windows VM tests

autophone is running all media tests which are not crashing or perma failing

Regarding our tests, we are working on tracking new tests added to the tree, what components they belong in, what harness they run in, and overall how many intermittents we have for each component and harness. Some preliminary work shows that we added 942 mochitest*/xpcshell tests in Q1 (609 were imported webgl tests, so we wrote 333 new tests, 208 of those are browser-chrome). Given the fact that we disabled 91 tests and added 942, we are not doing so bad!

Looking forward into April and Q2, I do not see immediate changes to a policy needed, maybe in May we can finalize a policy and make it more formal. With the recent re-org, we are now in the Product Integrity org. This is a good fit, but dedicating full time resources to sheriffing and tooling for the sake of project stockwell is not in the mission. Some of the original work will continue as it serves many purposes. We will be looking to formalize some of our practices and tools to make this a repeatable process to ensure that progress can still be made towards reducing intermittents (we want <7.0) and creating a sustainable ecosystem for managing these failures and getting fixes in place.

March 21, 2017

As with MozReview, Conduit is being designed to operate on
changesets. Since the end result of work on a codebase is a changeset,
it makes sense to start the process with one, so all the necessary
metadata (author, message, repository, etc.) are provided from the
beginning. You can always get a plain diff from a changeset, but you
can’t get a changeset from a plain diff.

Similarly, we’re keeping the concept of a logical series of
changesets. This encourages splitting up a unit of work into
incremental changes, which are easier to review and to test than large
patches that do many things at the same time. For more on the benefits
of working with small changesets, a few random articles are
Ship Small Diffs, Micro Commits, and
Large Diffs Are Hurting Your Ability To Ship.

In MozReview, we used the term commit series to refer to a set of one
or more changesets that build up to a solution. This term is a bit
confusing, since the series itself can have multiple revisions, so you
end up with a series of revisions of a series of changesets. For
Conduit, we decided to use the term topic instead of commit series,
since the commits in a single series are generally related in some
way. We’re using the term iteration to refer to each update of a
topic. Hence, a solution ends up being one or more iterations on a
particular topic. Note that the number of changesets can vary from
iteration to iteration in a single topic, if the author decides to
either further split up work or to coalesce changesets that are
tightly related. Also note that naming is hard, and we’re not
completely satisfied with “topic” and “iteration”, so we may change
the terminology if we come up with anything better.

As I noted in my last post, we’re working on the push-to-review part
of Conduit, the entrance to what we sometimes call the commit
pipeline. However, technically “push-to-review” isn’t accurate, as the
first process after pushing might be sending changesets to Try for
testing, or static analysis to get quick automated feedback on
formatting, syntax, or other problems that don’t require a human to
look at the code. So instead of review repository, which we’ve used in
MozReview, we’re calling it a staging repository in the Conduit world.

Along with the staging repository is the first service we’re building,
the commit index. This service holds the metadata that binds
changesets in the staging repo to iterations of topics. Eventually, it
will also hold information about how changesets moved through the
pipeline: where and when they were landed, if and when they were
backed out, and when they were uplifted into release branches.

Unfortunately a simple “push” command, whether from Mercurial or from
Git, does not provide enough information to update the commit
index. The main problem is that not all of the changesets the author
specifies for pushing may actually be sent. For example, I have three
changesets, A, B, and C, and pushed them up previously. I then update
C to make C′ and push again. Despite all three being in the “draft”
phase (which is how we differentiate work in progress from changes
that have landed in the mainline repository), only C′ will actually be
sent to the staging repo, since A and B already exist there.

Thus, we need a Mercurial or Git client extension, or a separate
command-line tool, to tell the commit index exactly what changesets
are part of the iteration we’re pushing up—in this example, A, B, and
C′. When it receives this information, the commit index creates a new
topic, if necessary, and a new iteration in that topic, and records
the data in a data store. This data will then be used by the review
service, to post review requests and provide information on reviews,
and by the autoland service, to determine which changesets to
transplant.

The biggest open question is how to associate a push with an existing
topic. For example, locally I might be working on two bugs at the same
time, using two different heads, which map to two different
topics. When I make some local changes and push one head up, how does
the commit index know which topic to update? Mercurial bookmarks,
which are roughly equivalent to Git branch names, are a possibility,
but as they are arbitrarily named by the author, collisions are too
great a possibility. We need to be sure that each topic is unique.

Another straightforward solution is to use the bug ID, since the vast
majority of commits to mozilla-central are associated with a bug in
BMO. However, that would restrict Conduit to one topic per bug,
requiring new bugs for all follow-up work or work in parallel by
multiple developers. In MozReview, we partially worked around this by
using an “ircnick” parameter and including that in the commit-series
identifiers, and by allowing arbitrary identifiers via the --reviewid
option to “hg push”. However this is unintuitive, and it still
requires each topic to be associated with a single bug, whereas we
would like the flexibility to associate multiple bugs with a single
topic. Although we’re still weighing options, likely an intuitive and
flexible solution will involve some combination of commit-message
annotations and/or inferences, command-line options, and interactive
prompts.

March 14, 2017

Autoland

We kicked off Conduit work in January starting with the new Autoland
service. Right now, much of the Autoland functionality is located in
the MozReview Review Board extension: the permissions model, the
rewriting of commit messages to reflect the reviewers, and the user
interface. The only part that is currently logically separate is the
“transplant service”, which actually takes commits from one repo
(e.g. reviewboard-hg) and applies it to another (e.g. try,
mozilla-central). Since the goal of Conduit is to decouple all the
automation from code-review tools, we have to take everything
that’s currently in Review Board and move it to new, separate
services.

The original plan was to switch Review Board over to the new Autoland
service when it was ready, stripping out all the old code from the
MozReview extension. This would mean little change for MozReview
users (basically just a new, separate UI), but would get people using
the new service right away. After Autoland, we’d work on the
push-to-review side, hooking that up to Review Board, and then extend
both systems to interface with BMO. This strategy of incrementally
replacing pieces of MozReview seemed like the best way to find bugs as
we went along, rather than a massive switchover all at once.

However, progress was a bit slower than we anticipated, largely due to
the fact that so many things were new about this project (see below).
We want Autoland to be fully hooked up to BMO by the end of June, and
integrating the new system with both Review Board and BMO as we went
along seemed increasingly like a bad idea. Instead, we decided to put
BMO integration first, and then follow with Review Board later (if
indeed we still want to use Review Board as our rich-code-review
solution).

This presented us with a problem: if we wouldn’t be hooking the new
Autoland service up to Review Board, then we’d have to wait until the
push service was also finished before we hooked them both up to BMO.
Not wanting to turn everything on at once, we pondered how we could
still launch new services as they were completed.

Moving to the other side of the pipeline

The answer is to table our work on Autoland for now and switch to the
push service, which is the entrance to the commit pipeline. Building
this piece first means that users will be able to push commits to BMO
for review. Even though they would not be able to Autoland them right
away, we could get feedback and make the service as easy to use as
possible. Think of it as a replacement for bzexport.

Thanks to our new Scrum process (see also below), this priority
adjustment was not very painful. We’ve been shipping Autoland code
each week, so, while it doesn’t do much yet, we’re not abandoning any
work in progress or leaving patches half finished. Plus, since this
new service is also being started from scratch (although involving lots
of code reuse from what’s currently in MozReview), we can apply the
lessons we learned from the last couple months, so we should be
moving pretty quickly.

Newness

As I mentioned above, although the essence of Conduit work right now
is decoupling existing functionality from Review Board, it involves a
lot of new stuff. Only recently did we realize exactly how much new
stuff there was to get used to!

New team members

We welcomed Israel Madueme to our team in January and threw him right
into the thick of things. He’s adapted tremendously well and started
contributing immediately. Of course a new team member means new team
dynamics, but he already feels like one of us.

Just recently, we’ve stolen dkl from the BMO team, where he’s been
working since joining Mozilla 6 years ago. I’m excited to have a
long-time A-Teamer join the Conduit team.

A new process

At the moment we have five developers working on the new Conduit
services. This is more people on a single project than we’re usually
able to pull together, so we needed a process to make sure we’re
working to our collective potential. Luckily one of us is a certified
ScrumMaster. I’ve never actually experienced Scrum-style development
before, but we decided to give it a try.

I’ll have a lot more to say about this in the future, as we’re only
just hitting our stride now, but it has felt really good to be working
with solid organizational principles. We’re spending more time in
meetings than usual, but it’s paying off with a new level of focus and
productivity.

A new architecture

Working within Review Board was pretty painful, and the MozReview
development environment, while amazing in its breadth and coverage,
was slow and too heavily focussed on lengthy end-to-end tests. Our new
design follows more of a microservice-based approach. The Autoland
verification system (which checks users permissions and ensures
that commits have been properly reviewed) is a separate service, as is
the UI and the transplant service (as noted above, this last part was
actually one of the few pieces of MozReview that was already
decoupled, so we’re one step ahead there). Similarly, on the other
side of the pipeline, the commit index is a separate service, and the
review service may eventually be split up as well.

We’re not yet going whole-hog on microservices—we don’t plan, for
starters at least, to have more than 4 or 5 separate services—but we’re
already benefitting from being able to work on features in parallel
and preventing runaway complexity. The book Building Microservices
has been instrumental to our new design, as well as pointing out
exactly why we had difficulties in our previous approach.

New operations

As the A-Team is now under Laura Thomson, we’re taking advantage of
our new, closer relationship to CloudOps to try a new deployment and
operations approach. This has freed us of some of the constraints of
working in the data centre while letting us take advantage of a proven
toolchain and process.

New technologies

We’re using Python 3.5 (and probably 3.6 at some point) for our new
services, which I believe is a first for an A-Team project. It’s new
for much of the team, but they’ve quickly adapted, and we’re now
insulated against the 2020 deadline for Python 2, as well as
benefitting from the niceties of Python 3 like better Unicode support.

We also used a few technologies for the Autoland service that are new
to most of the team: React and Tornado. While the team found
it interesting to learn them, in retrospect using them now was
probably a case of premature optimization. Both added complexity that
was unnecessary right now. React’s URL routing was difficult to get
working in a way that seamlessly supported a local, Docker-based
development environment and a production deployment scenario, and
Tornado’s asynchronous nature led to extra complexity in automated
tests. Although they are both fine technologies and provide scalable
solutions for complex apps, the individual Conduit services are
currently too small to really benefit.

We’ve learned from this, so we’re going to use Flask as the back
end for the push services (commit index and review-request generator),
for now at least, and, if we need a UI, we’ll probably use a
relatively simple template approach with JavaScript just for
enhancements.

Next

In my next post, I’m going to discuss our approach to the push
services and more on what we’ve learned from MozReview.

March 07, 2017

Over the last month we had a higher rate of commits, failures, and fixes. One large thing is that we turned on stylo specific tests and that was a slightly rocky road. Last month we suggested disabling tests after 2 weeks of seeing the failures. We ended up disabling many tests, but fixing many more.

In addition to more disabling of tests, we implemented a set bugzilla whiteboard entries to track our progress:
* [stockwell fixed] – a fix went in (even if it partially fixed the problem)
* in the last 2 months, we have 106
* [stockwell disabled] – we disabled the test in at least one config and no fix
* in the last 2 months, we have 61
* [stockwell infra] – Infra issues are usually externally driven
* in the last 2 months, we have 11
* [stockwell unknown] – this became less intermittent with no clear reason
* in the last 2 months, we have 44
* [stockwell needswork] – bugs in progress
* in the last 2 months, we have 24

We have also been tracking the orange factor and number of high frequency intermittents:

Week starting:

Jan 02, 2017

Jan 30, 2017

Feb 27, 2017

Orange Factor (OF)

13.76

10.75

9.06

# priority intermittents

42

61

32

OF – priority intermittents

7.25

5.78

4.78

I added a new row here, tracking the Orange Factor assuming all of the high frequency intermittent bugs didn’t exist. This is what the long tail looks like and I am really excited to see that number going down over time. For me a healthy spot would be OF <5.0 and the long tail <3.0.

We also looked at the number of unique bugs and repeat bugs/week. Most bugs have a lifecycle of 2 weeks and 2/3 of the bugs we see in a given week were high frequency (HF) the week prior. For example this past week we had 32 HF bugs and 21 of them were from the previous week (11 were still HF 2 weeks prior).

While it is nice to assume we should just disable all tests, we find that many developers are actively working on these issues and it shows that we have many more fixed bugs than disabled bugs. The main motivation for disabling tests is to reduce the confusion for developers on try and to reduce the work the sheriffs need to do. Taking this data into account we are looking to adjust our policy for disabling slightly:

all high frequency bugs (>=30 times/week) will be triaged and expected to be resolved in 2 weeks, otherwise we will start the process of disabling the test that is causing the bug

if a bug occurs >75 times/week, it will be triaged but expectations are that it will be resolved in 1 week, otherwise we will start the process of disabling the test that is causing the bug

if a bug is reduced below a high frequency (< 30 times/week), we will be happy to make a note of that and keep an eye on it- but will not look at disabling the test.

The big change here is we will be more serious on disabling tests specifically when a test is >= 75 times/week. We have had many tests failing at least 50% of the time for weeks, these show up on almost all try pushes that run these tests. Developers should not be seeing failures like these. Since we are tracking fixed vs disabled, if we determine that we are disabling too much, we can revisit this policy next month.

Outside of numbers and policy, our goal is to have a solid policy, process, and toolchain available for self triaging as the year goes on. We are refining the policy and process via manual triage. The toolchain is the other work we are doing, here are some updates:

adding BUG_COMPONENTS to all files in m-c (bug 1328351) – slow and steady progress, thanks for the reviews to date! We got behind to get SETA completed, but much of the heavy lifting is already done

retrigger an existing job with additional debugging arguments (bug 1322433) – main discussion is done, figuring out small details, we have a prototype working with little work remaining. Next steps would be to implement the top 3 or 4 use cases.

add a test-lint job to linux64/mochitest (bug 1323044) – no progress yet- this got put on the backburner as we worked on SETA and focused on triage, whiteboard tags, and BUG_COMPONENTS. We have landed code for using the ‘when’ clause for test jobs (bug 1342963) which is a small piece of this. Getting this initially working will move up in priority soon, and making this work on all harnesses/platforms will most likely be a Google Summer of Code project.

Are there items we should be working on or looking into? Please join our meetings.

February 28, 2017

Imagine this scenario. You've pushed a large series of commits to your favourite review tool
(because you are a believer in the glory of microcommits). The reviewer however has found several
problems, and worse, they are spread across all of the commits in your series. How do you fix all
the issues with minimal fuss while preserving the commit order?

If you were using the builtin histedit extension, you might make temporary "fixup" commits for
each commit that had issues. Then after running hg histedit you'd roll them up into their
respective parent. Or if you were using the evolve extension (which I definitely recommend),
you might do something like this:

```bash
$ hg update 1

fix issues in commit 1

$ hg amend
$ hg evolve

fix issues in commit 2

$ hg amend
$ hg evolve
etc.
```

Both methods are serviceable, but involve some jumping around through hoops to accomplish. Enter a
new extension from Facebook called absorb. The absorb extension will take each change in your
working directory, figure out which commits in your series modified that line, and automatically
amend the change to that commit. If there is any ambiguity (i.e multiple commits modified the same
line), then absorb will simply ignore that change and leave it in your working directory to be
resolved manually. So instead of the rather convoluted processes above, you can do this:

```bash

fix all issues across all commits

$ hg absorb
```

It's magic!

Installing Absorb

There's one big problem. The docs in the hg-experimental repo (where absorb lives) are
practically non-existent, and installation is a bit of a pain. So here are the steps I took to get
it working on Fedora. They won't hand hold you for other platforms, but they should at least point
you in the right direction.

First, clone the hg-experimental repo:

bash
$ hg clone https://bitbucket.org/facebook/hg-experimental

Absorb depends on a compiled python module called linelog which also lives in hg-experimental.
In order to compile linelog, you'll need some dependencies:

bash
$ sudo pip install cython
$ sudo dnf install python-devel

Edit: Previously I had lz4-devel and openssl-devel listed as dependencies, but as junw notes, that's only needed if you are compiling the whole hg-experimental repo (by omitting the --component flag below). Though it looks like lz4 might still be needed on OSX.

Make sure the cython dependency gets installed to the same python your mercurial install uses.
That may mean dropping the sudo from the pip command if you have mercurial running in user space.
Next, compile the hg-experimental repo by running:

February 23, 2017

I've previously blogged about why I believe try syntax is an antiquated development process
that should be replaced with something more modern and flexible. What follows is a series of ideas
that I'm trying to convert into a concrete plan of action to bring this about. This is not an
Intent to Implement or anything like that, but my hope is that this outline is detailed enough
that it could be used as a solid starting point by someone with enough time and motivation to
work on it.

This plan of action will operate on the rather large assumption that all tasks are scheduled with
taskcluster (either natively or over BuildbotBridge). I also want to be clear that I'm not talking
about removing try syntax completely. I simply think it should be parsed client side, before any
changes get pushed to try.

Brief Overview of How Try Syntax Currently Works in Taskcluster

In order to understand where we're going, I think it's important to be aware of where we're coming
from. This is a high level explanation of how a try syntax string currently gets turned into running
tasks:

A developer pushes a commit to try with a 'try:' token somewhere in the commit message.

The pushlog mercurial extension picks this up on the server, and publishes a JSON stream.

This template variable is substituted into the decision task's configuration, and ultimately
ends up getting passed intomach taskgraph decision with the --message parameter.

The decision task kicks off the taskgraph generation process. When it comes time to
optimize, the try syntax is finally passed into the TryOptionSyntax parser, which
filters out tasks that don't match any of the try options.

The optimized task graph is then submitted to taskcluster, and the relevant tasks start running
on try.

An Improved Data Transport

A key thing to realize, is that the decision task runs from within a mozilla-central clone. In other
words the try syntax string starts in version control, gets published to a webserver, gets
downloaded by a node module, gets substituted into a task configuration, only to be passed into a
process that had full access to the original version control all along. Steps 2-5 in the previous
section, could be replaced with:

The decision task extracts the try syntax from the appropriate commit message.

If we stopped there, this change wouldn't be worth making. It might make some code a bit cleaner,
but would hardly make things faster or more efficient since mozilla-taskcluster would still need to
query the pushlog either way. But this method has another, more important benefit: it gives the
decision task access to the entire commit instead of limiting it to whatever the pushlog extension
decides to publish.

That means there would be no particular reason we'd need to store try syntax strings in the commit
message at all. We could instead stuff it into the commit as arbitrary metadata using the commit's
extra field. To get this working, we could use the push-to-try extension to stuff the try
syntax into the extra field like this. Then the decision task could extract that syntax out
of the commit metadata like this:

bash
$ hg log -r $GECKO_HEAD_REV -T "{extras}"

An Improved Data Format

Again, these changes mostly amount to a refactoring and wouldn't be worth making just for the sake
of it. But once we are using arbitrary commit metadata to pass information to the decision task,
there's no reason for us to limit ourselves to a single line syntax string. We could use data
structures of arbitrary complexity.

One possibility (which I'll run with for the rest of the post), is simply to use a list of
taskcluster task labels as the data format. This has several advantages:

It's unambiguous (what is passed in, is what will be scheduled)

It's an easy target for tools to generate to

It provides flexibility in how we could potentially interact with try (via said tools)

The last two points are pretty big, have you ever attempted to write a tool that tries to convert
inputs into a try sytnax? It's very hard, and involves lots of hard coding in the tool and
memorization on the part of the users.

What we've done to this point is transform the data transport from a human friendly format to a
machine friendly format on top of which human friendly tools can be built. Probably the first tool
that will need to be built, will be a legacy try syntax specifier for those of us who enjoy writing
out try syntax strings. But that's not very interesting. There are probably a hundred different ways
we could dream of specifying tasks, but because my imagination is limited, I'll just talk about one
potential idea.

Fuzzy Finding Tasks

I've recently discovered and become a huge fan of fuzzyfinder. Fuzzyfinder the project
consists of two parts:

A binary called fzf

A vast multitude of shell and editor integrations that utilize fzf

The integrations allow you to quickly find things like file paths, processes and shell history (both
on the terminal or within an editor) with an intelligent approximate matching algorithm at
blazing speeds. While the integrations are insanely useful, it's the binary itself that would come
in useful here.

The fzf binary is actually quite simple. It receives a list of strings through stdin, allows the
user to select one or more of them using the fuzzy finding algorithm and a text based gui, then
prints all selected strings to stdout. The input is completely arbitrary, for example, I could
fuzzy select running processes with:

bash
$ ps -ef | fzf

Or lines in a file:

bash
$ cat foo.txt | fzf

Or the numbers 1-100:

bash
$ seq 100 | fzf

You get the idea. The other day I was thinking, what if we could pipe a list of every single task,
expanded over both chunks and platforms, into fzf? How useful would that be? Luckily, a list of
all taskcluster tasks can be generated with a mach command, so it was easy to test this out:

The parameters.yml file can be downloaded from any decision task on treeherder. I piped it into a
file because the mach taskgraph command takes a bit of time to complete, it's not a penalty we'd
want to incur on subsequent runs unless it was necessary. The -m tells fzf to allow
multi-selection.

The results were wonderful. But rather than try to describe how awesome the new (potential) try
choosing experience was, I created a demo. In this scenario, pretend I want to select all linux32
opt mochitest-chrome tasks:

Now instead of printing the task labels to stdout, imagine if this theoretical try chooser stuffed
that output into the commit's metadata. This is the last piece of the puzzle, to what I believe is a
comprehensive outline towards a viable replacement for try syntax.

No Plan Survives Breakfast

As Mike Conley is fond of saying, no plan survives breakfast. I'm sure my outline here is full of
holes that will need to be patched, but I think (hope) that at least the overall direction is solid.
While I'd love to work on this, I won't have the time or mandate to do so until later in the year.
With this post I hope to accomplish three things:

Serve as a brain dump so when (or if) I do get back to it, I'll remember everything

Motivate others to push in this direction in the meantime (or better yet, implement the whole
thing!)

Provide an excuse to plug fuzzyfinder. It's been months and using it still makes me giddy.
Seriously, give it a try, you'll be glad you did!

Let me know if you have any feedback, and especially if you have any other crazy ideas for selecting
tasks on try!

February 19, 2017

Hello from Dublin! Yesterday I had the privilege of attending KatsConf2, a functional programming conference put on by the fun-loving, welcoming, and crazy-well-organized @FunctionalKats. It was a whirlwind of really exciting talks from some of the best speakers around. Here’s a glimpse into what I learned.

There’s no such thing as an objectively perfect programming language: all languages make tradeoffs. But it is possible to find/design a language that’s more perfect for you and your project’s needs.

I took a bunch of notes during the talks, in case you’re hungering for more details. But @jessitron took amazing graphical notes that I’ve linked to in the talks below, so just go read those!

And for the complete experience, check out this storify Vicky Twomey-Lee, who led a great ally skills workshop the evening before the conference, made of the #KatsConf2 tweets:

<noscript>[<a href="http://storify.com/whykay/kats-conf-2" target="_blank">View the story "Kats Conf 2" on Storify</a>]</noscript>

Hopefully this gives you an idea of what was said and which brain-exploding things you should go look up now! Personally it opened up a bunch of cans of worms for me - definitely a lot of the material went over my head, but I have a ton of stuff to go find out more (i.e. the first thing) about.

Disclaimer: The (unedited!!!) notes below represent my initial impressions of the content of these talks, jotted down as I listened. They may or may not be totally accurate, or precisely/adequately represent what the speakers said or think, and the code examples are almost certainly mistake-ridden. Read at your own risk!

The origin story of FunctionalKats

FunctionalKatas => FunctionalKats => (as of today) FunctionalKubs

Meetups in Dublin & other locations

Katas for solving programming problems in different functional languages

“MS Excel!”
“Nobody wants to say ‘JavaScript’ as a joke?”
“Lisp!”
“I know there are Clojurians in the audience, they’re suspiciously silent…”

There’s no such thing as the perfect language; Languages are about compromise.

What the perfect language actually is is a personal thing.

I get paid to make whatever products I feel like to make life better for programmers. So I thought: I should design the perfect language.

What do I want in a language?

It should be hard to make mistakes

On that note let’s talk about JavaScript.
It was designed to be easy to get into, and not to place too many restrictions on what you can do.
But this means it’s easy to make mistakes & get unexpected results (cf. crazy stuff that happens when you add different things in JS).
By restricting the types of inputs/outputs (see TypeScript), we can throw errors for incorrect input types - error messages may look like the compiler yelling at you, but really they’re saving you a bunch of work later on by telling you up front.

There should be no ambiguity

Pony: 1 + (2 * 3) – have to use parentheses to make precedence explicit

It shouldn’t make you think

Joe made a language at Ericsson in the late 80’s called “Erlang”. This is a gif of Joe from the Erlang movie. He’s my favorite movie star.

Immutability: In Erlang, values and variable bindings never change. At all.

This takes away some cognitive overhead (because we don’t have to think about what value a variable has at the moment)

Erlang tends to essentially fold over state: the old state is an input to the function and the new state is an output.

The “abstraction ceiling”

This term has to do with being able to express abstractions in your language.

Those of you who don’t know C: you don’t know what you’re missing, and I urge you not to find out.
If garbage collection is a thing you don’t have to worry about in your language, that’s fantastic.

Elm doesn’t really let you abstract over the fact that e.g. map over array, list, set is somehow the same type of operation. So you have to provide 3 different variants of a function that can be mapped over any of the 3 types of collections.
This is a bit awkward, but Elm programmers tend not to mind, because there’s a tradeoff: the fact that you can’t do this makes the type system simple so that Elm programmers get succinct, helpful error messages from the compiler.

I was learning Rust recently and I wanted to be able to express this abstraction. If you have a Collection trait, you can express that you take in a Collection and return a Collection. But you can’t specify that the output Collection has to be the same type as the incoming one. Rust doesn’t have this ability to deal with this, but they’re trying to add it.

We can do this in Haskell, because we have functors. And that’s the last time I’m going to use a term from category theory, I promise.

On the other hand, in a language like Lisp you can use its metaprogramming capabilities to raise the abstraction ceiling in other ways.

Efficiency

I have a colleague and when I suggested using OCaml as an implementation language for our utopian language, she rejected it because it was 50% slower than C.

In slower languages like Python or Ruby you tend to have performance-critical code written in the lower-level language of C.

But my feeling is that in theory, we should be able to take a language like Haskell and build a smarter compiler that can be more efficient.

But the problem is that we’re designing languages that are built on the lambda calculus and so on, but the machines they’re implemented on are not built on that idea, but rather on the Von Neumann architecture. The computer has to do a lot of contortions to take the beautiful lambda calculus idea and convert it into something that can run on an architecture designed from very different principles. This obviously complicates writing a performant and high-level language.

Rust wanted to provide a language as high-level as possible, but with zero-cost abstractions. So instead of garbage collection, Rust has a type-system-assisted kind of clean up. This is easier to deal with than the C version.

If you want persistent data structures a la Erlang or Clojure, they can be pretty efficient, but simple mutation is always going to be more efficient. We couldn’t do PDSs natively.

Suppose you have a langauge that’s low-level enough to have zero-cost abstractions, but you can plug in something like garbage collection, currying, perhaps extend the type system, so that you can write high-level programs using that functionality, but it’s not actually part of the library. I have no idea how to do this but it would be really cool.

Summing up

You need to think about:

Ergonomics

Abstraction

Efficiency

Tooling (often forgotten at first, but very important!)

Community (Code sharing, Documentation, Education, Marketing)

Your language has to be open source. You can make a proprietary language, and you can make it succeed if you throw enough money at it, but even the successful historical examples of that were eventually open-sourced, which enabled their continued use. I could give a whole other talk about open source.

Functional programming & static typing for server-side web

Oskar Wickström @owickstrom

FP has been influencing JavaScript a lot in the last few years. You have ES6 functional features, libraries like Underscore, Rambda, etc, products like React with FP/FRP at their core, JS as a compile target for functional languages

But the focus is still client-side JS.

Single page applications: using the browser to write apps more like you wrote desktop apps before. Not the same model as perhaps the web browser was intended for at the beginning.

Lots of frameworks to choose from: Angular, Ember, Meteor, React&al. Without JS on the client, you get nothing.

There’s been talk recently of “isomorphic” applications: one framework which runs exactly the same way on the esrver and the client. The term is sort of stolen & not used in the same way as in category theory.

Static typing would be really useful for Middleware, which is a common abstraction but every easy to mess up if dynamically typed. In Clojure if you mess up the middleware you get the Stack Trace of Doom.

Let’s use extensible records in PureScript - shout out to Edwin’s talk related to this. That inspired me to implement this in PureScript, which started this project called Hyper which is what I’m working on right now in my free time.

Goals:

Safe HTTP middleware architecture

Make effects of middleware explicit

No magic

How?

Track middleware effects in type system

leverage extensible records in PureScript

Provide a common API for middleware

Write middleware that can work on multiple backends

Design

Conn: sort of like in Elixer, instead of passing a request and returning a response, pass them all together as a single unit

Middleware: a function that takes a connection c and returns another connection type c’ inside another type m

Indexed monads: similar to a state monad, but with two additional parameters: the type of the state before this action, and the type after. We can use this to prohibit effectful operations which aren’t correct.

Response state transitions: Hyper uses phantom types to track the state of response, guaranteeing correctness in response side effects

Induction: prove something for a base case and a first step, and you’ve proven it for all numbers

Induction hypothesis: if you are at step n, you must have been at step n-1 before that.

With these elements, we have a program! We just make an if/else: e.g. for sum(n), if n == 0: return 0; else return sum(n-1) + n

It all comes down to writing the right specification: which is where we need to step away from the keyboard and think.

Induction is the basis of recursion.

We can use induction to create a specification for sorting lists from which we can derive the QuickSort algorithm.

But we get 2 sorting algorithms for the price of 1: if we place a restriction that we can only do one recursive call, we can tweak the specification to derive InsertionSort, thus proving that Insertion Sort is a special case of Quick Sort.

I stole this from a PhD dissertation (“Functional Program Derivation” by ). This is all based on program derivation work by Djikstra.

Takeaways:

Programming == Math. Practicing some basic math is going to help you write code, even if you won’t be doing these kind of exercises on yo ur day-to-day

Calculations provide insight

Delay choices where possible. Say “let’s assume a solution to this part of the problem” and then go back and solve it later.

is there a link between the specification and the complexity of the program? Yes, the specification has implications for implementation. The choices you make within the specification (e.g. caching values, splitting computation) affect the efficency of the program.

What about proof assistants? Those are nice if you’re writing a dissertation or whatnot, but if you’re at the stage where you’re practicing this, the exercise is being precise, so I recommend doing this on paper. The second your fingers touch the keyboard, you can outsource your preciseness to the computer.

Once you’ve got your specification, how do you ensure that your program meets it? One of the things you could do is write the spec in something like fscheck, or you could convert the specification into tests. Testing and specification are really enriching each other. Writing tests as a way to test your specification is also a good way to go. You should also have some cases for which you know, or have an intuition of, the behavior. But none of this is supposed to go in a machine, it’s supposed to be on paper.

The last talk was about abstraction without (performance) regret. This talk is about abstraction without the regret of making your code harder to read.

Elm is a particularly good language to modify automatically, because it’s got some boilerplate, but I love that boilerplate! No polymorphism, no type classes - I know exactly what that code is going to do! Reading it is great, but writing it can be a bit of a headache.

As a programmer I want to spend my time thinking about what the users need and what my program is supposed to do. I don’t want to spend my time going “Oh no, i forgot to put that thing there”.

Here’s a simple Elm program that prints “Hello world”. The goal is to write a program that modifies this existing Elm code and changes the greeting that we print.

We’re going to do this with Scala. The goal is to generate readable code that I can later go ahead and change. It’s more like a templating engine, but instead of starting with a templating file it starts from a cromulent Scala program.

Our goal is to parse an Elm file into a parse tree, which give us the meaningful bits of that file.

The “parser” in parser combinators is actually a combination of lexer and parser.

Reuse is dangerous, dependencies are dangerous, because they create coupling. (Controlled, automated) Cut & Paste is a safer solution.

at which point @jessitron does some crazy fast live coding to write an Elm parser in Scala

Rug is the super-cool open-source project I get to work on as my day job now! It’s a framework for creating code rewriters

In conclusion: any time my job feels easy, I think “OMG I’m doing it wrong”. But I don’t want to introduce abstraction into my code, because someone else is going to have difficulty reading that. I want to be able to abstract without sacrificing code readability. I can make my job faster and harder by automating it.

Relational Programming

There are many programming paradigms that don’t get enough attention. The one I want to talk about today is Relational Programming. It’s somewhat representative of Logic Programming, like Prolog. I want to show you what can happen when you commit fully to the paradigm, and see where that leads us.

Functional Programming is a special case of Relational Programming, as we’re going to see in a minute.

What is functional programming about? There’s a hint in the name. It’s about functions, the idea that representing computation in the form of mathematical functions could be useful. Because you can compose functions, you don have to reason about mutable state, etc. - there are advantages to modeling computation as math. functions.

In relational programming, instead of representing computation as functions we represent it as relations. You can think of a relation in may ways. If you’re familiar with relational databases, or you can think in terms of tuples where we want to reason over sets or collections of tuples, or we can think of it in terms of algebra - like high school algebra - where we have variables representing unknown quantities and we have to figure out their values. We’ll see that we can get FP as a special case - there’s a different set of tradeoffs - but we’ll see that when you commit fully to this paradigm you can get some very surprising behavior.

Let’s start in our functional world, we’re going to write a little program in Scheme or Racket, a little program to manipulate lists. We’ll just do something simple like append or concatenate. Let’s define append in Scheme:

We’re going to use a relational programming language called Mini Kanren which is basically an extension that has been applied to lots of languages which allows us to put in variables representing values and ask Kanren to fill in those values.

So I’m going to define appendo. (By convention we define our names ending in -o, it’s kind of a long story, happy to explain offline.)

Writes a bunch of Kanren that we don’t really understand

Now I can do:

> (run 1 (q) (appendo '(a b c) '(d e) q))
((a b c d e))

So far, not very interesting, if this is all it does then it’s no better than append.
But where it gets interesting is that I can run it backwards to find an input:

It will run forever. This is sort of like a database query, except where the tables are infinite.

One program we could write is an interpreter, an evaluator. We’re going to take an eval that’s written in MiniKanren, which is called evalo and takes two arguments: the expression to be evaluated, and the value of that expression.

> (run 1 (a) (evalo '(lambda (x) x) q))
((closure x x ()))

> (run 1 (a) (evalo '(list 'a) q))
((a))

Professor wrote a Valentine's day post "99 ways to say 'I love you' in Racket", to teach people Racket by showing 99 different racket expressions that evaluate to the list `(I love you)`

> (run 99 (q) (evalo q '(I love you)))
...99 ways...

What about quines: a quine is a program that evaluates to itself. How could we find or generate a quine?

> (run 1 (q) (evalo q q))

And twines: two different programs p and q where p evaluates to q and q evaluates to p.

Now we’re starting to synthesize programs, based on specifications. When I gave this talk at PolyConf a couple of years ago Jessitron trolled me about how long it took to run this, since then we’ve gotten quite a bit faster.

This is a tool called Barliman that I (and Greg Rosenblatt) have been working on, and it’s basically a frontend, a dumb GUI to the interpreter we were just playing with. It’s just a prototype. We can see a partially specified definition - a Scheme function that’s partially defined, with metavariables that are fill-in-the-blanks for some Scheme expressions that we don’t know what they are yet. Barliman’s going to guess what the definition is going to be.

(define ,A
(lambda ,B
,C))

Now we give Barliman a bunch of examples. Like (append '() '()) gives '(). It guesses what the missing expressions were based on those examples. The more test cases we give it, the better approximation of the program it guesses. With 3 examples, we can get it to correctly guess the definition of append.

Yes, you are going to lose your jobs. Well, some people are going to lose their jobs. This is actually something that concerns me, because this tool is going to get a lot better.

If you want to see the full dog & pony show, watch the ClojureConj talk I gave with Greg.

Writing the tests is indeed the harder part. But if you’re already doing TDD or property-based testing, you’re already writing the tests, why don’t you just let the computer figure out the code for you based on those tests?

Some people say this is too hard, the search space is too big. But that’s what they said about Go, and it turns out that if you use the right techniques plus a lot of computational power, Go isn’t as hard as we thought. I think in about 10-15 years program synthesis won’t be as hard as we think now. We’ll have much more powerful IDEs, much more powerful synthesis tools. It could even tell you as you’re writing your code whether it’s inconsistent with your tests.

What this will do for jobs, I don’t know. I don’t know, maybe it won’t pan out, but I can no longer tell you that this definitely won’t work. I think we’re at the point now where a lot of the academic researchers are looking at a bunch of different parts of synthesis, and no one’s really combining them, but when they do, there will be huge breakthroughs. I don’t know what it’s going to do, but it’s going to do something.

Working hard to keep things lazy

Without laziness, we waste a lot of space, because when we have recursion we have to keep allocating memory for each evaluated thing. Laziness allows us to get around that.

What is laziness, from a theoretical standpoint?

The first thing we want to talk about is different ways to evaluate expressions.

> f x y = x + y
> f (1 + 1) (2 + 2)

How do we evaluate this?

=> (1 + 1) + (2 + 2)
=> 2 + 4
=> 6

This evaluation was normal form

Church-Rosser Theorem: the order of evaluation doesn’t matter, ultimately a lambda expression will evaluate to the same thing.

But! We have things like non-termination, and termination can only be determined after the fact.

Here’s a way we can think of types: Let’s think of a Boolean as something which has three possible values: True, False, and “bottom”, which represents not-yet-determined, a computation that hasn’t ended yet. True and False are more defined than bottom (e.g. _|_ <= True). Partial ordering.

Monotone functions: if we have a function that takes a Bool and returns a Bool, and x and y are bools where x <= y, then f x <= f y. We can now show that f _|_ = True and f x = False doesn’t work out, because it would have the consequence that True => False, which doesn’t work - that’s a good thing because if it did, we would have solve the halting problem. What’s nice here is that if we write a function and evaluate it in normal order, in the lazy way, then this naturally works out.

Laziness is basically non-strictness (this normal order thing I’ve been talking about the whole time), and sharing.

Laziness lets us reuse code and use combinators. This is something I miss from Haskell when I use any other language.

Honorable mention: Purely Functional Data Structures by Chris Okasaki. When you have Persistent Data Structures, you need laziness to have this whole amortization argument going on. This book introduces its own dialect of ML (lazy ML).

How do we do laziness in Haskell (in GHC)? At an intermediate stage of compilation called STG, Haskell takes unoptimized code and optimizes it to make it lazy. (???)

Total Functional Programming

Idris is a pure functional language with dependent types. It’s a “total” language, which means you have program totality: a program either terminates, or gives you new results.

Goals are:

Encourage type-driven development

Reduce the cost of writing correct software - giving you more tools to know upfront the program will do the correct thing.

People on the internet say, you can’t do X, you can’t do Y in a total language. I’m going to do X and Y in a total language.

Types become plans for a program. Define the type up front, and use it to guide writing the program.

You define the program interactively. The compiler should be less like a teacher, and more like a lab assistant. You say “let’s work on this” and it says “yes! let me help you”.

As you go, you need to refine the type and the program as necessary.

Test-driven development has “red, green, refactor”. We have “type, define, refine”.

If you care about types, you should also care about totality. You don’t have a type that completely describes your program unless your program is total.

Given f : T: if program f is total, we know that it will always give a result of type T. If it’s partial, we only know that if it gives a result, it will be type T, but it might crash, run forever, etc. and not give a result.

The difference between total and partial functions in this world: if it’s total, we can think of it as a Theorem.

Idris can tell us whether or not it thinks a program is total (though we can’t be sure, because we haven’t solved the halting problem “yet”, as a student once wrote in an assignment). If I write a program that type checks but Idris thinks it’s possibly not total, then I’ve probably done the wrong thing. So in my Idris code I can tell it that some function I’m defining should be total.

I can also tell Idris that if I can prove something that’s impossible, then I can basically deduce anything, e.g. an alt-fact about arithmetic. We have the absurd keyword.

We have Streams, where a Stream is sort of like a list without nil, so potentially infinite. As far as the runtime is concerned, this means this is lazy. Even though we have strictness.

Idris uses IO like Haskell to write interactive programs. IO is a description of actions that we expect the program to make(?). If you want to write interactive programs that loop, this stops it being total. But we can solve this by describing looping programs as a stream of IO actions. We know that the potentially-infinite loops are only going to get evaluated when we have a bit more information about what the program is going to do.

Turns out, you can use this to write servers, which run forever and accept responses, which are total. (So the people on the internet are wrong).

Check out David Turner’s paper “Elementary Strong Functional Programming”, where he argues that totality is more important than Turing-completeness, so if you have to give up one you should give up the latter.

February 13, 2017

Riding the tram you hear the word “Linux” pronounced in four different languages. Stepping out into the grey drizzle, you instantly smell fresh waffles and GitHub-sponsored coffee, and everywhere you look you see a FSF t-shirt. That’s right kids, it’s FOSDEM time again! The beer may not be free, but the software sure is.

Last year I got my first taste of this most epic of FLOSS conferences, back when I was an unemployed ex-grad-student with not even 5 pull requests to my name. This year, as a bona fide open source contributor, Mozillian, and full-time professional software engineer, I came back for more. Here are some things I learned:

Open source in general - and, anecdotally, FOSDEM in particular - has a diversity problem. (Yes, we already knew this, but it still needs mentioning.)

…But not for long, if organizations like Mozilla and projects like IncLudo have anything to say about it.

Disclaimer: The (unedited!!!) notes below represent my impressions of the content of these talks, jotted down as I listened. They may or may not be totally accurate, or precisely/adequately represent what the speakers said or think. If you want to get it from the horse’s mouth, follow the links to the FOSDEM schedule entry to find the video, slides, and/or other resources!

No, because there’s no true headless mode for Firefox, though you can make it effectively headless via e.g. running in Docker

Can it work with WebGL?

The problem is similar to with canvas - if we just have one element, we can’t look inside of it unless there’s some workaround to expose additional information about the the state of the app specifically for testing

Excecuting async JS?

Selenium has functionality to handle this, and since FoxPuppet builds on Selenium, the base Selenium functionality is still available

Copyright reform is happening, but unfortunately it’s not the kind of reform we need

Doesn’t focus on the interests of users on the internet

Instead of protecting & encouraging innovation & creativity online, may in some cases undermine that

Mozilla wants to ensure that the internet remains a “global public resource open & accessible to all”

not trying to get rid of copyright

but rather encourage copyright laws that support all actors in the web ecosystem

Issues in the current copyright directive

Upload filters:

platforms that are holding large amounts of copyrighted content would need agrements with rights holders

ensuring that they uphold those agreements would require them to implement upload filters that may end up restricting users’ ability to post their own content

Neighboring rights - aka “snippet tax” or “google tax”

proposal to extend copyright to press publishers

press publications would get to charge aggregators for e.g. posting a snippet of their article, the headline, and a hyperlink

already been attempted in Germany and Spain, where it had negative effects on startup aggregators and entrenched the power of established aggregators (Google)

Text & Data Mining (TDM)

there would be restrictions on ingesting copyrighted data for the purposes of data mining

there would only be exceptions for research institutions

The fight right now is unfortunately quite binary: The big Silicon Valley companies/aggregators (Google etc.) vs. the Publishing/Music/Film industry

We need it to involve the full spectrum of stakeholders on the web, especially users, independent content creators

Get involved!

changecopyright.org

raegan@mozilla.com

Series of events across Europe

Q&A:

Since filtering requires monitoring, and monitoring is unconstitutional in the EU, are there plans to fight this if it passes?

Yes, there is absolutely a contradiction there, and we plan to fight it. We want to bring the proposal in line with existing law and channel activism against these filters.

Previous events/campaigns were focused on Freedom of Panorama (copyright exception that allows you to take photographs of e.g. buildings, art and post them online). Will new events be focused on the 4 areas you discussed?

Yes, this is sort of our 2nd wave of activism on this issue, and we’ll be organizing and encouraging more advocacy around these issues.

Do you coordinate with the media?

Yes. There are a number of organizations working on a modern version of copyright that looks forward, not backwards. The C4C (copyright for creativity) brings together a lot of players (e.g libraries, digital rights NGOs), and that serves as a sort of umbrella. A lot of folks have similar issues and we work together as much as possible to amplify & support certain voices.

What is the purpose of another wave? Are we starting over?

EU policy making is a very slow game. This reform has been under discussion for over 5 years, and the process of it going through negotiations to reach a final EU parliament agreement will be at least a year. If we want to have an impact & mobilize different voices, it has to be a sustained, long-term effort, which was not the case in the 1st wave because we didn’t have the proposal yet. Now that we have it, we have more focus on what to encourage people to speak out about, which is potentially game-changing.

It seems that the education exception excludes all informal sources of education

This exception applies to cases where licensed materials can’t be acquired. But that’s not really the problem; the problem is the cost. There’s now a campaign copyrightforeducation.org. It’s something we’re following closely, and we’re mostly relying on our partners who are experts in this area.

When will this be decided by parliament?

There will be votes on committee opinions next month, but the main opinion will be deliberated in March, and they want it to be voted by end of summer 2017. So the next 6 months could be game-changing, it’s an important time to contact your representatives.

Would the TDM exception implicate privacy concerns?

This doesn’t deal with privacy-protected content, but rather would allow people that have lawfully acquired works/texts to create e.g. a visualization. It doesn’t get into privacy issues about mining people’s metadata and all that - it’s a separate issue from privacy and wouldn’t override it.

Paper prototypes are great for experimenting, though clients don’t always accept them so easily

Having a diverse team helps

Q&A

You mentioned a game where you have to hide your bias from others

You pretend you’re management at a company, and you have to hire someone for a position. Everyone has a secret bias card (“don’t want to hire [women, muslims, …]”). Your goal is to fight for the candidate you (don’t) want, but without being so obvious about it as to reveal your secret bias to the other managers. There were some really funny conversations coming out of it.

How do you measure the impact?

That’s really hard. There’s a few different things: you can try to measure what people learned from the game, which is difficult in itself. The other attempt is to see what the organizations actually do in real live - that’s what (our partner) ZMQ is going to do: see if the orgs actually change their practices.

How do your games relate to competitive vs. collaborative games

I wouldn’t agree that competition is bad by itself - it motivates us and as long as we understand that we’re competing in the game and not once it’s over. Our games are competitive, with the exception of Pirat Partage which wasn’t competitive but then the players started asking us for a scoring system so that they could see who’s winnign

Aren’t competition and diversity contradictory?

If you’re trying to bring diversity into an existing social structure made of companies that are competing, it makes sense to sell it to them that way.

February 10, 2017

Can you have an open and honest conversation with your peers and, this is the most important one, can you have an open and honest conversation with your manager?

Have a good think about this, don't answer straight away. Let's go through the following scenarios to find out if you can have open and honest conversations.

Can you...

Tell your manager when you are struggling with a task and not feel like you are going to chastised?

For me, as a manager and a technical lead, it is super important to help grow people. We all have times where we don't know something and no amount of searching the internet can fix it. Being able to go to your "lead" and say, "I don't know what to do.." is a good thing for everyone!

Tell your manager when you are being harassed?

This should be a given but if you were to ask a lot of your female colleagues, you will hear a resounding "NO!". This has to do with company culture or "not upsetting the 10x'er". Even though it can cost a lost of money for a company if there is harassment, a lot of people just don't trust their manager to tell them about problems like this.

Tell your manager that they are wrong

Feedback is hard to give and to accept. Especially in some cultures where it is seen as a weird thing. European culture is like that, you give a slight nod and that is it and anything more makes people uncomfortable.

Now imagine getting critical feedback, it can be hard.

Now... imagine telling your manager that you think they are wrong and giving them feedback. This could be at a technical level or it could be at how they are as a manager. Expressing that feedback can be hard. Now... how does your manager take it. Do they get all defensive, do you get defensive.

If you answered No to any of the above, you really need to take the initiative
and speak to your manager and tell them that you don't feel there is good opportunity for dialogue and you want to fix this. If they don't want to meet you half way to solve this then you don't need to feel bad that you want a new manager. This could be in a new company or within your company.

Honest and open conversations between your peers and your managers will create an amazing work environment and will allow everyone to succeed. It all starts from trust.

February 07, 2017

Last month we focused on triaging all bugs that met our criteria of >=30 failures/week. Every day there are many new bugs to triage and we started with a large list. In the end we have commented on all the bugs and have a small list every day to revisit or investigate.

One thing we focus on is only requesting assistance at most once per week- to that note we have a “Neglected Oranges” dashboard that we use daily.

What is changing this month- We will be recommending resolution on priority bugs (>=30 failures/week) in 2 weeks time. Resolution is active debugging, landing changes to the test to reduce,debug, or fix the intermittent, or in the case where there is a lack of time or ease of finding a fix disabling the test. If this goes well, we will reduce that down to 7 days in March.

So how are we doing?

Week starting:

Jan 02, 2017

Jan 30, 2017

Orange Factor

13.76

10.75

# priority intermittent

42

61

We have less overall failures, but more bugs spread out. Some interesting bugs:

January 23, 2017

Interacting with insecure SSL pages (eg. self-signed) in an automated test written for Selenium is an important feature. Especially when tests are getting run against locally served test pages. Under those circumstances you might never get fully secured websites served to the browser instance under test. To still allow running your tests with a successful test result, Selenium can instruct the browser to ignore the validity check, which will simply browse to the specified site without bringing up the SSL error page.

Since the default driver for Firefox was switched in Selenium 3.0 to Marionette by default, this feature was broken for a while, unless you explicitly opted-out from using it. The reason is that Marionette, which is the automation driver for Mozilla’s Gecko engine, hasn’t implement it yet. But now with bug 1103196 fixed, the feature is available starting with the upcoming Firefox 52.0 release, which will soon be available as Beta build.

Given that a couple of people have problems to get it working correctly, I wrote a basic Selenium test for Firefox by using the Python’s unittest framework. I hope that it helps you to figure out the remaining issues. But please keep in mind that you need at least a Firefox 52.0 build.

By using DesiredCapabilities.FIREFOX the default capabilities for Firefox will be retrieved and used. Those will also include “marionette: True“, which is necessary to enable webdriver for Firefox in Selenium 3. If not present the old FirefoxDriver will be utilized.

To actually enable accepting insecure SSL pages, the capabilities have to be updated with “acceptInsecureCerts: True“, and then passed into the Firefox’s Webdriver constructor.

That’s all. So enjoy!

Update: The capability for acceptInsecureCerts is set automatically when DesiredCapabilities.FIREFOX is used.

January 13, 2017

In 2017, Engineering Productivity is starting on a new project that
we’re calling “Conduit”, which will improve the automation around
submitting, testing, reviewing, and landing commits. In many ways,
Conduit is an evolution and course correction of the work on MozReview
we’ve done in the last couple years.

Before I get into what Conduit is exactly, I want to first clarify
that the MozReview team has not really been working on a review tool
per se, aside from some customizations requested by users (line
support for inline diff comments). Rather, most of our work was
building a whole pipeline of automation related to getting code
landed. This is where we’ve had the most success: allowing
developers to push commits up to a review tool and to easily land them
on try or mozilla-central. Unfortunately, by naming the project
“MozReview” we put the emphasis on the review tool (Review Board)
instead of the other pieces of automation, which are the parts unique
to Firefox’s engineering processes. In fact, the review tool should
be a replaceable part of our whole system, which I’ll get to shortly.

We originally selected Review Board as our new review tool for a few
reasons:

The back end is Python/Django, and our team has a lot of experience
working with both.

The diff viewer has a number of fancy features, like clearly
indicating moved code blocks and indentation changes.

A few people at Mozilla had previously contributed to the Review
Board project and thus knew its internals fairly well.

However, we’ve since realized that Review Board has some big
downsides, at least with regards to Mozilla’s engineering needs:

The UI can be confusing, particularly in how it separates the Diff
and the Reviews views. The Reviews view in particular has some
usability issues.

Loading large diffs is slow, but also conversely it’s unable to
depaginate, so long diffs are always split across pages. This
restricts the ability to search within diffs. Also, it’s impossible
to view diffs file by file.

Bugs in interdiffs and even occasionally in the diffs themselves.

No offline support.

In addition, the direction that the upstream project is taking is not
really in line with what our users are looking for in a review tool.

So, we’re taking a step back and evaluating our review-tool
requirements, and whether they would be best met with another tool or
by a small set of focussed improvements to Review Board. Meanwhile,
we need to decouple some pieces of MozReview so that we can accelerate
improvements to our productivity services, like Autoland, and ensure
that they will be useful no matter what review tool we go with.
Project Conduit is all about building a flexible set of services that
will let us focus on improving the overall experience of submitting
code to Firefox (and some other projects) and unburden us from the
restrictions of working within Review Board’s extension system.

In order to prove that our system can be independent of review tool,
and to give developers who aren’t happy with Review Board access to
Autoland, our first milestone will be hooking the commit repo (the
push-to-review feature) and Autoland up to BMO. Developers will be
able to push a series of one or more commits to the review repo, and
reviewers will be able to choose to review them either in BMO or
Review Board. The Autoland UI will be split off into its own service
and linked to from both BMO and Review Board.

(There’s one caveat: if there are multiple reviewers, the first one
gets to choose, in order to limit complexity. Not ideal, but the
problem quickly gets much more difficult if we fork the reviews out to
several tools.)

As with MozReview, the push-to-BMO feature won’t support confidential
bugs right away, but we have been working on a design to support
them. Implementating that will be a priority right after we finish
BMO integration.

Week of Jan 02 -> 09, 2017

Turning on leak checking (bug 1325148) – note, we did this Dec 29th and whitelisted a lot, still much exists and many great fixes have taken place

some infrastructure issues, other timeouts, and general failures

I am excited for the coming weeks as we reduce the orange factor back down <7 and get the high frequency bugs <20.

Outside of these tracking stats there are a few active projects we are working on:

adding BUG_COMPONENTS to all files in m-c (bug 1328351) – this will allow us to then match up triage contacts for each components so test case ownership has a patch to a live person

retrigger an existing job with additional debugging arguments (bug 1322433) – easier to get debug information, possibly extend to special runs like ‘rr-chaos’

add |mach test-info| support (bug 1324470) – allows us to get historical timing/run/pass data for a given test file

add a test-lint job to linux64/mochitest (bug 1323044) – ensure a test runs reliably by itself and in –repeat mode

While these seem small, we are currently actively triaging all bugs that are high frequency (>=30 times/week). In January triage means letting people know this is high frequency and trying to add more data to the bugs.

October 31, 2016

I wrote earlier about my initial experience with triaging frequent intermittent test failures. I was happy to find that most of the most-frequent test failures were under active investigation, but that also meant that finding important bugs in need of triage was a frustrating and time consuming process.

Thankfully, :ekyle provided me with a script to identify “neglected oranges”: Frequent intermittent test failure bugs with no recent comments. The neglected oranges script provides search results not unlike the default search on Orange Factor, but filters out bugs with recent comments from non-robots. It also shows the bug age and how long it has been since the last comment:

This has provided a treasure trove of bugs for triage.

So, now that I can find bugs for frequent intermittent failures that don’t have anyone actively working on them, can I instigate action? Does this type of triage lead to bug resolution and a reduction in Orange Factor (average number of failures per push)? Here’s one way of looking at it: If I look at the bugs I’ve recently triaged and look at the time those bugs were open before I commented on them, I find that, on average, those bugs were open for 65 days before my triage comment. Typically I tried to find someone familiar with the bug and pointed out that it was a frequently failing test; sometimes I offered some insight, or suggested some action (“this is a timeout in a long-running test; if it cannot be optimized or split up, requestLongerTimeout() should avoid the timeout”). On average, those bugs were resolved within 3 days of my triage comment. Wow!

I offer this evidence that triage of neglected oranges makes a difference, but also caution not to expect that much of a difference over time: I’ve chosen bugs that were open for months and with continued triage, we may quickly eliminate these long-neglected bugs (let’s hope!). I’ve also likely chosen “easy” bugs – bugs with an obvious, or at least apparent, resolution. There will also be intractable bugs, surely, and bugs without any apparent owner, or where interested parties cannot agree on a solution.

It is similarly difficult to draw conclusions from Orange Factor failure rates, but let’s look at those anyway, roughly for the time period I have been triaging:

That’s encouraging, isn’t it? I don’t know how much of that improvement was instigated by my triage comments, but I like to think I have contributed to the improvement, and that this type of action can continue to drive down failure rates. I’ll keep spending at least a few hours each week on neglected oranges, and see how that goes for the next couple of months. Can we bring Orange Factor under 10? Under 5?

October 28, 2016

Many of our frequent intermittent test failures are timeouts. There are a lot of ways that a test – or a test job – can time out. Some popular bug titles demonstrate the range of failure messages:

This test exceeded the timeout threshold. It should be rewritten or split up. If that’s not possible, use requestLongerTimeout(N), but only as a last resort.

Test timed out.

TEST-UNEXPECTED-TIMEOUT

TimeoutException: Timed out after … seconds

application ran for longer than allowed maximum time

application timed out after … seconds with no output

Task timeout after 3600 seconds. Force killing container.

We have tried re-wording some of these messages with the aim of clarifying the cause of the timeout and possible remedies, but I still see lots of confusion in bugs. In some cases, I think a complete explanation is much more involved than we can hope to express in an error message. I think we should write up a wiki page or MDN article with detailed explanations of messages like this, and point to that page from error messages in the test log.

One of the first things I do when I see a test failure due to timeout is look for a successful run of the same test on the same platform, and then compare the timing between the success and failure cases. If a test takes 4 seconds to run in the success case but times out after 45 seconds, perhaps there is an intermittent hang; but if the test takes 40 seconds to run successfully and intermittently times out after 45 seconds, it’s probably just a long running test with normal variation in run time.

This suggests some nice-to-have tools:

push a new test to try, get a report of how long your test runs on each platform, perhaps with a warning if run-time approaches known time-outs, or perhaps some arbitrary threshold;

same for longest duration without output (avoid “no output timeout”);

use custom code or a special test harness mode to identify existing long-running tests, for proactive follow-up to prevent timeouts in the future.

October 18, 2016

In the recent past most of my work has involved writing code in python. For the most part, I haven’t been working on code bases that large to worry about writing a full blown test harness to manage it. But recently working on a network simulator, me and my partner [@clouisa] (https://github.com/clouisa) were faced a problem: how to debug python code that is non-deterministic in nature.

Following the project guidelines, the code was supposed to be a random, unifomly distributed, sequence of events that could depending on their values could take a range of value which would result in a sequence of different code branches being taken. For example consider the following code:

What’s Mozharness?

Mozharness is a python test harness that serves a purpose two-fold. It is basically a set of scripts that can be used by both the author of the code and tester alike. This eliminates the need for a separate test specific teams as the developers themselves are capable of running full tests locally on their dev machines.

Currently mozharness is used in the Mozilla community to test the nightly releases of Firefox. It is language specific, so as long as the tests have a defined pass/fail conditions, regardless of what language the initial codebase being tested was written in, it will work just fine.

BaseLogger

The BaseLogger component is responsible for logging all the runs of the mozharness test harness and storing them under logs/</code>. It is useful to note that the logs are stored by stream so it's easier to just look at the respective stream (e.g. ERROR or INFO) when needed.

BaseConfig

config files define a set of tests that are required to be run. These config files can then be expanded and built on top of the standard so called sanity tests. BaseConfig does just that. BaseConfig provides a set of known and most commonly used sets of test cases that are more or less likely to be used.

It also provides methods to parse in various paramaters that can be used to pass in additional arguments into scripts.

October 17, 2016

Recently, I have been trying to spend a little time each day looking over the most frequent intermittent test failures in search of neglected bugs. I use Orange Factor to identify the most frequent failures, then scan the associated bugs in bugzilla to see if there is someone actively working on the bug.

I have had some encouraging successes. For example, in bug 1307388, I found a frequent intermittent with no one assigned and no sign of activity. The test had started failing recently – a few days earlier – with no sign of failures before that. A quick check of the mercurial logs showed that the test had been modified the day that it started failing, and a needinfo of the patch author led to immediate action.

In bug 1244707, the bug had been triaged several months ago and assigned to backlog, but the failure frequency had since increased dramatically. Pinging someone familiar with the test quickly led to discussion and resolution.

My experience in each of these cases was really rewarding: It took me just a few minutes to review the bug and bring it to the attention of someone who was interested and understood the failure.

Finding neglected bugs is more onerous. Orange Factor can be used to identify frequent test failures; the default view on https://brasstacks.mozilla.com/orangefactor/ provides a list, ordered by frequency, but most of those are not neglected — some one is already working on them and they just need time to investigate and land a fix. I think the sheriffs do a good job of finding owners for frequent intermittents, so it seems like 90% of the top intermittents have owners, and they are usually actively working on resolving those issues. I don’t think there’s any way to see that activity on Orange Factor:

So I end up opening lots of bugs each day before I find one that “needs help”. Broadly speaking, I’m looking for a search for bugs matching something like:

OrangeFactor does a good job of identifying the frequent failures, but I don’t think it has any data on bug activity…and this notion of bug activity is hazy anyway. Ping me if you have a better intermittent orange triage procedure, or thoughts on how to do this more efficiently.

** Update – I’ve been getting lots of ideas from folks on irc for better triaging:

ryanvm

look to aurora/beta for bugs that have been around for longer

would be nice if a dashboard would show trends for a bug (now happening more frequently, etc) – like socorro

bugzilla data fed to presto, so marrying it to treeherder with redash may be possible (mdoglio may know more)

wlach

might be able to use redash for change detection/trends once treeherder’s db is hooked up to it

October 10, 2016

Intermittent Oranges (tests which fail sometimes and pass other times) are an ever increasing problem with test automation at Mozilla.

While there are many common causes for failures (bad tests, the environment/infrastructure we run on, and bugs in the product)
we still do not have a clear definition of what we view as intermittent. Some common statements I have heard:

“It’s obvious, if it failed last year, the test is intermittent“

“If it failed 3 years ago, I don’t care, but if it failed 2 months ago, the test is intermittent“

“I fixed the test to not be intermittent, I verified by retriggering the job 20 times on try server“

These are imply much different definitions of what is intermittent, a definition will need to:

determine if we should take action on a test (programatically or manually)

define policy sheriffs and developers can use to guide work

guide developers to know when a new/fixed test is ready for production

provide useful data to release and Firefox product management about the quality of a release

Given the fact that I wanted to have a clear definition of what we are working with, I looked over 6 months (2016-04-01 to 2016-10-01) of OrangeFactor data (7330 bugs, 250,000 failures) to find patterns and trends. I was surprised at how many bugs had <10 instances reported (3310 bugs, 45.1%). Likewise, I was surprised at how such a small number (1236) of bugs account for >80% of the failures. It made sense to look at things daily, weekly, monthly, and every 6 weeks (our typical release cycle). After much slicing and dicing, I have come up with 4 buckets:

Random Orange: this test has failed, even multiple times in history, but in a given 6 week window we see <10 failures (45.2% of bugs)

Low Frequency Orange: this test might fail up to 4 times in a given day, typically <=1 failures for a day. in a 6 week window we see <60 failures (26.4% of bugs)

Intermittent Orange: fails up to 10 times/day or <120 times in 6 weeks. (11.5% of bugs)

High Frequency Orange: fails >10 times/day many times and are often seen in try pushes. (16.9% of bugs or 1236 bugs)

Alternatively, we could simplify our definitions and use:

low priority or not actionable (buckets 1 + 2)

high priority or actionable (buckets 3 + 4)

Does defining these buckets about the number of failures in a given time window help us with what we are trying to solve with the definition?

Determine if we should take action on a test (programatically or manually):

ideally buckets 1/2 can be detected programatically with autostar and removed from our view. Possibly rerunning to validate it isn’t a new failure.

buckets 3/4 have the best chance of reproducing, we can run in debuggers (like ‘rr’), or triage to the appropriate developer when we have enough information

Define policy sheriffs and developers can use to guide work

sheriffs can know when to file bugs (either buckets 2 or 3 as a starting point)

developers understand the severity based on the bucket. Ideally we will need a lot of context, but understanding severity is important.

Guide developers to know when a new/fixed test is ready for production

If we fix a test, we want to ensure it is stable before we make it tier-1. A developer can use math of 300 commits/day and ensure we pass.

NOTE: SETA and coalescing ensures we don’t run every test for every push, so we see more likely 100 test runs/day

Provide useful data to release and Firefox product management about the quality of a release

Release Management can take the OrangeFactor into account

new features might be required to have certain volume of tests <= Random Orange

One other way to look at this is what does gets put in bugs (war on orange bugzilla robot). There are simple rules:

15+ times/day – post a daily summary (bucket #4)

5+ times/week – post a weekly summary (bucket #3/4 – about 40% of bucket 2 will show up here)

Lastly I would like to cover some exceptions and how some might see this flawed:

missing or incorrect data in orange factor (human error)

some issues have many bugs, but a single root cause- we could miscategorize a fixable issue

I do not believe adjusting a definition will fix the above issues- possibly different tools or methods to run the tests would reduce the concerns there.

Earlier today we released geckodriver version 0.11.1.
geckodriver
is an HTTP proxy for using
W3C WebDriver-compatible clients
to interact with Gecko-based browsers.

The program provides the HTTP API
described by the WebDriver protocol
to communicate with Gecko browsers, such as Firefox.
It translates calls into
the Marionette automation protocol
by acting as a proxy between the local- and remote ends.

One non-backwards incompatible change to note
is that the firefox_binaryfirefox_args,
and firefox_profile capabilities
have all been removed in favour of
the moz:firefoxOptions dictionary.
Please consult the documentation
on how to use it.

October 07, 2016

Our automated tests seem to fail a lot. Instead of a sea of green, a typical good push often looks more like:

I’ve been thinking about ways that we can improve on that: Ways that we can reduce those pesky intermittent oranges.

Here’s one idea: Be more aggressive about disabling (skipping) tests that fail intermittently.

For today anyway, let’s put aside those tests that fail infrequently. If a test fails only rarely, there’s less to be gained by skipping it. It may also be harder to reproduce such failures, and harder to fix them and get them running again.

Notice that the most frequent intermittent failure for this one-week period is bug 1157948, which failed 721 times (well, it was reported/starred 721 times — it probably failed more than that!). Guess what happened the week before that? Yeah, another 700 or so oranges. And the week before that and … This is definitely a persistent, frequent intermittent failure.

I am actually intimately familiar with bug 1157948. I’ve worked hard to resolve it, and lots of other people have too, and I’m hopeful that a fix is landing for it right now. Still, it took over 3 months to fix this. What did we gain by running the affected tests for those 3 months? Was it worth the 10000+ failures that sheriffs and developers saw, read, diagnosed, and starred?

Bug 1157948 affected all taskcluster-initiated Android tests, so skipping the affected tests would have meant losing a lot of coverage. But it is not difficult to find other bugs with over 100 failures per week that affect just one test (like bug 1305601, just to point out an example). It would be easy to disable (skip-if annotate) this test while we work on it, and wouldn’t that be better? It won’t be fixed overnight, but it will continue to fail overnight — and there’s a cost to that.

There’s a trade-off here for sure. A skipped test means less coverage. If another change causes a spontaneous fix to this test, we won’t notice the change if it is skipped. And we won’t notice a change in the frequency of failures. How important are these considerations, and are they important enough that we can live with seeing, reporting, and tracking all these test failures?

I’m not yet sure about the particulars of when and how to skip intermittent failures, but it feels like we would profit by being more aggressive about skipping troublesome tests, particularly those that fail frequently and persistently.

Of course many more changes have happened
to the (internal) Marionette test harness
and there have been several more commits of janitorial nature,
but the intention here is to distil the bulk of information
into a format that is useful to
① people not actively engaged in development,
and ② for us to keep track of what we have
in the pipeline for the upcoming release.

I suspect it will also be useful as a reference
when triaging new issue reports.

XBL has the concept of anonymous nodes
that are not returned by the usual WebDriver element-finding methods.
However there are two Gecko-specific methods of finding them;
either by getting all the anonymous children of a reference element,
or getting a single anonymous child of a reference element
with specified attribute values.

This commit adds two endpoints corresponding to those methods:

/session/{sessionId}/moz/xbl/{elementId}/anonymous_children

Return all anonymous children.

/session/{sessionId}/moz/xbl/{elementId}/anonymous_by_attribute

Return an anonymous element with the given attribute value,
provided as a body of the form:

Using the capability firefoxOptions.log.level
(full usage described in README.md)
it’s possible to set the log level
of both Firefox and geckodriver itself.
This will become useful as most of our users
have trouble figuring out how to start geckodriver
with the -vv flag in order
to enable very verbose logging.
From 0.11 we can recommend users
to add this to their capabilities instead.

In the same vein,
there is an open PR
to replace the default log dependency with slog,
as env_logger
does not allow us to reinstantiate it at runtime.
This will cause issues when providing
a different log level in the capabilities
for subsequent sessions.

We also disabled the additional welcome URL in geckodriver
which has caused officially branded builds
to open two new tabs when started with a clean profile.
This doesn’t reproduce in Nightly builds
as it does not have it set by default.
As the additional welcome page uses a plugin that forks
and starts the plugin container process,
deleting a session whilst on this site
sometimes causes that process to crash.
However, we still haven’t pinned down the cause
of the underlying issue which is tracked in
issue 225.

APK Size

Here’s how the APK size changed over the quarter, for mozilla-central Android 4.0 API15+ opt builds:

As seen in the past, the APK size seems to gradually increase over time. But this quarter there is a pleasant surprise, with a recent very large improvement. That is :esawin’s change from bug 1291424. Nice!

Memory

Again, there is a tremendous improvement with bug 1291424. Thankyou :esawin!

Autophone-Talos

This section tracks Perfherder graphs for mozilla-central builds of Firefox for Android, for Talos tests run on Autophone, on android-6-0-armv8-api15. The test names shown are those used on treeherder. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

tp4m

Generic page load test. Lower values are better.

No significant improvements or regressions noted for tsvgx or tp4m.

Autophone

Throbber Start / Throbber Stop

Browser startup performance is measured on real phones (a variety of popular devices).

Here’s a quick summary for the local blank page test on various devices:

Again, there is an excellent performance improvement with bug 1291424. Yahoo!

August 22, 2016

Welcome back to my post series on the Marionette project! In Act I, we looked into Marionette’s automation framework for Gecko, the engine behind the Firefox browser. Here in Act II, we’ll take a look at a complementary side of the Marionette project: the testing framework that helps us run tests using our Marionette-animated browser, aka the Marionette test harness. If – like me at the start of my Outreachy internship – you’re clueless about test harnesses, or the Marionette harness in particular, and want to fix that, you’re in the right place!

Wait, what’s Marionette again?

Marionette refers to a suite of tools for automated testing of Mozilla browsers.

In that post, we saw how the Marionette automation framework lets us control the Gecko browser engine (our “puppet”), thanks to a server component built into Gecko (the puppet’s “strings”) and a client component (a “handle” for the puppeteer) that gives us a simple Python API to talk to the server and thus control the browser. But why do we need to automate the browser in the first place? What good does it do us?

Well, one thing it’s great for is testing. Indulge me in a brief return to my puppet metaphor from last time, won’t you? If the automation side of Marionette gives us strings and a handle that turn the browser into our puppet, the testing side of Marionette gives that puppet a reason for being, by letting it perform: it sets up a stage for the puppet to dance on, tell it to carry out a given performance, write a review of that performance, and tear down the stage again.

OK, OK, metaphor-indulgence over; let’s get real.

Wait, why do we need automated browser testing again?

As Firefox1 contributors, we don’t want to have to manually open up Firefox, click around, and check that everything works every time we change a line of code. We’re developers, we’re lazy!

But we can’t not do it, because then we might not realize that we’ve broken the entire internet (or, you know, introduced a bug that makes Firefox crash, which is just as bad).

So instead of testing manually, we do the same thing we always do: make the computer do it for us!

The type of program that can magically do this stuff for us is called a test harness. And there’s even a special version specific to testing Gecko-based browsers, called – can you guess? – the Marionette test harness, also known as the Marionette test runner.

So, what exactly is this magical “test harness” thing? And what do we need to know about the Marionette-specific one?

What’s a test harness?

First of all, let’s not get hung up on the name “test harness” – the names people use to refer to these things can be a bit ambiguous and confusing, as we saw with other parts of the Marionette suite in Act I. So let’s set aside the name of the thing for now, and focus on what the thing does.

Assuming we have a framework like the Marionette client/server that lets us automatically control the browser, the other thing we need for automatically testing the browser is something that lets us:

Properly set up & launch the browser, and any other related components we might need

Define tests we want to perform and their expected results

Discover tests defined in a file or directory

Run those tests, using the automation framework to do the stuff we want to do in the browser

Keep track of what we actually saw, and how it compares to what we expected to see

Report the results in human- and/or machine-readable logs

Clean up all of that stuff we set up in the beginning

Take out the browser-specific parts, and you’ve got the basic outline of what a test harness for any kind of software should do.

Ever write tests using Python’s unittest, JavaScript’s mocha, Java’s JUnit, or a similar tool? If you’re like me, you might have been perfectly happy writing unit tests with one of these, thinking not:

Yeah, I know unittest! It’s a test harness.

but rather:

Yeah, I know unittest! It’s, you know, a, like, thing for writing tests that lets you make assertions and write setup/teardown methods and stuff and, like, print out stuff about the test results, or whatever.

Turns out, they’re the same thing; one is just shorter (and less, like, full of “like”s, and stuff).

So that’s the general idea of a test harness. But we’re not concerned with just any test harness; we want to know more about the Marionette test harness.

What’s special about the Marionette test harness?

Um, like, duh, it’s made for tests using Marionette!

What I mean is that unlike an all-purpose test harness, the Marionette harness already knows that you’re a Mozillian specifically interested in is running Gecko-based browser tests using Marionette. So instead of making you write code in for setup/teardown/logging/etc. that talks to Marionette and uses other features of the Mozilla ecosystem, it does that legwork for you.

You still have control, though; it makes it easy for you to make decisions about certain Mozilla-/Gecko-specific properties that could affect your tests, like:

Need to use a specific Firefox binary? Or a particular Firefox instance running on a device somewhere?

Got a special profile or set of preferences you want the browser to run with?

Need to run an individual test module? A directory full of tests? Tests listed in a manifest file?

Want the tests run multiple times? Or in chunks?

How and where should the results be logged?

Care to drop into a debugger if something goes wrong?

But how does it do all this? What does it look like on the inside? Let’s dive into the code to find out.

How does the Marionette harness work?

Inside Marionette, in the file harness/marionette/runtests.py, we find the MarionetteHarness class. MarionetteHarness itself is quite simple: it takes in a set of arguments that specify the desired preferences with respect to the type of decisions we just mentioned, uses an argument parser to parse and process those arguments, and then passes them along to a test runner, which runs the tests accordingly.

So actually, it’s the “test runner” that does the brunt of the work of a test harness here. Perhaps for that reason, the names “Marionette Test Harness” and “Marionette Test Runner” sometimes seem to be used interchangeably, which I for one found quite confusing at first.

Anyway, the test runner that MarionetteHarness makes use of is the MarionetteTestRunner class defined in runtests.py, but that’s really just a little wrapper around BaseMarionetteTestRunner from harness/marionette/runner/base.py, which is where the magic happens – and also where I’ve spent most of my time for my Outreachyinternship, but more on that later. For now let’s check out the runner!

How does Marionette’s test runner work?

The beating heart of the Marionette test runner is the method run_tests. By combining some methods that take care of general test-harness functionality and some methods that let us set up and keep tabs on a Marionette client-server session, run_tests gives us the Marionette-centric test harness we never knew we always wanted. Thanks, run_tests!

To get an idea of how the test runner works, let’s take a walk through the run_tests method and see what it does.2

First of all, it simply initializes some things, e.g. timers and counters for passed/failed tests. So far, so boring.

Next, we get to the part that puts the “Marionette” in “Marionette test runner”. The run_tests method starts up Marionette, by creating a Marionette object – passing in the appropriate arguments based on the runner’s settings – which gives us the client-server session we need to automate the browser in the tests we’re about to run (we know how that all works from Act I).

Adding the tests we want to the runner’s to-run list (self.tests) is the next step. This means finding the appropriate tests from test modules, a directory containing test modules, or a manifest file listing tests and the conditions under which they should be run.

To actually run the tests, the runner calls run_test_sets, which runs the tests we added earlier, possibly dividing them into several sets (or chunks) that will be run separately (thus enabling parallelization). This in turn calls run_test_set, which basically just calls run_test, which is the final turtle.3

Glancing at run_test, we can see how the Marionette harness is based on Python’s unittest, which is why the tests we run with this harness basically look like unittest tests (we’ll say a bit more about that below). Using unittest to discover our test cases in the modules we provided, run_testruns each test using a MarionetteTextTestRunner and gets back a MarionetteTestResult. These are basically Marionette-specific versions of classes from moztest, which helps us store the test results in a format that’s compatible with other Mozilla automation tools, like Treeherder. Once we’ve got the test result, run_test simply adds it to the runner’s tally of test successes/failures.

So there we have it: our very own Marionette-centric test-runner! It runs our tests with Marionette and Firefox set up however we want, and also gives us control over more general things like logging and test chunking. In the next section, we’ll take a look at how we can interact with and customize the runner, and tell it how we want our tests run.

What do the tests look like?

As for the tests themselves, since the Marionette harness is an extension of Python’s unittest, tests are mostly written as a custom flavor of unittest test cases. Tests extend MarionetteTestCase, which is an extension of unittest.TestCase. So if you need to write a new test using Marionette, it’s as simple as writing a new test module named test_super_awesome_things.py which extends that class with whatever test_* methods you want – just like with vanilla unittest.

Our first thought might be, “Wow, that’s a lot of arguments”. Indeed! This is how the runner knows how you want the tests to be run. For example, binary is the path to the specific Firefox application binary you want to use, and e10s conveys whether or not you want to run Firefox with multiple processes.

Where do all these arguments come from? They’re passed to the runner by MarionetteHarness, which gets them from the argument parser we mentioned earlier, MarionetteArguments.

Analogous to MarionetteTestRunner/BaseMarionetteTestRunner, MarionetteArgument is just a small wrapper around BaseMarionetteArguments from runner/base.py, which in turn is just an extension of Python’s argparse.ArgumentParser. BaseMarionetteArguments defines which arguments can be passed in to the harness’s command-line interface to configure its settings. It also verifies that whatever arguments the user passed in make sense and don’t contract each other.

To actually use the harness, we can simply call the runtests.py script with: python runtests.py [whole bunch of awesome arguments]. Alternatively, we can use the Mach commandmarionette-test (which just calls runtests.py), as described here.

To see all of the available command-line options (there are a lot!), you can run python runtests.py --help or ./mach marionette-test --help, which just spits out the arguments and their descriptions as defined in BaseMarionetteArguments.

So, with the simple command mach marionette-test [super fancy arguments] test_super_fancy_things.py, you can get the harness to run your Marionette tests with whatever fancy options you desire to fit your specific fancy scenario.

But what if you’re extra fancy, and have testing needs that exceed the limits of what’s possible with the (copious) command-line options you can pass to the Marionette runner? Worry not! You can customize the runner even further by extending the base classes and making your own super-fancy harness. In the next section, we’ll see how and why you might do that.

How is the Marionette test harness used at Mozilla?

Other than enabling people to write and run their own tests using the Marionette client, what is the Marionette harness for? How does Mozilla use it internally?

Well, first and foremost, the harness is used to run the Marionette Python unit tests we described earlier, which check that Marionette is functioning as expected (e.g. if Marionette tells the browser to check that box, then by golly that box better get checked!). Those are the tests that will get run if you just run mach marionette-test without specifying any test(s) in particular.

But that’s not all! I mentioned above that there might be special cases where the runner’s functionality needs to be extended, and indeed Mozilla has already encountered this scenario a couple of times.

One example is the Firefox UI tests, and in particular the UI update tests. These test the functionality of e.g. clicking the “Update Firefox” button in the UI, which means they need to do things like compare the old version of the application to the updated one to make sure that the update worked. Since this involves binary-managing superpowers that the base Marionette harness doesn’t have, the UI tests have their own runner, FirefoxUITestRunner, which extends BaseMarionetteTestRunner with those superpowers.

Another test suite that makes use of a superpowered harness is the External Media Tests, which tests video playback in Firefox and need some extra resources – namely a list of video URLs to make available to the tests. Since there’s no easy way to make such resources available to tests using the base Marionette harness, the external media tests have their own test harness which uses the custom MediaTestRunner and MediaTestArguments (extensions of BaseMarionetteTestRunner and BaseMarionetteArguments, respectively), to allow the user to e.g. specify the video resources to use via the command line.

So the Marionette harness is used in at least three test suites at Mozilla, and more surely can and will be added as the need arises! Since the harness is designed with automation in mind, suites like marionette-test and the Firefox UI tests can be (and are!) run automatically to make sure that developers aren’t breaking Firefox or Marionette as they make changes to the Mozilla codebase. This all makes the Marionette harness a rather indispensable development tool.

Which brings us to a final thought…

How do we know that the harness itself is running tests properly?

The Marionette harness, like any test harness, is just another piece of software. It was written by humans, which means that bugs and breakage are always a possibility. Since breakage or bugs in the test harness could prevent us from running tests properly, and we need those tests to work on Firefox and other Mozilla tools, we need to make sure that they get caught!

Do you see where I’m going with this? We need to… wait for it…

Test the thing that runs the tests

Yup, that’s right: Meta-testing. Test-ception. Tests all the way down.

And that’s what I’ve been doing this summer for my Outreachy project: working on the tests for the Marionette test harness, otherwise known as the Marionette harness (unit) tests. I wrote a bit about what I’ve been up to in my previous post, but in my next and final Outreachy post, I’ll explain in more detail what the harness tests do, how we run them in automation, and what improvements I’ve made to them during my time as a Mozilla contributor.

Notes

3 If you think distinguishing run_tests, run_test_sets, run_test_set, and run_test is confusing, I wholeheartedly agree with you! But best get used to it; working on the Marionette test harness involves developing an eagle-eye for plurals in method names (we’ve also got _add_tests and add_test). ↩

August 04, 2016

One of the most painful aspects of a developer's work cycle is trying to fix failures that show up
on try, but which can't be reproduced locally. When this happens, there were really only two
options (neither of them nice):

You could spam try with print debugging. But this isn't very powerful, and takes forever to get
feedback.

You could request a loaner from releng. But this is a heavy handed process, and once you have the
loaner it is very hard to get tests up and running.

I'm pleased to announce there is now a third option, which is easy, powerful and 100% self-serve.
Rather than trying to explain it in words, here is a ~5 minute demo:

Because I'm too lazy to re-record, I want to clarify that when I said |mach reftest| was "identical"
to the command line that mozharness would have run, I meant other than the test selection arguments.
You'll still need to either pass in a test path or use chunking arguments to select which tests to
run.

Caveats and Known Issues

Before getting too excited, here is the requisite list of caveats.

This only works with taskcluster jobs that have one-click-loaner enabled. This means there is no
Windows or OS X support. Getting support here is obviously the most important thing we can do to
improve this system, but it's a hard problem with lots of unsolved dependencies. The biggest
blocker is even just getting these platforms to work on AWS in the first place.

The test package mach environment doesn't support all test harnesses yet. I've got mochitest,
reftest and xpcshell working so far and plan to add support for more harnesses later on
in the quarter.

There are many rough edges. I consider this workflow to be "beta" quality. Though I think it is
already useful enough to advertise, it is a long shot from being perfect. Please file bugs or
ping me with any issues or annoyances you come across. See the next section for known issues.

Known Issues

No Windows or OS X support

Hard to find "One-Click Loaner" link in treeherder

Confusing error message if not logged into taskcluster-tools

Interactive tasks should be high priority so they're never "pending"

Scrolling in the interactive shell is broken

No support for harnesses other than mochitest, reftest and xpcshell

Android workflow needs to be ironed out

Cloning gecko using interactive wizard is slow

Thanks

Finally I want to thank the taskcluster team (especially jonasfj and garndt) for implementing the
One-Click Loaner, it is seriously cool. I also want to thank armenzg for helping with reviews and
dustin for helping me navigate taskcluster and docker.

As always, please let me know if you have any problems, questions or suggestions!

August 02, 2016

It feels like yesterday that I started my Outreachy internship, but it was actually over 2 months ago! For the last couple of weeks I’ve been on Outreachy hiatus because of EuroPython, moving from Saarbrücken to Berlin, and my mentor being on vacation. Now I’m back, with 6 weeks left in my internship! So it seems like a good moment to check in and reflect on how things have been going so far, and what’s in store for the rest of my time as an Outreachyee.

What have I been up to?

Learning how to do the work

In the rather involved application process for Outreachy, I already had to spend quite a bit of time figuring out the process for making even the tiniest contribution to the Mozilla codebase. But obviously one can’t learn everything there is to know about a project within a couple of weeks, so a good chunk of my Outreachy time so far was spent on getting better acquainted with:

The tools

I’ve already written about my learning experiences with Mercurial, but there were a lot of other components of the Mozilla development process that I had to learn about (and am still learning about), such as Bugzilla, MozReview, Treeherder, Try, Mach…
Then, since the project I’m working focuses on testing, I had to grok things like Pytest and Mock. Since most everything I’m doing is in Python, I’ve also been picking up useful Python tidbits here and there.

The project

My internship project, “Test-driven refactoring of Marionette’s Python test runner”, relates to a component of the Marionette project, which encompasses a lot of moving parts. Even figuring out what Marionette is, what components it comprises, how these interrelate, and which of them I need to know about, was a non-trivial task. That’s why I’m writing a couple of posts about the project itself - one down, one to go - to crystallize what I’ve learned and hopefully make it a little easier for other people to get through the what-even-is-it steps that I’ve been going through. This post is a sort of “intermission”, so stay tuned for my upcoming post on the Marionette test runner and harness!

Doing the work

Of course, there’s a reason I’ve been trying to wrap my head around all the stuff I just mentioned: so that I can actually do this project! So what is the actual work I’ve been doing, i.e. my overall contribution to Mozilla as an Outreachy intern?

The “thing” is the Marionette test runner, a tool written in Python that allows us to run tests that make use of Marionette to automate the browser. It’s responsible for things like discovering which tests we need to run, setting up all the necessary prerequisites, running the tests one by one, and logging all of the results.

Since the test runner is essentially a program like any other, it can be broken just like any other! And since it’s used in automation to run the tests that let Firefox developers know if some new change they’ve introduced breaks something, if the test runner itself breaks, that could cause a lot of problems. So what do we do? We test it!

That’s where I come in. My mentor, Maja, had started writing some unit tests for the test runner before the internship began. My job is basically to add more tests. This involves:

Reading the code to identify things that could break and cause problems, and thus should be tested

Refactoring the code to make it easier to test, more readable, easier to extend/change, or otherwise better

Aside from the testing side of things, another aspect of the project (which I particularly enjoy) involves improving how the runner relates to the rest of the world. For example, improving the command line interface to the runner, or making sure the unit tests for the runner are producing logs that play nicely with the rest of the Mozilla ecosystem.

Writing stuff down

As you can see, I’ve also been spending a fair bit of time writing blog posts about what I’ve been learning and encountering over the course of my internship. Hopefully these have been or will be useful to others who might also be wrapping their heads around these things for the first time. But regardless, writing them has certainly been useful for me!

What’s been fun?

Learning all the things

While working with a new system or technology can often be frustrating, especially when you’re used to something similar-but-not-quite-the-same (ahem, git and hg), I’ve found that the frustration does subside (or at least lessen) eventually, and in its place you find not only the newfound ease of working with the new thing, but also the gratification that comes with the realization: “Hey! I learned the thing!” This makes the overall experience of grappling with the learning curve fun, in my experience.

Working remotely

This internship has been my first experience with a lifestyle that always attracted me: working remotely. I love the freedom of being able to work from home if I have things to take care of around the house, or at a cafe if I just really need to sip on a Chocolate Chai Soy Latte right now, or from the public library if I want some peace & quiet. I also loved being able to escape Germany for 2 weeks to visit my boyfriend’s family in Italy, or to work a day out of the Berlin office if I’m there for a long weekend looking at apartments. Now that I’ve moved to Berlin, I love the option of working out of the office here if I want to, or working from home or a cafe if I have things I need to take care of on the other side of the city. And because the team I’m working on is also completely distributed, there’s a great infrastructure already in place (IRC, video conferences, collaborative documents) to enable us to work in completely different countries/time zones and still feel connected.

Helping others get started contributing!

A couple of weeks ago I got to mentor my first bug on Bugzilla, and help someone else get started contributing to the thing that I had gotten started contributing to a few months ago for my Outreachy internship. Although it was a pretty simple & trivial thing, it felt great to help someone else get involved, and to realize I knew the answers to a couple of their questions, meaning that I’m actually already involved! That’s the kind of thing that really makes me want to continue working with FOSS projects after my internship ends, and makes me so appreciative of initiatives like Outreachy that help bring newcomers like me into this community.

What’s been hard?

Impostor Syndrome

The flip side of the learning-stuff fun is that, especially at the beginning, ye olde Impostor Syndrome gets to run amok. When I started my internship, I had the feeling that I had Absolutely No Idea what I was doing – over the past couple of months it has gotten gradually better, but I still have the feeling that I have a Shaky and Vague Idea of what I’m doing. From my communications with other current/former Outreachy interns, this seems to be par for the course, and I suppose it’s par for the course for anyone joining a new project or team for the first time. But even if it’s normal, it’s still there, and it’s still hard.

Working remotely

As I mentioned, overall I’ve been really enjoying the remote-work lifestyle, but it does have its drawbacks. When working from home, I find it incredibly difficult to separate my working time from my not-working time, which is most often manifested in my complete inability to stop working at the end of the day. Because I don’t have to physically walk away from my computer, at the end of the day I think “Oh, I’ll just do that one last thing,” and the next thing I know the Last Thing has led to 10 other Last Things and now it’s 11:00pm and I’ve been working for 13 hours straight. Not healthy, not fun. Also, while the flexibility and freedom of not having a fixed place of work is great, moving around (e.g. from Germany to Italy to other side of Germany) can also be chaotic and stressful, and can make working (productively) more difficult – especially if you’re not sure where your next internet is going to come from. So the remote work thing is really a double-edged sword, and doing in a way that preserves both flexibility and stability is clearly a balancing act that takes some practice. I’m working on it.

Measuring productivity

Speaking of working productively, how do you know when you’re doing it? Is spending a whole day reading about mock objects, or writing a blog post, or banging your head against a build failure, or [insert activity that is not writing 1000 lines of code] productive? The nature of the Outreachy system is that every project is different, and the target outcomes (or lack thereof) are determined by the project mentor, and whether or not your work is satisfactory is entirely a matter of their judgment. Luckily, my mentor is extremely fair, open, clear, and realistic about her goals for the project. She’s also been very reassuring when I’ve expressed uncertainty about productivity, and forthcoming about her satisfaction with my progress. But I feel like this is just my good luck having a mentor who a) is awesome and b) was an Outreachy intern herself once, and can thus empathize. I do wonder how my experience would be different, especially from the standpoint of knowing whether I’m measuring up to expectations, if I were on a different project with a different mentor. Which brings me to…

What’s been helpful?

Having a fantastic mentor

As I’ve just said, I feel really lucky to be working with my mentor, Maja. She’s been an incredible support throughout the internship, and has just made it a great experience. I’m really thankful for her for being so detailed & thorough in her initial conception of the project and instructions to me, and for being so consistently responsive and helpful with any of my questions or concerns. I can’t imagine a better mentor.

Being part of a team

“It takes a village,” or whatever they say, and my village is the Automation crew (who hang out in #automation on IRC) within the slightly-larger village of the Engineering Productivity team (A-Team) (#ateam). Just like my mentor, the rest of the crew and the team have also been really friendly and helpful to me so far. If Maja’s not there, if I’m working on some adjacent component, or if I have some general question, they’ve been there for me. And while having a fantastic mentor is fantastic, having a fantastic mentor within a fantastic team is double-fantastic, because it helps with the hard things like learning new tools or working remotely (especially when your mentor is in a different time zone, but other team members are in yours). So I’m also really grateful to the whole team for taking me in and treating me as one of their own.

Attending the All Hands

Apparently, at some point in the last couple of years, someone at Mozilla decided to start including Outreachy interns in the semi-annual All Hands meetings. Whoever made that decision: please accept my heartfelt thanks. Being included in the London All Hands made a real difference - not only because I understood a lot about various components of the Mozilla infrastructure that had previously been confusing or unclear to me, but also because the chance to meet and socialize with team members and other Outreachy interns face-to-face was a huge help in dealing with e.g. Impostor Syndrome and the challenges of working on a distributed team. I’m so glad I was able to join that meeting, because it really helped me feel more bonded to both Mozilla as a whole and to my specific team/project, and I hope for the sake of both Mozilla and Outreachy that Outreachyees continue to be invited to such gatherings.

Intern solidarity

Early on in the internship, one of the other Mozilla Outreachy interns started a channel just for us on Mozilla’s IRC. Having a “safe space” to check in with the other interns, ask “dumb” questions, express insecurities/frustrations, and just generally support each other is immensely helpful. On top of that, several of us got to hang out in person at the London All Hands meeting, which was fantastic. Having contact with a group of other people going through more or less the same exciting/bewildering/overwelming/interesting experience you are is invaluable, especially if you suffer from Impostor Syndrome as so many of us do. So I’m so grateful to the other interns for their support and solidarity.

What’s up next?

In the remaining weeks of my internship, I’m going to be continuing the work I mentioned, but instead of from a library in a small German town or a random internet connection in a small Italian town, I’ll be working mainly out of the Berlin office, and hopefully getting to know more Mozillians here. I’ll also be participating in the TechSpeakers program, a training program from the Mozilla Reps to improve your public speaking skills so that you can go forth and spread the word about Mozilla’s awesome technologies. Finally, in the last week or two, I’ll be figuring out how to pass the baton, i.e. tie up loose ends, document what I’ve done and where I’m leaving off, and make it possible for someone else – whether existing team or community members, or perhaps the next intern – to continue making the Marionette test runner and its unit tests Super Awesome. And blogging all the while, of course. :) Looking forward to it!

I wrote about
the progress from our TPAC 2015 meeting previously,
and we appear to have made good progress since then.
The specification text is nearing completion,
although it is missing a few important chapters:
Some particularly obvious omissions
are the complete lack of input handling,
and a big, difficult void
where advanced user actions are meant to be.

Actions

James has been hard at work
drafting a proposal for action semantics,
which we went over in great detail.
I think it’s fair to say there had been conceptual agreement
in the working group on what the actions were meant to accomplish,
but that the details of how they were going to work
were extremely scarce.

WebDriver tries to innovate on the actions
as they appear in Selenium.
Actions in Selenium were originally meant
to provide a way to pipeline a sequence of interactions—such as
pressing down a mouse button,
moving the mouse, and releasing it—through
a complex data structure
to a single command endpoint.
The idea was that this would help address
some of the race conditions
that are intrinsically part of the one-directional design of the protocol,
and reduce latency which may be critical
when interacting with a document.

Unfortunately the pipelining design
to reduce the number of HTTP requests
was never quite implemented in Selenium,
and the API design suffered from over-specialisation
of different types of input devices and actions.
The specification attempts to rectify this
by generalising the range of input device classes,
and by associating the actions that can be performed with a certain class.
This means we are moving away from
a flat sequence of types, such as
[{type: "mouseDown"}, {type: "mouseMove"}, {type: "mouseUp"}]
to a model where each input device has its own “track”.
This limits the actions you can perform with each device,
which makes some conceptual sense
because it would be impossible to i.e. type keys with a mouse
or press a mouse button with a stylus/pen input device.

The side-effect of this design is that it allows for
parallelisation of actions from one or more types of input devices.
This is an important development,
as it makes it possible to combine primitives
for input methods such as touch:
In reality, a device cannot determine whether two fingers
are “associated” with the same hand.
So instead of defining high-level actions
such as pinch and flick,
it gives you the right level of granularity
to combine actions from two or more touch “finger” devices
to synthesise more complex movements.
We believe this is a good approach
with the right level of granularity
that doesn’t try to over-specify
or shoehorn in primitives that might not make sense
in a cross-browser automation setting.

I’m looking forward to seeing James’ work
land in the specificaton text.
I think probably some explanatory notes and examples
are required to fully explain this concept
for both implementors and users.

Input locality

A known limitation of Selenium that we are not proud of
is that it does not have a good story
for input with alternative keyboard layouts.
We have explicitly phrased the specification
in such a way that it doesn’t make it impossible
to retrofit in support for multiple layouts in the future.
But right now we want to finish the baseline of the specification
before we try moving into this.

The current design ideas floating around
are to have some way of setting a keyboard layout
either through a command or a capability.
This would allow / to generate
key events for Shift and ? on an American layout,
and Shift and 7 on Norwegian layout.
The biggest reason this is hard
is because we need to find the right key code conversion tables
for what would happen when typing
for example 茶.

Untrusted SSL certificates

We had a big discussion on invalid, self-signed,
and untrusted SSL certificates.
The general agreement in the WG
is that it would be good to have functionality
to allow a WebDriver session to bypass
the security checks associated with them,
as WebDriver may be run in an environment
where it is difficult or even impossible
to instrument the browser/environment
in such a way that they are accepted implicitly
(e.g. by modifying the root store).

Different browser vendors raised questions
over whether this would pass security review
as implementing such a feature increases the attack surface
in one of the most critical components in web browsers.
A counterargument is that by the point your browser has WebDriver enabled,
you probably have bigger things to worry about
than the fact that untrusted certificates are implicitly accepted.

We also found that this is highly inconsistently implemented in Selenium.
For the two drivers that support it,
FirefoxDriver (written and maintained by Selenium)
has an acceptSslCerts capability
that takes a boolean to switch off security checks,
and chromedriver (by Google) by contrast
accepts all certificates by default.
The remaining drivers have no support for it.

This leaves the working group free
to decide on a new and consistent approach.
One point of concern is that a boolean
to disable all security checks
seems like an overly coarse design.
A suggested alternative is to provide
a list of domains to disable the checks for,
where wildcards can be expanded
to cover every domain or every subdomain,
so that i.e. ["*"] would be
equivalent to setting acceptSslCerts to true
in today’s Firefox implementation,
but that ["*.sny.no"]
would only disable untrusted certificates on this domain.

Navigation and URLs

Because WebDriver taps into the browser’s
navigation algorithm
at a much later point than when a user interacts with the address bar,
we decided that malformed URLs
should consistently return an error.
We have also changed the prose
to no longer mislead users to think that
navigating in effect
means the same as using the address bar;
the address bar is not a concept of the web platform.

There was a proposal from Mozilla
to allow navigation to relative URLs,
so that one could navigate to i.e. "/foo"
to go to the path on the current domain,
similar to how window.location = "/foo" works.
This was unfortunately voted down.
I feel it would be useful,
even just for consistency,
for the WebDriver navigation command
to mirror the platform API,
modulo security checks.

Desired vs. required capabilities

A big discussion during the meeting
was around the continuing confusion around capabilities:
Many feel they are an intermediary node concept
that is best left undefined in the core specification text itself,
because the specification explicitly
does not define any qualities or expectations
about local ends (clients bindings)
or intermediary nodes
(Selenium server or proxy that gives you a session).

There was however consensus around the fact
that having a way to pick a browser configuration
from some matrix was a good idea.
The uncertainty, I think,
comes largely from driver implementors
who feel that once capabilities reach the driver
there is very little that can be done
about the sort of conflict resolution
that required- and desired capabilities warrant.

For example, what does it mean to desire a profile
and how do you know if the provided profile is valid?
We were unable to reach any agreement on this
and decided to punt the topic for our next meeting in Lisbon.

Test coverage

In order to push the specification to “Rec”
(short for Recommendation)
one must have at least two interoperable implemenations
by two separate vendors.
To determine that they are interoperable, one needs a test suite.
I’ve written previously
about the test harness I wrote for the
Web Platform Tests
that integrates WebDriver spec tests
with wptrunner.

We have a few exhaustive tests for a couple of chapters,
but I hope to continue this work this quarter.

Next meeting

The working group is meeting again
for TPAC
that this year is in Lisbon (how civilised!)
in late September.
I’m enormously looking forward to visiting there
as I’ve never been.

We hope resolve the outstanding capabilities discussion
and make final decisions on a few more minor outstanding issues then.

July 12, 2016

This piece is about too few names for too many things, as well as a kind of origin story for a web standard. For the past year or so, I’ve been contributing to a Mozilla project broadly named Marionette — a set of tools for automating and testing Gecko-based browsers like Firefox. Marionette is part of a larger browser automation universe that I’ve managed to mostly ignore so far, but the time has finally come to make sense of it.

The main challenge for me has been nailing down imprecise terms that have changed over time. From my perspective, “Marionette” may refer to any combination of two to four things, and it’s related to equally vague names like “Selenium” and “WebDriver”… and then there are things like “FirefoxDriver” and “geckodriver”. Blargh. Untangling needed.

Aside: integrating a new team member (like, say, a volunteer contributor or an intern) is the best! They ask big questions and you get to teach them things, which leads to filling in your own knowledge. Everyone wins.

The W3C WebDriver Specification

Okay, so let’s work our way backwards, starting from the future. (“The future is now.”) We want to remotely control browsers so that we can do things like write automated tests for the content they run or tests for the browser UI itself. It sucks to have to write the same test in a different way for each browser or each platform, so let’s have a common interface for testing all browsers on all platforms. (Yay, open web standards!) To this end, a group of people from several organizations is working on the WebDriver Specification.

The main idea in this specification is the WebDriver Protocol, which provides a platform- and browser- agnostic way to send commands to the browser you want to control, commands like “open a new window” or “execute some JavaScript.” It’s a communication protocol1 where the payload is some JSON data that is sent over HTTP. For example, to tell the browser to navigate to a url, a client sends a POST request to the endpoint /session/{session id of the browser instance you're talking to}/url with body {"url": "http://example.com/"}.

The server side of the protocol, which might be implemented as a browser add-on or might be built into the browser itself, listens for commands and sends responses. The client side, such as a Python library for automating browsers, send commands and processes the responses.

This broad idea is already implemented and in use: an open source project for browser automation, Selenium WebDriver, became widely adopted and is now the basis for an open web standard. Awesome! (On the other hand, oh no! The overlapping names begin!)

Selenium WebDriver

Where does this WebDriver concept come from? You may have noticed that lots of web apps are tested across different browsers with Selenium — that’s precisely what it was built for back in 2004-20092. One of its components today is Selenium WebDriver.

(Confusingly3, the terms “Selenium Webdriver, “Webdriver”, “Selenium 2” and “Selenium” are often used interchangeably, as a consequence of the project’s history.)

Selenium WebDriver provides APIs so that you can write code in your favourite language to simulate user actions like this:

Underneath that API, commands are transmitted via JSON over HTTP, as described in the previous section. A fair name for the protocol currently implemented in Selenium is Selenium JSON Wire Protocol. We’ll come back to this distinction later.

As mentioned before, we need a server side that understands incoming commands and makes the browser do the right thing in response. The Selenium project provides this part too. For example, they wrote FirefoxDriver which is a Firefox add-on that takes care of interpreting WebDriver commands. There’s also InternetExplorerDriver, AndroidDriver and more. I imagine it takes a lot of effort to keep these browser-specific “drivers” up-to-date.

Then something cool happened

A while after Selenium 2 was released, browser vendors started implementing the Selenium JSON Wire Protocol themselves! Yay! This makes a lot of sense: they’re in the best position to maintain the server side and they can build the necessary behaviour directly into the browser.

Let’s Review

Selenium Webdriver (a.k.a. Selenium 2, WebDriver) provides a common API, protocol and browser-specific “drivers” to enable browser automation. Browser vendors started implementing the Selenium JSON Wire Protocol themselves, thus gradually replacing some of Selenium’s browser-specific drivers. Since WebDriver is already being implemented by all major browser vendors to some degree, it’s being turned into a rigorous web standard, and some day all browsers will implement it in a perfectly compatible way and we’ll all live happily ever after.

Is the Selenium JSON Wire Protocol the same as the W3C WebDriver protocol? Technically, no. The W3C spec is describing the future of WebDriver5, but it’s based on what Selenium WebDriver and browser vendors are already doing. The goal of the spec is to coordinate the browser automation effort and make sure we’re all implementing the same interface; each command in the protocol should mean the same thing across all browsers.

A Fresh Look at the Marionette Family

Now that I understand the context, my view of Marionette’s components is much clearer.

Marionette Server together with geckodriver make up Mozilla’s implementation of the W3C WebDriver protocol.

Marionette Server is built directly into Firefox (into the Gecko rendering engine) and it speaks a slightly different protocol. To make Marionette truly WebDriver-compatible, we need to translate between Marionette’s custom protocol and the WebDriver protocol, which is exactly what geckodriver does. The Selenium client can talk to geckodriver, which in turn talks to Marionette Server.

As I mentioned earlier, the plan for Selenium 3 is to have geckodriver replace Selenium’s FirefoxDriver. This is an important change: since FirefoxDriver is a Firefox add-on, it has limitations and is going to stop working altogether with future releases.

Marionette Client is Mozilla’s official Python library for remote control of Gecko, but it’s not covered by the W3C WebDriver spec and it’s not compatible with WebDriver in general. Think of it as an alternative to Selenium’s Python client with Gecko-specific features. Selenium + geckodriver should eventually replace Marionette Client, including the Gecko-specific features.

The Marionette project also includes tools for integrating with Mozilla’s intricate test infrastructure: Marionette Test Runner, a.k.a. the Marionette test harness. This part of the project has nothing to do with WebDriver, really, except that it knows how to run tests that depend on Marionette Client. The runner collects the tests you ask for, takes care of starting a Marionette session with the right browser instance, runs the tests and reports the results.6

As you can see, “Marionette” may refer to many different things. I think this ambiguity will always make me a little nervous… Words are hard, especially as a loose collection of projects evolves and becomes unified. In a few years, the terms will firm up. For now, let’s be extra careful and specify which piece we’re talking about.

Acknowledgements

Thanks to David Burns for patiently answering my half-baked questions last week, and to James Graham and Andreas Tolfsen for providing detailed and delightful feedback on a draft of this article. Bonus high-five to Anjana Vakil for contributions to Marionette Test Runner this year and for inspiring me to write this post in the first place.

Terminology lesson: the WebDriver protocol is a wire protocol because it’s at the application level and requires several applications working together. ↩

I give a range of years because Selenium WebDriver is a merger of two projects that started at different times. ↩

Abbreviated Selenium history and roadmap: Selenium 1 used an old API and mechanism called SeleniumRC, Selenium 2 favours the WebDriver API and JSON Wire Protocol, Selenium 3 will officially designate SeleniumRC as deprecated (“LegRC”, harhar), and Selenium 4 will implement the authoritative W3C WebDriver spec. ↩

For example, until recently Selenium WebDriver only included commands that are common to all browsers, with no way to use features that are specific to one. In contrast, the W3C WebDriver spec allows the possibility of extension commands. Extension commands are being implemented in Selenium clients right now! The future is now! ↩

Fun fact: Marionette is not only used for “Marionette Tests” at Mozilla. The client/server are also used to instrument Firefox for other test automation like mochitests and Web Platform Tests. ↩

In this two-part series, I’d like to share a bit of what I’ve learned about the Marionette project, and how its various components help us test Firefox by allowing us to automatically control the browser from within. Today, in Act I, I’ll give an overview of how the Marionette server and client make the browser our puppet. Later on, in Act II, I’ll describe how the Marionette test harness and runner make automated testing a breeze, and let you in on the work I’m doing on the Marionette test runner for my internship.

And since we’re talking about puppets, you can bet there’s going to be a hell of a lot of Being John Malkovich references. Consider yourself warned!

How do you test a browser?

On the one hand, you probably want to make sure that the browser’s interface, or chrome, is working as expected; that users can, say, open and close tabs and windows, type into the search bar, change preferences and so on. But you probably also want to test how the browser displays the actual web pages, or content, that you’re trying to, well, browse; that users can do things like click on links, interact with forms, or play video clips from their favorite movies.

Because let's be honest, that's like 96% of the point of the internet right there, no?

These two parts, chrome and content, are the main things the browser has to get right. So how do you test them?

Well, you could launch the browser application, type “Being John Malkovich” into the search bar and hit enter, check that a page of search results appears, click on a YouTube link and check that it takes you to a new page and starts playing a video, type “I ❤️ Charlie Kaufman and Spike Jonze” into the comment box and press enter, check that it submits the text…

And when you’re done, you could write up a report about what you tested, what worked and what didn’t, and what philosophical discoveries about the nature of identity and autonomy you made along the way.

Now, while this would be fun, if you have to do it a million times it would be less fun, and your boss might think that it is not fun at all that it takes you an entire developer-hour to test one simple interaction and report back.

Wouldn’t it be better if we could magically automate the whole thing? Maybe something like:

What is Marionette?

Marionette refers to a suite of tools for automated testing of Mozilla browsers. One important part of the project is an automation framework for Gecko, the engine that powers Firefox. The automation side of Marionette consists of the Marionette server and client, which work together to make the browser your puppet: they give you a nice, simple little wooden handle with which you can pull a bunch of strings, which are tied to the browser internals, to make the browser and the content it’s displaying do whatever you want. The client.do_stuff code above isn’t even an oversimplification; that’s exactly how easy it becomes to control the browser using Marionette’s client and server. Pretty great right?

But Marionette doesn’t stop there! In addition to the client & server giving you this easy-peasy apparatus for automatic control of the browser, another facet of the Marionette project - the test harness and runner - provides a full-on framework for automatically running and reporting tests that utilize the automation framework. This makes it easy to set up your puppet and the stage you want it to perform on, pull the strings to make the browser dance, check whether the choreography looked right, and log a review of the performance. I can see the headline now:

Screenwriter Charlie Kaufman enthralls again with a brooding script for automated browser testing

You: GROAN. That's a terrible joke.Me: Sorry, it won't be the last. Hang in there, reader.
Being John Malkovich via the dissected frog (adapted with GIFMaker.me)

As I see it, the overall aim of the Marionette project is to make automated browser testing easy. This breaks down into two main tasks: automation and testing. Here in Act I, we’ll investigate how the Marionette server and client let us automate the browser. In the next post, Act II, we’ll take a closer look at how the Marionette test harness and runner make use of that automation to test the browser.

How do Marionette’s server and client automate the browser?

A real marionette is composed of three parts:

a puppet

strings

a handle

This is a great analogy for Marionette’s approach to automating the browser (I guess that’s why they named it that).

The puppet we want to move is the Firefox browser, or to be precise, the Gecko browser engine underlying it. We want to make all of its parts – windows, tabs, pages, page elements, scripts, and so forth – dance about as we please.

The handle we use to control it is the Marionette client, a Python library that gives us an API for accessing and manipulating the browser’s components and mimicking user interactions.

The strings, which connect handle to puppet and thus make the whole contraption work, are the Marionette server. The server comprises a set of components built in to Gecko (the bottom ends of the strings), and listens for commands coming in from the client (the top ends of the strings).

Photo adapted from "Marionettes from Being John Malkovich" by Alex Headrick, via Flickr

The puppet: the Gecko browser engine

So far, I’ve been talking about “the browser” as the thing we want to automate, and the browser I have in mind is (desktop) Firefox, which Marionette indeed lets us automate. But in fact, Marionette’s even more powerful than that; we can also use it to automate other products, like Firefox for Android (codenamed “Fennec”, so cute!) or FirefoxOS/Boot to Gecko (B2G). That’s because the puppet Marionette lets us control is actually not the Firefox desktop browser itself, but rather the Gecko browser engine on top of which Firefox (like Fennec, and B2G) is built. All of the above, and any other Gecko-based products, can in principle be automated with Marionette.1, 2

So what exactly is this Gecko thing we’re playing with? Well, I’ve already revealed that it’s a browser engine - but if you’re like me at the beginning of this internship, you’re wondering what a “browser engine” even is/does. MDN explains:

Gecko’s function is to read web content, such as HTML, CSS, XUL, JavaScript, and render it on the user’s screen or print it. In XUL-based applications Gecko is used to render the application’s user interface as well.

In other words, a browser engine like Gecko takes all that ugly raw HTML/CSS/JS code and turns it into a pretty picture on your screen (or, you know, not so pretty - but a picture, nonetheless), which explains why browser engines are also called “layout engines” or “rendering engines”.

And see that bit about “XUL”? Well, XUL (XML User interface Language) is a markup language Mozilla came up with that lets you write application interfaces almost as if they were web pages. This lets Mozilla use Gecko not only to render the websites that Firefox lets you navigate to, but also to render the interface of Firefox itself: the search bar, tabs, forward and back buttons, etc. So it’s safe to say that Gecko is the heart of Firefox. And other applications, like the aforementioned Fennec and FirefoxOS, as well as the Thunderbird email client.

But wait a minute; why do we have to go all the way down to Gecko to control the browser? It’s pretty easy to write add-ons to control Firefox’s chrome or content, so why can’t we just do that? Well, first of all, security issues abound in add-on territory, which is why add-ons typically run with limited privileges and/or require approval; so an add-on-based automation system would likely give under- or over-powered control over the browser. But in fact, the real reason Marionette isn’t an add-on is more historical. As browser automation expert and Mozillian David Burns explained at SeleniumConf 2013, Marionette was originally developed to test FirefoxOS, which had the goal of using Gecko to run the entire operating system of a smartphone (hence FirefoxOS being codenamed Boot to Gecko). FirefoxOS didn’t allow add-ons, so the Marionette team had to get creative and build an automation solution right into Gecko itself. This gave them the opportunity to make Marionette an implementation of the WebDriver specification, a W3C standard for a browser automation interface. The decision to build Marionette as part of Gecko rather than an add-on thus had at least two advantages: Marionette had native access to Gecko and didn’t have to deal with add-on security issues,3 and by making Marionette WebDriver-compliant, Mozilla helped advanced the standardization of browser automation.

So that’s why it’s Gecko, not Firefox, that Marionette ties strings to. In the next section, we’ll see what those knots look like.

The strings: the Marionette server

As mentioned in the previous section, Marionette is built into Gecko itself. Specifically, the part of Marionette that’s built into Gecko, which gives us native access to the browser’s chrome and content, is called the Marionette server. The server acts as the strings of the contraption: the “top end” listens for commands from the handle (i.e. the Marionette client, which we’ll get to in the next section), and the “bottom end” actually manipulates the Gecko puppet as instructed by our commands.

The code that makes up these strings is written in JavaScript and lives within the Firefox codebase at mozilla-central/testing/marionette. Let’s take a little tour, shall we?

The strings are embedded into Gecko via a component called MarionetteComponent, which, when enabled,4 starts up a MarionetteServer, which is the object that most directly represents the strings themselves.
MarionetteComponent is defined in components/marionettecomponent.js (which, incidentally, includes the funniest line in the entire Marionette codebase), while
MarionetteServer is defined in the file server.js. As you can see in server.js, MarionetteServer is responsible for tying the whole contraption together: it sets up a socket on a given port where it can listen for commands, and uses the dispatcher defined in dispatcher.js to receive incoming commands and send data about command outcomes back to the client. Together, the server and dispatcher provide a point of connection to the “bottom end” of the strings: GeckoDriver, or the part of Marionette that actually talks to the browser’s chrome and content.

GeckoDriver, so named because it’s the part of the whole Marionette apparatus that can most accurately be said to be the part automatically driving Gecko, is defined in a file that is unsurprisingly named driver.js. The driver unites a bunch of other specialized modules which control various aspects of the automation, pulling them together and calling on them as needed according to the commands received by the server.
Some examples of the “specialized modules” I’m talking about are:

element.js, which maps out and gives us access to all the elements on the page

With the help of such modules, the driver allows us to grab on to whatever part of the browser’s chrome or content interests us and interact with it as we see fit.

The Marionette server thus lets us communicate automatically with the browser, acting as the strings that get tugged on by the handle and in turn pull on the various limbs of our Gecko puppet. The final part of the puzzle, and subject of the next section, is the handle we can use to tell Marionette what it is we want to do.

The handle: the Marionette client

For us puppeteers, the most relevant part of the whole marionette apparatus is the one we actually have contact with: the handle. In Marionette, that handle is called the Marionette client. The client gives us a convenient API for communicating with Gecko via the Marionette server and driver described in the previous section.

To start pulling Marionette’s strings, all we need to do is instantiate a Marionette object (a client), tell it to open up a “session” with the server, i.e. a unique connection that allows messages to be sent back and forth between the two, and give it some commands to send to the server, which (as we saw above) executes them in Gecko. And of course, this assumes that the instance of Firefox (or our Gecko-based product of choice) has Marionette enabled,4 i.e. that there’s a Marionette server ready and waiting for our commands; if the server’s disabled, the strings we’re pulling won’t actually be connected to anything.

The commands we give can either make changes to the browser’s state (e.g. navigate to a new page, click on an element, resize the window, etc.), or return information about that state (e.g. get the URL of the current page, get the value of a property of a certain element, etc.). When giving commands, we have to be mindful of which context we’re in, i.e. whether we’re trying to do something with the browser’s chrome or its content. Some commands are specific to one context or the other (e.g. navigating to a given URL is a content operation), while others work in both contexts (e.g. clicking on an element can pertain to either of the two). Luckily, the client API gives us an easy one-liner to switch from one context to another.

# Import browser automation magicfrommarionetteimportMarionettefrommarionette_driver.byimportBy# Instantiate the Marionette (client) classclient=Marionette(host='localhost',port=2828)# Start a session to talk to the server# (otherwise, nothing will work because the strings aren't connected)client.start_session()# Give a command and check that it workedclient.navigate("https://www.mozilla.org")assert"building a better Internet"inclient.title# Switch to the chrome context for a minute (content is default)withclient.using_context(client.CONTEXT_CHROME):urlbar=client.find_element(By.ID,"urlbar")urlbar.send_keys("about:robots")# try this one out yourself!# Close the window, which ends the sessionclient.close()

It’s as simple as that! We’ve got an easy-to-use, high-level API that gives us full control over the browser, in terms of both chrome and content. Given this simple handle provided by the client, we don’t really need to worry about the mechanics of the server strings or the Gecko puppet itself; instead, we can concern ourselves with the dance steps we want the browser to perform.

Arabesque? Pirouette? Pas de bourrée? You name it!Being John Malkovich via ScreenplayHowTo

The browser is our plaything! Now what?

So now we’ve seen how the Marionette client and server give us the apparatus to control the browser like a puppet. The client gives us a simple handle we can manipulate, the server ties strings to that handle that transmit our wishes directly to the Gecko browser engine, which dances about as we please.

But what about checking that it’s got the choreography right? Well, as mentioned above, the client API not only lets us make changes to the browser’s chrome and content, but also gives us information about their current state. And since it’s just regular old Python code, we can use simple assert statements to perform quick checks, as in the example in the last section. But if we want to test the browser and user interactions with it more thoroughly, we could probably use a more developed and full-featured testing framework.

ANOTHER SPOILER ALERT (well OK I actually already spoiled this one): the Marionette project gives us a tool for that too!

In Act II of this article, we’ll explore the Marionette test harness and runner, which wrap the server-client automation apparatus described here in Act I in a testing framework that makes it easy to set the stage, perform a dance, and write a review of the performance. See you back here after the intermission!

Sources

The brains of Mozillians David Burns, Maja Frydrychowicz, Henrik Skupin, and Andreas Tolfsen. Many thanks to them for their feedback on this article.

Notes

1 Emphasis on the “in principle” part, because getting Marionette to play nicely with all of these products may not be trivial. For example, my internship mentor Maja has been hard at work recently on the Marionette runner and client to make them Fennec-friendly. ↩

2 If you’re interested in automating a browser that doesn’t run on Gecko, don’t fret! Marionette is Gecko-specific, but it’s an implementation of the engine-agnostic WebDriver standard, a W3C specification for a browser automation protocol. Given a WebDriver implementation for the engine in question, any browser can in principle be automated in the same way that Marionette automates Gecko browsers. ↩

3 In fact, Marionette’s built-in design makes it able to circumvent the add-on signing requirement mentioned earlier; this would be dangerous if exposed to end users (see 4), but comes in handy when Mozilla developers need to inject unsigned add-ons into development versions of Gecko browsers in automation. ↩

4 At this point (or some much-earlier point) you might be wondering:

Wait a minute - if the Marionette server is built into Gecko itself, and gives us full automatic control of the browser, isn’t that a security risk?

5 Prefer to talk to the Marionette server from another language? No problem! All you need to do is implement a client in your language of choice, which is pretty simple since the WebDriver specification that Marionette implements uses bog-standard JSON over HTTP for client-server communications. If you want to use JavaScript, your job is even easier: you can take advantage of a JS Marionette client developed for the B2G project. ↩

APK Size

Here’s how the APK size changed over the quarter, for mozilla-central Android 4.0 API15+ opt builds:

APK size generally grew, generally in small increments. Our APK is about 1.3 MB larger today than it was 3 months ago. The largest increase, of about 400 KB around May 4, was caused by and discussed in bug 1260208. The largest decrease, of about 200 KB around April 25, was caused by bug 1266102.

For the same period, libxul.so also generally grew gradually:

Memory

These memory measurements are fairly steady over the quarter, with a gradual increase over time.

Autophone-Talos

This section tracks Perfherder graphs for mozilla-central builds of Firefox for Android, for Talos tests run on Autophone, on android-6-0-armv8-api15. The test names shown are those used on treeherder. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

tp4m

Generic page load test. Lower values are better.

No significant improvements or regressions noted for tsvgx or tp4m.

Autophone

Throbber Start / Throbber Stop

Browser startup performance is measured on real phones (a variety of popular devices).

For the first time on this blog, I’ve pulled this graph from Perfherder, rather than phonedash. A wealth of throbber start/throbber stop data is now available in Perfherder. Here’s a quick summary for the local blank page test on various devices:

June 28, 2016

<figcaption class="wp-caption-text">On my way to the venue</figcaption>

Last week was the pytest development sprint located in the beautiful town of Freiburg, Germany. I had been really looking forward to the sprint, and being immediately after the Mozilla all-hands in London I was still buzzing with excitement when I started my journey to Freiburg.

On the first morning I really wasn’t sure about how to get to our sprint venue via public transport, and it didn’t seem to be far to walk from my hotel. It was a lovely sunny morning, and I arrived just in time for the introductions. Having been a pytest user for over five years I was already familiar with Holger, Ronny, and a few others, but this was the first time meeting them. We then spent some time planning out our first day, and coming up with things to work on throughout the week. My first activity was pairing up to work on pytest issues.

For my first task I paired with Daniel to work on an issue he had recently encountered, which I had also needed to workaround in latest versions of pytest. It turned out to be quite a complex issue related to determination of the root directory, which is used for relative reference to test files as well as a number of other things. The fix seemed simple at first – we just needed to exclude arguments that are not paths from consideration for determining the root directory, however there were a number of edge cases that needed resolving. The patch to fix this has not yet landed, but I’m feeling confident that it will be merged soon. When it does, I think we’ll be able to close at least three related issues!

<figcaption class="wp-caption-text">Ducks in the town centre</figcaption>

Next, I worked with Bruno on moving a bunch of my plugins to the pytest-dev GitHub organisation. This allows any of the pytest core team to merge fixes to my plugins and means I’m not a blocker for any important bug fixes. I’m still intending on supporting the plugins, but it feels good to have a larger team looking after them if needed. The plugins I moved are pytest-selenium, pytest-html, pytest-variables, and pytest-base-url. Later in the week we also moved pytest-repeat with the approval of the author, who is happy for someone else to maintain the project.

If you’ve never used pytest, then you might expect to be able to simply run

pytest

on the command line to run your tests. Unfortunately, this isn’t the case, and the reason is that the tool used to be part of a collection of other tools, all with a common prefix of

py.

so you’d run your tests using the

py.test

command. I’m pleased to say that I worked with Oliver during the sprint to introduce

pytest

as the recommended command line entry point for pytest. Don’t worry – the old entry point will still work when 3.0 is released, but we should be able to skip a bunch of confusion for new users!

On Thursday we took a break and took a cable-car up a nearby peak and hiked around for a few hours. I also finally got an opportunity to order a slice of Schwarzwälder Kirschtorte (Black Forest gateau), which is named after the area and was a favourite of mine growing up. The break was needed after my brain had been working overtime processing the various talks, demonstrations, and discussions. We still talked a lot about the project, but to be out in the beautiful scenery watching para-gliders gracefully circling made for a great day.

<figcaption class="wp-caption-text">Hefeweizen!</figcaption>

When we returned to our sprint venue on Friday I headed straight into a bug triage with Tom, which ended up mostly focusing on one particular issue. The issue relates to hiding what is at first glance redundant information in the case of a failure, but on closer inspection there are actually many examples where this extra line in the explanation can be very useful.

Unfortunately I had to leave on Saturday morning, which meant I missed out on the final day of the sprint. I have to say that I can’t wait to attend the next one as I had so much fun getting to know everyone, learning some handy phrases in a number of different languages, and commiserating/laughing together in the wake of Brexit! I’m already missing my fellow sprinters!

June 20, 2016

Last week, all of Mozilla met in London for a whirlwind tour from TARDIS to TaskCluster, from BBC1 to e10s, from Regent Park to the release train, from Paddington to Positron. As an Outreachy intern, I felt incredibly lucky to be part of this event, which gave me a chance to get to know Mozilla, my team, and the other interns much better. It was a jam-packed work week of talks, meetings, team events, pubs, and parties, and it would be impossible to blog about all of the fun, fascinating, and foxy things I learned and did. But I can at least give you some of the highlights! Or, should I say, Who-lights? (Be warned, that is not the last pun you will encounter here today.)

Role models

While watching the plenary session that kicked off the week, it felt great to realize that of the 4 executives emerging from the TARDIS in the corner to take the stage (3 Mozillians and 1 guest star), a full 50% were women. As I had shared with my mentor (also a woman) before arriving in London, one of my goals for the week was to get inspired by finding some new role moz-els (ha!): Mozillians who I could aspire to be like one day, especially those of the female variety.

Why a female role model, specifically? What does gender have to do with it?

Well, to be a good role model for you, a person needs to not only have a life/career/lego-dragon you aspire to have one day, but also be someone who you can already identify with, and see yourself in, today. A role model serves as a bridge between the two. As I am a woman, and that is a fundamental part of my experience, a role model who shares that experience is that much easier for me to relate to. I wouldn’t turn down a half-Irish-half-Indian American living in Germany, either.

At any rate, in London I found no shortage of talented, experienced, and - perhaps most importantly - valuedwomen at Mozilla. I don’t want to single anyone out here, but I can tell you that I met women at all levels of the organization, from intern to executive, who have done and are doing really exciting things to advance both the technology and culture of Mozilla and the web. Knowing that those people exist, and that what they do is possible, might be the most valuable thing I took home with me from London.

Electrolysis (e10s)

Electrolysis, or “e10s” for those who prefer integers to morphemes, is a massive and long-running initiative to separate the work Firefox does into multiple processes.

At the moment, the Firefox desktop program that the average user downloads and uses to explore the web runs in a single process. That means that one process has to do all the work of loading & displaying web pages (the browser “content”), as well as the work of displaying the user interface and its various tabs, search bars, sidebars, etc. (the browser “chrome”). So if something goes wrong with, say, the execution of a poorly-written script on a particular page, instead of only that page refusing to load, or its tab perhaps crashing, the entire browser itself may hang or crash.

That’s not cool. Especially if you often have lots of tabs open. Not that I ever do.

Of course not. Anyway, even less cool is the possibility that some jerk (not that there are any of those on the internet, though, right?) could make a page with a script that hijacks the entire browser process, and does super uncool stuff.

It would be much cooler if, instead of a single massive process, Firefox could use separate processes for content and chrome. Then, if a page crashes, at least the UI still works. And if we assign the content process(es) reduced permissions, we can keep possibly-jerkish content in a nice, safe sandbox so that it can’t do uncool things with our browser or computer.

It’s not perfect yet - for example, compatibility with right-to-left languages, accessibility (or “a11y”, if “e10s” needs a buddy), and add-ons is still an issue - but it’s getting there, and it’s rolling out real soon! Given that the project has been underway since 2008, that’s pretty exciting.

Rust, Servo, & Oxidation

I first heard about the increasingly popular language Rust when I was at the Recurse Center last fall, and all I knew about it was that it was being used at Mozilla to develop a new browser engine called Servo.

More recently, I heard talks from Mozillians like E. Dunham that revealed a bit more about why people are so excited about Rust: it’s a new language for low-level programming, and compared with the current mainstay C, it guarantees memory safety. As in, “No more segfaults, no more NULLs, no more dangling pointers’ dirty looks”. It’s also been designed with concurrency and thread safety in mind, so that programs can take better advantage of e.g. multi-core processors. (Do not ask me to get into details on this; the lowest level I have programmed at is probably sitting in a beanbag chair. But I believe them when they say that Rust does those things, and that those things are good.)

OK OK OK, so Rust is a super cool new language. What can you do with it?

Well, lots of stuff. For example, you could write a totally new browser engine, and call it Servo.

Wait, what’s a browser engine?

A browser engine (aka layout or rendering engine) is basically the part of a browser that allows it to show you the web pages you navigate to. That is, it takes the raw HTML and CSS content of the page, figures out what it means, and turns it into a pretty picture for you to look at.

Uh, I’m pretty sure I can see web pages in Firefox right now. Doesn’t it already have an engine?

Indeed it does. It’s called Gecko, and it’s written in C++. It lets Firefox make the web beautiful every day.

So why Servo, then? Is it going to replace Gecko?

No. Servo is an experimental engine developed by Mozilla Research; it’s just intended to serve(-o!) as a playground for new ideas that could improve a browser’s performance and security.

The beauty of having a research project like Servo and a real-world project like Gecko under the same roof at Mozilla is that when the Servo team’s research unveils some new and clever way of doing something faster or more awesomely than Gecko does, everybody wins! That’s thanks to the Oxidation project, which aims to integrate clever Rust components cooked up in the Servo lab into Gecko. Apparently, Firefox 45 already got (somewhat unexpectedly) an MP4 metadata parser in Rust, which has been running just fine so far. It’s just the tip of the iceberg, but the potential for cool ideas from Servo to make their way into Gecko via Oxidation is pretty exciting.

The Janitor

Another really exciting thing I heard about during the week is The Janitor, a tool that lets you contribute to FOSS projects like Firefox straight from your browser.

For me, one of the biggest hurdles to contributing to a new open-source project is getting the development environment all set up.

Ugh I hate that. I just want to change one line of code, do I really need to spend two days grappling with installation and configuration?!?

Powered by the very cool Cloud9 IDE, the Janitor gives you one-click access to a ready-to-go, cloud-based development environment for a given project. At the moment there are a handful of project supported (including Firefox, Servo, and Google Chrome), and new ones can be added by simply writing a Dockerfile. I’m not sure that an easier point of entry for new FOSS contributors is physically possible. The ease of start-up is perfect for short-term contribution efforts like hackathons or workshops, and thanks to the collaborative features of Cloud9 it’s also perfect for remote pairing.

Awesome, I’m sold. How do I use it?

Unfortunately, the Janitor is still in alpha and invite-only, but you can go to janitor.technology and sign up to get on the waitlist. I’m still waiting to get my invite, but if it’s half as fantastic as it seems, it will be a huge step forward in making it easier for new contributors to get involved with FOSS projects. If it starts supporting offline work (apparently the Cloud9 editor is somewhat functional offline already, once you’ve loaded the page initially, but the terminal and VNC always needs a connection to function), I think it’ll be unstoppable.

L20n

The last cool thing I heard about (literally, it was the last session on Friday) at this work week was L20n.

Wait, I thought “localization” was abbreviated “L10n”?

Yeah, um, that’s the whole pun. Way to be sharp, exhausted-from-a-week-of-talks-Anjana.

See, L20n is a next-generation framework for web and browser localization (l10n) and internationalization (i18n). It’s apparently a long-running project too, born out of the frustrations of the l10n status quo.

According to the L20n team, at the moment the localization system for Firefox is spread over multiple files with multiple syntaxes, which is no fun for localizers, and multiple APIs, which is no fun for developers. What we end up with is program logic intermingling with l10n/i18n decisions (say, determining the correct format for a date) such that developers, who probably aren’t also localizers, end up making decisions about language that should really be in the hands of the localizers. And if a localizer makes a syntax error when editing a certain localization file, the entire browser refuses to run. Not cool.

Pop quiz: what’s cool?

Um…

C’mon, we just went over this. Go on and scroll up.

Electrolyis?

Yeah, that’s cool, but thinking more generally…

Separation?

That’s right! Separation is super cool! And that’s what L20n does: separate l10n code from program source code. This way, developers aren’t pretending to be localizers, and localizers aren’t crashing browsers. Instead, developers are merely getting localized strings by calling a single L20n API, and localizers are providing localized strings in a single file format & syntax.

Wait but, isn’t unifying everything into a single API/file format the opposite of separation? Does that mean it’s not cool?

Shhh. Meaningful separation of concerns is cool. Arbitrary separation of a single concern (l10n) is not cool. L20n knows the difference.

OK, fine. But first “e10s” and “a11y”, now “l10n”/”l20n” and “i18n”… why does everything need a numbreviation?

June 16, 2016

It was nearly a year ago that Microsoft shipped their first implementation of WebDriver. I remember being so excited as I wrote a blog post about it.

This week, Apple have said that they are going to be shipping a version of WebDriver that will allow people to drive Safari 10 in macOS. In the release notes they have created safari driver that will be shipping with the OS.

In addition to new Web Inspector features in Safari 10, we are also bringing native WebDriver support to macOS. https://t.co/PfwmkRIBIV

If you have ever wondered why this is important? Have a read of my last blog post. In Firefox 47 Selenium caused Firefox to crash on startup. The Mozilla implementation of WebDriver, called Marionette and GeckoDriver, would never have hit this problem because test failures and crashes like this would lead to patches being reverted and never shipped to end users.

This post brought to you from Mozilla’s London All Hands meeting - cheers!

When writing Python unit tests, sometimes you want to just test one specific aspect of a piece of code that does multiple things.

For example, maybe you’re wondering:

Does object X get created here?

Does method X get called here?

Assuming method X returns Y, does the right thing happen after that?

Finding the answers to such questions is super simple if you use mock: a library which “allows you to replace parts of your system under test with mock objects and make assertions about how they have been used.” Since Python 3.3 it’s available simply as unittest.mock, but if you’re using an earlier Python you can get it from PyPI with pip install mock.

So, what are mocks? How do you use them?

Well, in short I could tell you that a Mock is a sort of magical object that’s intended to be a doppelgänger for some object in your code that you want to test. Mocks have special attributes and methods you can use to find out how your test is using the object you’re mocking. For example, you can use Mock.called and .call_count to find out if and how many times a method has been called. You can also manipulate Mocks to simulate functionality that you’re not directly testing, but is necessary for the code you’re testing. For example, you can set Mock.return_value to pretend that an function gave you some particular output, and make sure that the right thing happens in your program.

But honestly, I don’t think I could give a better or more succinct overview of mocks than the Quick Guide, so for a real intro you should go read that. While you’re doing that, I’m going to watch this fantastic Michael Jackson video:

Oh you’re back? Hi! So, now that you have a basic idea of what makes Mocks super cool, let me share with you some of the tips/tips/trials/tribulations I discovered when starting to use them.

Patches and namespaces

When you import a helper module into a module you’re testing, the tested module gets its own namespace for the helper module. So if you want to mock a class from the helper module, you need to mock it within the tested module’s namespace.

For example, let’s say I have a Super Useful helper module, which defines a class HelperClass that is So Very Helpful:

Now, let’s say that it is Incredibly Important that I make sure that a HelperClass object is actually getting created in tested, i.e. that HelperClass() is being called. I can write a test module that patches HelperClass, and check the resulting Mock object’s called property. But I have to be careful that I patch the right HelperClass! Consider test_tested.py:

In this case, in my test for tested2 I need to patch the class with patch('helper.HelperClass') instead of patch('tested.HelperClass'). Consider test_tested2.py:

# test_tested2.py
import tested2
from mock import patch
# This time, this IS what I want:
@patch('helper.HelperClass')
def test_helper_2_right(mock_HelperClass):
tested2.fn()
assert mock_HelperClass.called # Passes! I am not sad :)
# And this is NOT what I want!
# Mock will complain: "module 'tested2' does not have the attribute 'HelperClass'"
@patch('tested2.HelperClass')
def test_helper_2_right(mock_HelperClass):
tested2.fn()
assert mock_HelperClass.called

Wonderful!

In short: be careful of which namespace you’re patching in. If you patch whatever object you’re testing in the wrong namespace, the object that’s created will be the real object, not the mocked version. And that will make you confused and sad.

I was confused and sad when I was trying to mock the TestManifest.active_tests() function to test BaseMarionetteTestRunner.add_test, and I was trying to mock it in the place it was defined, i.e. patch('manifestparser.manifestparser.TestManifest.active_tests').

Instead, I had to patch TestManifestwithin the runner.base module, i.e. the place where it was actually being called by the add_test function, i.e. patch('marionette.runner.base.TestManifest.active_tests').

So don’t be confused or sad, mock the thing where it is used, not where it was defined!

Pretending to read files with mock_open

One thing I find particularly annoying is writing tests for modules that have to interact with files. Well, I guess I could, like, write code in my tests that creates dummy files and then deletes them, or (even worse) just put some dummy files next to my test module for it to use. But wouldn’t it be better if I could just skip all that and pretend the files exist, and have whatever content I need them to have?

It sure would! And that’s exactly the type of thing mock is really helpful with. In fact, there’s even a helper called mock_open that makes it super simple to pretend to read a file. All you have to do is patch the builtin open function, and pass in mock_open(read_data="my data") to the patch to make the open in the code you’re testing only pretend to open a file with that content, instead of actually doing it.

To see it in action, you can take a look at a (not necessarily great) little test I wrote that pretends to open a file and read some data from it:

def test_nonJSON_file_throws_error(runner):
with patch('os.path.exists') as exists:
exists.return_value = True
with patch('__builtin__.open', mock_open(read_data='[not {valid JSON]')):
with pytest.raises(Exception) as json_exc:
runner._load_testvars() # This is the code I want to test, specifically to be sure it throws an exception
assert 'not properly formatted' in json_exc.value.message

Gotchya: Mocking and debugging at the same time

See that patch('os.path.exists') in the test I just mentioned? Yeah, that’s probably not a great idea. At least, I found it problematic.

I was having some difficulty with a similar test, in which I was also patching os.path.exists to fake a file (though that wasn’t the part I was having problems with), so I decided to set a breakpoint with pytest.set_trace() to drop into the Python debugger and try to understand the problem. The debugger I use is pdb++, which just adds some helpful little features to the default pdb, like colors and sticky mode.

So there I am, merrily debugging away at my (Pdb++) prompt. But as soon as I entered the patch('os.path.exists') context, I started getting weird behavior in the debugger console: complaints about some ~/.fancycompleterrc.py file and certain commands not working properly.

It turns out that at least one module pdb++ was using (e.g. fancycompleter) was getting confused about file(s) it needs to function, because of checks for os.path.exists that were now all messed up thanks to my ill-advised patch. This had me scratching my head for longer than I’d like to admit.

What I still don’t understand (explanations welcome!) is why I still got this weird behavior when I tried to change the test to patch 'mymodule.os.path.exists' (where mymodule.py contains import os) instead of just 'os.path.exists'. Based on what we saw about namespaces, I figured this would restrict the mock to only mymodule, so that pdb++ and related modules would be safe - but it didn’t seem to have any effect whatsoever. But I’ll have to save that mystery for another day (and another post).

Still, lesson learned: if you’re patching a commonly used function, like, say, os.path.exists, don’t forget that once you’re inside that mocked context, you no longer have access to the real function at all! So keep an eye out, and mock responsibly!

Mock the night away

Those are just a few of the things I’ve learned in my first few weeks of mocking. If you need some bedtime reading, check out these resources that I found helpful:

I’m sure mock has all kinds of secrets, magic, and superpowers I’ve yet to discover, but that gives me something to look forward to! If you have mock-foo tips to share, just give me a shout on Twitter!

June 14, 2016

With the release of Firefox 47, the extension based version FirefoxDriver is no longer working. There was a change in Firefox that when Selenium started the browser it caused it to crash. It has been fixed but there is a process to get this to release which is slow (to make sure we don't break anything else) so hopefully this version is due for release next week or so.

This does not mean that your tests need to stop working entirely as there are options to keep them working.

Marionette

Firstly, you can use Marionette, the Mozilla version of FirefoxDriver to drive Firefox. This has been in Firefox since about 24 as we, slowly working against Mozilla priorities, getting it up to Selenium level. Currently Marionette is passing ~85% of the Selenium test suite.

Firefox 45 ESR or Firefox 46

If you don't want to worry about Marionette, the other option is to downgrade to Firefox 45, preferably the ESR as it won't update to 47 and will update in about 6-9 months time to Firefox 52 when you will need to use Marionette.

Marionette will be turned on by default from Selenium 3, which is currently being worked on by the Selenium community. Ideally when Firefox 52 comes around you will just update to Selenium 3 and, fingers crossed, all works as planned.

June 03, 2016

When it comes to version control, I’m a Git girl. I had to use Subversion a little bit for a project in grad school (not distributed == not so fun). But I had never touched Mercurial until I decided to contribute to Mozilla’s Marionette, a testing tool for Firefox, for my Outreachy application. Mercurial is the main version control system for Firefox and Marionette development,1 so this gave me a great opportunity to start learning my way around the hg. Turns out it’s really close to Git, though there are some subtle differences that can be a little tricky. This post documents the basics and the trip-ups I discovered. Although there’s plenty of otherinfooutthere, I hope some of this might be helpful for others (especially other Gitters) using Mercurial or contributing to Mozilla code for the first time. Ready to heat things up? Let’s do this!

Getting my bearings on Planet Mercury

OK, so I’ve been working through the Firefox Onramp to install Mercurial (via the bootstrap script) and clone the mozilla-central repository, i.e. the source code for Firefox. This is just like Git; all I have to do is:

Cool, I’ve set foot on a new planet! …But where am I? What’s going on?

Just like in Git, I can find out about the repo’s history with hg log. Adding some flags make this even more readable: I like to --limit the number of changesets (change-whats? more on that later) displayed to a small number, and show the --graph to see how changes are related. For example:

Some (confusing) terminology

Changesets/revisions and their identifiers

According to the official definition, a changeset is “an atomic collection of changes to files in a repository.” As far as I can tell, this is basically what I would call a commit in Gitese. For now, that’s how I’m going to think of a changeset, though I’m sure there’s some subtle difference that’s going to come back to bite me later. Looking forward to it!

Changesets are also called revisions (because two names are better than one?), and each one has (confusingly) two identifying numbers: a local revision number (a small integer), and a global changeset ID (a 40-digit hexadecimal, more like Git’s commit IDs). These are what you see in the output of hg log above in the format:

changeset: <revision-number>:<changeset-ID>

For example,

changeset: 300339:e27fe24a746f

is the changeset with revision number 300339 (its number in my copy of the repo) and changeset ID e27fe24a746f (its number everywhere).

Why the confusing double-numbering? Well, apparently because revision numbers are “shorter to type” when you want to refer to a certain changeset locally on the command line; but since revision numbers only apply to your local copy of the repo and will “very likely” be different in another contributor’s local copy, you should only use changeset IDs when discussing changes with others. But on the command line I usually just copy-paste the hash I want, so length doesn’t really matter, so… I’m just going to ignore revision numbers and always use changeset IDs, OK Mercurial? Cool.

Branches, bookmarks, heads, and the tip

I know Git! I know what a “branch” is! - Anjana, learning Mercurial

Yeeeah, about that… Unfortunately, this term in Gitese is a false friend of its Mercurialian equivalent.

In the land of Gitania, when it’s time to start work on a new bug/feature, I make a new branch, giving it a feature-specific name; do a bunch of work on that branch, merging in master as needed; then merge the branch back into master whenever the feature is complete. I can make as many branches as I want, whenever I want, and give them whatever names I want.

This is because in Git, a “branch” is basically just a pointer (a reference or “ref”) to a certain commit, so I can add/delete/change that pointer whenever and however I want without altering the commit(s) at all. But on Mercury, a branch is simply a “diverged” series of changesets; it comes to exist simply by virtue of a given changeset having multiple children, and it doesn’t need to have a name. In the output of hg log --graph, you can see the branches on the left hand side: continuation of a branch looks like |, merging |\, and branching |/. Here are some examples of what that looks like.

Confusingly, Mercuial also has named branches, which are intended to be longer-lived than branches in Git, and actually become part of a commit’s information; when you make a commit on a certain named branch, that branch is part of that commit forever. This post has a pretty good explanation of this.

Luckily, Mercurial does have an equivalent to Git’s branches: they’re called bookmarks. Like Git branches, Mercurial bookmarks are just handy references to certain commits. I can create a new one thus:

$ hg bookmark my-awesome-bookmark

When I make it, it will point to the changeset I’m currently on, and if I commit more work, it will move forward to point to my most recent changeset. Once I’ve created a bookmark, I can use its name pretty much anywhere I can use a changeset ID, to refer to the changeset the bookmark is pointing to: e.g. to point to the bookmark I can do hg up my-awesome-bookmark. I can see all my bookmarks and the changesets they’re pointing to with the command:

When I’m on a bookmark, it’s “active”; the currently active bookmark is indicated with a *.

OK, maybe I was wrong about branches, but at least I know what the “HEAD” is! - Anjana, a bit later

Yeah, nope. I think of the “HEAD” in Git as the branch (or commit, if I’m in “detached HEAD” state) I’m currently on, i.e. a pointer to (the pointer to) the commit that would end up the parent of whatever I commit next. In Mercurial, this doesn’t seem to have a special name like “HEAD”, but it’s indicated in the output of hg log --graph by the symbol @. However, Mercurial documentation does talk about heads, which are just the most recent changesets on all branches (regardless of whether those branches have names or bookmarks pointing to them or not).2 You can see all those with the command hg heads.

The head which is the most recent changeset, period, gets a special name: the tip. This is another slight difference from Git, where we can talk about “the tip of a branch”, and therefore have several tips. In Mercurial, there is only one. It’s labeled in the output of hg log with tag: tip.

the most recent changeset in the entire history (regardless of branch structure)

All the world’s a stage (but Mercury’s not the world)

Just like with Git, I can use hg status to see the changes I’m about to commit before committing with hg commit. However, what’s missing is the part where it tells me which changes are staged, i.e. “to be committed”. Turns out the concept of “staging” is unique to Git; Mercurial doesn’t have it. That means that when you type hg commit, any changes to any tracked files in the repo will be committed; you don’t have to manually stage them like you do with git add <file> (hg add <file> is only used to tell Mercurial to track a new file that it’s not tracking yet).

However, just like you can use git add --patch to stage individual changes to a certain file a la carte, you can use the now-standard record extension to commit only certain files or parts of files at a time with hg commit --interactive. I haven’t yet had occasion to use this myself, but I’m looking forward to it!

Turning back time

I can mess with my Mercurial history in almost exactly the same way as I would in Git, although whereas this functionality is built in to Git, in Mercurial it’s accomplished by means of extensions. I can use the rebase extension to rebase a series of changesets (say, the parents of the active bookmark location) onto a given changeset (say, the latest change I pulled from central) with hg rebase, and I can use the hg histedit command provided by the histedit extension to reorder, edit, and squash (or “fold”, to use the Mercurialian term) changesets like I would with git rebase --interactive.

My Mozilla workflow

In my recent work refactoring and adding unit tests for Marionette’s Python test runner, I use a workflow that goes something like this.

I’m gonna start work on a new bug/feature, so first I want to make a new bookmark for work that will branch off of central:

$ hg up central
$ hg bookmark my-feature

Now I go ahead and do some work, and when I’m ready to commit it I simply do:

$ hg commit

which opens my default editor so that I can write a super great commit message. It’s going to be informative and formatted properly for MozReview/Bugzilla, so it might look something like this:

After working for a while, it’s possible that some new changes have come in on central (this happens about daily), so I may need to rebase my work on top of them. I can do that with:

$ hg pull central

followed by:

$ hg rebase -d central

which rebases the commits in the branch that my bookmark points to onto the most recent changeset in central. Note that this assumes that the bookmark I want to rebase is currently active (I can check if it is with hg bookmarks).

Then maybe I commit some more work, so that now I have a series of commits on my bookmark. But perhaps I want to reorder them, squash some together, or edit commit messages; no problemo, I just do a quick:

$ hg histedit

which opens a history listing all the changesets on my bookmark. I can edit that file to pick, fold (squash), or edit changesets in pretty much the same way I would using git rebase --interactive.

My special Mozillian configuration of Mercurial, which a wizard helped me set up during installation, magically prepares everything for MozReview and then asks me if I want to

publish these review requests now (Yn)?

To which I of course say Y (or, you know, realize I made a horrible mistake, say n, go back and re-do everything, and then push to review again).

Then I just wait for review feedback from my mentor, and perhaps make some changes and amend my commits based on that feedback, and push those to review again.

Ultimately, once the review has passed, my changes get merged into mozilla-inbound, then eventually mozilla-central (more on what that all means in a future post), and I become an official contributor. Yay! :)

So is this goodbye Git?

Nah, I’ll still be using Git as my go-to version control system for my own projects, and another Mozilla project I’m contributing to, Perfherder, has its code on Github, so Git is the default for that.

But learning to use Mercurial, like learning any new tool, has been educational! Although my progress was (and still is) a bit slow as I get used to the differences in features/workflow (which, I should reiterate, are quite minor when coming from Git), I’ve learned a bit more about version control systems in general, and some of the design decisions that have gone into these two. Plus, I’ve been able to contribute to a great open-source project! I’d call that a win. Thanks Mercurial, you deserve a hg. :)

Notes

1 However, there is a small but ardent faction of Mozilla devs who refuse to stop using Git. Despite being a Gitter, I chose to forego this option and use Mercurial because a) it’s the default, so most of the documentation etc. assumes it’s what you’re using, and b) I figured it was a good chance to get to know a new tool. ↩

2 Git actually uses this term the same way; the tips of all branches are stored in .git/refs/heads. But in my experience the term “heads” doesn’t pop up as often in Git as in Mercurial. Maybe this is because in Git we can talk about “branches” instead? ↩

June 02, 2016

Hello from Platforms Operations! Once a month we highlight one of our projects to help the Mozilla community discover a useful tool or an interesting contribution opportunity.

This month’s project is firefox-ui-tests!

What are firefox-ui-tests?

Firefox UI tests are a test suite for integration tests which are based on the Marionette automation framework and are majorly used for user interface centric testing of Firefox. The difference to pure Marionette tests is, that Firefox UI tests are interacting with the chrome scope (browser interface) and not content scope (websites) by default. Also the tests have access to a page object model called Firefox Puppeteer. It eases the interaction with all ui elements under test, and especially makes interacting with the browser possible even across different localizations of Firefox. That is a totally unique feature compared to all the other existing automated test suites.

Where Firefox UI tests are used

As of today the Firefox UI functional tests are getting executed for each code check-in on integration and release branches, but limited to Linux64 debug builds due to current Taskcluster restrictions. Once more platforms are available the testing will be expanded appropriately.

But as mentioned earlier we also want to test localized builds of Firefox. To get there the developer, and release builds, for which all locales exist, have to be used. Those tests run in our own CI system called mozmill-ci which is driven by Jenkins. Due to a low capacity of test machines only a handful of locales are getting tested. But this will change soon with the complete move to Taskcluster. With the CI system we also test updates of Firefox to ensure that there is no breakage for our users after an update.

What are we working on?

The current work is fully dedicated to bring more visibility of our test results to developers. We want to get there with the following sub projects:

Bug 1272228 – Get test results out of the by default hidden Tier-3 level on Treeherder and make them reporting as Tier-2 or even Tier-1. This will drastically reduce the number of regressions introduced for our tests.

Bug 1272145 – Tests should be located close to the code which actually gets tested. So we want to move as many Firefox UI tests as possible from testing/firefox-ui-tests/tests to individual browser or toolkit components.

Bug 1272236 – To increase stability and coverage of Firefox builds including all various locales, we want to get all of our tests for nightly builds on Linux64 executed via TaskCluster.

How to run the tests

The tests are located in the Firefox development tree. That allows us to keep them up-to-date when changes in Firefox are introduced. But that also means that before the tests can be executed a full checkout of mozilla-central has to be made. Depending on the connection it might take a while… so take the chance to grab a coffee while waiting.

When the Firefox build is available the tests can be run. A tool which allows a simple invocation of the tests is called mach and it is located in the root of the repository. Call it with various arguments to run different sets of tests or a different binary. Here some examples:

# Run integration tests with the Firefox you built
./mach firefox-ui-functional

How to get involved

If the above sounds interesting to you, and you are willing to learn more about test automation, the firefox-ui-tests project is definitely a good place to get started. We have a couple of open mentored bugs, and can create even more, depending on individual requirements and knowledge in Python.

May 23, 2016

Today was my first day as an Outreachy intern with Mozilla! What does that even mean? Why is it super exciting? How did I swing such a sweet gig? How will I be spending my summer non-vacation? Read on to find out!

What is Outreachy?

Outreachy is a fantastic initiative to get more women and members of other underrepresented groups involved in Free & Open Source Software. Through Outreachy, organizations that create open-source software (e.g. Mozilla, GNOME, Wikimedia, to name a few) take on interns to work full-time on a specific project for 3 months. There are two internship rounds each year, May-August and December-March. Interns are paid for their time, and receive guidance/supervision from an assigned mentor, usually a full-time employee of the organization who leads the given project.

Oh yeah, and the whole thing is done remotely! For a lot of people (myself included) who don’t/can’t/won’t live in a major tech hub, the opportunity to work remotely removes one of the biggest barriers to jumping in to the professional tech community. But as FOSS developers tend to be pretty distributed anyway (I think my project’s team, for example, is on about 3 continents), it’s relatively easy for the intern to integrate with the team. It seems that most communication takes place over IRC and, to a lesser extent, videoconferencing.

What does an Outreachy intern do?

Anything and everything! Each project and organization is different. But in general, interns spend their time…

Coding (or not)

A lot of projects involve writing code, though what that actually entails (language, framework, writing vs. refactoring, etc.) varies from organization to organization and project to project. However, there are also projects that don’t involve code at all, and instead have the intern working on equally important things like design, documentation, or community management.

As for me specifically, I’ll be working on the project Test-driven Refactoring of Marionette’s Python Test Runner. You can click through to the project description for more details, but basically I’ll be spending most of the summer writing Python code (yay!) to test and refactor a component of Marionette, a tool that lets developers run automated Firefox tests. This means I’ll be learning a lot about testing in general, Python testing libraries, the huge ecosystem of internal Mozilla tools, and maybe a bit about browser automation. That’s a lot! Luckily, I have my mentor Maja (who happens to also be an alum of both Outreachy and RC!) to help me out along the way, as well as the other members of the Engineering Productivity team, all of whom have been really friendly & helpful so far.

Traveling

Interns receive a $500 stipend for travel related to Outreachy, which is fantastic. I intend, as I’m guessing most do, to use this to attend conference(s) related to open source. If I were doing a winter round I would totally use it to attend FOSDEM, but there are also a ton of conferences in the summer! Actually, you don’t even need to do the traveling during the actual 3 months of the internship; they give you a year-long window so that if there’s an annual conference you really want to attend but it’s not during your internship, you’re still golden.

At Mozilla in particular, interns are also invited to a week-long all-hands meet up! This is beyond awesome, because it gives us a chance to meet our mentors and other team members in person. (Actually, I doubly lucked out because I got to meet my mentor at RC during “Never Graduate Week” a couple of weeks ago!)

Blogging

One of the requirements of the internship is to blog regularly about how the internship and project are coming along. This is my first post! Though we’re required to write a post every 2 weeks, I’m aiming to write one per week, on both technical and non-technical aspects of the internship. Stay tuned!

How do you get in?

I’m sure every Outreachy participant has a different journey, but here’s a rough outline of mine.

Step 1: Realize it is a thing

Let’s not forget that the first step to applying for any program/job/whatever is realizing that it exists! Like most people, I think, I had never heard of Outreachy, and was totally unaware that a remote, paid internship working on FOSS was a thing that existed in the universe. But then, in the fall of 2015, I made one of my all-time best moves ever by attending the Recurse Center (RC), where I soon learned about Outreachy from various Recursers who had been involved with the program. I discovered it about 2 weeks before applications closed for the December-March 2015-16 round, which was pretty last-minute; but a couple of other Recursers were applying and encouraged me to do the same, so I decided to go for it!

Step 2: Frantically apply at last minute

Applying to Outreachy is a relatively involved process. A couple months before each round begins, the list of participating organizations/projects is released. Prospective applicants are supposed to find a project that interests them, get in touch with the project mentor, and make an initial contribution to that project (e.g. fix a small bug).

But each of those tasks is pretty intimidating!

First of all, the list of participating organizations is long and varied, and some organizations (like Mozilla) have tons of different projects available. So even reading through the project descriptions and choosing one that sounds interesting (most of them do, at least to me!) is no small task.

Then, there’s the matter of mustering up the courage to join the organization/project’s IRC channel, find the project mentor, and talk to them about the application. I didn’t even really know what IRC was, and had never used it before, so I found this pretty scary. Luckily, I was RC, and one of my batchmates sat me down and walked me through IRC basics.

However, the hardest and most important part is actually making a contribution to the project at hand. Depending on the project, this can be long & complicated, quick & easy, or anything in between. The level of guidance/instruction also varies widely from project to project: some are laid out clearly in small, hand-holdy steps, others are more along the lines of “find something to do and then do it”. Furthermore, prerequisites for making the contribution can be anything from “if you know how to edit text and send an email, you’re fine” to “make a GitHub account” to “learn a new programming language and install 8 million new tools on your system just to set up the development environment”. All in all, this means that making that initial contribution can often be a deceptively large amount of work.

Because of all these factors, for my application to the December-March round I decided to target the Mozilla project “Contribute to the HTML standard”. In addition to the fact that I thought it would be awesome to contribute to such a fundamental part of the web, I chose it because the contribution itself was really simple: just choose a GitHub issue with a beginner-friendly label, ask some questions via GitHub comments, edit the source markup file as needed, and make a pull request. I was already familiar with GitHub so it was pretty smooth sailing.

Once you’ve made your contribution, it’s time to write the actual Outreachy application. This is just a plain text file you fill out with lots of information about your experience with FOSS, your contribution to the project, etc. In case it’s useful to anyone, here’s my application for the December-March 2015-16 round. But before you use that as an example, make sure you read what happened next…

Step 3: Don’t get in

Unfortunately, I didn’t get in to the December-March round (although I was stoked to see some of my fellow Recursers get accepted!). Honestly, I wasn’t too surprised, since my contributions and application had been so hectic and last-minute. But even though it wasn’t successful, the application process was educational in and of itself: I learned how to use IRC, got 3 of my first 5 GitHub pull requests merged, and became a contributor to the HTML standard! Not bad for a failure!

Step 4: Decide to go for it again (at last minute, again)

Fast forward six months: after finishing my batch at RC, I had been looking & interview-prepping, but still hadn’t gotten a job. When the applications for the May-August round opened up, I took a glance at the projects and found some cool ones, but decided that I wouldn’t apply this round because a) I needed a Real Job, not an internship, and b) the last round’s application process was a pretty big time investment which hadn’t paid off (although it actually had, as I just mentioned!).

But as the weeks went by, and the application deadline drew closer, I kept thinking about it. I was no closer to finding a Real Job, and upheaval in my personal life made my whereabouts over the summer an uncertainty (I seem never to know what continent I live on), so a paid, remote internship was becoming more and more attractive. When I broached my hesitation over whether or not to apply to other Recursers, they unanimously encouraged me (again) to go for it (again). Then, I found out that one of the project mentors, Maja, was a Recurser, and since her project was one of the ones I had shortlisted, I decided to apply for it.

Of course, by this point it was once again two weeks until the deadline, so panic once again set in!

Step 5: Learn from past mistakes

This time, the process as a whole was easier, because I had already done it once. IRC was less scary, I already felt comfortable asking the project mentor questions, and having already been rejected in the previous round made it somehow lower-stakes emotionally (“What the hell, at least I’ll get a PR or two out of it!”). During my first application I had spent a considerable amount of time reading about all the different projects and fretting about which one to do, flipping back and forth mentally until the last minute. This time, I avoided that mistake and was laser-focused on a single project: Test-driven Refactoring of Marionette’s Python Test Runner.

From a technical standpoint, however, contributing to the Marionette project was more complicated than the HTML standard had been. Luckily, Maja had written detailed instructions for prospective applicants explaining how to set up the development environment etc., but there were still a lot of steps to work through. Then, because there were so many folks applying to the project, there was actually a shortage of “good-first-bugs” for Marionette! So I ended up making my first contributions to a different but related project, Perfherder, which meant setting up a different dev environment and working with a different mentor (who was equally friendly). By the time I was done with the Perfherder stuff (which turned out to be a fun little rabbit hole!), Maja had found me something Marionette-specific to do, so I ended up working on both projects as part of my application process.

When it came time to write the actual application, I also had the luxury of being able to use my failed December-March application as both a starting point and an example of what not to do. Some of the more generic parts (my background, etc.) were reusable, which saved time. But when it came to the parts about my contribution to the project and my proposed internship timeline, I knew I had to do a much better job than before. So I opted for over-communciation, and basically wrote down everything I could think of about what I had already done and what I would need to do to complete the goals stated in the project description (which Maja had thankfully written quite clearly).

In the end, my May-August application was twice as long as my previous one had been. Much of that difference was the proposed timeline, which went from being one short paragraph to about 3 pages. Perhaps I was a bit more verbose than necessary, but I decided to err on the side of too many details, since I had done the opposite in my previous application.

Step 6: Get a bit lucky

Spoiler alert: this time I was accepted!

Although I knew I had made a much stronger application than in the previous round, I was still shocked to find out that I was chosen from what seemed to be a large, competitive applicant pool. I can’t be sure, but I think what made the difference the second time around must have been a) more substantial contributions to two different projects, b) better, more frequent communication with the project mentor and other team members, and c) a much more thorough and better thought-out application text.

But let’s not forget d) luck. I was lucky to have encouragement and support from the RC community throughout both my applications, lucky to have the time to work diligently on my application because I had no other full-time obligations, lucky to find a mentor who I had something in common with and therefore felt comfortable talking to and asking questions of, and lucky to ultimately be chosen from among what I’m sure were many strong applications. So while I certainly did work hard to get this internship, I have to acknowledge that I wouldn’t have gotten in without all of that luck.

Why am I doing this?

Last week I had the chance to attend OSCON 2016, where Mozilla’s E. Dunham gave a talk on How to learn Rust. A lot of the information applied to learning any language/new thing, though, including this great recommendation: When embarking on a new skill quest, record your motivation somewhere (I’m going to use this blog, but I suppose Twitter or a vision board or whatever would work too) before you begin.

The idea is that once you’re in the process of learning the new thing, you will probably have at least one moment where you’re stuck, frustrated, and asking yourself what the hell you were thinking when you began this crazy project. Writing it down beforehand is just doing your future self a favor, by saving up some motivation for a rainy day.

So, future self, let it be known that I’m doing Outreachy to…

Write code for an actual real-world project (as opposed to academic/toy projects that no one will ever use)

Get to know a great organization that I’ve respected and admired for years

Try out working remotely, to see if it suits me

Learn more about Python, testing, and automation

Gain confidence and feel more like a “real developer”

Launch my career in the software industry

I’m sure these goals will evolve as the internship goes along, but for now they’re the main things driving me. Now it’s just a matter of sitting back, relaxing, and working super hard all summer to achieve them! :D

Got any more questions?

Are you curious about Outreachy? Thinking of applying? Confused about the application process? Feel free to reach out to me! Go on, don’t be shy, just use one of those cute little contact buttons and drop me a line. :)

May 18, 2016

I recently got to spend a week back at the heart of an excellentdelightful inspiring technical community: Recurse Center or RC. This friendly group consists mostly of programmers from around the world who have, at some point, participated in RC’s three-month “retreat” in New York City to work on whatever projects happen to interest them. The retreat’s motto is “never graduate”, and so participants continue to support each other’s technical growth and curiosity forever and ever.

I’m an RC alum from 2014! RC’s retreat is how I ended up contributing to open source software and eventually gathering the courage to join Mozilla. Before RC, despite already having thousands of hours of programming and fancy math under my belt, I held myself back with doubts about whether I’m a “real programmer”, whatever that stereotype means. That subconscious negativity hasn’t magically disappeared, but I’ve had a lot of good experiences in the past few years to help me manage it. Today, RC helps me stay excited about learning all the things for the sake of learning all the things.

A retreat at RC looks something like this: you put your life more-or-less on hold, move to NYC, and spend three months tinkering in a big, open office with around fifty fellow (thoughtful, kind, enthusiastic) programmers. During my 2014 retreat, I worked mostly on lowish-level networking things in Python, pair programmed on whatever else people happened to be working on, gave and received code review, chatted with wise “residents”, attended spontaneous workshops, presentations and so on.

Every May, alumni are invited to return to the RC space for a week, and this year I got to go! (Thanks, Mozilla!) It was awesome! Exclamation points! This past week felt like a tiny version of the 3-month retreat. After two years away, I felt right at home — that says a lot about the warm atmosphere RC manages to cultivate. My personal goal for the week was just to work in a language that’s relatively new to me - JavaScript - but I also happened to have really interesting conversations about things like:

How to implement a basic debugger?

How to improve the technical interview process?

What holds developers back or slows them down? What unnecessary assumptions do we have about our tools and their limitations?

RC’s retreat is a great environment for growing as a developer, but I don’t want to make it sound like it’s all effortless whimsy. Both the hardest and most wonderful part of RC (and many other groups) is being surrounded by extremely impressive, positive people who never seem to struggle with anything. It’s easy to slip into showing off our knowledge or to get distracted by measuring ourselves against our peers. Sometimes this is impostor syndrome. Sometimes it’s the myth of the 10x developer. RC puts a lot of effort into being a safe space where you can reveal your ignorance and ask questions, but insecurity can always be a challenge.

Similarly, the main benefit of RC is learning from your peers, but the usual ways of doing this seem to be geared toward people who are outgoing and think out loud. These are valuable skills, but when we focus on them exclusively we don’t hear from people who have different defaults. There is also little structure provided by RC so you are free to self-organize and exchange ideas as you deem appropriate. The risk is that quiet people are allowed to hide in their quiet corners, and then everyone misses out on their contributions. I think RC makes efforts to balance this out, but the overall lack of structure means you really have to take charge of how you learn from others. I’m definitely better at this than I used to be.

RC is an experiment and it’s always changing. Although at this point my involvement is mostly passive, I’m glad to be a part of it. I love that I’ve been able to work closely with vastly different people, getting an inside look at their work habits and ways of thinking. Now, long after my “never-graduation”, the RC community continues to expose me to a variety of ideas about technology and learning in a way that makes us all get better. Continuous improvement, yeah!

May 04, 2016

I have just released a new version of the Marionette, well the executable that you need to download.

The main fix in this release is the ability to send over custom profiles that will be used. To be able to use the custom profile you will need to have marionette:true capability and pass in a profile when you instantiate your FirefoxDriver.

We have also fixed a number of minor issues like IPv6 support and compiler warnings.

We have also move the repository where our executable is developed to live under the Mozilla Organization. This is now called GeckoDriver. We will be updating the naming of it in Selenium and documentation over the next few weeks.

Since you are awesome early adopters it would be great if we could raise bugs.

I am not expecting everything to work but below is a quick list that I know doesn't work.

No support for self-signed certificates

No support for actions

No support logging endpoint

I am sure there are other things we don't remember

Switching of Frames needs to be done with either a WebElement or an index. Windows can only be switched by window handles.

Triggered by file changes

All along I wanted to run some in-tree tests without having them wait around for a Firefox build or any other dependencies they don’t need. So I originally implemented this task as a “build” so that it would get scheduled for every incoming changeset in Mozilla’s repositories.

But forget “builds”, forget “tests” — now there’s a third category of tasks that we’ll call “generic” and it’s exactly what I need.

In base_jobs.yml I say, “hey, here’s a new task called marionette-harness — run it whenever there’s a change under (branch)/testing/marionette/harness”. Of course, I can also just trigger the task with try syntax like try: -p linux64_tc -j marionette-harness -u none -t none.

For Tasks that Make Sense in a gecko Source Checkout

As you can see, I made the build.sh script in the desktop-build docker image execute an arbitrary in-tree JOB_SCRIPT, and I created harness-test-linux.sh to run mozharness within a gecko source checkout.

Why not the desktop-test image?

But we can also run arbitrary mozharness scripts thanks to the configuration in the desktop-test docker image! Yes, and all of that configuration is geared toward testing a Firefox binary, which implies downloading tools that my task either doesn’t need or already has access to in the source tree. Now we have a lighter-weight option for executing tests that don’t exercise Firefox.

Why not mach?

In my lazy work-in-progress, I had originally executed the Marionette harness tests via a simple call to mach, yet now I have this crazy chain of shell scripts that leads all the way mozharness. The mach command didn’t disappear — you can run Marionette harness tests with ./mach python-test .... However, mozharness provides clearer control of Python dependencies, appropriate handling of return codes to report test results to Treeherder, and I can write a job-specific script and configuration.

April 24, 2016

This conference was awesome: not too big, not too cramped of a schedule (long breaks between talk sessions), free drinks, snacks & meals (with vegan options!), unisex bathrooms (toiletries & tampons provided!), a code of conduct, and - most importantly, to me - a great diversity program that gave me and 16 others support to attend! The unconference format was really interesting, and worked better than I expected. It also enabled something I wasn’t planning on: I gave my first talk at a tech conference!

What’s an unconference?

There’s no pre-planned schedule; instead, at the beginning of each day, anyone who’s interested in giving a talk makes a short pitch of their topic, and for the next hour or so the rest of the attendees vote on which talks they want to attend. The highest-voted talks are selected, and begin shortly after that. It sounds like it would be chaos, but it works!

I gave my first tech talk! On 3 hours’ notice!

On day 2 of the conference, in a completely unexpected turn of events, I proposed, planned, and delivered a 30-minute talk within a period of about 3 hours. Am I crazy? Perhaps. But the good kind of crazy!

See, there had been some interest in functional programming in JS (as part of the unconference format, people can submit topics they’d like to hear a talk on as well), and some talks on specific topics related to functional languages/libraries, but no one had proposed a high-level general introduction about it. So, at literally the last minute of the talk-proposal session, I spontaneously got up and pitched “Learning Functional Programming with JS” (that’s how I learned FP, after all!).

Turns out people were indeed interested: my proposal actually got more votes than any other that day. Which meant that I would present in the main room, the only one out of the three tracks that was being recorded. So all I had to do was, you know, plan a talk and make slides from scratch and then speak for 30 minutes in front of a camera, all in the space of about 3 hours.

Yay! No, wait… panic!

Luckily my years of teaching experience and a few presentations at academic conferences came to the rescue. I had to skip a couple of the sessions before mine (luckily some talks were recorded), and get a little instant feedback from a few folks at the conference that I had gotten to know, but ultimately I was able to throw together a talk outline and some slides.

When it came to actually delivering the talk, it was actually less scary than I thought. I even had enough time to do an ad-hoc digression (on the chalkboard!!!) into persistent data structures, which are the topic of my first scheduled tech talk at !!Con 2016.

The whole thing was a great experience, and definitely gave me a huge confidence boost for speaking at more conferences in the future (look out, !!Con and EuroPython!). I would recommend it to anyone! Come to think of it, why aren’t you giving talks yet?

Single TCP connection, but multiple streams with requests running in parallel

Headers are compressed

Each browser can determine how to figure out/build the tree of dependencies

Firefox has the most efficient implementation at the moment

It sets up the dependency tree of requests before actually making any requests (?)

Sidenote: Huffman encoding

Take a string to compress

Count the frequencies of each character

Make a binary tree such that the leaf nodes are the characters arranged left-to-right from most frequent to least, and the leaves are connected through binary nodes from right to left, where each branch is labeled 0 on the left and 1 on the right

Use the path from the root of the tree to the character’s leaf node as the compression table

So the most frequent character will be 00, the least frequent will be e.g. 1111

This means more frequent characters have shorter compressions, so the overall compression will be as small as possible

HTTP/2 is already in use (22% of sites(???)) - you should start using it now!

Customers using HTTP/1.1 will experience an increase in load times, but those using updated browsers will see a decrease

APK Size

Here’s how the APK size changed over the quarter, for mozilla-central Android 4.0 API15+ opt builds:

The dramatic decrease in February was caused by bug 1233799, which enabled the download content service and removed fonts from the APK.

For the same period, libxul.so generally increased in size:

The recent decrease in libxul was caused by bug 1259521, an upgrade of the Android NDK.

Memory

This quarter we began tracking some memory metrics, using test_awsy_lite.

These memory measurements are generally steady over the quarter, with some small improvements.

Autophone-Talos

This section tracks Perfherder graphs for mozilla-central builds of Firefox for Android, for Talos tests run on Autophone, on android-6-0-armv8-api15. The test names shown are those used on treeherder. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

In previous quarters, these tests were running on Pandaboards; beginning this quarter, these tests run on actual phones via Autophone.

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

tp4m

Generic page load test. Lower values are better.

No significant improvements or regressions noted for tsvgx or tp4m.

Autophone

Throbber Start / Throbber Stop

These graphs are taken from http://phonedash.mozilla.org. Browser startup performance is measured on real phones (a variety of popular devices).

There was a lot of work on Autophone this quarter, with new devices added and old devices retired or re-purposed. These graphs show devices running mozilla-central builds, of which none were in continuous use over the quarter.

Throbber Start/Stop test regressions are tracked by bug 953342; a recent regression in throbber start is under investigation in bug 1259479.

mozbench

mozbench has been retired.

Long live arewefastyet.com! I’ll check in on arewefastyet.com next quarter.

Today is the last day of Q1 2016 which means time to review what I have done during all those last weeks. When I checked my status reports it’s kinda lot, so I will shorten it a bit and only talk about the really important changes.

Build System / Mozharness

After I had to dig into mozharness to get support for Firefox UI Tests during last quarter I have seen that more work had to be done to fully support tests which utilize Nightly or Release builds of Firefox.

The most challenging work for me (because I never did a build system patch so far) was indeed prefixing the test_packages.json file which gets uploaded next to any nightly build to archive.mozilla.org. This work was necessary because without the prefix the file was always overwritten by later build uploads. Means when trying to get the test archives for OS X and Linux always the Windows ones were returned. Due to binary incompatibilities between those platforms this situation caused complete bustage. No-one noticed that until now because any other testsuite is run on a checkin basis and doesn’t have to rely on the nightly build folders on archive.mozilla.org. For Taskcluster this wasn’t a problem.

In regards of firefox-ui-tests I was finally able to get a test task added to Taskcluster which will execute our firefox-ui-tests for each check-in and this in e10s and non-e10s mode. Due to current Taskcluster limitations this only runs for Linux64 debug, but that already helps a lot and I hope that we can increase platform coverage soon. If you are interested in the results you can have a look at Treeherder.

Other Mozharness specific changes are the following ones:

Fix to always copy the log files to the upload folder even in case of early aborts, e.g. failed downloads (bug 1230150)

Refactoring of download_unzip() method to allow support of ZipFile and TarFile instead of external commands (bug 1237706)

Removing hard requirement for the –symbols-url parameter to let mozcrash analyze the crash. This was possible because the minidump_stackwalk binary can automatically detect the appropriate symbols for nightly and release builds (bug 1243684)

The move itself was easy but keeping backward compatibility with mozmill-ci and other Firefox branches down to mozilla-esr38 was a lot of work. To achieve that I first had to convert all three different modules (harness, puppeteer, tests) to individual Python packages. Those got landed for Firefox 46.0 on mozilla-central and then backported to Firefox 45.0 which also became our new ESR release. Due to backport complexity for older branches I decided to not land packages for Firefox 44.0, 43.0, and 38ESR. Instead those branches got smaller updates for the harness so that they had full support for our latest mozharness script on mozilla-central. Yes, in case you wonder all branches used mozharness from mozilla-central at this time. It was easier to do, and I finally switched to branch specific mozharness scripts later in mozmill-ci once Firefox 45.0 and its ESR release were out.

Adding mach support for Firefox UI Tests on mozilla-central was the next step to assist in running our tests. Required arguments from before are now magically selected by mach, and that allowed me to remove the firefox-ui-test dependency on firefox_harness, which was always a thorn in our eyes. As final result I was even able to completely remove the firefox-ui-test package, so that we are now free in moving our tests to any place in the tree!

In case you want to know more about our tests please check out our new documentation on MDN which can be found here:

Mozmill CI

Lots of changes have been done to this project to accommodate the Jenkins jobs to all the Firefox UI Tests modifications. Especially that I needed a generic solution which works for all existing Firefox versions. The first real task was to no longer use the firefox-ui-tests Github repository to grab the tests from, but instead let mozharness download the appropriate test package as produced and uploaded with builds to archive.mozilla.org.

It was all fine immediately for en-US builds given that the location of the test_packages.json file is distributed along with the Mozilla Pulse build notification. But it’s not the case for l10n builds and funsize update notifications. For those we have to utilize mozdownload to fetch the correct URL based on the version, platform, and build id. So all fine. A special situation came up for update tests which actually use two different Firefox builds. If we get the tests for the pre build, how can we magically switch the tests for the target version? Given that there is no easy way I decided to always use the tests from the target version, and in case of UI changes we have to keep backward compatibility code in our tests and Firefox Puppeteer. This is maybe the most ideal solution for us.

Another issue I had to solve with test packages was with release candidate builds. For those builds Release Engineering is not uploading nor creating any test archive. So a connection had to be made between candidate builds and CI (tinderbox) builds. As turned out the two properties which helped here are the revision and the branch. With them I at least know the changeset of the mozilla-beta, mozilla-release, and mozilla-esr* branches as used to trigger the release build process. But sadly that’s only a tag and no builds nor tests are getting created. Means something more is necessary. After some investigation I found out that Treeherder and its Rest API can be of help. Using the known tag and walking back the parents until Treeherder reports a successful build for the given platform, allowed me to retrieve the next possible revision to be used with mozdownload to retrieve the test_packages.json URL. I know its not perfect but satisfies us enough for now.

Then the release promotion project as worked on by the Release Engineering team was close to be activated. I heard a couple of days before, that Firefox 46.0b1 will be the first candidate to get it tested on. It gave me basically no time for testing at all. Thanks to all the support from Rail Aliiev I was able to get the new Mozilla Pulse listener created to handle appropriate release promotion build notifications. Given that with release promotion we create the candidates based on a signed off CI build we already have a valid revision to be used with mozdownload to retrieve the test_packages.json file – so no need for the above mentioned Treeherder traversal code. \o/ Once all has been implemented Firefox 46.0b3 was the first beta release for which we were able to process the release promotion notifications.

At the same time with release promotion news I also got informed by Robert Kaiser that the ondemand update jobs as performed with Mozmill do not work anymore. As turned out a change in the JS engine caused the bustage for Firefox 46.0b1. Given that Mozmill is dead I was not going to update it again. Instead I converted the ondemand update jobs to make use of Firefox-UI-Tests. This went pretty well, also because we were running those tests already for a while on mozilla-central and mozilla-aurora for nightly builds. As result we were able to run update jobs a day later for Firefox 46.0b1 and noticed that nearly all locales on Windows were busted, so only en-US got finally shipped. Not sure if that would have been that visible with Mozmill.

What’s next

I already have plans what’s next. But given that I will be away from work for a full month now, I will have to revisit those once I’m back in May. I promise that I will also blog about them around that time.

March 10, 2016

As Firefox for Android drops support for ancient versions of Android, I find my collection of test phones becoming less and less relevant. For instance, I have a Galaxy S that works fine but only runs Android 2.2.1 (API 8), and I have a Galaxy Nexus that runs Android 4.0.1 (API 14). I cannot run current builds of Firefox for Android on either phone, and, perhaps because I rooted them or otherwise messed around with them in the distant past, neither phone will upgrade to a newer version of Android.

I have been letting these phones gather dust while I test on emulators, but I recently needed a real phone and managed to breathe new life into the Galaxy Nexus using an AOSP build. I wanted all the development bells and whistles and a root shell, so I made a full-eng build and I updated the Galaxy Nexus to Android 4.3 (api 18) — good enough for Firefox for Android, at least for a while!

Once make completes, I had binaries in <aosp>/out/… I put the phone in bootloader mode (hold down Volume Up + Volume Down + Power to boot Galaxy Nexus), connected it by USB and executed “fastboot -w flashall”.

Actually, in my case, fastboot could not see the connected device, unless I ran it from root. In the root account, I didn’t have the right settings, so I needed to do something like:

March 01, 2016

The thing that is at the core of every hyper effective team is trust. Without it, any of the pieces that make the team hyper effective can fall apart very quickly. This is something that I have always instinctively known. I always work hard with my reports to make sure they can trust me. If they trust me, and more importantly I trust them, then I can ask them to take on work and then just come back every so often to see if they are stuck.

The other week I was in Washington, D.C to meet up with my manager peers. This was done with the plan to see how we can interact with each other, build bridges and more importantly build trust.

How did we do this?

We did a few trust exercises which, I am not going to lie was extremely uncomfortable. One actually made me shake in my boots was one where I had to think of things I was proud of last year and things I could have done better. Then I needed to say what I was planning for this year that I will be proud of. Once my part was done, the rest of the group could make comments about me.

"They are my peers, they are open to me all the time..." is what my brain should have been saying. In actual fact it was saying, "They are about to crucify you...". The irony is that my peers are a lovely group who are amazingly supportive. My brain knows that but went into flight mode...

This exercise showed that people are allowed to say both positive and negative things about your work. Always assume the best in people (at first until they prove otherwise).

It showed that conflict is ok, in actual fact it is extremely healthy! Well as long as it is constructive to the group and not destructive.

February 19, 2016

Quite a few weeks ago now, the Second official Quarter of Contribution wrapped up. We had advertised 4 projects and found awesome contributors for all 4. While all hackers gave a good effort, sometimes plans change and life gets in the way. In the end we had 2 projects with very active contributors.

First off, this 2nd round of QoC wouldn’t have been possible without the Mentors creating projects and mentoring, nor without the great contributors volunteering their time to build great tools and features.

I really like to look at what worked and what didn’t, let me try to summarize some thoughts.

What worked well:

building up interest in others to propose and mentor projects

having the entire community in #ateam serve as an environment of encouragement and learning

specifying exact start/end dates

advertising on blogs/twitter/newsgroups to find great hackers

What I would like to see changed for QoC.3:

Be clear up front on what we expect. Many contributors waited until the start date before working- that doesn’t give people a chance to ensure mentors and projects are a good fit for them (especially over a couple of months)

Ensure each project has clear guidelines on code expectations. Linting, Tests, self review before PR, etc. These are all things which might be tough to define and tough to do at first, but it makes for better programmers and end products!

Keep a check every other week on the projects as mentors (just a simple irc chat or email chain)

Consider timing of the project, either on-demand as mentors want to do it, or continue in batches, but avoid with common mentor time off (work weeks, holidays)

Encourage mentors to set weekly meetings and “office hours”

As it stands now, we are pushing on submitting Outreachy and GSoC project proposals, assuming that those programs pick up our projects, we will look at QoC.3 more into September or November.

Quite a few weeks ago now, the Second official Quarter of Contribution wrapped up. We had advertised 4 projects and found awesome contributors for all 4. While all hackers gave a good effort, sometimes plans change and life gets in the way. In the end we had 2 projects with very active contributors.

This post, I want to talk about WPT Results Viewer. You can find the code on github, and still find the team on irc in #ateam. As this finished up, I reached out to :martianwars to learn what his experience was like, here are his own words:

What interested you in QoC?

So I’d been contributing to Mozilla for sometime fixing random bugs here and there. I was looking for something larger and more interesting. I think that was the major motivation behind QoC, besides Manishearth’s recommendation to work on the Web Platform Test Viewer. I guess I’m really happy that QoC came around the right time!

What challenges did you encounter while working on your project? How did you solve them?

I guess the major issue while working on wptview was the lack of Javascript knowledge and the lack of help online when it came to Lovefield. But like every project, I doubt I would have enjoyed much had I known everything required right from the start. I’m glad I got jgraham as a mentor, who made sure I worked my way up the learning curve as we made steady progress.

What are some things you learned?

So I definitely learnt some Javascript, code styling, the importance of code reviews, but there was a lot more to this project. I think the most important thing that I learnt was patience. I generally tend to search for StackOverflow answers when it I need to perform a programming task I’m unaware of. With Lovefield being a relatively new project, I was compelled to patiently read and understand the documentation and sample programs. I also learnt a bit on how a large open source community functions, and I feel excited being a part of it! A bit irrelevant to the question, but I think I’ve made some friends in #ateam The IRC is like my second home, and helps me escape life’s never ending stress, to a wonderland of ideas and excitement!

If you were to give advice to students looking at doing a QoC, what would you tell them?

Well the first thing I would advice them is not to be afraid, especially of asking the so called “stupid” questions on the IRC. The second thing would be to make sure they give the project a decent amount of time, not with the aim of completing it or something, but to learn as much as they can Showing enthusiasm is the best way to ensure one has a worthwhile QoC Lastly, I’ve tried my level best to get a few newcomers into wptview. I think spreading the knowledge one learns is important, and one should try to motivate others to join open source

If you were to give advice to mentors wanting to mentor a project, what would you tell them?

I think jgraham has set a great example of what an ideal mentor should be like. Like I mentioned earlier, James helped me learn while we made steady progress. I especially appreciate the way he had (has rather) planned this project. Every feature was slowly built upon and in the right order, and he ensured the project continued to progress while I was away. He would give me a sufficient insight into each feature, and leave the technical aspects to me, correcting my fallacies after the first commit. I think this is the right approach. Lastly, a quality every mentor MUST have, is to be awake at 1am on a weekend night reviewing PRs

Personally I have really enjoyed getting to know :martianwars and seeing the great progress he has made.

February 16, 2016

TaskCluster is a new-ish continuous integration system made at Mozilla. It manages the scheduling and execution of tasks based on a graph of their dependencies. It’s a general CI tool, and could be used for any kind of job, not just Mozilla things.

However, the example I describe here refers to a Mozilla-centric use case of TaskCluster1: tasks are run per check-in on the branches of Mozilla’s Mercurial repository and then results are posted to Treeherder. For now, the tasks can be configured to run in Docker images (Linux), but other platforms are in the works2.

So, I want to schedule a task! I need to add a new task to the task graph that’s created for each revision submitted to hg.mozilla.org. (This is part of my work on deploying a suite of tests for the Marionette Python test runner, i.e. testing the test harness itself.)

There are builds and there are tests

mozilla-taskcluster operates based on the info under testing/taskcluster/tasks in Mozilla’s source tree, where there are yaml files that describe tasks. Specific tasks can inherit common configuration options from base yaml files.

The yaml files are organized into two main categories of tasks: builds and tests. This is just a convention in mozilla-taskcluster about how to group task configurations; TC itself doesn’t actually know or care whether a task is a build or a test.

The task I’m creating doesn’t quite fit into either category: it runs harness tests that just exercise the Python runner code in marionette_client, so I only need a source checkout, not a Firefox build. I’d like these tests to run quickly without having to wait around for a build. Another example of such a task is the recently-created ESLint task.

Scheduling a task

Just adding a yaml file that describes your new task under testing/taskcluster/tasks isn’t enough to get it scheduled: you must also add it to the list of tasks in base_jobs.yml, and define an identifier for your task in base_job_flags.yml. This identifier is used in base_jobs.yml, and also by people who want to run your task when pushing to try.

How does scheduling work? First a decision task generates a task graph, which describes all the tasks and their relationships. More precisely, it looks at base_jobs.yml and other yaml files in testing/taskcluster/tasks and spits out a json artifact, graph.json3. Then, graph.json gets sent to TC’s createTask endpoint, which takes care of the actual scheduling.

In the excerpt below, you can see a task definition with a requires field and you can recognize a lot of fields that are in common with the ‘task’ section of the yaml files under testing/taskcluster/tasks/.

{"tasks":[{"requires":[// id of a build task that this task depends on"fZ42HVdDQ-KFFycr9PxptA"],"task":{"taskId":"c2VD_eCgQyeUDVOjsmQZSg""extra":{"treeherder":{"groupName":"Reftest","groupSymbol":"tc-R",},},"metadata":{"description":"Reftest test run 1","name":"[TC] Reftest",//...]}

For now at least, a major assumption in the task-graph creation process seems to be that test tasks can depend on build tasks and build tasks don’t really4 depend on anything. So:

If you want your tasks to run for every push to a Mozilla hg branch, add it to the list of builds in base_jobs.yml.

If you want your task to run after certain build tasks succeed, add it to the list of tests in base_jobs.yml and specify which build tasks it depends on.

Other than the above, I don’t see any way to specify a dependency between task A and task B in testing/taskcluster/tasks.

So, I added marionette-harness under builds. Recall, my task isn’t a build task, but it doesn’t depend on a build, so it’s not a test, so I’ll treat it like a build.

This will allow me to trigger my task with the following try syntax: try: -b o -p marionette-harness. Cool.

Make your task do stuff

Now I have to add some stuff to tasks/tests/harness_marionette.yml. Many of my choices here are based on the work done for the ESLint task. I created a base task called harness_test.yml by mostly copying bits and pieces from the basic build task, build.yml and making a few small changes. The actual task, harness_marionette.yml inherits from harness_test.yml and defines specifics like Treeherder symbols and the command to run.

The command

The heart of the task is in task.payload.command. You could chain a bunch of shell commands together directly in this field of the yaml file, but it’s better not to. Instead, it’s common to call a TaskCluster-friendly shell script that’s available in your task’s environment. For example, the desktop-test docker image has a script called test.sh through which you can call the mozharness script for your tests. There’s a similar build.sh script on desktop-build. Both of these scripts depend on environment variables set elsewhere in your task definition, or in the Docker image used by your task. The environment might also provide utilities like tc-vcs, which is used for checking out source code.

My task’s payload.command should be moved into a custom shell script, but for now it just chains together the source checkout and a call to mach. It’s not terrible of me to use mach in this case because I expect my task to work in a build environment, but most tests would likely call mozharness.

Configuring the task’s environment

Where should the task run? What resources should it have access to? This was probably the hardest piece for me to figure out.

docker-worker

My task will run in a docker image using a docker-worker5. The image, called desktop-build, is defined in-tree under testing/docker. There are many other images defined there, but I only considered desktop-build versus desktop-test. I opted for desktop-build because desktop-test seems to contain mozharness-related stuff that I don’t need for now.

The image is stored as an artifact of another TC task, which makes it a ‘task-image’. Which artifact? The default is public/image.tar. Which task do I find the image in? The magic incantation '{{#task_id_for_image}}desktop-build{{/task_id_for_image}}' somehow6 obtains the correct ID, and if I look at a particular run of my task, the above snippet does indeed get populated with an actual taskId.

Other details that I mostly ignored

# in harness_test.ymlscopes:# Nearly all of our build tasks use tc-vcs-'docker-worker:cache:level-{{level}}-{{project}}-tc-vcs'cache:# The taskcluster-vcs tooling stores the large clone caches in this# directory and will reuse them for new requests this saves about 20s~# and is the most generic cache possible.level-{{level}}-{{project}}-tc-vcs:'/home/worker/.tc-vcs'

Routes allow your task to be looked up in the task index. This isn’t necessary in my case so I just omitted routes altogether.

Scopes are permissions for your tasks, and I just copied the scope that is used for checking out source code.

workerType is a configuration for managing the workers that run tasks. To me, this was a choice between b2gtest and b2gbuild, which aren’t specific to b2g anyway. b2gtest is more lightweight, I hear, which suits my harness-test task fine.

I had to include a few dummy values under extra in harness_test.yml, like build_name, just because they are expected in build tasks. I don’t use these values for anything, but my task fails to run if I don’t include them.

Yay for trial and error

If you have syntax errors in your yaml, the Decision task will fail. If this happens during a try push, look under Job Details > Inspect Task to fine useful error messages.

Iterating on your task is pretty easy. Aside from pushing to try, you can run tasks locally using vagrant and you can build a task graph locally as well with mach taskcluster-graph.

To look at a graph.json artifact, go to Treeherder, click a green ‘D’ job, then Job details > Inspect Task, where you should find a list of artifacts. ↩

It’s not really true that build tasks don’t depend on anything. Any task that uses a task-image depends on the task that creates the image. I’m sorry for saying ‘task’ five times in every sentence, by the way. ↩

February 12, 2016

Mach is the Mozilla developer's swiss army knife. It gathers all the important commands you'll ever
need to run, and puts them in one convenient place. Instead of hunting down documentation, or asking
for help on irc, often a simple |mach help| is all that's needed to get you started. Mach is great.
But lately, mach is becoming more like the Mozilla developer's toolbox. It still has everything you
need but it weighs a ton, and it takes a good deal of rummaging around to find anything.

Frankly, a good deal of the mach commands that exist now are either poorly written, confusing to use,
or even have no business being mach commands in the first place. Why is this important? What's wrong
with having a toolbox?

Here's a quote from an excellent article on engineering effectiveness from the Developer
Productivity lead at Twitter:

Finally there’s a psychological aspect to providing good tools to engineers that I have to
believe has a really (sic) impact on people’s overall effectiveness. On one hand, good tools are
just a pleasure to work with. On that basis alone, we should provide good tools for the same
reason so many companies provide awesome food to their employees: it just makes coming to work
every day that much more of a pleasure. But good tools play another important role: because the
tools we use are themselves software, and we all spend all day writing software, having to do so
with bad tools has this corrosive psychological effect of suggesting that maybe we don’t
actually know how to write good software. Intellectually we may know that there are different
groups working on internal tools than the main features of the product but if the tools you use
get in your way or are obviously poorly engineered, it’s hard not to doubt your company’s
overall competence.

Working with good tools is a pleasure. Rather than breaking mental focus, they keep you in the zone.
They do not deny you your zen. Mach is the frontline, it is the main interface to Mozilla for most
developers. For this reason, it's especially important that mach and all of its commands are an
absolute joy to use.

There is already gooddocumentation for building a mach command, so I'm not going to go
over that. Instead, here are some practical tips to help keep your mach command simple, intuitive
and enjoyable to use.

Keep Logic out of It

As awesome as mach is, it doesn't sprinkle magic fairy dust on your messy jumble of code to make it
smell like a bunch of roses. So unless your mach command is trivial, don't stuff all your logic into
a single mach_commands.py. Instead, create a dedicated python package that contains all your functionality,
and turn your mach_commands.py into a dumb dispatcher. This python package will henceforth be
called the 'underlying library'.

Doing this makes your command more maintainable, more extensible and more re-useable. It's a
no-brainer!

No Global Imports

Other than things that live in the stdlib, mozbuild or mach itself, don't import anything in a
mach_commands.py's global scope. Doing this will evaluate the imported file any time the mach
binary is invoked. No one wants your module to load itself when running an unrelated command or
|mach help|.

It's easy to see how this can quickly add up to be a huge performance cost.

Re-use the Argument Parser

If your underlying library has a CLI itself, don't redefine all the arguments with
@CommandArgument decorators. Your redefined arguments will get out of date, and your users will
become frustrated. It also encourages a pattern of adding 'mach-only' features, which seem like a
good idea at first, but as I explain in the next section, leads down a bad path.

Instead, import the underlying library's ArgumentParser directly. You can do this by using the
parser argument to the @Command decorator. It'll even conveniently accept a callable so you
can avoid global imports. Here's an example:

If the underlying ArgumentParser has arguments you'd like to avoid exposing to your mach command,
you can use argparse.SUPPRESS to hide it from the help.

Don't Treat the Underlying Library Like a Black Box

Sometimes the underlying library is a huge mess. It can be very tempting to treat it like a black
box and use your mach command as a convenient little fantasy-land wrapper where you can put all the
nice things without having to worry about the darkness below.

This situation is temporary. You'll quickly make the situation way worse than before, as not only
will your mach command devolve into a similar state of darkness, but now changes to the underlying
library can potentially break your mach command. Just suck it up and pay a little technical debt
now, to avoid many times that debt in the future. Implement all new features and UX improvements
directly in the underlying library.

Keep the CLI Simple

The command line is a user interface, so put some thought into making your command useable and
intuitive. It should be easy to figure out how to use your command simply by looking at its help. If
you find your command's list of arguments growing to a size of epic proportions, consider breaking
your command up into subcommands with an @SubCommand decorator.

Rather than putting the onus on your user to choose every minor detail, make the experience more
magical than a Disney band.

Be Annoyingly Helpful When Something Goes Wrong

You want your mach command to be like one of those super helpful customer service reps. The ones
with the big fake smiles and reassuring voices. When something goes wrong, your command should calm
your users and tell them everything is ok, no matter what crazy environment they have.

Instead of printing an error message, print an error paragraph. Use natural language. Include all
relevant paths and details. Format it nicely. Create separate paragraphs for each possible failure.
But most importantly, only be annoying after something went wrong.

Use Conditions Liberally

A mach command will only be enabled if all of its condition functions return True. This keeps
the global |mach help| free of clutter, and makes it painfully obvious when your command is or isn't
supposed to work. A command that only works on Android, shouldn't show up for a Firefox desktop
developer. This only leads to confusion.

If the user does not have an active fennec objdir, the above command will not show up by default in
|mach help|, and trying to run it will display an appropriate error message.

Design Breadth First

Put another way, keep the big picture in mind. It's ok to implement a mach command with super
specific functionality, but try to think about how it will be extended in the future and build with
that in mind. We don't want a situation where we clone a command to do something only slightly
differently (e.g |mach mochitest| and |mach mochitest-b2g-desktop| from back in the day) because the
original wasn't extensible enough.

It's good to improve a very specific use case that impacts a small number of people, but it's better
to create a base upon which other slightly different use cases can be improved as well.

Take a Breath

Congratulations, now you are a mach guru. Take a breath, smell the flowers and revel in the
satisfaction of designing a great user experience. But most importantly, enjoy coming into work and
getting to use kick-ass tools.

I first learned about pytest when I joined Mozilla in late 2010. Much of the browser based automation at that time was either using Selenium IDE or Python’s unittest. There was a need to simplify much of the Python code, and to standardise across the various suites. One important requirement was the generation of JUnit XML reports (considered essential for reporting results in Jenkins) without compromising the ability to run tests in parallel. Initially we looked into nose, but there was an issue with this exact requirement. Fortunately, pytest didn’t have a problem with this – JUnit XML was supported in core and was compatible with the pytest-xdist plugin for running tests in parallel.

Ever since the decision to use pytest was made, I have not seen a compelling reason to switch away. I’ve worked on various projects, some with overly complex suites based on unittest, and I’ve always been grateful when I’ve been able to return to pytest. The active development of pytest has meant we’ve never had to worry about the project becoming unsupported. I’ve also always found the core contributors to be extremely friendly and helpful on IRC (#pylib on irc.freenode.net) whenever I need help. I’ve also more recently been following the pytest-dev mailing list.

I’ve recently written about the various plugins that we’ve released, which have allowed us to considerably reduce the amount of duplication between our various automation suites. This is even more critical as the Web QA team shifts some of the responsibility and ownership of some of their suites to the developers. This means we can continue to enhance the plugins and benefit all of the users at once, and our users are not limited to teams at Mozilla. The pytest user base is large, and that means our plugins are discovered and used by many. I always love hearing from users, especially when they submit their own enhancements to our plugins!

There are a few features I particularly like in pytest. Highest on the list is probably fixtures, which can really simplify setup and teardown, whilst keeping the codebase very clean. I also like being able to mark tests and use this to influence the collection of tests. One I find myself using a lot is a ‘smoke’ or ‘sanity’ marker, which collects a subset of the tests for when you can’t afford to run the entire suite.

During the sprint in June, I’d like to spend some time improving our plugins. In particular I hope to learn better ways to write tests for plugins. I’m not sure how much I’ll be able to help with the core pytest development, but I do have my own wishlist for improvements. This includes the following:

February 02, 2016

As promised in my last blog posts I don’t want to only blog about the goals from last quarters, but also about planned work and what’s currently in progress. So this post will be the first one which will shed some light into my active work.

First lets get started with my goals for this quarter.

Execute firefox-ui-tests in TaskCluster

Now that our tests are located in mozilla-central, mozilla-aurora, and mozilla-beta we want to see them run on a check-in basis including try. Usually you will setup Buildbot jobs to get your wanted tasks running. But given that the build system will be moved to Taskcluster in the next couple of months, we decided to start directly with the new CI infrastructure.

So how will this look like and how will mozmill-ci cope with that? For the latter I can say that we don’t want to run more tests as we do right now. This is mostly due to our limited infrastructure I have to maintain myself. Having the needs to run firefox-ui-tests for each check-in on all platforms and even for try pushes, would mean that we totally exceed the machine capacity. Therefore we continue to use mozmill-ci for now to test nightly and release builds for en-US but also a couple of other locales. This might change later this year when mozmill-ci can be replaced by running all the tasks in Taskcluster.

Anyway, for now my job is to get the firefox-ui-tests running in Taskcluster once a build task has been finished. Although that this can only be done for Linux right now it shouldn’t matter that much given that nothing in our firefox-puppeteer package is platform dependent so far. Expanding testing to other platforms should be trivial later on. For now the primary goal is to see test results of our tests in Treeherder and letting developers know what needs to be changed if e.g. UI changes are causing a regression for us.

Documentation of firefox-ui-tests and mozmill-ci

We are submitting our test results to Treeherder for a while and are pretty stable. But the jobs are still listed as Tier-3 and are not taking care of by sheriffs. To reach the Tier-2 level we definitely need proper documentation for our firefox-ui-tests, and especially mozmill-ci. In case of test failures or build bustage the sheriffs have to know what’s necessary to do.

Now that the dust caused by all the refactoring and moving the firefox-ui-tests to hg.mozilla.org settles a bit, we want to start to work more with contributors again. To allow an easy contribution I will create various project documentation which will show how to get started, and how to submit patches. Ultimately I want to see a quarter of contribution project for our firefox-ui-tests around mid this year. Lets see how this goes…

January 27, 2016

Bug 1233220 added a new Android-only mochitest-chrome test called test_awsy_lite.html. Inspired by https://www.areweslimyet.com/mobile/, test_awsy_lite runs similar code and takes similar measurements to areweslimyet.com, but runs as a simple mochitest and reports results to Perfherder.

There are some interesting trade-offs to this approach to performance testing, compared to running a custom harness like areweslimyet.com or Talos.

+ Tests can be run locally to reproduce and debug test failures or irregularities.

+ There’s no special hardware to maintain. This is a big win compared to ad-hoc systems that might fail because someone kicks the phone hanging off the laptop that’s been tucked under their desk, or because of network changes, or failing hardware. areweslimyet.com/mobile was plagued by problems like this and hasn’t produced results in over a year.

? Your new mochitest is automatically run on every push…unless the test job is coalesced or optimized away by SETA.

? Results are tracked in Perfherder. I am a big fan of Perfherder and think it has a solid UI that works for a variety of data (APK sizes, build times, Talos results). I expect Perfherder will accommodate test_awsy_lite data too, but some comparisons may be less convenient to view in Perfherder compared to a custom UI, like areweslimyet.com.

– For Android, mochitests are run only on Android emulators, running on aws. That may not be representative of performance on real phones — but I’m hoping memory use is similar on emulators.

– Tests cannot run for too long. Some Talos and other performance tests run many iterations or pause for long periods of time, resulting in run-times of 20 minutes or more. Generally, a mochitest should not run for that long and will probably cause some sort of timeout if it does.

For test_awsy_lite.html, I took a few short-cuts, worth noting:

test_awsy_lite only reports “Resident memory” (RSS); other measurements like “Explicit memory” should be easy to add;

test_awsy_lite loads fewer pages than areweslimyet.com/mobile, to keep run-time manageable; it runs in about 10 minutes, using about 6.5 minutes for page loads.

Results are in Perfherder. Add data for “android-2-3-armv7-api9” or “android-4-3-armv7-api15” and you will see various tests named “Resident Memory …”, each corresponding to a traditional areweslimyet.com measurement.