Monday, September 21, 2015

Minimal job scheduling

As chmanchester and I were recently discussing, this is very important as we believe we can do much better on deciding what to schedule on try.

Currently, too many developers push to try with -p all -u all (which schedules all platforms and all tests). It is understandable, it is the easiest way to reduce the risk of your change being backed out when it lands on one of the integration trees (e.g. mozilla-inbound).

In-tree scheduling analysis

What if your changes would get analysed and we would determine the best educated guess set of platforms and test jobs required to test your changes in order to not be backed out on an integration tree?

For instance, when I push Mozharness changes to mozilla-inbound, I wish I could tell the system that I only need these set of platforms and not those other ones.

If everyone had the minimum amount of jobs added to their pushes, our systems would be able to return results faster (less load) and no one would need to take short cuts.

This would be the best approximation and we would need to fine tune the logic over time to get things as right as possible. We would need to find the right balance of some changes being backed out because we did not get the right scheduling on try and getting results faster for everyone.

Prioritized tests

There is already some code that chmanchester landed where we can tell the infrastructure to run a small set of tests based on files changed. In this case we hijack one of the jobs (e.g. mochitest-1) to run the most relevant tests to your changes which would can normally be tested on different chunks. Once the prioritized tests are run, we can run the remaining tests as we would normally do. Prioritized tests also applies to suites that are not chunked (run a subset of tests instead of all).

There are some UI problems in here that we would need to figure out with Treeherder and Buildbot.

Tiered testing

Soon, we will have all technological pieces to create a multi tiered job scheduling system.

For instance, we could run things in this order (just a suggestion):

Builds

Prioritized tests (run in 5 to 15 mins)

Remaining tests in normal mode

This has the advantage of using prioritized tests as a canary job which would prevent running the remaining tests if we fail the canary (shorter) job.

Post minimal run (automatic) precise scheduling (manual)

This is not specifically to scheduling the right thing automatically but to extending what gets scheduled automatically.

Imagine that you're not satisfied with what gets scheduled automatically and you would like to add more jobs (e.g. missing platforms or missing suites).

You will be able to add those missing jobs later directly from Treeherder by selecting which jobs are missing.