When working on an ecosystem of python packages where some packages
depend on other packages, it becomes a question what versions of the
dependencies to require. There are three basic choices:

Unpegged: If foo depends on bar, allow any version of
bar to be used.

Exactly pegged: If foo depends on bar, require a specific
version of bar. This is done in python with the string
bar == 3.14 to require version 3.14 of bar.

Forward compatible: If foo depends on bar, require a
minimum version of bar. This is done in python with the string
bar >= 3.14 to require at least version 3.14 of bar.

There is no magic bullet: all of these strategies have advantages and
disadvantages. In general, the API of dependencies will change and
a consumer of a particular version will only work with a certain range
of versions of the dependency. Because it is in general unknown
whether the next version of a dependency will break the API for
consuming software, there is not a blanket strategy whereby
compatability can be guaranteed via a setup.py file.

Considering the cases, case 1. allows for the most flexibility: if
any version of dependency (bar) is registered, the dependency is
satisfied. (Otherwise, the latest version of the dependency
will be downloaded and installed
from e.g.http://pypi.python.org/ .) However, case 1. is very
vulnerable to API changes in the dependency: it does nothing to
ensure that the dependency is compatible with the consuming software.
Assuming that the latest versions of a set of packages are internally
compatible, a fresh install will give an internally compatible set of
packages. However, if a package is updated there is nothing to
guarantee that the API is compatible.

Case 2 is the most strict: the consuming package demands a particular
version of a dependency. If this strategy is followed for all
dependencies, it can be assured that for a particular version of the
consuming software (foo) that a compatible version of the
dependency (bar) is used. However, this is done at the price of
losing forwards compatability. If a new version of the dependency
(bar) is available, it will not be used regardless of
compatability.

Case 3 seeks to balance the alternatives: the consuming packages
demands a version of a dependency of at least a given version. This
protects from using an API that is too old for the package of
interest. This strategy also allows newer versions of the software to
be installed without complaining. If the API hasn't changed, then
this is good. However, this still does not protect from API changes.
If the newest version of bar has a different API from the minimum
version specified in foo's setup.py, while setup.py won't
complain, the software will not work. Ideally, one would be able to
note post-facto that there was an API-breaking change in the new
version and that all software pegged to bar >= 0.1 should really
be bar >= 0.1, bar < 1.0. However, once a distribution of (e.g.)
foo is released, it cannot meaningfully be re-released.

I've had several conversations since starting the
signal from noise
project about enhancing the statistical fidelity of
Talos
numbers about "Why is this hard?". From a developer point of view,
you look at http://graphs.mozilla.org/ for a particular test, you see
a nice number per changeset. The numbers might be a little rough (or
very rough), but things are good enough, right? We just need to make
the numbers a little better and turn TBPL orange on failure.

The truth of the matter is that those nice series of numbers hide a
whole story behind them. For complex software like
Firefox ,
performance testing is not an easy problem. Talos performance testing
has historically been done by engineers who wanted to have some
numbers to compare. While this is often how software starts --
throwing things together -- it is not to be mistaken for rigorous or
extensible.

Where we are now

I debated whether to start with how things currently look or how
things should look. While starting with how thing should look gives
an unfettered view of Firefox performance testing, I've decided to
start with how things currently work for those familiar with the
current system and to emphasize the challenges getting from here to
where we need to go. I'm not justifying (or contesting, for that
matter) the decisions as to why its done the way that it was done. I'm
just trying to explain it.

To start off, we have two kinds of tests:
startup tests
and
page load tests .
In the interest of time and simplicity, let's pretend that it is true
that startup tests start the browser, load a URL, and then measure the
time at an event (onload or mozAfterPaint), then shutdown the
browser; and for page load tests we start the browser, load a list of
pages from a manifest (each page N times), and then likewise
usually measures at some event. There are many variations on this
theme: tests can compute their own metrics, you can load the pages in
different order, etc., but the above is the basic idea.

From this, we get a series of numbers. For startup tests, it is just
a list of (e.g.) times. For page load tests, you get a series (e.g.)
of N numbers per page.

Outside of a little streamlining and a lot of details I'm glossing
over, we mostly want to keep the above procedure. The disparity
begins with what we do with those numbers.

In order to send data to our
graphserver ,
we have to get the data for each test into a
format
that graphserver likes. Since the startup test results are just a
list of page (e.g.) load times, these can be directly translated to
the graph server format, using NULL for page names. For page load
tests, on the other hand, we have N numbers for each page. So Talos
averages
the values for each page and sends a list of averages. Not that this
average may not be a straight mean. The
default
is to ignore the maximum value (per page) and take the mean of the
remaining iterations. But this is configurable per-test.

The list of numbers and page names is uploaded to the graphserver.
When you look at graphserver, you see a single point for each test for
each changeset, not a list. This is because graphserver does
additional averaging across the page set ,
ignoring the maximum value and taking the mean of the remaining
numbers.

This is the crux of the problem. Every time an average (of any form)
is taken, you're reducing a spectrum to a single scalar. While having
a single number makes it easy to read and deal with, what that number
means is obscured. The graphserver averaging is particularly
hazardous. Since we average across pages, we are averaging numbers
that may be of very different scales. So pages that take longer to
load/render have more weight than pages that take less time to
load...EXCEPT we throw away the most expensive page. Which if you
think about it is strange: the most expensive page is likely to be
consistent from run to run (if its not, then other strange things
could happen in this averaging). We run this page many times and
upload it and then ignore it.

The averaging on the Talos side is also problematic, though more
subtly. As documented in
Larres' thesis ,
load/render times for a particular page do not follow a bell curve
distribution. Multi-modal distributions are often seen in practice and
dropping the maximum value was (probably) done in order to nudge the
data towards the lowest mode. However, when this doesn't work, the
averaging is just misleading. While several hypotheses have been
proposed, no one ultimately knows what conditions cause the
multi-modality. This would be a worthy field of study of its own right.

So now we've reduced (in the page load test case) a 2d array of
numbers into a single number per test (or page set, depending on your
perspective) per changeset for display on http://graphs.mozilla.org/ .
Now how do we detect regressions?

According to our
documentation,
"to determine whether a good point is "good" or "bad", we take 20-30
points of historical data, and 5 points of future data." Of course, it
doesn't tell how we use this data. Larres tells us more here:
https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf#page=74
Essentially, we make two windows: one before the data point, and one
after the data point. We use a
t-test
to see if there is statistical significance between the two series.
If a regression is detected, the
dev-tree-management
list is emailed. Then
mbrubeck actually looks at the data
and tries to figure out if its an actual regression. I get fifty or
more of these emails per day. Most of them don't appear to be actual
regressions, at least to the naked eye given the amount of noise in
the various data sets.

Using the "past" and "future" window is intended as a "before" and
"after" picture. However, a big implicit assumption is that the
numbers are flat for each segment: that is to say that nothing is
pushed in the range before alters performance and nothing pushed
pushed in the range after alters performance. This is a pretty brazen
assumption and has certainly been wrong long enough in practice.

While looking at a single number of graphserver is very convenient, it
is also misleading. The statistics applied to Talos data to determine
if a performance regression or improvement is seen is a good example
of a very engineer-y metric: various tactics are tried until something
is found that is sorta stable, "looks right", but it isn't clear at
all what it measures or how rigorous it is.

We can do better.

Where we want to be

The most important part of solving a problem is to get people to care
about the problem. A critical part of getting software engineers to
care about a problem is to build a system that is easy (and maybe even
fun) to use. We need to be able to rigorously identify regressions.
This is a hard task. If a regression is seen, whatever UI we build for
it should clearly display it and clearly display why it is a
regression. One artefact of our current system where we reduce all
the data into a single number per changeset is that it is not at all
clear if the regression is real or noise. We have no ability to drill
down in the data and see which pages regressed. We have no particular
clue as to what happened. In fact, regressions aren't marked on the
chart at all.

Another critical part is sending the right signal to the
right people. This means getting rigorous regression information into
the hands of people that can understand (with the tools given) the
extent of the regression and can hopefully help determine why.
TBPL should go orange if a regression is pushed. This does not mean
that we can never have regressions -- that is unrealistic, as desired features
may require performance regressions to implement, as well as trade-off
decisions between competing performance metrics (the infamous example
being speed vs memory).

https://wiki.mozilla.org/Auto-tools/Projects/Signal_From_Noise and
other
blog posts here have discussed in detail why we
want to keep the full spectrum of numbers that we get. We don't
measure our noise levels. We don't know how many samples are required
for convergence or if we've reached it or if that's even possible
(though Larres has done
some analysis ).
We need to know this. Some tests we might run for too many
iterations. Others we run for most assuredly far too few.

And, as
discussed elsewhere
,
I think it is extremely important that
people actually look at this data. Having a system that is easy to
use, rigorous, and that documents how it does its calculations will be
a huge help, as people would actually have a reason to want to use
it. But we also need someone that's really ready and willing to drill
down and mine the data for the knowledge it contains. Why do we get
multi-modal distributions? What are we testing? What aren't we
testing?

If you think this all sounds hard....it is! Its a lot of work and
there aren't many appreciable short cuts. Much of our work thus far
has been ripping out hacks that were made for expediency in the past,
and replacing them with less hacky code. There are some things worth
doing right. Going without performance tests for Firefox is pretty
much unthinkable, so we're left with the alternative: actually making
a system that works.

Talos

This takes care of the buildbot information. For desktop talos, it is
then possible to call PerfConfigurator with the arguments from
mozilla-tests/config.py and generate a Talos configuration file.
remotePerfConfigurator
currently requires a device to be attached in order to work correctly,
so I punted on that problem for the time being.
Having the config file, it can be read to introspect how the tests are
being run.

Hovering over a talos letter on TBPL, you can see the full name of the
associated (TBPL) suite, e.g. Talos nochrome opt was successful, took 12mins
when one hovers over T (n) . If you click on the n, you will see
the name of the suite as reported by
buildbot: Rev4 MacOSX Lion 10.7 mozilla-central talos nochromer .
Note the nochromer from
http://hg.mozilla.org/build/buildbot-configs/file/68c191f31d39/mozilla-tests/config.py#l291
You can also see the name of the test as reported to graphserver, in
this case:

Graphserver

So we have buildbot, TBPL, and the talos sides of things figured out,
nicely lining us up to tackle graphserver. Graphserver details the
test mapping from short name to long name in the rather Kafka-esque
data.sql schema:
http://hg.mozilla.org/graphs/file/da54bac92c1b/sql/data.sql#l2568
I wanted to at least get the long graphserver names from the short
names, as these are the only strings displayed in the UI.
So I created a in-memory database using
SQLite as there was no desire to persist
the data, just read it, and SQLite is built in to python and avoids
database-deployment woes.
The table defitions
was not SQLite-compatible, so I
created my own table definitions .
unix_timestamp() is not a SQLite function, so I removed lines
containing a reference to this function. Fortunately, this does not
affect any of the test lines I care about.

Putting it all together you get a table following the information flow:

buildbot has test suites which contain arguments to PerfConfigurator

PerfConfigurator generates a YAML file which is used as
configuration to run one or more tests

the tests report results to graphserver and the resulting links are
displayed on TBPL

the buildbot suite is reported to TBPL

graphserver maps the Talos test names, plus an extension for the
page load test case, to a full name displayed it its UI

I called the script I wrote to parse all of this talosnames:
http://k0s.org/mozilla/hg/talosnames . Its one of the messiest scripts
I've ever written, though I suppose its partially amazing, given that
no one ever thought about doing this before, that it was possible to
write at all.

Currently, talosnames outputs just a single page which I host here:
http://k0s.org/mozilla/talos/talosnames.html
Its not dynamic, currently, but if it needs to be regenerated please
feel free to ping me and I can do this.

How this could be easier

In general, this was mostly an exercise in untangling a web that we
ourselves wove. If we had decided and stuck with conventions up front,
there would be nothing to do here.

remotePerfConfigurator currently requires a device to be
attached to generate configuration:
https://bugzilla.mozilla.org/show_bug.cgi?id=775221 .
If remotePerfConfigurator could work sans a device, we could
generate and inspect this test information in talosnames.

I couldn't really figure out what buildbot command lines were for
desktop and which were for mobile. I probably could have eventually
tracked this down, or did a much easier hack whereby if
--fennecIDs was in the command line then I'd call
remotePerfConfigurator though the above prevented action on this anyway

data.sql should mostly go away

up to date data structures: talosnames graphs the tip of TBPL's
Config.js
and
Data.js ,
builbot-config tip's
config.py ,
and graphsever tip's
data.sql .
While this gets the latest information, it is unknown what the
deployment state of any of these files are.

TODO

While I am glad to be able to sort this out a bit, a lot more could be
done given the time.

a TBPL-like view that displays the TBPL abbreviations and maps to
buildbot suites and tests

list which buildbot suites are active or inactive

Talos counters: the talos test config lists some of the counters
(although not all) in http://k0s.org/mozilla/talos/talosnames.html .
Graphserver, on the other hand, has entries for each of these
counters on a per-test basis. Counters are mostly a mess in Talos.
It would be nice to consolidate them there and display in
talosnames all of the counters associated with a Talos test.

Last week I pushed a fix to
bug 704654
that fixes a number of issues, conceptual and user-facing, with how
Talos handles configuration. I've had an idea on how I wanted to do
this for a few months now, but it has always been tabled. But with my
(joking, sorry) pledge to Bob Moss to fix all bugs in
Talos
by the end of quarter.

I had a free weekend so instead of killing the prerequisite bugs as I
usually do I decided to tackle the problem in one go.
My goals:

remove the need to edit several different configuration to change a
configuration basis. Most .config edits needed to happen in
5 places (formerly 6). This is not only prone to human error (which
I and others have been guilty of many times), it is
a discouragement to change default configuration.

consistent and declarative serialization/deserialization. Serialization in
PerfConfigurator was mostly awful, scanning through line by line and
looking for particular strings in (basically) an if-else tree, often
depending on particular whitespace or other subtle (and
undocumented) formatting issues. While the .config files conform to
YAML, we don't make use of this for de/serialization. In addition,
while in run_tests.py we allow command line overrides for the YAML
items, we do not post-process them as we would in
PerfConfigurator.

consistent error checking. Currently some of our config-checking is
in PerfConfigurator and some is done in run_tests. This opens the
possibility that either case may miss cases where the other one
would find it. If you call run_tests.py with a .yml file, you will
not get the checking done for the combination of command line items
and the .yml configuration that is done in PerfConfigurator. Since
we process a lot of command line items into resulting configuration,
this can lead to interesting results (e.g. while --activeTests is a
command line item for run_tests.py, it is not used, anywhere).
In general, configuration should be checked in one place before any
program logic takes place. While this patch doesn't completely
address this issue, it a big step forward and should pave the way
for future improvement.

configuration should be declarative. You should get what you expect
from configuration, not inconsistent results. If you edit a (e.g.)
.yml file with the existing Talos, you have no real way to know if
the keys you add or edit are going to be used by run_tests.py (and
what format they should be in, etc.) Having a basis for
configuration gives a single place to denote what is expected (and
thereby what isn't allowed) and the form that it is supposed to be
in. It is also nice to have all configuration in a single place
instead of having to look at a bunch of config files for the basis
as well as all over the code to see what is expected and how it is
processed.

allow running directly from run_test.py . For particular
(e.g. production) systems, it may be advisable to use tuned
(.yml) configuration files to have highly customized runs (note that
we don't do this and use (remote)PerfConfigurator in all cases for
reasons that may be infered from the above). However, for a typical
developer, there is little reason to run
PerfConfigurator -e `which firefox` -a ts --develop-o ts.yml && talos -n-d ts.yml
for a particular run. Instead, the entirety of this may be invoked
with this patch as
talos -n-d-e `which firefox` -a ts --develop-o ts.yml
in a one-step process. (Note that we're still dumping to ts.yml
though one wouldn't have to if the result is intended as ephemeral).

I hear people prefer blog posts with pictures, so with no reason here
is a bunch of cute foxes:

I've moved the basis of the Talos configuration to
PerfConfigurator.py
instead of some combination of .config files, PerfConfigurator.py, and
run_tests.py.
This gets rid of the duplication between the various config files as
well as the command line options. In fact, there isn't much left of
the configuration files

I don't like configuration to live in code, and so empathize with
those who look at this cautiously from that point of view. However,
PerfConfigurator following my rework isn't so much configuration, but
a configuration basis. Given the goals above, some piece of code has
to validate a given configuration, has to know what data is in a
configuration, and has to provide whatever command line options are
used to front-end the configuration. The previous incarnation of
Talos and PerfConfigurator had a significant amount of code to this
end, but it was both spread out and incomplete. So I don't think
putting it all in one place is a big conceptual change. Having a
piece of code that knows the allowable form of configuration gives
great power and having the code all in one place just makes it more
human-readable.

The unofficial history of Talos configuration, as I understand it,
goes something like this: Initially, there was one configuration
file. You copied it, edited it by hand, and ran your tests on it. At
some point, this became cumbersome, and PerfConfigurator was created
to automatically fill in values from a set of command-line choices,
and in addition allow the values to be marked up a bit. The road was
already paved for some part of configuration basis living in code
versus in the .config file. Then, as the need to run tests in
different configurations grew, .config files flourished to this
end. I'd like to think the changes for
bug 704654 as
the next logical step in Talos's configuration evolution.

Longer term, we'd like to remove even more of Talos's configuration and
replace .yml files with command line options. The complexity of
configuration will be managed by
mozharness .

Currently, the canonical unit of Talos tests is a page set.
However, a page-centric point of view offers several intrinsic
advantages on top of being, in my opinion, more conceptually coherent.

A page-centric point of view allows easy adding and updating of
pages. Currently, making a new page set is a big deal. Since we
average over all pages in a page set to obtain a quality metric,
adding a new page (or removing a page) will change this number and
the entire baseline for comparison has to be recentered. If we
made the page the canonical unit of testing, then adding or
removing a page doesn't involve a recentering as each page has a
quality metric associated with it.

Taking an average over all pages to get a quality metric, as we do,
gives a higher weight to pages that take (e.g.) longer to load. For
instance, consider the output for tsvg:

A performance loss (or gain) in e.g. gearflowers.svg is likely not
to be noticed in this pageset as it is several orders of magnitude
lower than (e.g.) hixie-002.xml, so a small percentage-wise noise
in the latter could easily hide a legitimate regression in the former.

Having this additional data of what changes regress which pages allows
us to explore how these particular page modifications affect
performance. If we can isolate patterns, we can fix them.

One conceptual disadvantage to a page-centric approach is that
deciding whether a changeset is a net regression or not becomes
harder. Ideally a human (or other expert system) would evaluate all
of the data across pages and decide whether a change is a regression
or not. However, we have many pages and not enough people, so this is
harder to do than to craft a formula for a quality metric.
To obtain an overall quality metric for a push, some sort of averaging
over pages must be done. We currently throw away the highest value
and take the mean of remaining page averages. If we continue with
this approach we throw away the ability easily add and remove pages
without futzing with the metric. Instead, a method should be sought
whereby adding a new page does not affect a metric.

While this is a small change in terms of how the code currently works,
it lays the groundwork for a window of possibilities in terms of Talos
statistics. Currently, pageloader calculates the "median" (ignoring
the high value), the mean, the max, and the min, and outputs these
along with the raw run data. Pageloader is for loading pages and
taking measurements, not really for doing statistics. So it would be
nice to move this upstream: first to Talos, then to graphserver proper.

Being able to specify data filters with --filter from the command
line and filter: in the .yml configuration file allows the
test-runner to change the "interesting number" by which we measure
performance metrics on the fly. While there are currently only a few
filters available, it is easy to add more metrics as we need them.

In a parallel effort, the
JetPerfsoftware
consumes Talos
filters
. This is a good example of the expansion of the Talos ecosystem: as a
ciritical part of our performance testing infrastructure, building
tests and frameworks on top of Talos. In general, the
A Team is moving towards a
testing ecosystem of reusable parts and sane APIs.

Data filters were added to talos
as an interim measure to make the "interesting number" calculations
more flexible. As we play with different types of statistics, we need
the ability to change configuration without having to jump through too
many hoops and this fulfills this immediate need.

However, in the longer term, Talos and pageloader shouldn't really be
doing statistics at all. They are in the "statistics gathering" camp
where
graphserver is in the
"statistics processing" business. It would also be nice if there was
a piece of software that let you analyze Talos results locally,
ideally using the same statistics processing package that graphserver uses.
This is outlined in
https://bugzilla.mozilla.org/show_bug.cgi?id=721902 .

For the old data and the low value of the new data, we see times around
110-120ms. The high value of the new data is around 590ms. Are these
numbers what we'd expect?

Throwing away the high value and taking the median for both data sets gives
a number of the order of 100 or so (the old algorithm). Taking the median
functions as a filter for the bifurcated results towards the majorant
population. Since the low population is slightly more majorant, dropping
the highest number in the way that pageloader does further biases towards it.
It is not surprising we see no bifurcation in the old data.

For the new data, we drop the first run. Coincidentally or not, for the cases
studied the first run was part of the low population, so that tends
towards bifurcation. Taking the median of the remaining data points gives

Okay mystery solved. We know why graphserver is reporting what data
it is reporting and we also know that our algorithm is doing what we
think it is doing. However, this is the beginning instead of the end
of the problem.

By taking the average and discarding the high value of two data
points, we are doing something weird and wrong. We are effectively
only reporting one of the two pages. Note for the high and the low
case what we are actually viewing data from the different pages! This
is misleading and probably outright wrong. We essentially have two
pages just to throw one of them away and then we have no confidence at
what we are looking at. I'm not sure if the code at
http://hg.mozilla.org/graphs/file/d93235e751c1/server/pyfomatic/collect.py#l208
would even work for a single page. Probably not. In general I grow
increasingly skeptical of our amalgamation of results. We need
increasingly to be able to get to and manipulate the raw data. We
certainly need a way of digging into the stats and know what we're
looking at and have confidence in it. In general, talos, pageloader,
and graphserver need to be made such that it is both easier to try new
filters as well as more transparent to what is actually happening.

We have been trying to bias towards the low numbers. Looking at the data for
the four tests show that there are 13 low-state numbers and 7 high-state
numbers. While there are more numbers in the low state, it is not an
overwhelming majority.

This leaves the big elephant in the room: why are these runs
bifurcated? Are we seeing a code path, or is something else happening
on these builders that leads to bifurcated results? While this will
be challenging to investigate, IMHO we should know why this happens.
While our method of throwing out the highest data point, getting the
median, throwing the data to graphserver, then getting the average of
the whole pageset back, has a positive effect of minimizing noise
(which is important), it is also sweeping a lot under the rug. We
need to have confidence that what we're ignoring is okay to ignore. I
don't have that confidence yet.

Playing with Jetpack + Talos performance lets us explore statistics in
a bit more straight-forward manner than the production Talos numbers.
As part of the
Signal from Noise
project which I am also part of, there is a lot of parts to staging
even small changes in how we process Talos data since the system
involved has many moving parts
(
Talos,
pageloader,
graphserver
). By contrast, since JetPerf is a new project, it is much more
flexible to explore the data that we have not hitherto explored.

Looking at raw numbers wasn't very interesting, so I made a
parser
for Talos's
data format
It was pretty quick to get some
averages
out before and after the addon was installed, but I thought it would
be more usefulto display the raw data along with the averages.

These really aren't fair numbers, as currently the stub jetpack I use
prints to a file, but its at least a start of a methodology.

The reason I'm sharing this isn't just to make a progress report, but
more to present some ideas about thinking about what to do with Talos data.
While this was done for JetPerf, much of this also applies to Signal
from Noise. You run Talos and get some results. What do you do with
them? Currently we just shove them into http://graphs.mozilla.org/
and say that's where you process them, but I think looking at them
locally is not only important but necessary if you're doing
development work. I think a big part of any statistics-heavy projects
is to make it easy for all of the stakeholders to explore data,
apply different filters and see how things fit together. While it
takes a statistician to be rigorous about the process, anyone can play
with statistics and it takes a village to really conceptualize what is
being looked at. I hope, to this end, developers will use my software
so that they can understand what it is doing and provide the valuable
feedback I need.

TODO

JetPerf is still very much at a proof of concept stage. Ignoring the
fact that none of it is in production, there are still many
outstanding questions
about basic facts of what we are doing here. But outside of polishing
rough edges, here are some things on the pipe.

test more variation of addons; currently we just load panel and
print something to a file

test on checkin (CI):
so the main point of JetPerf is to get a better idea of what SDK
changes cause addon performance regressions and hits, to be able to
quantify them. While as stated this is a very open ended project,
one thing to turn this from a casual exploration to a developer
tool is running the tests on checkin. This will give an update in
real time of if a checkin breaks performance.

graphserver: in order to assess Jetpack's performance over time, we
will want to send numbers to some sort of
graphserver .
This will allow us to keep track of the data,
to view it, and apply various operations to it.

I may also spin off the (ad hoc) graphing portion and the Talos log parser
portions into their own modules, as they may be useful outside of just
Jetperf.

While buildbot comes with a
gitpoller
the version in
buildbot 0.8.5
(the current in http://pypi.python.org/ ) did not work with
git 1.6.3, the version on k0s.org. Since my box is on an ancient
version of Ubuntu (and is remote and not trivially upgradable), I
brought the generic
autobot poller
from being buildbot 0.8.3 compatible to 0.8.5 compatible
(which is worth noting is not trivial).
Also, while there has been
a patch for an hgpoller
submitted by
Mozilla
developers some four years ago, it has been be WONTFIX ed, so I
went ahead with a generic polling architecture which (IMHO) seems a
wiser architectural choice. While I sympathize with the architectural
ideology of using a push-based architecture, and believe this is
closer to ideal, polling will always work and does not require access
to the repository servers which is a huge factor when using
https://github.com or even Mozilla hg repositories. (Incidentally,
I found neither this patch nor
http://hg.mozilla.org/build/buildbotcustom/file/tip/changes/hgpoller.py
to work OOTB, so, sadly, I proceeded to roll my own. Also
incidentally, it is not trivial to depend on buildbotcustom using
install_requires due to its lack of a setup.py file.)
After debugging the gitpoller I pushed
a test change and was happy to see
that autobot built correctly. Autobot now listens to MozBase changes!

I was unable to finish the (parenthetical)
Q4 goal
of having autobot report to
autolog , so
this remains outstanding work. There is a lot that could be done with
autolog. The basic idea and TODOs are outlined in the
README
(which itself could use some work; it is largely up to date
except the Projects section, though incomplete). I will endeavor to
work on this in my available time or as need escalates, but my
priority for
2012 Q1
will be separating
Talos Signal From Noise
so it is unlikely I will be able to put a lot of time into autobot
(sadly). On the other hand, I am more than willing to help
and advise if anyone
wants any features or to iron out the crinkles. While the
architecture is not completely straight forward, it is a decent
approximation to a
convex hull
over the
problem space
of having simple to write, simple to maintain, simple to debug
continuous integration for small(er) projects. As usual, if anyone
wanted to seek out alternate solutions, that is fine too, but I am
essentially happy with my architecture decisions and technology
choices.

Regardless of whether the CI solution for MozBase is autobot or
(other), it is important to remember that continuous integration is a
safety net and not a first line of defense. It is regrettable that
autobot has no more notifications (yet) than the
waterfall display
and the autobot character lurking in
#ateam (the default
IRC bot
isn't very verbal OOTB and I haven't had time to customize
it). But I think having some (admittedly smokescreen) automated testing
for MozBase is an important step towards the evolution of the software
as well as towards development practices in general.

From one point of view, this isn't exciting work. But I live for this
stuff. I think of software as an ecosystem to be cultivated and I
live to cultivate it. So while, for the most part, I can't point to
any exciting features that I implemented (nor were there planned to
be), in retrospect I am proud of the fruits of my efforts and those of
my team-mates and comrades. A big shout out to BYK and others who
have stepped up to the plate to help the
A-Team with these
super-important efforts.

MozBase didn't have documentation or tests worth speaking of. Now
it has at least a good start!

Talos even has a test for installation. We need more tests, but its a good start!

There has been a lot of cleanup of Talos towards the end of making it more robust, easier to use, and easier to contribute to.

The A-Team didn't have any community contributors. Now we do!
This one actually makes me the happiest :)

When I look the progress, I see Talos evolving towards what I would
call real software (instead of a one-off that has been extended to do
way too much to make it a one-off) that Mozillians can hack on and
extend and make useful changes to. This also sets the stage for
making Talos easier for developers to use locally to test their
changes as well as getting more of our test harnesses to use the
MozBase suite of utilities as well as making it easier to write new
harnesses without reinventing so much of the wheel.

One of our our next priorities towards these ends is
Bug 713055 - get Talos on Mozharness in production
This is a huge step towards making buildbot more extensible as well as
having desktop talos be more accessible to developers in a way that
should be identical to the way that it is done in automation.
:aki has done a bunch of work to start
moving our aging buildbot infrastructure towards something more sane.
This is mozharness .

So a huge shout out to
:jmaher and
:wlach for all the Talos help, and
:ahal
and :ctalbert
as well as all the help from those in
release engineering
for making all of this possible. I look forward to getting this all
better in the coming year.

we need to fix Talos importing of mozbase. We want to get Talos to consume
mozdevice, mozinfo, mozhttpd, and mozrunner, mozprocess, mozprofile

The current state of things:
- talos includes the files of mozdevice and mozhttpd.py
- we mirror these manually but things get out of sync

An interim solution is posed in bug 707218: mirror mozdevice,
mozinfo, and mozhttpd to talos for the purpose of creating a
tests.zip file and list them in setup.py for setuptools
installation. This works because these are all simple dependencies,
but will not work for mozprocess, mozrunner, and mozprofile as these
all have dependencies of their own

In order to use these dependencies (mozprocess, mozrunner,
mozprofile) in production talos, we will need a releng python
package index: Bug 701506 . I will do a mock-up there; whether it
will fulfill releng needs or not is hard to say. We will probably
want to transition to mozharness soon thereafter or at the same
time, but we shouldn't block on any more than we need to. These are
all big changes to our deployment strategy, and for purposes of QA
we will want to make as strategic and specific decisions as possible.

Once the transition is completely done, we can do away with
talos.zip entirely.

Additionally, in order to make talos work with setup.py, the
pywin32 package should be listed. However, pywin32 is in general
compiled for specific python and windows versions. See e.g.
https://bugzilla.mozilla.org/show_bug.cgi?id=673132#c8 . BYK is
looking into this, possibly switching the linked dependency based on
the platform you are on.

slewchuk is looking into Talos data aggregation: bug 707486

This is a rough map of what we want to do. As said, with so many
balls in the air, we will want to block on as little as possible and
make as few really big changes at a time so that we can ensure that
each piece of the puzzle fits together correctly.

I've been developing
Talos
recently. There are many caveats working on this test harness that
demands a more rigorous process than, say, a webapp. It has a large
amount of necessary platform-specific code. It is deployed in a
complex infrastructure
environment. And it has no
tests.

In order to test Talos, the
A*Team
has an internal staging environment (thanks to the efforts of
anode and
bhearsum and others)
that mirrors the production testing infrastructure environment. Like
production, it requires an HTTP-hosted URL structure containing
pageloader , a pageset
(tp5 ), and other
resources necessary for
buildbot
plus Talos. (We should probably document the directory structure.)

In order to test Talos, you point the
A*Team staging environment
configuration to your HTTP-hosted location of your copy of
this structure of resources. Then you issue a buildbot sendchange
(which can be scripted for ease of use) that corresponds to a set of
Talos tests
that are run on each platform of interest.
We have some simple scripts to run tests (i.e. ./chrome.sh or
./dirty.sh) to run sets of tests as we do in production.
This translates to a variety of buildbot sendchange commands
appropriate for the tests to be run. Green runs means good.

In order to test my Talos changes, I needed to setup a system whereby
I could translate my changes into a hosted copy of talos, pageloader,
etc. So here is what I did.

Based on
jmaher'supdate_talos.sh, I
wrote a script to help me turn changes into changes in my hosted copy
of talos.zip. Since I work largely in diffs hosted on
bugzilla or
my mercurial queue
of Talos patches, I wanted a script that would apply a series of
changes to a checkout of
talos .
In addition, I wanted to keep the flexibility of being able to edit
these files on disk.

The script lives at http://k0s.org/mozilla/update_talos.py . I will
endeavor to improve it as testing needs become more apparent. It
sadly loses
jmaher'supdate_talos.sh
feature to create versioned zips. I thought about hosting a dedicated
talos repository for testing (and still may, if that seems better down
the line), but usually want to test a specific change and rollback to
a known state.

After the HTTP copy is updated, I can run (e.g.) xperf.sh to
trigger that set of tests in the staging environment and watch the
waterfall to assess the viability of the change

It would be nice to have something more generic, but the path to good
software is through iteration. Perhaps as more people develop their
own scripts to test Talos in the staging environment we will evolve to
a more generic script to update talos as well as copies or templates
of the URL/directory structure of what as needed as well as the
staging software.

Over the years, Mozilla has developed a number of
test harnesses
for automated testing of Firefox and other applications. Most of the
harness code is written in
python due to its utility towards this
type of development. As one would expect, the harnesses arose from
necessity and grew organically. However, as the harnesses grew it
became apparent that there were several generic tasks that the
harnesses shared:

creating and manipulating a profile

installing addons into the profile

invoking (e.g.) Firefox in a desired manner

process management

...a few other things

These pieces have largely been developed in a vacuum (in the early
stages) or copy+pasted from other harnesses (in the later stages).
This has lead to duplicated functionality, difficult to maintain
and inconsistent harness software (since fixing things one place means
that they probably need to fix them other places), and a system which
was fully understood by no one after it became of sufficient
complexity. The harness software could not be reused because it is
tightly coupled to the implementation even when the underlying intent
was generic.

Meet MozBase!

As software grows, it should be cultivated such that the effectivity
and its knowledge base are maximized. Code should be made reusable
and the architecture evolved towards a representation of intent. This
is the goal of the MozBase effort by the
A-Team :
https://wiki.mozilla.org/Auto-tools/Projects/MozBase

we want to make high quality components to build test harnesses

... and other pieces of software

... that might be useful on their own

we want to replace existing code with these pieces

... but cultivate their knowledge base

we want to develop canonical and reusable python tools

... and encourage the community to use them

Developing
MozBase is
one of the
A-Team goals
this quarter. While cultivating software is an ongoing effort, we're
off to a good start. We already have several MozBase python packages:

Our immediate goals are to cultivate these into high-quality tools
taking lessons from the existing harnesses. Then, porting the
harnesses to these tools that can be maintained in a unified manner.
Right now, we're working on
Talos both because this
is a good proving ground for these tools and because much of its code
can be replaced with MozBase code easily (for some definition of
"easy").

While MozBase is about software, it is also about having a sane and
maintainable environment to cultivate software in. While modular
packages are great, their utility is in how they may be used together
(as well as with other code) instead of in the craft of an individual
package. So we're tackling these issues too.

Python importing in Mozilla Central: currently (most) python in
mozilla central is not packaged and we manually
futz with pythonpath
and sys.path in several inconsistent and hard to maintain ways.
In order to move towards python packages in any reasonable fashion we
need to make importing easy and unified as well as moving towards how
the python world typically does importing. There is
bug 661908
for creating a unified virtualenv in
the $OBJDIR. Work is likely to start on this or a similar effort
soon (either this quarter or Q1 2012).

Mirroring software to Mozilla Central: we have hampered ourselves --
rewritten software and avoided fixing bugs -- by not using third-party
python packages for tools that live in mozilla-central. In addition,
since many of the test harness already
live in m-c ,
if we are going to move these to consume mozbase we will need a
strategy to mirror it and other software to the tree. While nothing
has been definitively decided, preliminary discussion has pointed
towards having a script to fetch resources from a variety of locations
and add them to mozilla-central or elsewhere. We're having a meeting
this week to figure out what we really want to do and go from there.

Such is the MozBase effort. I am excited to start moving our code
into a solid maintainable structure, and I hope you are too. If you
are, please check out our
github project or sign in to
#ateam# and tell us what you think. We'd love contributors!

I am going to be maintaining mozregression going forward. I released a
0.6 version to pypi today which hopefully fixes a few setup.py issues.
You can find me at jhammel __at__ mozilla __dot__ com or as jhammel in #ateam.

The A-Team is working on
creating a set of high-quality python utilities that are consumable,
general purpose, and interoperable in an effort called
MozBase .
A huge part of
this quarter's effort
is to improve Talos
to consume MozBase software and to make it an extensible harness that
may also be consumed.

As one of the first steps towards making Talos consume upstream
MozBase packages, I have
made Talos a python package .
This allows Talos to depend on upstream python packages in an
automated fashion, permit additional setup/install time steps to be
automated, and install in a manner that dotted paths against talos
can be resolved by python import. That is, other packages can now
usefully import talos without depending on a set directory structure.

Unfortunately, since the talos repository was arranged such that all
the python scripts and other data lived in a fairly disorganized
top-level directory, this involved making a talos subdirectory and
moving all files (except the README) into that subdirectory and
carefully ensuring that all data resources were properly installed
alongside the python scripts.

Even more unfortunately, this change led to some confusion that
could have been avoided ahead of time. Talos uses a tests.zip
file that contains both the scripts and the data, and though I would
have liked to do additional cleanup as part of making Talos a python
package, I deliberately held off on changing anything that would
invalidate this methodology. However, unbeknownst to me, there were
other resources that depended on the talos directory structure, and these
got broken with my change. I apologize for that, and will communicate these
changes more widely next time. In the meantime, if you have any tools
that depend on the talos directory structure, know that they will break
next time you update. If you have questions about this, please contact me.

Although the fallout was regrettable, I think this is a necessary and
forward facing change in the light of MozBase,
Mozharness , and general good
python practices. We're now looking at deprecating the tests.zip
methodology and moving towards a
Mozharness script for running Talos
for both desktop testers and production. More on that as things
progress.