However, in practice, a serial (or monotonically increasing) key can be
handy to have around. I was reminded of this during a recent situation
where we (app developers & ops) needed to be highly confident that a
replica was consistent before performing a failover. (None of us had
access to the back end to see what the DB thought the replication lag
was.)

Sphinx tip: mailhtml

I often find that I want to email around a doc I’ve put together with
sphinx (I often use the *diag or graphviz extensions). Sadly, the world
hasn’t embraced the obvious way of supporting this via ePub [1] readers
everwhere. What I want is plain html output, with nothing fancy. There’s
probably a style out there, but I just add the following target to the
Makefile generated by sphinx-quickstart:

Py Bay 2016 - a First Report

PyBay held their first local Python conference this last weekend
(Friday, August 19 through Sunday, August 21). What a great event! I
just wanted to get down some first impressions - I hope to do more after
the slides and videos are up.

End of an Experiment

A long time ago we started an experiment to see if there was any
support for developing Mozilla products on social coding sites. Well,
the community-at-large has spoken, with the results many predicted:

pyenv & virtualenv can get along

In my (apparently) continuing list of tiny hassles with PyEnv, I
finally figured out how to “fix” the PyEnv notion of a virtualenv.
This may apply only to my setup: my main python version is managed by
homebrew.

Tuning Legacy vcs-sync for 2x profit!

One of the challenges of maintaining a legacy system is deciding how much
effort should be invested in improvements. Since modern vcs-sync is
“right around the corner”, I have been avoiding looking at improvements
to legacy (which is still the production version for all build farm use
cases).

While adding another gaia branch, I noticed that the conversion path for
active branches was both highly variable and frustratingly long. It
usually took 40 minutes for a commit to an active branch to trigger a
build farm build. And worse, that time could easily be 60 minutes if the
stars didn’t align properly. (Actually, that’s the conversion time for
git -> hg. There’s an additional 5-7 minutes, worst case, for b2g_bumper to
generate the trigger.)

The full details are in bug 1226805, but a simple rearrangement of
the jobs removed the 50% variability in the times and cut the average
time by 50% as well. That’s a savings of 20-40 minutes per gaia push!

Moral: don’t take your eye off the legacy systems – there still can be
some gold waiting to be found!

Complexity & * Practices

I was fortunate enough to be able to attend Dev Ops Days Silicon
Valley this year. One of the main talks was given by
Jason Hand, and he made some great points. I wanted to highlight two
of them in this post:

Post Mortems are really learning events, so you should hold them
when things go right, right? RIGHT!! (Seriously, why
wouldn’t you want to spot your best ideas and repeat them?)

Systems are hard – if you’re pushing the envelope, you’re
teetering on the line between complexity and chaos. And we’re
all pushing the envelope these days - either by getting fancy
or getting lean.

Post Mortems as Learning Events

Our industry has talked a lot about “Blameless Post Mortems”, and
techniques for holding them. Well, we can call them “blameless” all we
want, but if we only hold them when things go wrong, folks will get the
message loud and clear.

If they are truly blameless learning events, then you would also hold
them when things go right. And go meh. Radical idea? Not really - why
else would sports teams study game films when they win? (This point was
also made in a great Ignite by Katie Rose: GridIronOps - go read her
slides.)

My $0.02 is - this would also give us a chance to celebrate success.
That is something we do not do enough, and we all know the dedication
and hard work it takes to not have things go sideways.

And, by the way, terminology matters during the learning event. The
person who is accountable for an operation is just that: capable of
giving an account of the operation. Accountability is not
responsibility.

Terminology and Systems – Setting the right expectations

Part way through Jason’s talk, he has this awesome slide about how
system complexity relates to monitoring which relates to problem
resolution. Go look at slide 19 - here’s some of what I find
amazing in that slide:

It is not a straight line with a destination. Your most stable
system can suddenly display inexplicable behavior due to any
number of environmental reasons. And you’re back in the chaotic
world with all that implies.

Systems can progress out of chaos, but that is an uphill battle.
Knowing which stage a system is in (roughly) informs the approach
to problem resolution.

Note the wording choices: “known” vs “unknowable” – for all but
the “obvious” case, it will be confusing. That is a property of
the system, not a matter of staff competency.

While not in his slide, Jason spoke to how each level really has
different expectations. Or should have, but often the appropriate
expectation is not set. Here’s how he related each level to industry
terms.

Best Practices:

The only level with enough certainty to be able to expect the “best”
is the known and familiar one. This is the “obvious” one, because
we’ve all done exactly this before over a long enough time period to
fully characterize the system, its boundaries, and abnormal
behavior.

Once we back away from such certainty, it is only realistic to have
less certainty in our responses. With the increased uncertainty, the
linkage of cause and effect is more tenuous.

Even if we have all the event history and logs in front of us, more
analysis is needed before appropriate corrective action can be
determined. Even with automation, there is a latency to the
response.

Emergent Practices:

Okay, now we are pushing the envelope. The system is complex, and
we are still learning. We may not have all the data at hand, and may
need to poke the system to see what parts are stuck.

Cause and effect should be related, but how will not be
visible until afterwards.
There is much to learn.

Novel Practices:

For chaotic systems, everything is new. A lot is truly unknowable
because that situation has never occurred before. Many parts of the
system are effectively black boxes. Thus resolution will often be a
process of trying something, waiting to see the results, and
responding to the new conditions.

Next Steps

There is so much more in that diagram I want to explore. The connecting
of problem resolution behavior to complexity level feels very powerful.

<hand_waving caffeine_level=”deprived”>

My experience tells me that many of these subjective terms are
highly context sensitive, and in no way absolute. Problem resolution
at 0300 local with a bad case of the flu just has a way of making
“obvious” systems appear quite complex or even chaotic.

By observing the behavior of someone trying to resolve a problem,
you may be able to get a sense of how that person views that system
at that time. If that isn’t the consensus view, then there is a gap.
And gaps can be bridged with training or documentation or
experience.

duo MFA & viscosity no-cell setup

The Duo application is nice if you have a supported mobile device, and it’s
usable even when you
you have no cell connection via TOTP. However, getting Viscosity to allow both
choices took some work for me.

For various reasons, I don’t want to always use the Duo application, so
would like for Viscosity to alway prompt for password. (I had already
saved a password - a fresh install likely would not have that issue.)
That took a bit of work, and some web searches.

Disable any saved passwords for Viscosity. On a Mac, this means
opening up “Keychain Access” application, searching for “Viscosity”
and deleting any associated entries.

Ask Viscosity to save the “user name” field (optional). I really
don’t need this, as my setup uses a certificate to identify me.
So it doesn’t matter what I type in the field. But, I like hints, so
I told Viscosity to save just the user name field:

defaults write com.viscosityvpn.Viscosity RememberUsername -bool true

With the above, you’ll be prompted every time. You have to put
“something” in the user name field, so I chose to put “push or TOTP” to
remind me of the valid values. You can put anything there, just do not check
the “Remember details in my Keychain” toggle.

Using Password Store

Password Store (aka “pass”) is a very handy wrapper for dealing
with pgp encrypted secrets. It greatly simplifies securely working with
multiple secrets. This is still true even if you happen to keep your
encrypted secrets in non-password-store managed repositories, although
that setup isn’t covered in the docs. I’ll show my setup here. (See the
Password Store page for usage: “pass show -c <spam>” & “pass
search <eggs>” are among my favorites.)

Decoding Hashed known_hosts Files

Modern ssh comes with the option to obfuscate the hosts it can connect
to, by enabling the HashKnownHosts option. Modern server installs
have that as a default. This is a good thing.

The obfuscation occurs by hashing the first field of the known_hosts
file - this field contains the hostname,port and IP address used to
connect to a host. Presumably, there is a private ssh key on the host
used to make the connection, so this process makes it harder for an
attacker to utilize those private keys if the server is ever
compromised.

Super! Nifty! Now how do I audit those files? Some services have
multiple IP addresses that serve a host, so some updates and changes are
legitimate. But which ones? It’s a one way hash, so you can’t decode.

Well, if you had an unhashed copy of the file, you could match host keys
and determine the host name & IP. [1] You might just have such a file on
your laptop (at least I don’t hash keys locally). [2] (Or build a
special file by connecting to the hosts you expect with the options “-oHashKnownHosts=no-oUserKnownHostsFile=/path/to/new_master”.)

I through together a quick python script to do the matching, and it’s at
this gist. I hope it’s useful - as I find bugs, I’ll keep it updated.

GMail multi-inbox

As much as GMail’s search syntax makes me long for PCRE, there are some
unobvious gems laying around.

For example, I get tons of mail about releases. Occasionally, I need to
monitor a given release, paying attention to not only the automated
progress, but also human generated emails as well. Here’s my current
setup:

Docker at Vungle

Tonight I attended the San Francisco Dev Ops meetup at Vungle. The
topic was one we often discuss at Mozilla - how to simplify a
developer’s life. In this case, the solution they have migrated to is
one based on Docker, although I guess the title already gave that away.

Long (but interesting - I’ll update with a link to the video when it
becomes available) story short, they are having much more success using
DevOps managed Docker containers for development than their previous setup of
Virtualbox images built & maintained with Vagrant and Chef.

Kaizen the low tech way

On Jan 29, I treated myself to a seminar on Successful Lean Teams,
with an emphasis on Kanban & Kaizen techniques.
I’d read about both, but found the presentation useful. Many
of the other attendees were from the Health Care industry and their
perspectives were very enlightening!

Hearing how successful they were in such a high risk,
multi-disciplinary, bureaucratic, and highly regulated environment is inspiring. I’m
inclined to believe that it would also be achievable in a
simple-by-comparison low risk environment of software development. ;)

What these hospitals are using is a light weight, self managed process
which:

ensures visibility of changes to all impacted folks

outlines the expected benefits

includes a “trial” to ensure the change has the desired impact

has a built in feedback system

That sounds achievable. In several of the settings, the traditional
paper and bulletin board approach was used, with 4 columns labeled “New
Ideas”, “To Do”, “Doing”, and “Done”. (Not a true Kanban board for
several reasons, but Trello would be a reasonable visual approximation;
CAB uses spreadsheets.)

Cards move left to right, and could cycle back to “New Ideas” if
iteration is needed. “New Ideas” is where things start, and they
transition from there (I paraphrase a lot in the following):

Everyone can mark up cards in New Ideas & add alternatives, etc.

A standup is held to select cards to move from “New Ideas” to “To Do”

The card stays in “To Do” for a while to allow concerns to be
expressed by other stake holders. Also a team needs to sign up to
move the change through the remaining steps. Before the card can move
to “Doing”, a “test” (pilot or checkpoints) is agreed on to ensure the
change can be evaluated for success.

The team moves the card into “Doing”, and performs PSDA cycles (Plan, Do,
Study, Adjust) as needed.

Assuming the change yields the projected results, the change is
implemented and the card is moved to “Done”. If the results aren’t as
anticipated, the card gets annotated with the lessons learned, and
either goes to “Done” (abandon) or back to “New Ideas” (try again) as
appropriate.

For me, I’m drawn to the 2nd and 3rd steps. That seems to be the change
from current practice in teams I work on. We already have a gazillion bugs
filed (1st step). We also can test changes in staging (4th step) and
update production (5th step). Well, okay, sometimes we skip the staging
run. Occasionally that *really* bites us. (Foot guns, foot guns –
get your foot guns here!)

The 2nd and 3rd steps help focus on changes. And make the set of changes
happening “nowish” more visible. Other stakeholders then have a small
set of items to comment upon. Net result - more changes “stick” with
less overall friction.

Painting with a broad brush, this Kaizen approach is essentially what
the CAB process is that Mozilla IT implemented successfully. I have
experienced the CAB reduce the amount of stress, surprises, and self
inflicted damage amongst both inside and outside of IT. Over time, the
velocity of changes has increased and backlogs have been reduced. In
short, it is a “Good Thing(tm)”.

So, I’m going to see if there is a way to “right size” this process for
the smaller teams I’m on now. Stay tuned....

Pyevn & Tox Can Get Along

I fought this for quite a few days on a background project. I finally
found the answer, and want to ensure I don’t forget it.

tl;dr:

Activate all the python versions you need before running tox.

After I upgraded my laptop to OSX 10.10, I also switched to using pyenv
for installing non-system python versions. Things went well (afaict)
until they didn’t. All of a sudden, I could not get both my code tests
to pass, and my doc build to succeed.

The error message was especially confusing:

pyenv: python2.7: command not found
The `python2.7' command exists in these Python versions:
2.7.5

Searching the web didn’t really shed any enlightenment. I’d find other
folks who had the problem. I wasn’t alone. But they all disappeared from
the bug traffic over a year ago (example). And with no sign of resolution.

Finally, I tried different search terms, and landed on this post.
The secret – you can have multiple pyevn instances “active”. The first
listed is the one that a bare python will invoke. The others are
available as python*major*.*minor* (e.g. “python3.2”) and
python*major* (e.g. “python3”)

ChatOps Meetup

I had two primary goals in attending: I wanted to understand what
made ChatOps special, and I wanted to see how much was applicable to my
current work at Mozilla. The two presentations helped me accomplish the first. I’m
still mulling over the second. (Ironically, I had to shift focus during
the event to clean up a deployment-gone-wrong that was very close to
one of the success stories mentioned by Dan Chuparkoff.)

My takeaway on why chatops works is that it is less about the tooling
(although modern web services make it a lot easier), and more about
the process. Like a number of techniques, it appears to be more
successful when teams fully embrace their vision of ChatOps,
and make implementation a top priority. Success is enhanced when the tooling
supports the vision, and that appears to be what all the recent buzz is
about – lots of new tools, examples, and lessons learned make it easier
to follow the pioneers.

What are the key differentiators?

Heck, many teams use irc for operational coordination. There are scripts
which automate steps (some workflows can be invoked from the web even).
We’ve got automated configuration, logging, dashboards, and wikis – are we doing ChatOps?

Well, no, we aren’t.

Here are the differences I noted:

ChatOps requires everyone both agreeing and committing to a
single interface to all operations. (The opsbot, like hubot,
lita or Err.) Technical debt (non-conforming legacy systems)
will be reworked to fit into ChatOps.

ChatOps requires focus and discipline. There are a small number of
channels (chat rooms, MUC) that have very specific uses - and
folks follow that. High signal to noise ratio. (No animated gifs
in the deploy channel - that’s what the lolcat channel is for.)

A commitment to explicitly documenting all business rules as
executable code.

What do you get for giving up all those options and flexibility? Here
was the “ah ha!” concepts for me:

Each ChatOps room is a “shared console” everyone can see and
operate. No more screen sharing over video, or “refresh now”
coordination!

There is a bot which provides the “facts” about the world. One
view accessible by all.

The bot is also the primary way folks interact and modify the
system. And it is consistent in usage across all commands. (The
bot extensions perform the mapping to whatever the backend
needs. The code adapts, not the human!)

The bot knows all and does all:

Where’s the documentation?

How do I do X?

Do X!

What is the status of system Y?

The bot is “fail safe” - you can’t bypass the rules. (If you code
in a bypass, well, you loaded that foot gun!)

Thus everything is consistent and familiar for users, which helps during
those 03:00 forays into a system you aren’t as familiar with. Nirvana
ensues (remember, everyone did agree to drink the koolaid above).

Can you get there from here?

The speaker selection was great – Dan was able to speak to the benefits
of committing to ChatOps early in a startup’s life. James Fryman (from
StackStorm) showed a path for migrating existing operations to a
ChatOps model. That pretty much brackets the range, so yeah, it’s
doable.

The main hurdle, imo, would be getting the agreement to a total
commitment! There are some tensions in deploying such a system at a
highly open operation like Mozilla: ideally chat ops is open to everyone, and
business rules ensure you can’t do or see anything improper. That means
the bot has (somewhere) the credentials to do some very powerful
operations. (Dan hopes to get their company to the “no one uses ssh,
ever” point.)

My next steps? Still thinking about it a bit – I may load Err onto my
laptop and try doing all my local automation via that.

bz Quick Search

Determine the quick search parameters you want. Experimenting on the
Bugzilla Quick Search page is useful.

If this is your first time, install a search engine that you can
copy and modify. The bugzilla one is an obvious good choice.

Find the xml file for the search engine in the “searchplugins”
directory of your profile. Modify the “template” attribute in the
“os:Url” element based on your research in (1). I tend to put all my
customization after the special token “{searchTerms}”, as that makes
it easier to refine the search on the bugzilla search results page.

New Hg Server Status Page

Just a quick note to let folks know that the Developer Services team
continues to make improvements on Mozilla’s Mercurial server. We’ve set
up a status page to make it easier to check on current status.

As we continue to improve monitoring and status displays, you’ll always
find the “latest and greatest” on this page. And we’ll keep the page
updated with recent improvements to the system. We hope this
page will become your first stop whenever you have
questions about our Mercurial server.

2014-06 try server update

Chatting with Aki the other day, I realized that word of all the
wonderful improvements to the try server issue have not been publicized.
A lot of folks have done a lot of work to make things better - here’s a
brief summary of the good news.

Before:

Try server pushes could appear to take up to 4 hours, during which
time others would be locked out.

Now:

The major time taker has been found and eliminated: ancestor
processing. And we understand the remaining occasional slow downs
are related to caching . Fortunately, there are some steps that
developers can take now to minimize delays.

What folks can do to help

The biggest remaining slowdown is caused by rebuilding the
cache. The cache is only invalidated if the push is
interrupted. If you can avoid causing a disconnect until your push is
complete, that helps everyone! So, please, no Ctrl-C during the
push! The other changes should address the long wait times you used to
see.

What has been done to infrastructure

There has long been a belief that many of our hg problems, especially on
try, came from the fact that we had r/w NFS mounts of the repositories
across multiple machines (both hgssh servers & hgweb servers). For
various historical reasons, a large part of this was due to the way
pushlog was implemented.

What has been done to our hooks

All along, folks have been discussing our try server performance issues
with the hg developers. A key confusing issue was that we saw processes
“hang” for VERY long times (45 min or more) without making a system
call. Kendall managed to observe an hg process in such an
infinite-looking-loop-that-eventually-terminated a few times. A stack
trace would show it was looking up an hg ancestor without makes system
calls or library accesses. In discussions, this confused the hg team
as they did not know of any reason that ancestor code should be being
invoked during a push.

Thanks to lots of debugging help from glandium one evening, we found and
disabled a local hook that invoked the ancestor function on every
commit to try. \o/ team work!

Caching – the remaining problem

With the ancestor-invoking-hook disabled, we still saw some longish
periods of time where we couldn’t explain why pushes to try appeared
hung. Granted it was a much shorter time, and always self corrected,
but it was still puzzling.

A number of our old theories, such as “too many heads” were discounted
by hg developers as both (a) we didn’t have that many heads, and (b)
lots of heads shouldn’t be a significant issue – hg wants to support
even more heads than we have on try.

Greg did a wonderful bit of sleuthing to find the impact of ^C during
push. Our current belief is once the caching is fixed upstream, we’ll
be in a pretty good spot. (Especially with the inclusion of some
performance optimizations also possible with the new cache-fixed
version.)

What is coming next

To take advantage of all the good stuff upstream Hg versions have,
including the bug fixes we want, we’re going to be moving towards
removing roadblocks to staying closer to the tip. Historically, we had
some issues due to http header sizes and load balancers; ancient python
or hg client versions; and similar. The client issues have been
addressed, and a proper testing/staging environment is on the horizon.

There are a few competing priorities, so I’m not going to predict a
completion date. But I’m positive the future is coming. I hope you have
a glimpse into that as well.

CVS Attic in DVCS

One handy feature of CVS was the presence of the Attic directory. The
primary purpose of the Attic directory was to simplify trunk checkouts,
while providing space for both removed and added-only-on-branch files.

As a consequence of this, it was relatively easy to browse all such
file names. I often would use this as my “memory” of scripts I had
written for specific purposes, but were no longer needed. Often these
would form the basis for a future special purpose script.

This isn’t a very commonly needed use case, but I have found myself
being a bit reluctant to delete files using DVCS systems, as I wasn’t
quite sure how to find things easily in the future.

Well, I finally scratched the itch – here are the tricks I’ve added to
my toolkit.

Hg version

A simplistic version, which just shows when file names were deleted, is
to add the alias to ~/.hgrc:

[alias]
attic=log --template '{rev}:{file_dels}\n'

Git version

Very similar for git:

git config --global alias.attic 'log --diff-filter=D --summary'

(Not actually ideal, as not a one liner, but good enough for how often I
use this.)

Bluetooth Finder for Fitbit

Pro tip - if you have a Fitbit or other small BLE device, go get a
“bluetooth finder” app for your smartphone or tablet. NOW. No thanks
needed.

I ended up spending far-too-long looking for my misplaced black fitbit
One last weekend. Turned out the black fitbit was behind a black sock on
a shelf in a dark closet. (Next time, I’ll get a fuchsia colored on – I
don’t have too many pairs of fuchsia socks.)

After several trips through the house looking, I thought I’d turn to
technology. By seeing where in the house I could still sync with my
phone, I could confirm it was in the house. I tried setting alarms on
the fitbit, but I couldn’t hear them go off. (Likely, the vibrations
were completely muffled by the sock. Socks - I should just get rid of
them.)

Then I had the bright idea of asking the interwebs for help. Surely, I
couldn’t be the first person in this predicament. I was rewarded with
this FAQ on the fitbit site, but I’d already followed those
suggestions.

Finally, I just searched for “finding bluetooth”, and discovered the tv
ads were right: there is an app for that! Since I was on my android
tablet at the time, I ended up with Bluetooth Finder, and found my
Fitbit within 5 minutes. (I also found a similar app for my iPhone, but I
don’t find it as easy to use. Displaying the signal strength on a meter
is more natural for me than watching dB numbers.)

More VIM fun: Vundle

During this last RelEng workweek, I thought I’d try a new VIM plugin for
reST: RIV. While that didn’t work out great (yet), it did get me to
start using Vundle. Vundle is a quite nice vim plugin manager, and is
easier for me to understand than Pathogen.

using with bundles not managed by either Pathogen or Vundle.
(While Vundle running won’t interfere with unmanaged bundles, the
:BundleClean command will claim they are unused and offer to
delete them. That’s just too risky for me.)

The two cases appear to have the same solution:

ensure all directories in the bundle location (typically
~/.vim/bundles/) are managed by Vundle.

use a file:// URI for any bundle you don’t want Vundle to
update.

For example, I installed the ctrlp bundle a while back, from the
Bitbucket (hg) repository. (Yes, there (now?) is a github repository,
but why spoil my fun.) Since the hg checkout already lived in
~/.vim/bundle, I only needed to add the following line to my vimrc
file:

Bundle 'file:///~/.vim/bundle/ctrlp.vim/'

Vundle no longer offers to delete that repository when BundleClean
is run.

I suspect I’ll get errors if I ever asked Vundle to update that repo,
but that isn’t in my plans. I believe my major use case for Vundle will
be to trial install plugins, and then BundleClean will clean things
up safely.

Inter Repository Operations

[This is an experiment in publishing a doc piece by piece as blog entries.
Please refer to themain pagefor additional context.]

Mozilla, like most operations, has the Repositories of Record (RoR) set
to only allow “fast forward” updates when new code is landed. In order
to fast forward merge, the tip of the destination repository (RoR) must
be an ancestor of the commit being pushed from the source repository. In
the discussion below, it will be useful to say if a repository is
“ahead”, “behind”, or “equal” to another. These states are defined as:

If the tip of the two repositories are the same reference, then
the two repositories are said to be equal (‘e‘ in table
below)

Else if the tip of the upstream repository is a ancestor of the
tip of the destination repository, the upstream is defined to be
behind (‘B‘ in table below) the source repository

Otherwise, the upstream repository is ahead (‘A‘ in table
below) of the source repository.

Landing a change in the normal (2 repository case: RoR and lander’s
repository), the process is logically (assuming no network issues):

Make sure lander’s repository is equivalent to RoR (start with
equality)

Apply the changes (RoR is now “Behind” the local repository)

Push the changes to the RoR

if the push succeeds, then stop. (equality restored)

if the push fails, simultaneous landings were being attempted,
and you lost the race.

When simultaneous landings are attempted, only one will succeed,
and the others will need to repeat the landing attempt. The
RoR is now “Ahead” of the local repository, and the new
upstream changes will need to be incorporated, logically as:

When an authorized committer wants to land a change set on an
hg RoR from git, there are three repositories involved. These are the
RoR, the git repository the lander is working in, and internal
hggit used for translation. The sections below describe how this
affects the normal case above.

Land from git – Happy Path

On the happy path (no commit collisions, no network issues), the steps
are identical to the normal path above. The git commands executed by the
lander are set by the tool chain to perform any additional operations
needed.

Land from git – Commit Collision

Occasionally, multiple people will try to land commits simultaneously,
and a commit collision will occur (steps 3a, 3b, & 3c above). As long as the
collision is noticed and dealt with before addition changes are
committed to the git repository, the tooling will unapply the change to
the internal hggit repository.

Land from git – Sad Path

In real life, network connections fail, power outages occur, and other
gremlins create the need to deal with “sad paths”. The following
sections are only needed when we’re neither on the happy path nor
experiencing a normal commit collision.

Because these cases cover every possible case of disaster recovery, it
can appear more complex than it is. While there are multiple (6)
different sad paths, only one will be in play for a given repository.
And the maximum number of operations to recover is only three (3). The
relationship between each pair of repositories determines the correct
actions to take to restore the repositories to a known, consistent
state. The static case is simply:

Simplistic Recovery State Diagram

Note

The simplistic diagram assumes no changes to RoR during the
duration of the recovery (not a valid assumption for real
life). See the text for information on dealing with the
changes.

States “BB” & “BA” are not shown, as they represent invalid
states that may require restoring portions of the system
from backup before proceeding.

In reality, it is impractical to guarantee the RoR is static during
recovery steps. That can be dealt with by applying the process described
in the flowchart to restore equality and using the tables below to
locate the actions.

The primary goal is to ensure correctness based on the RoR. The
secondary goal is to make the interim repository as invisible as possible.

This “shouldn’t happen”, as it implies the git repository
has been restored from a backup and the “pending landing” in
the hggit repository is no longer a part of the git history.
If there isn’t a clear understanding of why this occurred,
client side repository setup should be considered suspect,
and replaced.

Lander shot themselves in the foot - they have 2 incomplete landings
in progress. If they are extremely lucky, they can recover by
completing the first landing (“hg push RoR” -> “eB”), and proceed
from there.

The deterministic approach, which also must be used if
landing of first change set fails, is to back out second
landing from hggit and git, then back out first landing from
hggit and git.) Then equality can be restored, and each
landing redone separately.

DVCS Commands

Next Step

Active Repository

Command

pull from RoR

hggit

hgpull

pull to git

git

gitpullRoR

push from git

git

gitpushRoR

push to RoR

hggit

hgpush

Note

that if any of the above actions fail, it simply means that we’ve
lost another race condition with someone else’s commit. The recovery
path is simply to re-evaluate the current state and proceed as
indicated (as shown in diagram 1).

Issues and solutions encountered in maintaining a single code
base under active development in both hg & git formats.

Background

Mozilla Corporation operates an extensive build farm that
is mostly
used to build binary products installed by the end user. Mozilla
has been using Mercurial repositories for this since converting from
CVS in 2007. We currently use a 6 week “Rapid Release” cycle for
most products.

Speaker Notes

We currently have upwards of 4,000 hosts involved in the continuous
integration and testing of Mozilla products. These hosts do
approximately 140 hours of work on each commit.

Firefox Operating System is a new product that ships source to be
incoporated
by various partners in the mobile phone industry. These partners,
experienced
with the Android build process, require source be delivered via git
repositories. This is close to a “Continuous Release” process.

Speaker Notes

A large part of the FxOS product is code used in the browser
products. That is in Mercurial and needs to be converted to git.
Most new code modules for FxOS are developed on github, and need to
be converted to Mercurial for use in our CI & build
systems.

Summary

What we initially set out to do:

Make it purely a developer choice which dvcs to use.

Speaker Notes

Ideal was to allow developers to make dvcs as personal a
choice as editor.

Support multiple social coding sites.

Speaker Notes

These social coding sites, such as github and bitbucket,
make it much easier for new community members to
contribute.

That was much tougher than anticipated.

In theory, git & hg are very close...

... In practice, “the devil is in the details”.

Where we are:

Changed direction to support FFOS release to partners.

Quickly mirror Repository of Record (RoR) between git & hg.

CI/build system remains Mercurial centric.

Challenge Areas

Changesets have different hashes in Mercurial and git.

We added tooling to support both in static documents such as
manifest files.

All tools continue to use hg hash as primary value for indexing
and linking.

Propagation delays of changesets to the “other” system.

Speaker Notes

For most use cases, the approximately 20 minute average
we’re achieving is acceptable.

Compounded by hash differences between two systems.

Speaker Notes

A common use case here is a developer wanting to start a
self serve build. If the commit was to git, the self serve
build won’t be successful until that commit is converted to
hg.

We are continuing work on this. It is closely tied to
determining which commit broke the build, when multiple
repositories are involved.

Build details

Movable tags are not popular in git based workflows, but have been
a common technique at Mozilla to mark “latest”.

Challenge Areas (Con’t)

Mixed philosophies are often linked with mixed repositories.

Android never wants history to appear to change. Downstream
servers allow only fast forward changesets and deny deletions.

Mozilla uses “RoR is authoritative”.

Speaker Notes

Either approach is self consistent. It is when the two need
to interact that challenges arrise.

The canonical commit/push/land cycle at Mozilla

[This is an experiment in publishing a doc piece by piece as blog entries.
Please refer to themain pagefor
additional context.]

Untangling the terminology

In the old days, before DVCS, “commit” only had only one real purpose.
It was how you published your work to the rest of the world (or your
project’s world at least). With DVCS, you are likely committing quite
often, but still only occasionally publishing.

Changes to commit workflow

[This is an experiment in publishing a doc piece by piece as blog entries.
Please refer to themain pagefor additional context.]

With all the changes to support git, how will that affect a committer’s
workflow? (For developer impact, see this post.)

The primary goal is to work within the existing Mozilla commit
policy[1]. Working within that constraint, the idea is “as little
as possible”, and this post will try to describe how big “as little” is.

Remember: all existing ways of working with hg will continue to
work! These are just going to be some additional options for folks who
prefer to use github & bitbucket.

Commit to git is not the issue:

... the Wowza feature of git

tl;dr

Wowza! I found the killer feature in git - you can have your cake and
eat it, too!

Every time I’ve had to move to a new VCS, there’s never been enough time
available to move the complete history correctly. Linux had this problem
in spades when they moved off BitKeeper onto git in a very-short-time.

The solution? Take your time to convert the history correctly (or not,
you can correct later), then allow developers who want it to prepend it
on their machines, without making their repo operate any differently
from the latest one.

Releng As Is - January 2012

Where we are in January 2012

The purpose of this post is to present a very high level picture of the
current Firefox build & release process as a set of requirements. Some
of these services are provided or supported by groups outside of releng
(particularly it & webdev). This diagram will be useful in understanding
the impact of changes.

The Ideal Future

Based on discussions to date, everyone seems to have similar ideas about
what “supporting git for releng” means. Later posts will highlight the
work needed to ensure the ideal can be achieved, and how to arrive
there.

For this post, I intend to limit the viewpoint and scope to that of the
developer impact. Release notions (such as “system of record”) and
scaling issues won’t be mentioned here. (N.B. Those concerns will be a
key part of the path to verifying feasibility, but do not change the
goal.)

As a reminder, I’m just talking about repositories that are used to
produce products. [1]

... a View from Outside

tl;dr

One of the things that excited me about the opportunity to work at
Mozilla was the chance to change perspectives. After working in many
closed environments, I knew the open source world of Mozilla would be
different. And that would lead to a re-examination of basic questions,
such as:

Q: Are there any significant differences in the role a VCS plays
at Mozilla than at j-random-private-enterprise?