All in all, a great fun read, and I found the “extra” sidebar cartoons equally fun… especially the yakshaver! If you like xkcd, and don’t already have this book, go get it.

ps: He’s got a new book coming out in a few days, a book tour in progress, and a really subtle turtles-all-the-way-down comic which nudges about the new book… if you look *really* closely! I’m looking forward to getting my hands on it!

“Lost Cat” tells the true story of how an urban cat owner (one of the authors) loses her cat, then has the cat casually walk back in the door weeks later healthy and well. The book details various experiments the authors did using GPS trackers, and tiny “CatCam” cameras to figure out where her cat actually went. Overlaying that data onto google maps surprised them both – they never knew their cats roamed so far and wide across the city. The detective work they did to track down and then meeting with “Cat StealerA” and “Cat Stealer B” made for a fun read… Just like “Meanwhile in San Francisco”, the illustrations are all paintings. Literally. My all-time favorite painting of any cat ever is on page7.

We’re passionate about open source, and ensure that all 100% of code in a Hortonworks HDP release is open sourced in the Apache Software Foundation Hadoop project. We work with other large organizations to help them upstream their contributions to the Apache project, which helps accelerate the general Hadoop community. Its so important to us, it is part of the Hortonworks Manifesto.

We’re proud of our HDP releases. Our clients rely on HDP in production environments where phrases like “petabytes per day” and “zettabytes” are common. We sim-ship on centos5, centos6, ubuntu, debian, suse and windows – all from the same changeset. Building and testing at this scale has its own special forms of challenges, and is exciting. In the rare case where customers hit production issues, we are able to deliver supported fixes super-quickly.

The Hortonworks Release Engineering team works hard behind the scenes to design, build and maintain the infrastructure-at-scale needed to make this possible. For more details, and to apply, click here.

Note: The current team is spread across 3 cities, so remoties are welcome, even encouraged! Hardly a surprise if you read the other remoties posts on my blog, but worth stating explicitly!

In April 1997, Netscape ReleaseEngineers wrote, and started running, the world’s first? second? continuous integration server. Now, just over 17 years later, in May 2014, the tinderbox server was finally turned off. Permanently.

This is a historic moment for Mozilla, and for the software industry in general, so I thought people might find it interesting to get some background, as well as outline the assumptions we changed when designing the replacement Continuous Integration and Release Engineering infrastructure now in use at Mozilla.

At Netscape, developers would checkin a code change, and then go home at night, without knowing if their change broke anything. There were no builds during the day.

Instead, developers would have to wait until the next morning to find out if their change caused any problems. At 10am each morning, Netscape RelEng would gather all the checkins from the previous day, and manually start to build. Even if a given individual change was “good”, it was frequently possible for a combination of “good” changes to cause problems. In fact, as this was the first time that all the checkins from the previous day were compiled together, or “integrated” together, surprise build breakages were common.

This integration process was so fragile that all developers who did checkins in a day had to be in the office before 10am the next morning to immediately help debug any problems that arose with the build. Only after the 10am build completed successfully were Netscape developers allowed to start checking-in more code changes on top of what was now proven to be good code. If you were lucky, this 10am build worked first time, took “only” a couple of hours, and allowed new checkins to start lunchtime-ish. However, this 10am build was frequently broken, causing checkins to remain blocked until the gathered developers and release engineers figured out which change caused the problem and fixed it.

Fixing build bustages like this took time, and lots of people, to figure out which of all the checkins that day caused the problem. Worst case, some checkins were fine by themselves, but cause problems when combined with, or integrated with, other changes, so even the best-intentioned developer could still “break the build” in non-obvious ways. Sometimes, it could take all day to debug and fix the build problem – no new checkins happened on those days, halting all development for the entire day. More rare, but not unheard of, was that the build bustage halted development for multiple days in a row. Obviously, this was disruptive to the developers who had landed a change, to the other developers who were waiting to land a change, and to the Release Engineers in the middle of it all…. With so many people involved, this was expensive to the organization in terms of salary as well as opportunity cost.

If you could do builds twice a day, you only had half-as-many changes to sort through and detangle, so you could more quickly identify and fix build problems. But doing builds more frequently would also be disruptive because everyone had to stop and help manually debug-build-problems twice as often. How to get out of this vicious cycle?

In these desperate times, Netscape RelEng built a system that grabbed the latest source code, generated a build, displayed the results in a simple linear time-sorted format on a webpage where everyone could see status, and then start again… grab the latest source code, build, post status… again. And again. And again. Not just once a day. At first, this was triggered every hour, hence the phrase “hourly build”, but that was quickly changed to starting a new build immediately after finishing the previous build.

All with no human intervention.

By integrating all the checkins and building continuously like this throughout the day, it meant that each individual build contained fewer changes to detangle if problems arose. By sharing the results on a company-wide-visible webserver, it meant that any developer (not just the few Release Engineers) could now help detangle build problems.

What do you call a new system that continuously integrates code checkins? Hmmm… how about “a continuous integration server“?! Good builds were colored “green”. The vertical columns of green reminded people of trees, giving rise to the phrase “the tree is green” when all builds looked good and it was safe for developers to land checkins. Bad builds were colored “red”, and gave rise to “the tree is burning” or “the tree is closed”. As builds would break (or “burn” with flames) with seemingly little provocation, the web-based system for displaying all this was called “tinderbox“.

Pretty amazing stuff in 1997, and a pivotal moment for Netscape developers. When Netscape moved to open source Mozilla, all this infrastructure was exposed to the entire industry and the idea spread quickly. This remains a core underlying principle in all the various continuous integration products, and agile / scrum development methodologies in use today. Most people starting a software project in 2014 would first setup a continuous integration system. But in 1997, this was unheard of and simply brilliant.

(From talking to people who were there 17 years ago, there’s some debate about whether this was originally invented at Netscape or inspired by a similar system at SGI that was hardwired into the building’s public announcement system using a synthesized voice to declare: “THE BUILD IS BROKEN. BRENDAN BROKE THE BUILD.” If anyone reading this has additional info, please let me know and I’ll update this post.)

If tinderbox server is so awesome, and worked so well for 17 years, why turn it off? Why not just fix it up and keep it running?

In mid-2007, an important criteria for the reborn Mozilla RelEng group was to significantly scale up Mozilla’s developer infrastructure – not just incrementally, but by orders of magnitude. This was essential if Mozilla was to hire more developers, gather many more community members, tackle a bunch of major initiatives, ship releases more predictably and to have these new additional Mozilla’s developers and community contributors be able to work effectively. When we analyzed how tinderbox worked, we discovered a few assumptions from 1997 no longer applied, and were causing bottlenecks we needed to solve.

1) Need to run multiple jobs-of-the-same-type at a time2) Build-on-checkin, not build-continuously.3) Display build results arranged by developer checkin not by time.

1) Need to run multiple jobs-of-the-same-type at a time
The design of this tinderbox waterfall assumed that you only had one job of a given type in progress at a time. For example, one linux32 opt build had to finish before the next linux32 opt build could start.

Mechanically, this was done by having only one machine dedicated to doing linux opt builds, and that one machine could only generate one build at a time. The results from one machine were displayed in one time-sorted column on the website page. If you wanted an additional different type of build, say linux32 debug builds, you needed another dedicate machine displaying results in another dedicated column.

For a small (~15?) number of checkins per day, and a small number of types of builds, this approach works fine. However, when you increase the checkins per day, many “hourly” build has almost as many checkins as Netscape had each day in 1997. By 2007, Mozilla was routinely struggling with multi-hour blockages as developers debugged integration failures.

Instead of having only one machine do linux32 opt builds at a time, we setup a pool of identically configured machines, each able to do a build-per-checkin, even while the previous build was still in progress. In peak load situations, we might still get more-then-one-checkin-per-build, but now we could start the 2nd linux32 opt build, even while the 1st linux32 opt build was still in progress. This got us back to having very small number of checkins, ideally only one checkin, per build… identifying which checkin broke the build, and hence fixing that build, was once again quick and easy.

Another related problem here was that there were ~86 different types of machines, each dedicated to running different types of jobs, on their own OS and each reporting to different dedicated columns on the tinderbox. There was a linux32 opt builder, a linux32 debug builder, a win32 opt builder, etc. This design had two important drawbacks.

Each different type of build took different times to complete. Even if all jobs started at the same time on day1, the continuous looping of jobs of different durations meant that after a while, all the jobs were starting/stopping at different times – which made it hard for a human to look across all the time-sorted waterfall columns to determine if a particular checkin had caused a given problem. Even getting all 86 columns to fit on a screen was a problem.

It also made each of these 86 machines a single point of failure to the entire system, a model which clearly would not scale. Building out pools of identical machines from 86 machines to ~5,500 machines allowed us to generate multiple jobs-of-the-same-type at the same time. It also meant that whenever one of these set-of-identical machines failed, it was not a single point of failure, and did not immediately close the tree, because another identically-configured machine was available to handle that type of work. This allowed people time to correctly diagnose and repair the machine properly before returning it to production, instead of being under time-pressure to find the quickest way to band-aid the machine back to life so the tree could reopen, only to have the machine fail again later when the band-aid repair failed.

All great, but fixing that uncovered the next hidden assumption.

2) Build-per-checkin, not build-continuously.

The “grab latest source code, generated a build, displayed the results” loop of tinderbox never looked to check if anything had actually changed. Tinderbox just started another build – even if nothing had changed.

Having only one machine available to do a given job meant that machine was constantly busy, so this assumption was not initially obvious. And given that the machine was on anyway, what harm in having it doing an unnecessary build or two?

Generating extra builds, even when nothing had changed, complicated the manual what-change-broke-the-build debugging work. It also meant introduced delays when a human actually did a checkin, as a build containing that checkin could only start after the unneccessary-nothing-changed-build-in-progress completed.

Finally, when we changed to having multiple machines run jobs concurrently, having the machines build even when there was no checkin made no sense. We needed to make sure each machine only started building when a new checkin happened, and there was something new to build. This turned into a separate project to build out an enhanced job scheduler system and machine-tracking system which could span multiple 4 physical colos, 3 amazon regions, assign jobs to the appropriate machines, take sick/dead machines out of production, add new machines into rotation, etc.

3) Display build results arranged by developer checkin not by time.

Tinderbox sorted results by time, specifically job-start-time and job-end-time. However, developers typically care about the results of their checkin, and sometimes the results of the checkin that landed just before them.

Further: Once we started generating multiple-jobs-of-the-same-type concurrently, it uncovered another hidden assumption. The design of this cascading waterfall assumed that you only had one build of a given type running at a time; the waterfall display was not designed to show the results of two linux32 opt builds that were run concurrently. As a transition, we hacked our new replacement systems to send tinderbox-server-compatible status for each concurrent builds to the tinderbox server… more observant developers would occasionally see some race-condition bugs with how these concurrent builds were displayed on the one column of the waterfall. These intermittent display bugs were confusing, hard to debug, but usually self corrected.

As we supported more OS, more build-types-per-OS and started to run unittests and perf-tests per platform, it quickly became more and more complex to figure out whether a given change had caused a problem across all the time-sorted-columns on the waterfall display. Complaints about the width of the waterfall not fitting on developers monitors were widespread. Running more and more of these jobs concurrently make deciphering the waterfall even more complex.

Finding a way to collect all the results related to a specific developer’s checkin, and display these results in a meaningful way was crucial. We tried a few ideas, but a community member (Markus Stange) surprised us all by building a prototype server that everyone instantly loved. This new server was called “tbpl”, because it scraped the TinderBox server Push Logs to gather its data.

Over time, there’s been improvements to tbpl.mozilla.org to allow sheriffs to “star” known failures, link to self-service APIs, link to the commits in the repo, link to bugs and most importantly gather all the per-checkin information directly from the buildbot scheduling database we use to schedule and keep track of job status… eliminating the intermittent race-condition bugs when scraping HTML page on tinderbox server. All great, but the user interface has remained basically the same since the first prototype by Markus – developers can easily and quickly see if a developer checkin has caused any bustage.

Fixing these 3 root assumptions in tinderbox.m.o code would be “non-trivial” – basically a re-write – so we instead focused on gracefully transitioning off tinderbox. Since Sept2012, all Mozilla RelEng systems have been off tinderbox.m.o and using tbpl.m.o plus buildbot instead.

Making the Continuous Integration process more efficient has allowed Mozilla to hire more developers who can do more checkins, transition developers from all-on-one-tip-development to multi-project-branch-development, and change the organization from traditional releases to rapid-release model. Game changing stuff. Since 2007, Mozilla has grown the number of employee engineers by a factor of 8, while the number of checkins that developers did has increased by a factor of 21. Infrastructure improvements have outpaced hiring!

On 16 May 2014, with the last Mozilla project finally migrated off tinderbox, so the tinderbox server was powered off. Tinderbox was the first of its kind, and helped changed how the software industry developed software. As much as we can gripe about tinderbox server’s various weaknesses, it has carried Mozilla from 1997 until 2012, and spawned an industry of products that help developers ship better software. Given it’s impact, it feels like we should look for a pedestal to put this on, with a small plaque that says “This changed how software companies develop software, thank you Tinderbox”… As it has been a VM for several years now, maybe this blog post counts as a virtual pedestal?! Regardless, if you are a software developer, and you ever meet any of the original team who built tinderbox, please do thank them.

I’d like to give thanks to some original Netscape folks (Tara Hernandez, Terry Weissman, Lloyd Tabb, Leaf, jwz) as well as aki, brendan, bmoss, chofmann, dmose, myk and rhelmer for their help researching the origins of Tinderbox. Also, thank you to lxt, catlee, bhearsum, rail and others for inviting me back to attend the ceremonial final-powering-off event… After the years of work leading up to this moment, it meant a lot to me to be there at the very end.

pps: When a server has been running for so long, figuring out what other undocumented systems might break when tinderbox is turned off is tricky. Here’s my “upcoming end-of-life” post from 02-apr-2013 when we thought we were nearly done. Surprise dependencies delayed this shutdown several times and frequently uncovered new, non-trivial, projects that had to be migrated. You can see the various loose ends that had to be tracked down in bug#843383, and all the many many linked bugs.

This was the first significant feature release shipped since I joined Hortonworks at the start of the year. There’s lots of interesting new features, and functionality in this HDP2.1 release – already well covered by others in great detail here. Oh, and of course, you candownload it from here.

In this post, I’ll instead focus on some of the behind-the-scenes mechanics. There were lots of major accomplishments in this release, but the ones that really stood out to me were:

1) sim-ship windows and linux.
This was the first HDP release where all OS were built from the same changeset and shipped at the same time. Making this happen was a hectic first priority in January. As well as the plumbing/mechanics within RelEng, it also took lots of coordination changes across different groups within Hortonworks to make this happen. The payoff on this was great. We sim-shipped, which is great and massively important for HWX as a company. Even more importantly, we set things up so we could sim-ship for every HDP2.1-and-above release going forward… and we proved it by sim-shipping the quick followup HDP2.1.2.0 release on 02may2014.

2) adding 5 new components.HDP2.1 contained 17 components, compared to HDP 2.0 (with 12 components) and HDP 1.3 (with 10 components), making HDP2.1 the largest growth of components ever?!? Oh, and in addition to the new components, every one of the 12 pre-existing components were also significantly updated to newer versions. That meant that each required significant new integration work, new installers on all supported OS (…remember the “sim-ship” goal?). Oh, and we were to ship all this new functionality at the fastest cadence yet.

3) improving support for other trains.
In January, we were learning how to support 3 active trains of code: supporting 1.3 and 2.0 maintenance work, while also building out infrastructure for 2.1 new-product-development-work… even while the 2.1 development work was in progress, which obviously complicated things for developers. Today, we’re supporting 4 active trains: maintenance work for 1.3, 2.0 and 2.1, as well as the 2.2 new-product-development-work. This time, the 2.2 infrastructure was built out and live before developers finished working on 2.1… enabling the developers! Things are not perfect yet, by any means, but today (with 4 trains) feels calmer and more organized then earlier this year (with â€œonlyâ€ 3 trains).

All great improvements to see up close, and all important to us as we scale. Big thanks to everyone for their help… and do stay tuned for even more improvements already underway.

I stumbled across this book by accident recently, and really enjoyed it. One of the reasons I love to travel is because of the different cultural norms… what is “normal” in one location would be considered downright “odd/strange/unusual” in another location. Since I first moved to San Francisco, the different types of people, from different backgrounds, who each call this town “home” continue to fascinate me… and all in a small 7mile x 7mile area.

This book is painted (yes really!) by a San Francisco resident, and does an excellent job of describing the heart of many different aspects of this unique town: Mah Jong in Chinatown, the SF City Library’s fulltime employee who is a social worker for homeless people, Frank Chu, Critical Mass, dogwalkers, Mission Hipsters, Muni drivers … and of course, everything you need to know about a Mission burrito!

A fun read… and a great gift to anyone who has patiently listened while you’ve tried to explain what makes San Francisco so special.

On 1st April, 1957, Panorama ended its show with a brief ~3minute segment on the early harvest of the Spaghetti trees along the Swiss-Italian border.

It is believed to be one of the first times an April’s Fool joke was played on television viewers, and caused quite the stir at the time. Excellently put together, with great attention to detail, and a script echoing an earlier segment about the French wine harvest, I found it a great fun 3minute watch.

If you build software delivery pipelines for your company, or if you work in a software company that has software delivery needs, I recommend you follow @relengcon, block off April 11th, 2014 on your calendar and book now. It will be well worth your time.

(Context: In case people missed this transition, my last day at Mozilla was Dec31, so obviously, I’m not going to be doing these monthly infrastructure load posts anymore. I started this series of posts in Jan2009, because the data, and analysis, gave important context for everyone in Mozilla engineering to step back and sanity-check the scale, usage patterns and overall health of Mozilla’s developer infrastructure. The data in these posts have shaped conversations and strategy within Mozilla over the years, so are important to continue. I want to give thanks to Armen for eagerly taking over this role from me during my transition out of Mozilla. Those of you who know Armen know that he’ll do this exceedingly well, in his own inimitable style, and I’m super happy he’s taken this on. I’ve already said this to Armen privately over the last few months of transition details, but am repeating here publicly for the record – thank you, Armen, for taking on the responsibility of this blog-post-series.)

December saw a big drop in overall load – 6,063 is our lowest load in almost half-a-year. However, this is no surprise given that all Mozilla employees were offline for 10-14 days out of the 31days – basically a 1/3rd of the month. At the rate people were doing checkins for the first 2/3rds of the month, December2013 was on track to be our first month ever over 8,000 checkins-per-month.

January saw people jump straight back into work full speed. 7,710 is our second heaviest load on record (slightly behind the current record 7,771 checkins in August2013).

Those are my quick highlights. For more details, you should go read Armen’s post for Dec2013 and post for Jan2014 yourself. He has changed the format a little, but the graphs, data and analysis are all there. And hey, Armen even makes the raw data available in html and json formats, so now you can generate your own reports and graphs if interested. A very nice touch, Armen.

[UPDATE: The newest version of this presentation is here. joduinn 09nov2014]

(My life been hectic on several other fronts, so I only just now noticed that I never actually published this blog post. Sorry!!)

On 07-nov-2013, I was invited to present “We are all remoties” in Twilio’s headquarters here in San Francisco as part of their in-house tech talk series.
For context, its worth noting that Twilio is doing great as a company, which means they are hiring. And outgrowing their current space, so one option they were investigating was to keep the current space, and open up a second office elsewhere in the bay area. As they’d always been used to working in the one location, this “split into two offices” was top of everyone’s mind… hence the invitation from Thomas to give this company-wide talk about remoties.

Twilio’s entire office is a large, SOMA-style-warehouse-converted-into-open-plan-offices layout, packed with lots of people. The area I was to present in was their big “common area”, where they typically host company all-hand meetings, Friday socials and other big company-wide events. Quite, quite large. I’ve no idea how many people were there but it felt huge, and was wall-to-wall packed. The size gave an echo-y audio effect off the super-high high concrete ceilings and far-distant bare concrete walls, with a weird couple of structural pillars right in the middle of the room. Despite my best intentions, during the session, I found myself trying to “peer around” the pillars, aware of the people blocked from view.

Its great to see the response from folks when slides in a presentation *exactly* hit onto what is on top-of-their-minds. One section, about companies moving to multiple locations, clearly hit home with everyone… not too surprising, given the context. Another section, about a trusted employee moving out from office to start being a 100% remote employee, hit a very personal note – there was someone in the 2nd row who was a long-trusted employee actually about to embark on this exact change. He got quite the attention from everyone around him, and we stopped everything for a few minutes to talk about his exact situation. As far as I can tell, he found the entire session very helpful, but only time will tell how things work out for him.

The very great interactions, the lively Q+A, and the crowd of questions afterwards were all lots of fun and quite informative.

Big thanks to Thomas Wilsher @ Twilio for putting it all together. I found it a great experience, and the lively discussions before+during+after lead me to believe others did too.

John.
PS: For a PDF copy of the presentation, click on the smiley faces! For the sake of my poor blogsite, the much, much, larger keynote file is available on request.