All posts by laurathomson

On the first day of Mozlandia, Johnny Stenback and Doug Turner presented a list of key accomplishments in Platform Engineering/Engineering Operations in 2014.

I have been told a few times recently that people don’t know what my teams do, so in the interest of addressing that, I thought I’d share our part of the list. It was a pretty damn good year for us, all things considered, and especially given the level of organizational churn and other distractions.

We had a bit of organizational churn ourselves. I started the year managing Web Engineering, and between March and September ended up also managing the Release Engineering teams, Release Operations, SUMO and Input Development, and Developer Services. It’s been a challenging but very productive year.

Here’s the list of what we got done.

Web Engineering

Migrate crash-stats storage off HBase and into S3

Launch Crash-stats “hacker” API (access to search, raw data, reports)

Ship fully-localized Firefox Health Report on Android

Many new crash-stats reports including GC-related crashes, JS crashes, graphics adapter summary, and modern correlation reports

Prototype services for checking health of the browser and a support API

Solve scaling problems in Moztrap to reduce pain for QA

New admin UI for Balrog (new update server)

Bouncer: correctness testing, continuous integration, a staging environment, and multi-homing for high availability

Grew Air Mozilla community contributions from 0 to 6 non-staff committers

Many new features for Air Mozilla including: direct download for offline viewing of public events, tear out video player, WebRTC self publishing prototype, Roku Channel, multi-rate HLS streams for auto switching to optimal bitrate, search over transcripts, integration with Mozilla Popcorn functionality, and access control based on Mozillians groups (e.g. “nda”)

DXR

Modeless, explorable UI with all-new JS

Case-insensitive searching

Proof-of-concept Rust analysis

Improved C++ analysis, with lots of new search types

Multi-tree support

Multi-line selection (linkable!)

HTTP API for search

Line-based searching

Multi-language support (Python already implemented, Rust and JS in progress)

In Q4, Greg Szorc and the Developer Services team generally have been working on a headless try implementation to solve try performance issues as the number of heads increases. In addition, he’s made a number of performance improvements to hg, independent of the headless implementation.

I think these graphs speak for themselves, so I’m just going to leave them here.

Try queue depth (people waiting on an in-flight push to complete before their push can go through):

Try push median wait times:

(Thanks, too, to Hal Wine for setting up a bunch of analytics so we can see the effects of the work represented in such shiny graphs.)

tl;dr: The move of Mozilla’s Release Engineering infrastructure from the SCL1 datacenter to SCL3 will begin having an impact May 19, and continue for the following six weeks. No tree closures are anticipated.

On Monday, we will begin the major part of the work of moving out of the SCL1 datacenter. Some of our pandas have already been relocated, as a test run. The rest of our machines will now move on a series of “move trains”. The majority of non-pandas are in four big trains, which will move each Monday for the next few weeks.

Impact

The implications for engineering include:

Build farm capacity will be degraded at times (especially on Mondays and Tuesdays).

No tree closures are anticipated.

Datacenter Ops, Release Operations, and Release Engineering will be busy during the moves.

Background

We are moving out of SCL1 and consolidating our infrastructure into SCL3. This will produce cost savings of $900K/year and speed problem resolution (as we have staff located in SCL3).

The end user visible part of the process will start on Monday May 19, and continue for 5-6 weeks. We’ll make a general announcement in the Monday project meeting on May 12, with a follow-up posting to dev-platform.

Data Center Operations, Release Operations, and Release Engineering have been planning for the move for several months. There are approximately 20 racks of equipment to move. The plan is to move key machines in batches (“trains”), spread across functional areas. That is, we’ll degrade all platforms slightly, rather than take a single platform offline. We don’t believe there will be any need to close the trees with this approach, although sheriffs may not merge as often.

The key systems will be moved each Monday morning, and should be back online no later than Wednesday noon, worst case. We are leaving slack in each week’s schedule to ensure some uninterrupted dev time each week, and to allow for minimal impact to release schedules. There is also some slack in the overall schedule for contingencies. If a critical event such as a chemspill occurs, we will be able to cancel that week’s move on short notice.

My thanks to IT for conceiving this grand plan, and all y’all in DCOps, Release Ops, and Release Engineering for execution.

My dad was an engineer and pilot in the RAF. We come from a long line of engineers, all the way back to Napier, the dude that figured out logarithms.

As a kid I wanted to learn everything about everything. I read every book I could lay my hands on, and I took things apart to see how they worked, notably, an alarm clock that never went back together right. (Why is there always a spring left over? The clock still worked, so I guess you could call it refactoring.)

I first programmed when I was in the fourth grade. I was eight. A school near me had an Apple II, and they set up a program to bring in kids who were good at math to learn to program. Everybody else was in the seventh or eighth grades, but my school knew I was bored, so they sent me. We learned LOGO, a Lisp.

In the seventh grade my school got BBC Bs. I typed in games in Basic from magazines (Computer and Video Games, anyone?) and modified them. I worked out how to put them on the file server so everybody could play. The teacher could not figure out how to get rid of them.

I saved up money from many odd jobs and bought myself a Commodore 64, and wrote code for that. All through this, I still wanted to be a lawyer/veterinarian/secret agent/journalist. I don’t think I ever considered being a programmer at that stage. I don’t think I knew it was a job, as such.

At the start of my final year of high school, I had a disagreement with my parents and moved out of home, and dropped out of school. After a short aborted career as a bicycle courier, I applied for and got a job working for the government as a trainee, a program where you worked three days a week and went to TAFE (community college) for two. They called and said, we have a new program which is on a technology track. Is that interesting? I said yes, and that was my first tech job.

I went from there to another courier firm where I did things with dBase, and worked in the evenings at a Laser Tag place. One night, at a party, I started talking to these guys who were doing stuff with recorded information services over POTS. They had the first NeXTs in Australia, and I really wanted to get my hands on them.

They offered me a job, and I was suddenly Operations Controller, leading a team of four people. Still not really sure how that happened.

The bottom fell out of that industry, and I went back to school, finished high school, and went to college. Best decision I ever made career wise was my choice of program. I studied Computer Science and Computer Systems Engineering at RMIT. I was the only woman in the combined program. It was intense: you took basically all the courses needed for both of those programs (one three years, one four years) in a five year period. We took more courses in a single semester than most people did in a year. I loved it. I had found my tribe.

One day, I went to the 24 hour lab and I saw a friend, Rosemary Waghorn, with something on her terminal I had never seen before. “What’s that?” I asked. “It’s called Mosaic,” she said. “This is the world wide web.”

I sat down. I was hooked. I knew right away that *this* was what I wanted to do.

Mike Morgan – morgamic – was my boss for nearly six years. Friday was his last day working at Mozilla. I wanted to write something to memorialize his departure, in the same way he did for others. Of course, this blog post will not be as eloquent as if he had written it, but I will do my best.

There are two things that stand out about Morgamic: his leadership, and his passion for the Mozilla and the Open Web.

Morgamic is that rare leader who, rather than rallying the troops from the front, leads from beside you, encouraging you every step of the way. Morgamic is an introvert. Never let anyone tell you introverts can’t lead. He excels at leadership because of his special talents for introspection, reflection and the ability and willingness to listen.

He taught me, by example, and by teaching me to ask the right questions, three important things about leadership:

Enable autonomy by quiet leadership. In six years, I don’t think he ever really told me to do anything. Like Confucius, he simply asked questions that helped me figure it out for myself.

Trust people. I can get really mad about things being done in a way I consider wrong. He always encouraged me to ask myself why someone might be doing it that way, and to trust that they were doing the best they could.

Reframe problems. Mike sees problems as complex and nuanced. It’s never black or white: you just have to zoom out a little to see a million solutions to a problem that you might not have seen before.

We certainly had disagreements over the last few years, but we always managed to resolve them in a constructive way, and that might be the greatest lesson of all. As a technical leader, he goes out of his way to hire people that he is confident are smarter than him, and he never gets insecure about it. (In my case, I’m not sure he was right. He certainly outdoes me in wisdom.) He coaches those people into excellence. Morgamic is a force multiplier. Not only that, but he cares about his people, and will go out of his way to help them develop into the best and happiest versions of themselves.

Virtually every website you use at Mozilla was made with Morgamic’s hands, Morgamic’s help, or Morgamic’s leadership. We still use code from the first web app he ever built for Mozilla, when he was a volunteer: Every time you update or download Firefox, you can do that because of Morgamic.

Morgamic also has a vision and a passion for the Mozilla Mission and the Open Web. If you’ve ever talked with him about it, you’ll know exactly what I mean. With Mike, it always came back to two questions: How does this move the mission forward? How does this benefit the Open Web?

He also manages to bring humor and humanity into every action: whether it’s org charts with Care Bears, photoshopping your head onto a meerkat, or presenting interns with trophies at the end of the summer. Once, when I had a sick pet and he knew I was really upset, he sent me a giant bunch of flowers (‘From the webdev team’). That made me cry, quite a lot, but in a good way, I swear. I still have that card on my desk, and I tear up every time I look at it.

I’m not the only one with stories. Here are some from other people who have had the pleasure and privilege of working with Morgamic:

“I’m not sure I consistently hear more praise for any other manager at Mozilla as I do about Morgamic. That includes me hearing myself talk about how pleased I’ve been over the past two years to have him mentor me — and our entire team. I feel pushed to do great work because of him, but in a way unique to him and Fred (who I have to think he mentored well, given their similar management styles) — constantly encouraged and pushed but with amazing empathy and reason for pushing me. He also encouraged us all to get along with each other and all of Mozilla, taken what seems to me as the sanest growth plan in Mozilla, and strived to build an awesome team instead of just a big one. He encouraged us to reach out and include “former” colleagues and constantly bring potentials into the webdev world.”

“Mike Morgan brought me to Mozilla, a move I had always wanted and was appreciative of. It wasn’t until I had to opportunity to really work with Mike and see him in action that I realized how much of a compliment it was to have him seek me. Mike invests so much into each of his developers that they can’t help but strive for greatness to repay the favor. Morgamic fought hard for his developers and made sure they were working on something they were passionate about. I’m proud to have worked with and for Mike Morgan and I’m already jealous of the next set of developers he’ll lead. Mozilla wont be the same without him. Legend.”

“Like many of us on the web development team, I came to Mozilla through Mike. I’ve worked closely with him for 7 years and watched him grow from a volunteer developer into a well respected leader. I watched a team of two turn into a team of fifty with his expertise and guidance. He is magnetic – someone who naturally acts as a hub, of people, of information, and of value. He strived to be a better leader, reading books, studying role models, and speaking with experts about how to encourage excellence on his team. People who have worked with him will understand how short “he’ll be missed” falls – we’re all fortunate to have worked with him for this long, and really, I guess we’ve been greedy, it’s only fair to let the rest of the world have a chance too. “

“Even when we disagreed he trusted me. He could have ordered me to do something else, or ordered my boss to order me. Instead he’d take me for coffee and try to convince me of another way. Usually he succeeded, but when he didn’t he would go out of his way to support my decision. Our products were a byproduct of his relentless focus on the team — hiring the right people and trusting them to make the right decisions.”

“Morgamic embodied Mozilla in so many ways. He was a continual positive influence in everything we did in WebDev, always believing in people and trying to get them to improve themselves. But he went far beyond the boundaries of the team and influenced so many others. His legacy at Mozilla will continue on from those lucky enough to have worked with him.”

“Soon after I switched from the Webdev team to the Engagement team, Morgamic walked by the glass walls of a meeting room I was doing a video conference in. He walked away, came back with a whiteboard marker, drew a heart, and left.
I’d follow that man to Hades.”

“He helped me feel good at Mozilla very quickly. I like how he can be totally not serious sometimes, but efficient when he needs to. He gathered an impressive team of wonderful, excellent, incredible Web devs (except me, of course, but every team has its weakness 🙂 ). We were the first interns to win the Annual Employees VS Interns Basketball match!”

“Morgamic exemplified Mozilla for me. Openness, transparency, and just plain fight-for-the-user awesomeness. Morgamic was one of the few managers I’ve had who was less my superior and more my facilitator. He often acted like a Mozilla concierge – ensuring I had what I needed, intervening where I was blocked, and making sure I was happy and headed in the right direction. I don’t think I ever disagreed with his strategic decisions, which often had included my input or had at least been communicated to me early & often. Not that my agreement is needed to run the company, but it at least felt like he always had my back and we were doing things the right way.”

“I’ve known and worked with Mike for ten years. On meeting him, I knew immediately that I had met one of those rare personalities that one encounters only only a few times in life. I watched Mike mature over the years in both his personal and professional life. One Mozilla cantina night, I recall sitting with Shaver and, possibly, Schrep: the conversation was about finding good engineering management. I remember pointing across the room to Mike, who likely at that moment was doing something very silly/dangerous will alcohol and fire. ‘There is your man, promote from within and you’ll see amazing things from him.’ I think I nailed it.”

The words “he will be missed” are so far from adequate it’s not even funny.

His legacy at Mozilla will live on, in the projects he built, in the people he mentored, through every Open Source project that comes out of Webdev, in the Mozilla mission, and in the hearts of all of us. I can’t wait to see what he pulls off next – I’ll be watching, and so should you, because I have no doubt it will be amazing. Remember, too, that being a Mozillian isn’t something that stops just because you change jobs. It’s more like what happens when your best friend moves away. Nothing changes except the logistics.

As for the rest of us, we will miss him, but we will go on working for the Open Web as the better people he helped us to be.

I asked Erik Rose from my team to blog about his work on DXR (docs), the code search and static analysis tool for the Firefox codebase. He did so on the Mozilla Webdev blog, so it would show up on Planet Mozilla. Today, it was pointed out to me that the Webdev blog is not on Planet.

Talk of impostor syndrome is almost memetic at the moment. If you don’t know what it is, go look it up. I’ll wait.

Like lots of other people, I struggle with this constantly. I’m not as smart as everybody else in the room. I’m not as good a coder. I’m not as good a manager. Sooner or later I will be found out for what I am: an impostor.

Thing is, I can rationally defeat many of those things by looking at objective evidence. I recite the evidence to myself. I am smart: my IQ is nearly 150. I wrote a programming book that some people really like – note I first wrote that as “great”, deleted it, wrote “best-selling”, deleted it, and settled for “some people really like”. I have worked on some interesting coding projects. I manage a successful team at an interesting company doing things that are technically difficult and that will hopefully make a difference in the world.

But in the back of my brain, a little voice says, that was just luck.

I recently realized that impostor syndrome is present in all parts of my life, not just in my career. Everyone is better at riding horses than I am, even though I’ve been doing it since I was four. My fiction writing sucks, and my critique group will eject me once they figure it out. My house is messier than everyone else’s, and I think I’m a terrible cook. I can’t co-ordinate my wardrobe.

The worst part is standing at the playground, thinking that every other parent there knows what they are doing except for me.

I have to remind myself these things aren’t true. Every day. I heard some good advice recently, which was to speak to yourself as if you were your best friend. You wouldn’t say to your best friend, “You’re an idiot”, now, would you? Even if your BFF did something objectively stupid, you might tell them, “You’re not stupid. We all do dumb things, sometimes.”

How about you? If you have strategies for overcoming impostor syndrome, share them in the comments.

Elmo

Elmo is a localization management dashboard. We worked this into a Playdoh app, completed a redesign, built a new homepage, deployed it on new infrastructure, moved it to a new domain, added metrics and launched the app!

Bouncer

Bouncer is the download redirector and is one of the oldest webapps at Mozilla. In 2012 we revived the project in order to support the stub installer. We worked with IT to build out new dev, stage and prod clusters. We added support for the redirects that stub installer needed, and made Bouncer SSL aware. We also fixed a number of other issues.

Air Mozilla

We built and launched the brand-new Air Mozilla webapp, including support for Persona, secure/private streams, integrated event scheduling, and a bunch of other exciting features.

Perfomatic

We worked with A-team to update graph server into Datazilla to support changes to make Talos more statistically reliable.

DXR

DXR is a code search tool based on static analysis of the code. We ran a usability study and built mockups in preparation for the work we’ve been doing this year (new UI, MXR parity).

Etherpad / Etherpad Lite

We deployed Etherpad with Persona support, and added Persona and Teampad support to Etherpad Lite (staged on the PaaS so I won’t link it here). We are working on security review of EL prior to deployment, and also on getting our changes upstreamed.

Playdoh

Verbatim

PTO

We built out a new PTO app for reporting vacation. This was completed but did not launch as a different approach is being pursued.

Sheriffs

We built out a new app for co-ordinating the Sheriffs calendar. This was completed but did not launch due to hiring a perma-sheriff (probably a better solution than a webapp).

Bramble/Briar-patch

We prototyped a monitoring and capacity planning dashboard for the build farm. This project was later put on hold and did not launch.

Team growth and development

During the year, we welcomed new team members Selena Deckelmann and Erik Rose, and intern Tim Mickel. We participated in several Mozilla workweeks, including a Stability themed work week with Engineering, a team-only workweek at DjangoCon, and a Webdev workweek. We gave talks at several conferences and participated in HackerSchool.

We got better at working with Ops, QA, and RelEng and built trust and relationships with those groups.

We automated a bunch of processes, perhaps most notably building on pull requests with Leeroy (awesome!).

finally:

If I could change anything it would be avoiding the rabbithole of projects that were later killed – it’s a waste of team effort. We had a small handful of these.

Overall, it was an awesome, invigorating, and exhausting year. I hope we can do even more and cooler things in 2013.

One point to note is that we are a broadly distributed and largely remote team, but we work well together and ship a lot of stuff. We are currently spread across Mountain View, northern California, Oregon (multiple locations), Maryland (multiple locations), France, and South Africa.

Not long ago, in a datacenter not far away…this is a story about stuff going wrong and how we fixed it.

Prologue

My team works on Socorro, the Firefox crash reporting system, among other things.

When Firefox crashes, it submits two files to us via HTTP POST. One is JSON metadata, and one is a minidump. This is similar in nature to a core dump. It’s a binary file, median size between 150 and 200 kB.

When we have a plugin problem (typically with Flash), we get two of these crash reports: one from the browser process, and one from the plugin container. Until recently it was challenging to reunite the two halves of the crash information. Benjamin Smedberg made a change to the way plugin crashes are reported. We now get a single JSON metadata file, with both minidumps, the one from the browser, and the one from the plugin container. We may at some point get another 1-2 dumps as part of the same crash report.

We needed to make a number of code changes to Socorro to support this change in our data format. From here on in, I shall refer to this architectural change as “multidump support”, or just “multidump”.

Crashes arrive via our collectors. This is a set of boxes that run two processes:
1. Collector: this is Python (web.py) running in a mod_wsgi process via Apache. Collector receives crashes via POST, and writes them to local filesystem storage.

2. Crash mover: This is a Python daemon that picks up crashes from the filesystem and writes them to HBase.

You may be saying, “Wow, local disk? That is the worst excuse for a queue I’ve ever seen.” You would be right. The collector uses pluggable storage, so it can write wherever you want (from Postgres, HBase, filesystem). We have previously written crashes to NFS, and more recently and less successfully directly to HBase. That turned out to be a Bad Idea ™, so about two years ago I suggested we write them to local disk “until we can implement a proper queue”. Local storage has largely turned out to be “good enough”, which is why it has persisted for so long.

Adding multidump support changed the filesystem code, among other things.

Act I: An Unexpected Journey

1/10/2013
We had landed multidump support on our master and stage branch, but engineers and QA agreed that we were not quite comfortable enough with it to ship it. Although we had planned to ship it this day, we didn’t, but we had some other stuff we needed to ship. Instead of what we usually do (in git, push master to stage, which is our release branch), we stashed stage changes between the last release and now, and then cherry picked the stuff we needed to ship.

What we didn’t realize was that we accidentally left multidump in the stage branch, so when we pushed, we pushed multidump support. It ran for several hours in production seemingly without problems. We did not apply the PostgreSQL schema migration, but we had previously changed the HBase schema to support this, so it didn’t cause any problems, but was not end-user visible. When we realized the error, we rolled back, rebuilt, and pushed the intended changes. This happened within a couple of hours. (The rollback/rebuild/repush took a minute or two.)

1/17/2013
We intentionally pushed multidump support. It passed QA, and everything seemed to be going swimmingly.

1/22/2013
A Socorro user (Kairo) noticed that our crash volume had been lower than average for the last couple of days.

Investigation showed that many, many crashes were backed up in filesystem storage, and that HBase writes were giving frequent errors, meaning that the crashmovers were having trouble keeping up.

We decided to take one collector box at a time out of the pool, to allow it to catch up. We also noticed at this time that all the collectors were backed up except collector04, which was keeping up. This was a massive red herring as it later turned out. We ran around checking the config and build and netflows on collector04 were the same as on the other collectors. While we watched, collector04 gradually began backing up, and then was in the same boat as the others.

Based on previous experiences, many bad words were said about Thrift at this point. (If you don’t know Thrift, it’s a mechanism we use for talking to HBase. We use it because our code is in Python and not a JVM language, so we use Thrift as middleman.) But this was instinct, not empirical evidence, and therefore not useful for problem solving.

To actually diagnose the problem, we first tried strace-ing the crashmover process, and then decided to push an instrumented build to a single box. By “instrumented” I mean “it logs a lot”. As soon as we had the instrumentation in place, syslog began to tell a story. Each crash move was taking 4-5 seconds to complete. Our normal throughput on a single collector topped out at around 2800-3000 crashes/minute, so something was horribly wrong.

As it turned out the slow part was actually *deleting* the crashes from disk. That was consuming almost all of the 4-5 seconds.

While looking at the crashes on disk, trying to discern a pattern, we made an interesting discovery. Our filesystem code uses radix storage: files are distributed among directories on a YY/MM/DD/HH/MM/ basis. (There are also symlinks to access the crashes by the first digit of their hex OOID, or crash ID.) We discovered that instead of distributing crashes like this, all the crashes on each collector were in a directory named [YY]/[MM]/[DD]/00/00. Given the backlog, that meant that, on the worst collector, we had 750,000 crashes in a single directory, on ext4. What could possibly go wrong?

At this point we formed the hypothesis that deletes were taking so long because of the sheer number of files in a directory. (If there’s any kind of stat in the code – and strace showed there was – then this would perform poorly.)

We moved the crashes manually out of the way, as a test. This sped things up quite a bit.

We also noticed at this point that the 00/00 crashes had backed up on several days. We had some orphaned crashes on disk (a known bug, when multiple retries fail), and this was the pattern.
01/10/00/00 – a moderate number of crashes
01/17/00/00 – ditto
(same for each succeeding day)
01/22/00/00 – a huge number of crashes

These days correlated to the days we had multidump code running in production. We had kind of suspected that, but this was proof.

We rolled back a single collector to pre-multidump code, and it immediately resumed running at full speed. We then rolled back the remainder of the collectors, and took them out of the pool one at a time so they could catch up.

Somewhere during our investigation (my notes don’t show when) the intermittent failures from HBase had stopped.

By Saturday 1/26, we had caught up on the backlog. We had also by this time, discovered the code bug that wrote all files into a single directory, and patched it. (The filesystem code no long had access to the time, so all times were 00/00.)

We thought we were out of the woods, and scheduled a postmortem for 1/31. However, it wasn’t going to be that easy.

Act II: All this has happened before, and will happen again.

1/28/2013
We ran backfill for our aggregate processing, in order to recalculate totals with the additional processed crashes included.

Our working hypothesis at this stage was as follows. An unknown event involving HBase connection outages (specifically on writes) had caused crashes to begin backing up, and then having a large number of crashes in a single directory made deletion slow. We still wanted to know what had caused the HBase issue, but there were two factors that we knew about. First, at the time of the problem, we had an outage on a single Region Server. This shouldn’t cause a problem, but the timing was suspicious. Secondly, we saw an increased number of errors from Thrift. This has happened periodically and is short-term solved by restarting Thrift. We believe it is partially caused by our code handling Thrift connections in a suboptimal way, something that is in the process of being solved by our intern.

1/31
A big day. We had two things planned for this day: first, a postmortem for the multidump issue, and second, a PostgreSQL failover from our primary to secondary master so we could replace the disks with bigger ones.

Murphy, the god of outages, intermittent errors, and ironic timing, did not smile fondly upon us this day.

Crashes began backing up on collectors once again (see https://bugzilla.mozilla.org/show_bug.cgi?id=836845). We saw no HBase connection errors at this time, and hence realized at this point that we must have missed something. We rolled back to a pre-multidump build on collectors, and they immediately began catching up. We held off running backfill of aggregates at this time, because we wanted to go ahead with the failover. Disk was getting desperately short and we had already had to delay the failover once due to external factors.

We postponed the postmortem, because clearly we didn’t have a handle on the root cause(s) at this time.

We proceeded with the planned failover from master01 to master02, and replaced the disks in master01. Our plan was to maintain master02 as primary, with master01 replicating from master02. The failover went well, but the new disks for master01 turned out to be faulty, post-installation. We were now in a position where we no longer had a hot standby. Our disk vendor did not meet their SLA for replacement.

2/1
We ran backfill of aggregate reports, and from an end-user perspective everything was back to normal.

2/2
Replaced disks on master01 (again). These too had some errors but we managed to solve that.

Later, we pushed a new build that solved the quickDelete() issue. We were officially out of the woods.

Epilogue

Things that went well:

The team, consisting of engineers, WebOps, and DCOps worked extremely well and constructively together.

As a result of looking closely at our filesystem/HBase interactions, we tuned disk performance and ordered some SSDs which have effectively doubled performance since installation. Thrift appears to be the next bottleneck in the system.

Things we could have done better:

Release management: we broke our RM process and that led us to accidentally ship the code prematurely.

Not shipped broken code, you know, the usual. Although I do have to say this was more subtly broken than average. The preventative measures here would have been better in-code documentation in the old code (“Using quickDelete here instead of remove because remove performs badly.”) We did go through code review, unit and integration testing, and manual QA, as per usual, but given this code only performed poorly once other parts of the system showed degraded performance, this was hard to catch.

Relying on end-user observation to discover how the system was broken. Monitoring can solve this.

Things we will change:

Improvements to monitoring. We will now monitor the number of backed up crashes. It’s not a root cause monitor but an indicator of trouble somewhere in the system. We have a few others of these, and they are good catch-alls for things we haven’t thought to monitor yet. We are also working on better monitoring of Thrift errors using thresholds. Right now we consider a 1% error rate on Thrift connections normal, and support limited retries with exponential fallback. We want to alert if the percentage increases. We plan on doing more of these thresholded monitors by writing these errors to statsd, and pointing nagios at the rolling aggregates. This will also work for monitoring degraded performance over time.

Improvements to our test and release cycles. We have seen a few times now an issue where when we get a feature to staging we decide it’s not ready to ship, and this involves git wrangling and introduces a level of human error. Our intention is to build out a set of “try” environments, that is parallel staging environments that run different branches from the repo.

Confession:
I like disasters. They always lead to a better process and better code. Also, when the team works well together, it’s a positive trust-building and team-building experience. Much better than trust falls in my experience.

A final note
All of the troubleshooting was done with a remote team, working from various locations across North America, communicating via IRC and Vidyo. It works.

Collected more than one billion crashes: more than 150TB of raw data, amounting to around half a petabyte stored.(Not all at once: we now have a data expiration policy.)

Shipped 54 releases

Resolved 1010 bugs. Approximately 10% of these were the Django rewrite, and 40% were UI bugs. Many of the others were backend changes to support the front end work (new API calls, stored procedures, and so on).

New features include:

Reports available in build time as well as clock time (graphs, crashes/user, topcrashers)

Rapid beta support

Multiple dump support for plugin crashes

New signature summary report

Per OS top crashers

Addition of memory usage information, Android hardware information, and other new metadata

Timezone support

Correlation reports for Java

Better admin navigation

New crash trends report

Added exploitability analysis to processing and exposed this in the UI (for authorized users)

Support for ESR channel and products

Support for WebRT

Support for WebappRTMobile

Support for B2G

Explosiveness reporting (back end)

More than 50 UI tweaks for better UX

Non-user facing work included:

Automated most parts of our release process

All data access moved into a unified REST API

Completely rewrote front end in Python/Django (from old KohanaPHP version with no upgrade path)

Implemented a unified configuration management solution

Implemented unified cron job management

Implemented auto-recovery in connections for resilience

Added statsd data collection

Implemented fact tables for cleaner data reporting

Added rules-based transforms to support greater flexibility in adding new products

Refactored back end into pluggable fetch-transform-save architecture

Automated data export to stage and development environments

Created fakedata sandbox for development for both Mozilla employees and outside contributors

Implemented automated reprocessing of elfhack broken crashes

Automated tests run on all pull requests

Added views and stored procedures for metrics analysts

Opened read-only access to PostgreSQL and HBase (via Pig) for internal users

I believe we run one of the biggest software error collection services in the world. Our code is used by open source users across the internet, games, gaming (casino), music, and audio industries.

As well as working on Socorro, the Webtools team worked on more than 30 other projects, fixed countless bugs, shipped many, many releases, and supported critical organizational goals such as stub installer and Firefox Health Report. We contributed to Gaia, too.

We could not have done any of this without help from IT (especially WebOps, SRE, and DB Ops) and WebQA. A huge thank you to those teams. <3

I’ll write a part two of this blog post to talk more about our work on projects other than crash reporting, but I figured collecting a billion crashes deserved its own blog post.

Edited to add: I learned from Corey Shields, our Systems Manager, that we had 100% uptime in Q4. (He’s still working on statistics for the whole of 2012.)