Summary

Socorro is the crash ingestion
pipeline for Mozilla's products like Firefox. When Firefox crashes, the Breakpad
crash reporter asks the user if the user would like to send a crash report. If
the user answers "yes!", then the Breakpad crash reporter collects data related
to the crash, generates a crash report, and submits that crash report as an HTTP
POST to Socorro. Socorro saves the crash report, processes it, and provides an
interface for aggregating, searching, and looking at crash reports.

2017 was a big year for Socorro. In this blog post, I opine about our
accomplishments.

Turnover in 2017

At the beginning of the year, we had three full-time developers. Then Adrian
left for Pontoon (a completely different project) and Peter left to continue
working on Tecken (all things symbols in Socorro) on another team. That left me
alone-ish for a month which was tough. Then we picked up Mike in October.

Throughout, Lonnen managed to find time to fix Socorro things and review PRs.
That alleviated some of the lack-of-critical-mass problems we had during the
year.

We also had a cadre of engineers and other people who contributed fixes mostly
around signature generation.

That's a lot of turnover for a very very small team. At one point, I was the
only developer which was really hard because Socorro is a huge code base. We
made due and still did great things.

Highlights 2017

2017 was a big year. I really can't overstate that. Despite the turnover, we
accomplished a lot. Some highlights:

Replaced the Socorro collector. Replaced the Socorro collector with a
top-to-bottom rewrite code-named Antenna. We put it in production in April 2017
and fixed a few minor issues that came up. We haven't touched it since then--it's
been solid.

Created a new Docker-based local development environment. This radically
improved our ability to trouble-shoot, debug, reproduce issues, fix issues,
and verify correctness of fixes. It was a game-changer.

Rewrote signature generation code and added a command line interface. This
allows us to verify signature generation changes and experiment with new ones.
We can confidently make changes to signature generation code now and know
roughly what the effects will be.

Not only that, but the tools are easy to use and make it possible for anyone
to test their signature generation changes.

Built a new Docker-based -stage environment. Our current infrastructure
has some rough edges and it's really different than the other systems at
Mozilla. In order to be more like other systems, we're building a new
infrastructure for Socorro that uses Ops-preferred Dockerflow bits. This new
infrastructure will make it easier to scale individual components, deploy,
back out deploys, and manage everything.

Getting a working -stage environment was a huge accomplishment. From writing
Docker files and new command scripts, to infrastructure glue and deploy
pipeline bits, to getting everything including our tests working on Circle CI,
to rewriting Socorro code that had underlying assumptions about how it was
being run to work with the new system.

Work for this ongoing project is covered in [bug 1391034] and a bunch of bugs
blocking that one.

Rewrote Snappy symbolification server and all things symbols. We rewrote
the Snappy symbolification server which engineers use to symbolicate stacks to
get meaningful stack traces. This new system is code-named Tecken.

In addition to that, Peter took the project several steps further and
centralized all things symbols into Tecken.

Socorro's minidump stackwalker now asks Tecken for symbol lookups allowing
Tecken to keep track of missing symbols. Soon, we'll be able to remove all the
missing symbol bookkeeping code from Socorro.

We're also switching to Tecken for symbol uploads. Soon, we'll be able to
remove all the symbol upload code from Socorro, too.

Peter wrote a plog entry on load testing Tecken which covers some other
bits about Tecken as well.

Removed lots of code and other things from the repository. Adrian and
Peter worked on the "deprecation rampage" focusing on removing unused API
endpoints. We spent time removing Postgres tables, stored procedures, and
views we weren't using. We removed the fakedata generation code. We removed
the middleware component (most of it was folded into the webapp). We removed
the aging and broken Vagrant development environment. We removed a bunch of
scripts whose purpose has long been forgotten. We removed code for cron jobs
we no longer run. We removed bits and bobs for projects long abandoned
(running Socorro on Heroku, using hbase, etc).

There's still a lot of code ripe for removal and cleaning up, but we made
significant progress towards reducing the code base to a size that's
maintainable by a small team.

Updated Python dependencies and reworked how we manage them. We
updated all the Python dependencies (some of which were several years old),
switched to a requirements file and constraints file to specify them, and set
up monthly dependency reviews for non-security updates and daily dependency
reviews for security updates.

Updated JavaScript dependencies and switched to npm to manage them. Our
webapp relies on a bunch of JavaScript libraries. We had copies of these
libraries in the repository. We removed the vendored copies and switched to
npm to install them from a requirements file. Additionally, the updated the
dependencies to more recent versions and set up monthly review for updates.

Built better metrics infrastructure for the webapp. We switched the webapp
to use a library I wrote for Antenna called Markus. This makes it much easier to measure
things like how often API endpoints are being used. Adding metrics to the
webapp is now a two-line code-change.

I want to update the rest of Socorro in similar ways. Hopefully, I can get to
that in early 2018.

Cleaned up bugs. We triaged and resolved 1,221 bugs. We resolved bugs that
were obsolete, for projects we abandoned, fixed, and otherwise not helpful
anymore. We're down to under 500 bugs now.

Switched from nose to pytest. We switched from nose to pytest. We have
hundreds of tests, so this was an overhaul of our test code which took a
while. The end result is that we're now using a test library that has features
that will make writing and maintaining tests much easier.

Linted Python code and added linting to CI. We linted all the Python code,
fixed issues, and added linting to our CI. Linting is an important tool for
finding certain classes of bugs. Being able to lint in CI reduces the risk of
code changes.

Overhauled documentation. We overhauled the documentation. We now have a
new Getting Started guide that
gets you a local development environment in roughly 4 steps. It documents the
scripts we use for manipulating that environment and running the various
components of Socorro individually as well as in conjunction with other
components.

We also updated all the documentation related to administrating and
maintaining the infrastructure.

There's still a lot of work to do here, but we made significant progress.

Wrote a system checklist. We wrote up a system checklist for verifying
that the entire system is working as expected. This is helpful after big
changes like upgrading Python versions or critical libraries.

This also gives us a list of important things in the system so we can automate
verification as much as possible and change parts that are hard to verify.

Radically reduced onboarding time for new developers. When I started in
2016, it took me more than 6 months before I was up-to-speed and had a working
development environment enough to be productive.

Contrast that experience with Mike who was up-to-speed in a few weeks.

It was a good year!

Lowlights 2017

We had a bunch of highlights, but we also had some low lights:

Elasticsearch cluster upgrade fails. We've been having problems with our
Elasticsearch cluster for a while now. In 2017, we tried several times to
upgrade our Elasticsearch cluster from 1.4 to 5.1 hoping that this will
alleviate some of our problems.

We tried this in our -stage environment several times failing each time. This
project was supposed to be pretty straight-forward, but it had complexities we
didn't understand until later.

First, we discovered we had a lot of problems in our data making it difficult
to migrate it over. We have a lot of data, so we had to copy the data from
one cluster to another cluster and transform it along the way. It's really
difficult to do that quickly. We were mucking with Groovy script embedded in a
reindex-from-remote command. The iteration cycle was rough, too--we'd run the
script for a day and then discover more issues that we had to fix.

Second, we had to rewrite and update a lot of code and our testing had a lot
of holes in it. We'd get some things working in -stage only to discover new
issues.

Since we have only one -stage environment, these experiments blocked all
Socorro development.

After the third abandoned attempt, I suggested we back up a step and build a
local development environment with both Elasticsearch cluster versions, test
everything out there, and work out the issues. Meanwhile, we can fix some of
our data problems which is probably a good idea anyhow.

Meanwhile, we're also trying to redo our infrastructure. We have a really
small team. We can't do two big projects like this at the same time. I
reprioritized them a few times hoping we could get one of them done and reduce
the number of big projects we were juggling. I think that only made things
worse.

Another year with Postgres crash storage. The Socorro processor
processes a raw crash into a processed crash and then saves it to a bunch of
crash stores. We've been trying to remove Postgres from the crash store
destinations.

This work has been really hard. The code is really tangled and slides between
Python-land and Postgres-stored-procedure-land. Some of it is well tested,
but some of it has no tests at all and interesting side-effects.

I thought we were really close to dropping Postgres as a crash store. I
tried to pick up where Adrian and Peter left off, but essentially ran out of
time in the year to finish this off.

Switch from ftpscraper to buildhub. We currently have a script called
ftpscraper that scrapes archive.mozilla.org for new and updated build
information. It has a bunch of "interesting logic" for traversing the
directory trees and interpreting the data. It then executes a bunch of stored
procedures that convert that build information into some form and stores it in
the database.

Those stored procedures do interesting things. They handle a bunch of "one-off"
scenarios in the build information some of which stem from goofs and some from
the ever evolving Firefox build system. They also enforce invariants that
aren't true anymore as far as I can tell. They have no tests.

Socorro's system for accruing build information is really hard to debug. It
takes days to understand how the data flowed and why weird things happened.
Many issues are ephemeral, so they're not reproducible after the fact.

Over the summer, Buildhub
was written and stood up to build and maintain a set of build data much like
what we're getting with ftpscraper. I looked at dropping our ftpscraper script
for a similar Buildhub-based script, but haven't had time to continue that
work and keep pushing it off in order to finish other things. In the meantime,
we continue to have problems with build information which we spend/waste gobs
of time debugging.

Spent the bulk of our time addressing technical debt. We worked through a
lot of technical debt that had been accreting for years. That's great, but it
was at the cost of spending time improving things that people use.

We could have spent more time honing the Crash Stats webapp interface. We
could have spent more time improving bits to make QA easier. We could have
spent more time fixing our API documentation to make it more usable.

There's never enough time to do everything, but it would be better if we had
accomplished more user-facing things.