My dad was an engineer and pilot in the RAF. We come from a long line of engineers, all the way back to Napier, the dude that figured out logarithms.

As a kid I wanted to learn everything about everything. I read every book I could lay my hands on, and I took things apart to see how they worked, notably, an alarm clock that never went back together right. (Why is there always a spring left over? The clock still worked, so I guess you could call it refactoring.)

I first programmed when I was in the fourth grade. I was eight. A school near me had an Apple II, and they set up a program to bring in kids who were good at math to learn to program. Everybody else was in the seventh or eighth grades, but my school knew I was bored, so they sent me. We learned LOGO, a Lisp.

In the seventh grade my school got BBC Bs. I typed in games in Basic from magazines (Computer and Video Games, anyone?) and modified them. I worked out how to put them on the file server so everybody could play. The teacher could not figure out how to get rid of them.

I saved up money from many odd jobs and bought myself a Commodore 64, and wrote code for that. All through this, I still wanted to be a lawyer/veterinarian/secret agent/journalist. I don’t think I ever considered being a programmer at that stage. I don’t think I knew it was a job, as such.

At the start of my final year of high school, I had a disagreement with my parents and moved out of home, and dropped out of school. After a short aborted career as a bicycle courier, I applied for and got a job working for the government as a trainee, a program where you worked three days a week and went to TAFE (community college) for two. They called and said, we have a new program which is on a technology track. Is that interesting? I said yes, and that was my first tech job.

I went from there to another courier firm where I did things with dBase, and worked in the evenings at a Laser Tag place. One night, at a party, I started talking to these guys who were doing stuff with recorded information services over POTS. They had the first NeXTs in Australia, and I really wanted to get my hands on them.

They offered me a job, and I was suddenly Operations Controller, leading a team of four people. Still not really sure how that happened.

The bottom fell out of that industry, and I went back to school, finished high school, and went to college. Best decision I ever made career wise was my choice of program. I studied Computer Science and Computer Systems Engineering at RMIT. I was the only woman in the combined program. It was intense: you took basically all the courses needed for both of those programs (one three years, one four years) in a five year period. We took more courses in a single semester than most people did in a year. I loved it. I had found my tribe.

One day, I went to the 24 hour lab and I saw a friend, Rosemary Waghorn, with something on her terminal I had never seen before. “What’s that?” I asked. “It’s called Mosaic,” she said. “This is the world wide web.”

I sat down. I was hooked. I knew right away that *this* was what I wanted to do.

Talk of impostor syndrome is almost memetic at the moment. If you don’t know what it is, go look it up. I’ll wait.

Like lots of other people, I struggle with this constantly. I’m not as smart as everybody else in the room. I’m not as good a coder. I’m not as good a manager. Sooner or later I will be found out for what I am: an impostor.

Thing is, I can rationally defeat many of those things by looking at objective evidence. I recite the evidence to myself. I am smart: my IQ is nearly 150. I wrote a programming book that some people really like - note I first wrote that as “great”, deleted it, wrote “best-selling”, deleted it, and settled for “some people really like”. I have worked on some interesting coding projects. I manage a successful team at an interesting company doing things that are technically difficult and that will hopefully make a difference in the world.

But in the back of my brain, a little voice says, that was just luck.

I recently realized that impostor syndrome is present in all parts of my life, not just in my career. Everyone is better at riding horses than I am, even though I’ve been doing it since I was four. My fiction writing sucks, and my critique group will eject me once they figure it out. My house is messier than everyone else’s, and I think I’m a terrible cook. I can’t co-ordinate my wardrobe.

The worst part is standing at the playground, thinking that every other parent there knows what they are doing except for me.

I have to remind myself these things aren’t true. Every day. I heard some good advice recently, which was to speak to yourself as if you were your best friend. You wouldn’t say to your best friend, “You’re an idiot”, now, would you? Even if your BFF did something objectively stupid, you might tell them, “You’re not stupid. We all do dumb things, sometimes.”

How about you? If you have strategies for overcoming impostor syndrome, share them in the comments.

In July, I was privileged to visit Hacker School as part of their Open Source week. Hacker School is an amazing place, where hackers from all walks of life work together to level up as programmers. It reminded me of all the good things about grad school. I really loved the atmosphere.

During Open Source Week, students’ goal is to submit their first patch to an existing Open Source project. A wide variety of projects were chosen by the students.

I gave a talk on getting started in Open Source, and then myself and two of my Mozilla colleagues helped some students get started on some Mozilla projects. At the end of the week, the organizers gathered together a list of what the students had contributed on our projects. I’d like to share those contributions with you. They include patches, pull requests, and filed bugs.

The skill range of students varies from self-taught in the last six months, to several years’ experience, to PhD students on summer vacation. But everyone works side by side, productively and enthusiastically.

Calls to action

I learned a lot from my day at Hacker School, and it inspired me to issue these calls to action:

Coders: If you’re thinking about applying to Hacker School, do it. It’s a truly amazing place. Applications are open for the fall batch.

Hackers: Nominate people (including yourself!) to be a Hacker School resident, working alongside students for a couple of weeks.

Mozillians: we should sponsor and run and be involved with more hackathons on Mozilla. projects. We should host Hackdays where we get brand new contributors involved with our projects. I propose we do this at existing Open Source conferences, get-togethers, and MozCamps, and at informal hackathons wherever the opportunity presents itself.

Finally

I’d like to thank Nick Bergson-Shilcock, David Albert, Sonali Sridhar, Thomas Ballinger, and Alan O’Donnell for running Hacker School and hosting us, and Etsy, 37signals, and Yammer for their sponsorship of the school. And of course, I’d like to thank the students for being awesome, and for their contributions!

VM Brasseur and I had a chat about what it means to be an engineering manager, as a follow up to her excellent talk on the subject at Open Source Bridge. I promised her I would put my (lengthy, rambling) thoughts into an essay of sorts, so here it is.

“Management is the art of getting things done through people.”
This is a nice pithy quote, but I prefer my version:

“Management is the craft of enabling people to get things done.”
Yes, it’s less grammatical. Sue me.

Why is management a craft?

It’s a craft for the same reasons engineering is a craft. You can read all the books you want on something but crafts are learned by getting your hands in it and getting them dirty. Crafts have rough edges, and shortcuts, and rules of thumb, and things that are held together with duct tape. The product of craft is something useful and pleasing.

(Art to me is a good deal purer: more about aesthetics and making a statement than it is about making a thing. Craft suits my analogy much better.)

Why enabling people to get things done?

Engineers, in general, know their jobs, to a greater or lesser extent. My job, as an engineering manager, is to make their jobs easier.

What do engineers value? This is of course going to be a sweeping generalization, but I’m going to resort to quoting Dan Pink: Mastery, autonomy, and purpose.

Mastery

Mastery has all kinds of implications. As a manager, my job is to enable engineers to achieve and maintain mastery. This means helping them to be good at and get better at their jobs. Enabling them to ship stuff they are passionate about. To learn the skills they need to do that. To work alongside others who they can teach and learn from. To have the right tools to do their jobs.

Autonomy

Autonomy is the key to scaling yourself as an engineering manager. As an engineer, I hate nothing more than being micromanaged. As an engineering manager, my job is to communicate the goals and where we want to get to, and work with you to determine how we’re going to get there. Then I’m going to leave you the hell alone to get stuff done.

The two most important things I do as a manager are in this section.
The first is to act as a BS umbrella for my people. This means going to meetings, fighting my way through the uncertainty, and coming up with clear goals for the team. I am the wall that stands between bureaucracy and engineers. This is also the most stressful part of my job.

The second is in 1:1s. While I talk to my remote, distributed team all day every day in IRC as needed, this is the sacrosanct time each week where we get to talk. There are three questions that make up the core of the 1:1:

How is everything going? This is an opportunity for any venting, and lets the engineer set the direction of the conversation.

What are you going to do next? Here, as a manager, I can help clarify priorities, and suggest next steps if the person is blocked.

What do you need? This can be anything from political wrangling to hardware. I will do my best to get them what they need.

In Vicky’s talk she talked about getting all your ducks in a row. In my view, the advantage of empowering your engineers with autonomy is that you get self-organizing ducks.

The key thing to remember with autonomy is this: Hire people you can trust, and then trust them to do their best.

Purpose

This is key to being a good manager, because you’re providing the purpose. You help engineers work out what the goals should be, prioritize them, clarify requirements, and make sure everybody has a clear thing they are working towards. Clarity of purpose is a powerful motivator. Dealing with uncertainty is yet another roadblock you remove from the path of your team.

Why is management fun? Why should I become a manager?

Don’t become an engineering manager because you want power - that’s the worst possible reason. A manager is a servant to their team. Become a manager if you want to serve. Become a manager if you want to work on many things at once. Becoming a manager helps you become a fulcrum for the engineering lever, and that’s a remarkably awesome place to be.

The Maker’s Schedule makes sense to me in a work setting, but how about for side projects, things you’re trying to do after hours?

I started fomenting this blog post a while ago. A very good engineer I know said something to me which I must admit rubbed me up the wrong way. He said something along the lines of, “See, you like to write for fun, and I like to code for fun.” Actually, I really like to code for fun too, but it’s much easier to write than code in fifteen minute increments, which is often all I have available to me on any given day.

Let’s be clear about one thing: I don’t think of myself as a consumer. I barely watch TV, only when my two year old insists. I can’t tell you the last time I had time to watch a movie, and I haven’t played a non-casual video game since college. I do read books, but books, too, lend themselves well to being read in fifteen minute increments.

I want to be a producer: someone who makes things. Unfortunately my life is not compatible with these long chunks of time that Paul Graham talks about. I think any parent of small children would say the same. When you’re not at work you are on an interrupt-driven schedule: not controlled by management, but controlled by the whims of the little people who are the center of your universe.

This is how I work:

When I’m doing one of the mindless things that consume some of my non-work time - showering, driving, grocery shopping, cleaning the house, laundry, barn chores - I’m planning. Whether it’s cranking away on a work problem, planning a blog post or a plot for a novel that I want to write, thinking of what projects to build for our next PHP book, mapping out a conference talk, planning code that I want to work on. This is brain priming time. When I get fifteen minutes to myself I can act on those things.

In other words, planning is parallelizable. Doing is not. Since I have so little uninterrupted time to *do*, I plan it carefully, and use it as much as I can.

When I get the occasional hour or two - nap time on a weekend (and to hell with the laundry), my husband taking our child out somewhere, or those blessed, perfect hours on a transcontinental flight - I can get so much done it makes my head hurt. But those are the exceptions, not the norm. I expect that to be the case until our child is a good deal older.

I had to train myself to do *anything* in fifteen minutes. It didn’t come naturally, but I heard the advice over and over again, particularly from women writers, some of them New York Times bestsellers. One has five children and wrote six books last year, so it can be done. The coding is coming. Training myself to code in fifteen minute increments has taken a lot longer than training myself to write in the same time.

The trick is to do that planning. Train your mind to immerse itself in the problem as soon as you get into the zone where your brain is being underutilized. This kind of immersion thinking has been useful to me for years for problem solving, and I just had to retrain myself to use it for planning.

In summary: don’t despair of Graham’s Maker’s Schedule if you just don’t have those big chunks of time outside of work. You can still be a maker. You can still be a creative person. You just have to practice. Remember: the things that count are the things we do every day, even if it’s only for fifteen minutes.

Continuous deployment is very buzzword-y right now. I have some really strong opinions about deployment (just ask James Socol or Erik Kastner, who have heard me ranting about this). Here’s what I think, in a nutshell:

You should build the capability for continuous deployment even if you never intend to do continuous deployment. The machinery is more important than your deployment velocity.

Let me take a step back and talk about deployment maturity.

Immature deployment

At the immature end, a developer or team works on a system that has no staging environment. Code goes from the development environment straight to production. (I’m not going to even talk about the situation where the development environment is production. I am fully aware that these still exist, from asking questions to that effect of the audience in conference talks.) I’m also assuming, in this era of github, that everybody is using version control.

(I want to point out that it’s the easy availability of services like github that has enabled even tiny, disorganized teams to use version control. VC is ubiquitous now, and that is a huge blessing.)

This sample scenario is very common: work on dev, make a commit, push it out to prod. Usually, no ops team is involved, or even exists. This can work really well in an early stage company or project, especially if you’re pre-launch.

This team likely has no automation, and a variable number of tests (0+). Even if they have tests, they may have no coverage numbers and no continuous integration.

When you hear book authors, conference speakers or tech bloggers talk about the wonders of continuous deployment, this scenario is not what they are describing.

The machinery of continuous deployment

Recently in the Mozilla webdev team, we’ve had a bunch of conversations about CD. When we talked about what was needed to do this, I had a revelation.

Although we were choosing not to do CD on my team, we had in place every requirement that was needed:

Continuous integration with build-on-commit

Tests with good coverage, and a good feel for the holes in our coverage

A staging environment that reflects production – our stage environment is a scale model of production, with the same ratios between boxes

Managed configuration

Scripted deployment to a large number of machines

I realized then that the machinery for continuous deployment is different from the deployment velocity that you choose for your team. If we need to, we can make a commit and push it out inside of a couple of minutes, without breaking a sweat.

Why we don’t do continuous deployment on Socorro

We choose not to, except in urgent situations, for a few reasons:

We like to performance test our stuff, and we haven’t yet automated that

We like to have a human QA team test in addition to automated tests

We like to version our code and do “proper” releases because it’s open source and other people use our packages

A commit to one component of our system is often related to other commits in other components, which make more sense to ship as a bundle

Our process looks like this:

The “dev” environment runs trunk, and the “stage” environment runs a release branch.

On commit, Jenkins builds packages and deploys them to the appropriate environment.

To deploy to prod, we run a script that pulls the specified package from Jenkins and pushes it out to our production environment. We also tag the revision that was packaged for others to use, and for future reference. Note we are pushing the same package to prod that we pushed to stage, and stage reflects production.

If we need to change configuration for a release, we change it first in Puppet on staging, and then update production the same way.

We do intend to increase our deployment velocity in the coming months, but for us that’s not about the machinery of deployment, it’s about increasing our delivery velocity.

Delivery velocity is a different problem, which I’m wrestling with right now. We have a small team, and the work we’re trying to do tends not to come in small chunks but big ones, like a new report, or a system-wide change to the way we aggregate data (the thing we’re working on at the moment). It’s not that changes sit in trunk, waiting for a release. It’s more that it takes us a while to get something to a deployable stage. That is, deployment for us is not the blocker to getting features to our users faster.

finally:

It’s the same old theme you’ve seen ten times before on this blog: everybody’s environment is different, and continuous deployment may not be for everybody. On the other hand, the machinery for continuous deployment can be critical to making your life easier . Automating all this stuff certainly helps me sleep at night much better than I used to when we did manual pushes.

(Incidentally, I’d like to thank Rob Helmer and Justin Dow for automating our world: you couldn’t find better engineers to work with.)

I’ve been thinking a lot about this idea lately. I’ve spent a lot of years as an engineer and consultant fixing other people’s systems that suck, writing my own systems that suck, and working on legacy systems, that, well, suck.

If it’s an old system, there’s the part of the code that everybody is afraid to work on: the fragile code that is easier to replace than maintain or refactor. Sometimes this seems hard, or nobody really understands it. These parts of the code are almost always surrounded by an SEP field. If you’re unfamiliar with the term, it means “Somebody Else’s Problem”. Items with an SEP field are completely invisible to the average human.

New systems have the parts that haven’t been built yet, so you’ll hear things like “This will be so awesome once we build feature X”. That sucks.

There’s also the prototype that made it into production, a common problem. Something somebody knocked together over a weekend, whether it was because of lack of time, or because of their utter brilliance, is probably going to suck in ways you just haven’t worked out yet.

All systems, old and crufty or new and shiny, have bottlenecks, where a bottleneck is defined as the slow part, the part that will break first when the system is under excessive load. This is also part of your system that sucks.

If someone claims their system has no bugs, I have news for you: their system sucks. And they are overly optimistic (or naive). (Possibly they just suck as an engineer, too.)

In our heads as engineers we have the Platonic Form of a system: the system that doesn’t suck, that never breaks, that runs perfectly and quietly without anyone looking at it. We work tirelessly to make our systems approach that system.

Even if you produce this Platonically perfect system, it will begin to suck as soon as you release it. As data grows and changes, there will start to be parts of the system that don’t work right, or that don’t work fast enough. Users will find ways to make your system suck in ways you hadn’t even anticipated. When you need to add features to your perfect system, they will detract from its perfection, and make it suck more.

Here’s the punchline: sucking is like scaling. You just have to keep on top of it, keep fixing and refactoring and improving and rewriting as you go. Sometimes you can manage the suck in a linear fashion with bug fixes and refactoring, and sometimes you need a phase change where you re-do parts or all of the system to recover from suckiness.

This mess is what makes engineering and ops different from pure mathematics. Embrace the suck. It’s what gets me up in the mornings.

As readers of my blog posts know, Socorro is Mozilla’s crash reporting system. All of the code for Socorro (the crash catcher) and Breakpad (the client side) is open source and available on Google Code.

Some other companies are starting to use Socorro to process crashes. In particular, we are seeing adoption in the gaming and music sectors - people who ship connected client software.

One of these companies is Valve Software, the makers of Half Life, Left 4 Dead, and Portal, among other awesome games. Recently Elan Ruskin, a game developer at Valve, gave a talk at the Game Developers Conference about Valve’s use of crash analysis. His slides are up on his blog and are well worth a read.

If you’re thinking about trying Socorro, I’d encourage you to join the general discussion mailing list (or you can follow it on Google Groups). It’s very low traffic at present but I anticipate that it will grow as more people join.

Later in the year, we plan on hosting the first inaugural Crash Summit at Mozilla, where we’ll talk about tools, crash analysis, and the future of crash reporting. Email me if you’re interested in attending (laura at mozilla) or would like to present. The event will be open to Mozillians and others. I’ll post updates on this blog as we develop the event.

We’ll be talking about what Big Data is, how to work with it, Big Data APIs (how to design and implement your own, and how to consume them), data visualization, and the wonders of MapReduce. I’ll talk through a case study around Socorro: the nature of the data we have, how we manage it, and some of the challenges we have faced so far.

Workshops are new at SXSW. They are longer than the traditional panel - 2.5 hours - so we can actually get into some techinical content. We plan on making our presentation a conversation about data, with plenty of war stories.

This has been a mammoth exercise, that sprang from two separate roots:

In June last year, we had a configuration problem that looked like a spike in crashes for a pre-release version of Firefox (3.6.4). This incident (known inaccurately, as “the crash spike”) made it clear that crash-stats is on the critical path to shipping Firefox releases.

Near the end of Q3 we realized we were rapidly approaching the capacity of the current Socorro infrastructure.

Around that time we also had difficulty with Socorro stability. This spawned the creation of a master Socorro Stability Plan, with six key areas. I’ll talk about each of these in turn.

Improve stability

Here, we solved some Hbase related issues, upgraded our software, scheduled restarts three times a week and, most importantly, sat down to conduct an architectural review.

The specific HBase issue that we finally solved had to do with intra-cluster communication. (We needed to upgrade our Broadcomm NIC drivers and kernel to solve a known issue when used with HBase. This problem surfaces as the number of TCP connections growing and growing until the box crashes. Solving this removed the need for constant system restarts.)

Architectural improvements

We collect incoming crashes directly to HBase. We determined that as HBase is relatively new and as we’d had stability problems, that we should get Hbase off the critical path for production uptime. We rewrote collectors to seamlessly fall back to disk if HBase was unavailable, and for them optionally to use disk as primary. As part of this process, the system that moves crashes from disk to HBase was replaced. It went from single threaded to multi-threaded, which makes playing catchup after an HBase downtime much much faster.

We still want to put a write through caching layer in front of HBase. This quarter, the Metrics team is prototyping a system for us to test, using Hazelcast.

Build process

We now have an automated build system. When code is checked into svn, Hudson notices, creates a build and runs tests. This build is then deployed in staging. Now we are on the new infrastructure, we will deploy new releases from Hudson as well.

Improved release practices

We have a greatly improved set of release guidelines, including writing a rollback plan for every major release. We did this as part of the migration, too. Developers now have read only access to everything in production: we can audit configs and tail logs.

The biggest change here, though, is switching to Puppet to manage all of our configurations. Socorro has a lot of moving parts, each with its own configuration, and thanks to the fine work by the team, all of these configurations are automatically and centrally managed and can be changed and deployed at the push of a button.

Improved insight into systems

As part of the migration, we audited our monitoring systems. We now have many more nagios monitors on all components, and have spent a lot of time tuning these. We also set up ganglia and have a good feel for what the system looks like under normal load.

We still intend to build a better ops dashboard so we can see all of these checks and balances in one place.

Move to bigger, better hardware in PHX

This one is the doozy. Virtually all Socorro team resources have been on this task full time since October, and we managed to subsume half of the IT/Ops team as well. I’d like to give a special shout-out here to NetOps, who patiently helped with our many many requests.

It’s worth noting that part of the challenge here is that Socorro has a fairly large number of moving parts. I present, for your perusal, a diagram of our current architecture.
I’ve blogged about our architecture before, and will do so again, soon, so this is just a teaser.

One of the most interesting parts of the migration was the extensive smoke and load testing we performed. We set up a farm of 40 Seamicro nodes and used them to launch crashes at the new infrastructure. This allowed us to find network bottlenecks, misconfigurations, and perform tuning, so that on the day of the migration we were really confident that things would go well. QA also really helped here because with the addition of a huge number of automated tests on the UI, we knew that things were looking great from an end user perspective.

The future

We now have a great infrastructure and set of processes to build on. The goal of Socorro - and Webtools in general - is to help Firefox ship as fast and seamlessly as possible, by providing information and tools that help to build the best possible browser. (I’ll talk more about this in future blog posts, too.)

Thanks

I’ll wrap up by thanking every member of the team that worked on the migration, in no particular order: