Flickr "10+ Deploys Per Day" @ Velocity 2009

My favorite talk at Velocity was by Paul Hammond and John Allspaw from Flickr, who are doing real lowercase-a agile:

UPDATE

Here’s the video, highly recommended:

“If there’s one thing you do, it should be automated infrastructure”. This was a refrain through the conference – as Theo Schlossnagle put it, it doesn’t matter if it’s chef, puppet, bcfg2, cfengine – whatever works for you, just do it.

Some of their techniques:

One-step build. They literally go to a web page and click a button and watch the build take the full site from soup to nuts.

Deploys: Who. What. When. You want to make all the meta-details of a deploy easily visible to anyone. Deploy logs are readily accessible.

Always ship trunk. In a webapp this is possible. It vastly simplifies – everyone knows where to look for what’s going out, and what’s live.

Flickr does their branching in the code (take note git people), and these become natural ops levers (i.e. uh-oh turn that feature off / turn it down).

“Dark launches”. Facebook does this too: launch the guts of a new feature with the UI turned off so you can see how the technology works in real production.

Which results in anticlimactic launches, the best kind

They roll forward (not back) to turn features off that aren’t working.

“Gather shitloads of metrics”. System metrics, app metrics, everything. John has a great writeup on their Ganglia setup in his book.

Developers watch the metrics just as obsessively as ops. Visualization is a powerful tool used day-to-day by the whole group.

Developers have a way of putting in their own metrics via a little framework.

They’re all on IRC. This kept coming up at the conference – teams are on some chat tool like IRC, Skype chat LINK, or Campfire LINK.

Important events are piped into IRC, so you’ll be right in the middle of a conversation and an alert will pop in.

Logs are piped into a search engine so they can find things in the log history, easily.

Culture, philosophy:

There’s an ongoing conversation between dev and ops. They’re learning to solve the Flickr problem together. Each side’s way of thinking informs the other (major conference theme as well)

Failure will happen. Develop your ability to respond. Like ER doctors you practice on failures, that makes you better/competent at handling what comes along next.

In addition to the ops people on call, there’s always a developer who has a pager

3 Comments

Chad Woolley says:

Great stuff Steve. I want to go to this next year :) Does his book or site have more details on all this stuff? I want to know more, like how they implement and manipulate the knobs and levers – guis? config files? what about required restarts?

June 26, 2009 at 12:55 am

Steve Conover says:

I think you need to buy his book ;-)

– guis: both Allspaw and Schlossnagle say “learn rrdtool” – I did and I’m very happy I did so. After that it’s a matter of getting ganglia up and running (via Chef of course) and stringing together the right gmetric commands to collect stats and the right rrdgraph commands to generate graphs. Then you just stick the stuff in a webpage using your favorite dynamic page generation framework.

– CI / builds – a lot of people at the conference are using Hudson. And they like it.

– Levers are just switches in code, along the lines of “if debug” except for real. From what they were saying there are different kinds of levers – on/off being the most obvious, but also numeric levers where contracting the size of a feature’s garden hose makes sense.

But this is my big objection to git thinking (which admittedly seems to have calmed down some): forcing yourself to branch in the code naturally causes on/off levers to appear. If it’s a major new feautre, you have an on/off lever (set to off) until it’s done. Then you flip it on. Then extracting that switch into something that’s configurable by an external admin is not hard.

Anyway that’s what they were talking about…also just about everyone mentioned operational levers, you just can’t ever totally predict what’s going to go down in real prod and the on-call people need and off switch if development “checks in a chainsaw”.

June 26, 2009 at 6:58 am

Mike Dalessio says:

I once worked in a large software shop that benefited hugely from using feature switches. This practice was the natural outcome of having a *very* painful rollback process (painful both technically and politically). A nice side-benefit was that the switches allowed features to be rolled out slowly, Google-style, to select groups of users.

The one danger to beware of, though, is the natural proliferation of deprecated code (and tests) as new code is turned on. YMMV, but it pays to have an aggressive pruning policy in place.