Tuesday, February 10, 2009

At long last, some of the actual implementers of the advanced systems we built at IMVU for rapid deployment and rapid response are starting to write about it. I find these on-the-ground descriptions of the system and how they work so much more credible than just theory-type posts that I am excited to share them with you. I can personally attest that these guys know what they are talking about; I saw them do it first-hand. I will always be full of awe and gratitude for what they accomplished.

It’s important to note that system I’m about to explain evolved organically in response to new demands on the system and in response to post-mortems of failures. Nobody gets here overnight, but every step along the way has made us better developers.

The high level of our process is dead simple: Continuously integrate (commit early and often). On commit automatically run all tests. If the tests pass deploy to the cluster. If the deploy succeeds, repeat.

Our tests suite takes nine minutes to run (distributed across 30-40 machines). Our code pushes take another six minutes. Since these two steps are pipelined that means at peak we’re pushing a new revision of the code to the website every nine minutes. That’s 6 deploys an hour. Even at that pace we’re often batching multiple commits into a single test/push cycle. On average we deploy new code fifty times a day.

We call this process continuous deployment because it seemed to us like a natural extension of the continuous integration we were already doing. Our eventual conclusion was that there was no reason to have code that had passed the integration step but was not yet deployed. Every batch of software for which that is true is an opportunity for defects to creep in: maybe someone is changing the production environment in ways that are incompatible with code-in-progress; maybe someone in customer support is writing up a bug report about something that's just being fixed (or worse, the symptom is now changing); and no matter what else is happening, any problems that arise due to the code-in-progress require that the person who wrote it still remember how it works. The longer you wait to find out about the problem, the more likely it is to have fallen out of the human-memory cache.

Now, continuous deployment is not the only possible way to solve these kinds of problems. In another post I really enjoyed, Timothy explains five other non-solutions that seem like they will help, but really won't.

1. More manual testing.

This obviously doesn’t scale with complexity. This also literally can’t catch every problem, because your test sandboxes or test clusters will never be exactly like the production system.

2. More up-front planning

Up-front planning is like spices in a cooking recipe. I can’t tell you how much is too little and I can’t tell you how much is too much. But I will tell you not to have too little or too much, because those definitely ruin the food or product. The natural tendency of over planning is to concentrate on non-real issues. Now you’ll be making more stupid mistakes, but they’ll be for requirements that won’t ever matter.

3. More automated testing.

Automated testing is great. More automated testing is even better. No amount of automated testing ensures that a feature given to real humans will survive, because no automated tests are as brutal, random, malicious, ignorant or aggressive as the sum of all your users will be.

4. Code reviews and pairing

Great practices. They’ll increase code quality, prevent defects and educate your developers. While they can go a long way to mitigating defects, ultimately they’re limited by the fact that while two humans are better than one, they’re still both human. These techniques only catch the failures your organization as a whole already was capable of discovering.

5. Ship more infrequently

While this may decrease downtime (things break and you roll back), the cost on development time from work and rework will be large, and mistakes will continue to slip through. The natural tendency will be to ship even more infrequently, until you aren’t shipping at all. Then you’ve gone and forced yourself into a total rewrite. Which will also be doomed.

What all of these non-solutions have in common is that they treat only one aspect of the problem, but at the expense of another aspect. This is a common form of sub-optimization, where you gain efficiency in one of the sub-parts at the expense of the efficiency of the overall process. You can't make these global efficiency improvements until you get clear about the goal of your development process.

That leads to a seemingly-obvious question: what is progress in software development? It seems like it should be the amount of correctly-working code we've written. Heck, that's what it says right there in the agile manifesto. But, unfortunately, startups can't afford to adopt that standard. As I've argued elsewhere, my belief is that startups (and anyone else trying to find an unknown solution to an unknown problem) have to measure progress with validated learning about customers. In a lot of cases, that's just a fancy name for revenue or profit, but not always. Either way, we have to recognize that the biggest form of waste is building something that nobody wants, and continuous deployment is an optimization that tries to shorten this code-data-learning feedback loop.

Assuming you're with me so far, what will that mean in practice? Throwing out a lot of code. That's because as you get better at continuous deployment, you learn more and more about what works and what doesn't. If you're serious about learning, you'll continuously learn to prune the dead weight that doesn't work. That's not entirely without risk, which is a lesson we learned all-too-well at IMVU. Luckily, Chad Austin has recently weighed in with an excellent piece called 10 Pitfalls of Dirty Code.

IMVU was started with a particular philosophy: We don't know what customers will like, so let's rapidly build a lot of different stuff and throw away what doesn't work. This was an effective approachto discovering a business by using a sequence of product prototypes to get early customer feedback. The first version of the 3D IMVU client took about six months to build, and as the founders iterated towards a compelling user experience, the user base grew monthly thereafter.

This development philosophy created a culture around rapid prototyping of features, followed by testing them against large numbers of actual customers. If a feature worked, we'd keep it. If it didn't, we'd trash it.

It would be hard to argue against this product development strategy, in general. However, hindsight indicates we forgot to do something important when developing IMVU: When the product changed, we did not update the code to reflect the new product, leaving us with piles of dirty code.

So that you can learn from our mistakes, Chad has helpfully listed ten reasons why you want to manage this dirty-code (sometimes called "technical debt") problem proactively. If we could do it over again, I would have started a full continuous integration, deployment, and refactoring process from day one, complete with five why's for root cause analysis. But, to me anyway, one of the most inspiring parts of the IMVU story is that we didn't start with all these processes. We hadn't even heard of half of them. Slowly, painfully, incrementally, we were able to build them up over time (and without ever having a full-stop-let's-start-over timeout). If you read these pieces by the guys who were there, you'll get a visceral sense for just how painful it was.

Did it ever occur to any of you how many man hours are wasted on writing software that gets thrown out because you didn't simply ask the user if they would even like that feature? Sounds like a fabulous waste of time for me. Maybe you're getting paid by the line.

@anonymous - Do you know how many man hours are wasted on writing software that users said they wanted but didn't use? Or how much software was never sold because users never said they wanted a feature that they would have used? I don't. Just wondering if you did because you seem to have the answers.