The Big-Bang Release Strikes Back

This is how I started off my New Year – merging the completely refactored code branch into the master. Test coverage was pitiful enough that there was no way we could do a release without a full regression test. Another term near and dear to my heart. I believe full regression testing was actually invented by Emperor Palpatine himself. And I’m pretty sure that guy didn’t use scrum in either of his Death Star projects.
No releases for almost two weeks! If you’re not doing continuous integration and deployment, two weeks may sound like a ridiculously short cycle. But we typically released everyday (and sometimes more). When the other departments heard there would be no release during this period, there was confusion and disappointment. The two sprint meetings we held during this time were a joke. The backlog was full of critical bugs that had top priority. Even if we did implement a user story (aka customer value) – we couldn’t release it until this beast was officially signed off.

100’s of commits

I watched the growing number of unreleased commits with a sinking feeling in my stomach. I’d been here and done this all too many times before. Day #1 – 23 commits, Day #2 – 38, Day #3 – 56 … this was going to be a bad release. I knew it but the massive changes to the both the code and database left us with little choice. My only priority was to minify the release requirements drastically and push this damn thing out kicking and screaming if I had to.

“Do we absolutely need this feature to be running when we release? No? Excellent. What about this one? Really? Great!”

I think by the time we actually did the release, there were 187 commits – a record breaking number that earned absolutely noone any recognition, awards or trophies.

As the possibility of release neared, I actually had to write a release HOWTO. Hadn’t written one of these gems in a few years. This is always a fun little exercise where you get to think about things like:

an end-user => “Site down – wtf??”

a software architect => “I wonder how this thing will hold up under load?”

a sysadmin => “Did they really tell me about all the Apache rewrites they need?”

And, yes, as I went through just this small exercise, I found critical problems that had to be addressed.

We don’t need another hero

So, there I was – 1am and time for the release. First backup the database, then push the code, then run the db migrations, and finally a smoke test. We should be done in 1 hour tops. All these steps were automatically done on the staging server every night for the last two weeks. Jenkins (adieu sweet Hudson) was glowingly green the last few days and it seemed as if the gods were finally smiling down their favor. I broke out my capistrano tool belt and went to work.

Database backed up – check. Code pushed – check. Database migration –
“[error] – remote site timeout – disconnect”
Fuuuuuu …. did NOT see that one coming. The cap tasks always ran locally right on the staging server – there was no danger of a timeout. And what about the db migration? Checking top showed mysqld churning away, so I gently prodded it manually with the rest of the scripts knowing full well I had just b0rk3d the production database.

Which, indeed, the smoke test confirmed 20 minutes later. I diligently reimported the database at 1:45am – watching replication struggling mightily to keep up with the load. Ok, 2:30am – let’s do the db migration manually. Joy! But, wait – what’s this? This conversion step isn’t working as expected? Why not? It’s only 3am – I should be able to figure it out in a jiffy.
“UPDATE posts p SET p.url=CONCAT((select t.url from themes t where t.id = p.theme_id), p.url);”

Smoke test successfully completed at 3:15am. Time for bed. I drug my ass into work by 10 o’clock the next morning. Everyone praised me for such a great job. I was a hero for saving the release. And I felt like a complete loser. Because I knew better. This is not the way to build and release software. Heroes only show up when you need them – and if you need them, you have bigger problems than you’d probably care to admit.

100’s of bugs

Like every late night party (release or otherwise), there’s the painful hangover the next day. This hangover lasted almost a week, and we ended up fixing hundreds of bugs that hadn’t been caught on the test system. But we made it – made our deadline and we could finally get back to delivering customer value.

Of course, we all know Luke did the right thing at the end. He rejected his father’s hand – that chance for glory, power and fame – and ultimately saved his father’s soul. Think hard about that the next time you’re planning a big-bang release. It may be the easy thing to do, but is it the right thing? For everyone involved?