System Fail

I came across an article on the New York Post via Hacker News: My Hell as a 911 Operator. Following an overhaul of the NYC 911 computer system, the article describes the chaos stemming from the system’s inability to handle high call volumes and frequent crashes. Operators are forced to handle the system’s tasks manually to the point where they have to literally run to deliver a message to a dispatcher after they receive a 911 call.

Designing and implementing large systems, especially a mission-critical one like the NYC 911 system, is extremely difficult. A complete one-shot overhaul is usually a bad idea as it rarely performs flawlessly upon a new deployment. Having worked on medium to large systems in the past and present, here are some rules and guidelines that I’ve learned to follow that have helped prevent such epic fails.

1) Do incremental deployments. Start with a system with basic core features. Add features with small and frequent releases so that it minimizes the amount of potential bugs that go with a production release. It’s always better to fight one or two bees at a time rather than fighting the whole hive.

2) Always choose stability first before performance. Address performance issues only when absolutely necessary. Increasing performance often means adding concurrency, adding specialized and low-level code, and/or maintaining extra state of some sort. That means there are more things that could go wrong. If your first thought is to add multithreading, please do your customer a favor and smack yourself across the cheek.

3) Think again if a new system is necessary. Is it better to just enhance the current system? If the current system is rock-solid and does what it needs to do, why replace it? I’m sure the NYC 911 operators would love having their old system back.

4) Plan for the worst and have a backup plan. Even if you are a super rockstar ninja software developer from a galaxy far, far away, please keep your ego in check. Having the ability to rollback to the old system would have been a great backup plan for the NYC 911 system.

These are great insights. For 2) I would add that its rarely a good idea for a typical programmer (i.e. the 99.99% of programmers who aren't experts in concurrency) to ever do low level concurrent programming in a critical production system. There are plenty of great frameworks that handle concurrency and allow you to plug in your code in the appropriate places. I remember an interesting comment from Prof. Doug Lea, the author of Concurrent Programming in Java, during a JCP discussion, "Concurrent systems are hard to write and even harder to test. I get it wrong all the time."

Also, while I agree that you should always optimize for stability first, its a good idea to do some thinking about future performance requirements so you don't make decisions that prevent scaling up in the future.