Meta

It’s Murphy’s Law. If it can go wrong, it will. What’s your application’s opinion on errors? If we take a stance on errors, these principles will tell us exactly how the application will behave. Shared understanding means less miscommunication. Here are some of the things I’m talking about.

Operators have the accurate information – your data is already wrong

If you’ve ever heard the following conversation, you know data entry will always be out of sync with the real world. “I know the database is telling me that bus 7 has transmission 8 installed, but I’m looking at transmission 3. Don’t tell me I can’t uninstall transmission 3, because I just did.”

What does that mean for implementation? Favor “soft validations,” warning that errors are present, but allow overrides. If I have a misallocated part in the database, I need to uninstall it from where the database “thinks” it is and mark that uninstall as something needing follow up by people in the field.

It’s all fun and games until someone dies. – The data must have integrity

Particularly in the medical field, the configuration of a medical device should be exactly what you intend with no mistakes or corruptions. Corrupt data can kill. Also, data must meet certain validations to be used, but can fail during editing.

Never write over the data file. Always write to a shadow file, test the integrity and in one atomic action, move the new data file over the old one. If there is ANY failure of integrity, the old data file is a safe fallback position.

In addition, the internal data structures might be “persistent” allowing undoing to a known good state if any operation fails an integrity check. Some transient operations might lead to bad values, like changing a medical dosing schedule might temporarily exceed safe dosages within a time frame, but you should not exit the schedule edit until the schedule validates.

If it crashes, I can use the browser, right? – The application must stay running

This is useful for kiosk or embedded applications. However some long running applications also need to handle failure.

Use multiple executables. The less an application does, the less likely it is to crash. In addition, a supervisor application can restart parts of the application that fail. Operations with risky components like using third party drivers or that can enter bad states because of bad data can run in their own sandbox and die if they ever enter a corrupt state.

There is real money involved

Real money requires an audit log. In addition, duplicate submissions are a killer and should be identified ASAP to avoid duplicate charges.

Operations, like purchases and fulfillments, are seen and recorded as first class data. The engineering team may lean on CQRS for these operations. Operations may be given an identifier early in the process. For example your cart my always have a “next order Id” so commands may be made idempotent preventing duplicate submissions.

Take a bit and think about how your application can fail and what can go wrong. Pick your battles and identify the ones that are the most likely and the ones that cause the most havoc. How are you going to handle them?

About Brian Ball

I'm a Software Engineer in Indianapolis. I'm interested in programming languages. I love dynamic languages like Ruby and Perl. I'm exploring functional languages like Haskell, F#, Clojure, and Erlang. I also use the .NET framework and occasionally deal with System Testing.

If you're not careful, you might also see some sketches and other art come up every once in a while.