{

system validation in the age of CD

building

problem

Why do systems fail?

solution

Do not write bugs!

Do not write bugs!

solution

Do not write bugs!

unit testing

integration testing

formal validation

however

knock knockrace condition!who's there?

OS updateself-replicating bugSSL expiration

“Windows Azure Storage experienced a worldwide outage impacting HTTPS traffic due to an expired SSL certificate,” Martin reported. “HTTP traffic was unaffected but the event impacted a number of Windows Azure services that are dependent on Storage.”

Blast radius reduction

Canary deployments - deploy to one (or few) instances, compare metrics for a while

Controlled rollout - deploy to X% at the time, auto-rollback on alarms easier during low-traffic windows

off the beaten track

We don't have many tests. I'm not advocating that you shouldn't put in tests. [The reason we can get away with this] is that we have a great community. So instead of having automated test we have automated human tests.

Let's play

In the early 2000s, Amazon created GameDay, a program designed to increase resilience by purposely injecting major failures into critical systems semi-regularly to discover flaws and subtle dependencies. Basically, a GameDay exercise tests a company's systems, software, and people in the course of preparing for a response to a disastrous event.

DiRT was developed to find vulnerabilities in critical systems and business processes by intentionally causing failures in them, and to fix them before such failures happen in an uncontrolled manner. DiRT tests both Google's technical robustness, by breaking live systems, and our operational resilience by explicitly preventing critical personnel, area experts, and leaders from participating.