why

production is never the same as staging. Think about context – does staging have all the same monitoring tools and alerts set up to notify your team of failures? To be sure your backup plans work, you will have to test on on production

Customers aren’t debugging tools

Make inevitable problems come up during business hours when engineers and snacks are available instead of at night

how

Failure Fridays – someone at pager duty makes a server fail in a controlled way, team sees what happens. This is usually done to 1 just service

“Nuclear Option” testing includes all teams while someone cranks up traffic across all services

Apply scientific method to load testing. Form a hypothesis, run a test, check results

Any surprises from this testing?

Multiplication! Think about spin up times of services that depend on each other in sequence and requests piling up while they wait

HTTP Rate limits at periphery of system don’t help if the event processors at the center of system can’t keep up with back log of requests