It's a shame it's not easier or more common for people to create clones of (most|all) of their infrastructure for testing purposes.

Something like half of outages are caused by configuration oopsies.

If you accept that configuration is code, then you also come to the following disturbing conclusion: the usual test environment for critical network-related code in most environments is the production environment.

The main issue there is that "environments" are defined by configuration, so if you try to set up a configuration test environment, you run into a direct logical impass: either your configs are production configs, and thus not a separate environment, or they're different from production configs, and thus may provide different test results from production.

While I agree with you, I think we could get closer to "production" than is common right now.

In an AWS environment, imagine a setup where all that differs is the API keys used (the API keys of the production vs test environment). What gets tricky is dealing with external dependencies, user data, and simulating traffic.

For an example more relevant to today's issue: imagine a second simulated "internet" in a globally distributed lab environment. With BGP configs, fake external BGP sessions, etc, servers receiving production traffic, etc.

I get that it's a lot of work to setup and would require ongoing work to maintain - and that it's hard/impossible to have it correctly simulate the many nuances of real world traffic - and yet I also think in many cases it would be sufficient to prevent issues from making it into production.