Migrating to the cloud is chaotic. Embrace it.

With a simple Google search, you will find hundreds of articles on how to migrate to the cloud, simple and easy. But the truth is that migrating to the cloud is simply chaotic and the best way to get ahead of natural chaos is to synthesize your own, in a controlled experiment. Matt Fornaciari explains how you can embrace the chaos and make the most of it!

There’s plenty to like about the cloud, namely not having to manage your own hardware and offloading the lion share of the nitty gritty details of your infrastructure to people who make it their core competency. But it would behoove us all to acknowledge that in abstracting these details away, we also forfeit some semblance of control and put ourselves at the mercy of a third-party. That means that more than ever, we must plan for failure. We must architect it into every design decision we make. And more than anything, we must do the due diligence to understand the technologies we’re adopting.

As Chaos Engineering evangelists, we often hear “The cloud is chaotic enough! Why would we add more chaos on purpose?” We get it — we’ve answered the 3 am pages too. But remember: the practice of Chaos Engineering was pioneered by the very same companies that helped develop and currently run, quite reliably I might add, in the cloud. Perhaps the most well-known example of Chaos Engineering is Netflix’s creation and use of Chaos Monkey. Instead of shying away from uncertainty, Netflix embraced the problem of ephemeral servers by constantly shutting down servers, at random, to ensure that a downed server, didn’t mean a bad customer experience. What’s more, they imposed the tool on all service owners as a requirement for operational maturity as opposed to offering it as an optional tool, engineers could use if they felt the calling.

In the same vein, we advocate that the best way to get ahead of natural chaos is to synthesize your own, in a controlled experiment, where you can closely observe the results, and pull the ripcord if everything goes south. We find this practice to be especially powerful when you apply it regularly — and when it’s non-negotiable. As you make continuous chaos a part of your normal engineering practice, you gradually learn to stop fearing the unknown.

What’s even more chaotic than running in the cloud? Migrating to it. Although you might rather start your Chaos Engineering practice after your migration, the truth is that you should be proactive about testing and identifying weaknesses as you slowly move over pieces of your infrastructure. If you think about it pragmatically, this is a golden opportunity to test how your infrastructure behaves in a piecewise fashion, as your old infrastructure is still taking most of the production traffic, and you’re likely moving a single service at a time. Better to run your cloud environment through the ringer now while it’s still in its embryonic stage and it’s low stakes. What’s the alternative? Migrating all of your mission-critical services and finding out they don’t behave in the cloud like you’d hoped. In that scenario, you’re already pot committed and you’re impacting live traffic.

Odds are, if you’re moving to the cloud, you’re probably trying to take advantage of some of the latest advances in infrastructure technologies. The landscape has been completely revolutionized by the introduction of containers and container orchestration, not to mention the brave new world of Functions as a Service (FaaS) and about a million other cloud offerings. With the advent of these technologies comes the promise of auto scaling, auto healing and zero downtime….But I’m here to say there is no magic bullet, and unless you test the claims made by these technologies, you’ll have no idea if they’ll come to your aid when called upon.

So whether you’re migrating to the cloud or even just implementing some new infrastructure practices, consider this your chance to start doing infrastructure-wide testing, before you’re too far down the rabbit hole to understand the complex interactions of your fully distributed system. After all, you wouldn’t wait until your application was code complete to start writing unit tests — why should implementing a brand new infrastructure be any different? We need to shift the mindset of operations from reactive to proactive, and the only way to do that is to test our assumptions early and often, in order to identify failure instead of responding to it.

Matt Fornaciari is Co-Founder and CTO of Gremlin. Previously, he was a Senior Platform Engineer at Salesforce, where he led the charge to bolster the experience of viewing and editing each and every record. Before that he improved the reliability and customer experience of the Amazon Retail website, where he founded the ’Fatals‘ team which reduced the number of website errors by half in its first year.