Lessons Learned from last week Windows Azure outage – Redundancy

As you might already know, Windows Azure had an Outage last week, and this generated a lot of “fuzz” around it, as well as when the same happened to Amazon last year, or even other providers. Based on these outages, lots of people are now saying that Cloud Computing shouldn’t be an option because it sometimes fails, because it has outages and so on. This isn’t really a very correct approach to the issue, since when we have everything inside our own Data Center, sometimes bad things happen also, from someone doing an update in the network and that crashes it, machines that just “die” from one moment to the other, and a lot more.

What these outages remind us is that even when going to the Cloud, and Windows Azure for example, we need to continue to analyze the impacts that an outage in our solution might have in our business, because Cloud Computing provides us a better platform and a way for us to be more secure, since they already have some Disaster Recovery and Data Replication mechanisms, but they also provide SLA’s and if we need more than those we need to really work on it and architect for it right from the start. And this isn’t a Cloud Computing fault, it’s really a requirement that our business has, and will have inside or outside or our own Data Center.

When we are dealing with something that is inside out own Data Center what we do is Redundancy. Let’s talk on a real-world example not IT related. Airplanes don’t need so many engines to fly, but they have them because if 50% fails, the other 50% will still get the airplane to reach to the desired destination without problems, and of course the level of redundancy depends on the reliability of the engines, and also the impact of that failure. Since airplanes don’t work very well without engines working, this is critical so sometimes you see 50% Redundancy and some other 75% Redundancy, like it happened in the earlier days. So we need to do the same with our solutions when building the on Windows Azure, and that is understand the impacts that an outage has for the business and then plan Redundancy and Disaster Recovery based on those, but we have some things that we can count on already, that is how Windows Azure takes care of Storage, SQL Azure, Compute and so on, since it provides us SLA’s that will provide us a level of security already very good. Also in case of Storage, provide us 3 replicas of everything that is placed inside the Storage account, being it Tables, Queues or even Blobs, and also Geo-Replicates 1 copy into another Data Center in the same region. What this does is that when for example a Data Center goes down, like it happened last week with Windows Azure, normally isn’t all the Data Center, and so some part will continue to be available, and as soon as the platform identifies that some part of the Data Center is down, the same platform will take the primary replica and place it as the original one, and them everything works again, but if for some reason all the Data Center goes down, it will continue to have a replica in the other Data Center of the same Region.

If we talk about SQL Azure, the same thing happen as the Storage, just the Geo-Replication isn’t there, so if all the Data Center goes down, there’s no Geo-Replica fallback process and so we need to plan for it.

So based on all this we should really look at Redundancy and Disaster Recovery as a very important part of our Architecture and System design, but we also need to take into account that this means costs, and so we need to get the right approach for the customer, because there’s no solution “One fits all” for this.

In some next posts I’ll talk about some approaches to designing Windows Azure for Redundancy and Disaster Recovery.

Also you can leave a comment and say what you’d like to hear about and I’ll do my best to write about it.