Microsoft’s Journey: Solving Cloud Reliability With Software

Every GPS Needs a Map

Resilient software solves problems beyond the physical world, but to get there, that software needs to have an intimate understanding of the physical environment it resides on top of. While the role of a data center manager might come with a satellite phone, it rarely comes with a GPS. Very few data center operators have a comprehensive view of how server or workload placements affect service availability. Typical placement activities are more art than science – balancing capacity constraints, utilization targets, virtualization initiatives, and budgets. Relying on hardware takes a variable off the table in this complex dance. But there are a few things you can do to start building the maps and turn-by-turn directions that will enable resilient software in your environment, whether you prefer private, hybrid or public cloud.

Map physical environment and availability domains: From a hardware standpoint, it’s important to look at the physical placement of hardware against infrastructure. We automate and then integrate that automation to be able to communicate between the data center, the network, the server, and the operations team running them. Understanding the failure and maintenance domains of your data center, server, network, and manageability infrastructure is key to placing virtualized workloads for high availability. Trace the single line diagram to identify common failure points and place software replication pairs in uncorrelated environments. In most data centers, you’re limited to one or a handful of failure domains at best. However, with a cloud services’ application platform like Windows Azure, a developer or IT professional can now choose from many different regions and availability domains to spread their applications across many physical hardware environments.

Define hardware abstractions: As you are looking at private, public, and hybrid cloud solutions, now is a good time to start thinking about how you present the abstraction layer of your data center infrastructure. How workloads are placed on top of data center, server, and network infrastructure can make a significant difference in service resiliency and availability. Rather than assign physical hardware to a workload, can you challenge your systems integrator or software developer to consume compute, storage, and bandwidth resources tied to an availability domain and network latency envelope? In a hardware-abstracted environment, there is a lot room for the data center to become an active participant in the real-time availability decisions made in software. Resilient software solves for problems beyond the physical world. But to get there, the development of this software requires an intimate understanding of the physical infrastructure in order to abstract it away.

Total cost of operations (TCO) performance and availability metrics: Measure constantly and focus on TCO-driven metrics like performance/dollar/per kW-month, and balance that against revenue, risk, and profit. At cloud-scale, each software revision cycle is an opportunity to improve the infrastructure. The tools available to software developers—whether it be debuggers or coding environments—allow them to understand failures much more rapidly than we can model in the data center space. Enabling shared key performance indicators (KPIs) across the business, developer, IT operations, and data center is key to demonstrating the value of infrastructure to the businesses bottom line. Finally, building bi-directional service contracts with software and business teams will enable these key business, service, and application insights to be holistically leveraged on your journey to the cloud.

Resilient software is a key enabler of service availability in today’s complex IT landscape when operating at cloud-scale. By shifting the mindset away from hardware redundancy, Microsoft has made significant gains in service reliability (uptime), while lowering costs and increasing scalability, efficiency, and sustainability. So while we continue to deliver mission critical services to more than one billion people, 20 million business and in 76 market places around the world, we’re doing it on significantly more resilient, highly-integrated software that is delivered via hardware that is decidedly less than mission critical.

Related Stories

David Bills, Microsoft's chief reliability strategist, outlines the challenges facing all cloud service providers as they strive to provide highly available services to power the digital lives of billions. Providers now need to factor resiliency in at all levels and across all components of the service. Read More

When considering public cloud options, it’s important to understand where there is a direct fit. This means that both key business stakeholders as well as IT executives will need to see the benefits of moving towards a public cloud “Infrastructure as a Service” environment. Read More

Sometimes a relatively small amount of software code can make a huge difference in how much you spend on hardware. That's the case at Microsoft, which has been able to reduce the number of diesel backup generators at its data centers by using software to move workloads from one location to another. Read More

For software vendors, factors to consider when moving to a managed cloud for software-as-a-service delivery include cost, security as well as technology integration. Here are some important questions to consider. Read More