Assessing Risk in the Virtual Data Center

Having lived through both the Internet bubble in 2001 and the Housing/CDO bubble in 2008, we’re all too familiar with what happens when large inter-connected entities start having problems. It can often be like watching dominos fall. Things were really great when the systems appeared to be working according to the plan, but then “the plan” got sidetracked and bad things started happening that weren’t on the radar.

Thinking about today’s virtualized Data Centers, we don’t have any choice but to think about them as inter-connected entities. And these entities are more inter-connected than they ever were in the past. Sure, we’ve always connected servers to networks (internal & external) to storage, but now those inter-connections are more consolidated and less well defined.

Part of the reason to utilize any company asset, for CAPEX or OPEX purposes, is to accelerate the activities of the business. The other reason is to reduce or mitigate risk. And by definition, IT technologies can both accelerate the activities of a business and minimize some of their risk (loss of business to competitors; dependencies on physical locations for operations, etc.). But as anyone that works in IT knows, every decision surrounding IT usage (internally or externally) must also be balanced by the risk associated with the equipment or service itself.

Will it operate properly?

How does it handle unexpected events within the system?

How does it handle different types of failure?

Is it secure?

What happens if the vendor falls behind in technology or goes out of business?

How to upgrade or migrate as the technology or standards evolve?

THE NETWORK – CONVERGING THE INFRASTRUCTURE

In the past, the mitigation of risk within the Data Center could be somewhat contained to silo’d entities. A critical application might be confined to a highly-available server with redundant components (power-supplies, NICs, on-board RAID controllers and disks, etc.). If that individual server, or small cluster of physical servers failed, the impact would be felt but mostly contained to that application.

But today’s Data Center has changed. Virtualization places a far greater burden on both the LAN and SAN to support live migrations, high availability, storage mobility, fault tolerance, clustering and the unification of the LAN/SAN fabric. The risk of building a LAN or SAN that is not scalable, secure and virtualization-enabled begins to far outweigh the risk of a single device (server) failing. The wrong decisions for the LAN/SAN will impact 1000s, 10,000s, or millions of users. Orders of magnitude greater impact.

Assuming that you have a LAN/SAN in place that was designed for the virtual Data Center, let’s now look at other elements to consider for risk mitigation.

How does your Data Center infrastructure handle situations where ports/paths are oversubscribed? Does it have the ability to dynamically provide Quality of Service (QoS) to the highest priority traffic, or is the infrastructure statically configured for ports/paths?

How does your Data Center infrastructure handle situations where ports/paths are under-utilized? Does it have the ability to let other ports/paths utilize free resources when bursts are needed?

Does your Data Center infrastructure have the ability to utilize multiple paths across the LAN/SAN to efficiently use all resources, while still providing redundancy and high-availability?

Does your Data Center infrastructure provide the ability for the Security teams to have visibility and extend policies to adapt to the changes that Virtualization brings to the Data Center? Without this visibility, the value of Virtualization could quickly be reduced because your business is now exposed to threats that previously were preventable in a physical environment.

As problems arise within the Data Center infrastructure, do your technical teams have visibility into the inter-connected elements from multiple vantage points? Are they “plugged in” across their technology groups? Without this, they can get into scenarios where not only can they not troubleshoot properly, but the groups may not have permissions to utilize the proper tools to solve the problems.

If the resolution to problems requires new hardware (Servers, Network, Storage), does the Data Center infrastructure have the ability to be replaced with minimal physical intervention or is manual reconfiguration required?

With virtualization changing so many of the rules, processes and procedures within the Data Center, it is so critical to not only ensure that your LAN/SAN infrastructure is world-class, but also that it has been designed to adapt to unforeseen problems. Problems that don’t get addressed by marketing slides announcing new speeds and feeds. Problems that can have tremendous impacts with inter-connected systems.

SERVERS AND COMPUTING

Until now, I’ve been more focused on the broader risk mitigation that comes from Data Center infrastructure. The fabric that interconnects 100s and 1000s of devices and ensures that virtualized applications can be highly efficient and highly available.

But we all know that the Computing environment risk is equally important to address. More and more applications are being written to depend on other applications such that their individual risk is heightened.

On the surface, there is some very good news here. x86 architectures are serving more and more of the applications within the Data Center. Customers realize that riding the Moore’s Law curve gives them tremendous computing power at an ever-decreasing price point. Every server vendor on the planet is working extremely close with Intel to leverage their next-generation architectures (Nahalem, Westmere, etc.). Each new announcement delivers greater computing power in a small density footprint. Due to various development cycles you will see the vendors leapfrog each other from quarter to quarter, with incremental change at each step.

So if everyone is leveraging very similar motherboard and CPU architectures, then the risk in computing must be somewhere else. In fact, it’s within the overall computing system architecture and it’s ability to adapt to dynamic environments today and improved technology in the future.

Does the Computing architecture allow the environment to grow and expand without large step-cost increases at small intervals?

Does the Computing architecture decrease in performance if the system has dynamically changing workloads and traffic patterns? Does it decrease in performance if advanced functionality is enabled?

Does the Computing architecture require significant changes as 40Gb and 100Gb technologies emerge?

All of these architectural elements should be considered when attempting to control risk within the Computing portion of the Data Center.

The final piece to consider with Computing risk in today’s virtualized Data Center is VM Density. How many VMs should be put on individual ESX hosts? How many VMs in a Datastore? How many VMs in an ESX cluster? There are pros and cons to this discussion, with the end result coming down to the level of risk that each IT organization is willing to accept. Instead of giving an opinion, I thought it would be best to reference four of VMware’s 2010 vExperts, engineers identified to be the best and the brightest in the field today regarding virtualization deployments and architectures. As you will see from their discussions, the answer is not always “uber-density wins!”

As I stated at the beginning, we now live in a world of highly inter-connected systems. This inter-connectedness provides us unprecedented access to information, new markets and new business opportunities. But it also creates new levels of risks that must be considered within the Data Center. It’s important to look at risk systematically, taking into consideration the expected plan and those unforeseen problems that can make these inter-connected systems tumble like dominos. I hope this post can help you begin to create a risk-management framework that allows you to take all the emerging virtualized technologies within the Data Center into consideration.

3 Comments.

Very cool Brian. It seems like all of the data center issues lately have all been related to power by either accidental outage or machine failure. What steps can be taken to prevent this, other than having a instant failover system in place in a separate location?

Some of the individuals posting to this site, including the moderators, work for Cisco Systems. Opinions expressed here and in any corresponding comments are the personal opinions of the original authors, not of Cisco. The content is provided for informational purposes only and is not meant to be an endorsement or representation by Cisco or any other party. This site is available to the public. No information you consider confidential should be posted to this site. By posting you agree to be solely responsible for the content of all information you contribute, link to, or otherwise upload to the Website and release Cisco from any liability related to your use of the Website. You also grant to Cisco a worldwide, perpetual, irrevocable, royalty-free and fully-paid, transferable (including rights to sublicense) right to exercise all copyright, publicity, and moral rights with respect to any original content you provide. The comments are moderated. Comments will appear as soon as they are approved by the moderator.