Virtualization in the trenches with VMware, Part 1: Basics and benefits

In this four-part series, Ars looks at various aspects of the real-world …

IT in the enterprise is as much about technology as it is about people, processes, and business needs. In a five-part series, we will cover some of the challenges faced when trying to design and deploy a virtualization platform for a sizable enterprise and migrate its infrastructure into the cloud. This usually ends up being a far larger undertaking than imagined, partly due to technical challenges, but mostly due to having to make careful selections at every step of the way.

For reasons I'll cover in a moment, this series focuses on VMware. There are some fantastic alternatives to VMware out there, but VMware was the package that best suited my own company's needs. However, much of the discussion in this series can easily be applied to other virtualization platforms.

Product considerations

When talking about enterprise environments, it is important to remember that one rarely deals with logical situations. That is to say, while you may both logically and intuitively know the right answer to the technical challenge you face, the reality of the situation usually ends up being that those in management—and often the accounting department—get the final say in what route you take. You are often left with either older or inadequate technology as your project resource.

And then there are the typical enterprise hardware refresh cycles and the associated headaches with trying to both introduce, and transition to, a new platform. For example, a core switching platform upgrade and customer port migration can end up taking up to two years, depending on the scale and uptime requirements. This starts to be a serious problem if the enterprise works on a three-year lifecycle refresh for hardware. Nor does this take into account business requirements such as waiting 6 to 12 months before adopting a technology or hardware platform, or choosing to run one full software version behind the current stable release, as some financial services companies opt for. That easily ends up causing various headaches, especially with something fast-moving like virtualization software.

For my own company's enterprise virtualization rollout, the biggest question when choosing the virtualization platform isn't about performance or features—it's about support. Who do you call at 4am when everything is down and not coming back up on its own, upset customers are calling in, overnight jobs are failing, and you have about four hours to get every service back up and running before the office denizens pour in to start their day? The choice then becomes about being able to hold a third party accountable in front of your customers and managers for the unexpected downtime. So having that third party contractually obligated to fix your issue as soon as possible is the driving force here. This factor is a significant driver for open-source projects offering a commercial variant with paid-for support. The simple truth is that most enterprises will not touch a lot of technology without there being a strong support contingent backing it up, especially when it comes to Tier 1 applications like Exchange Server, SAP, SQL Server, etc. Obviously, there are exceptions here (e.g. DNS and Web servers). However, it's an entirely different story trying to justify running software with very expensive support contracts on top of a free platform such as CentOS.

On that note, the best supported virtualization platform for commodity x86 hardware has been VMware's ESX/vSphere platform, in use by 100 percent of the Fortune 100, 98 percent of the Fortune 500, and 96 percent of the Fortune 1000. Add in some aggressive marketing about not only fully supporting Tier 1 applications for virtualization, but also full support and certification of the VMware platform by the actual manufacturers of said applications, and we had a winner in management's eyes.

However, a reputation like that comes at a price for both the software and the support. This price can lead companies to try to cut corners, and not purchase hardware and software for a 1:1 disaster recovery environment, a development/testing/staging/user acceptance environment, and a lab environment for training. It's scarily common for companies to skip out on some of the aforementioned, and in some cases, skip out on all of them, meaning that everything needs to be done to the live, in-production cluster, with a dash of hope and a pinch of prayer. I have seen one company take this to the extreme: they bought dozens upon dozens of blades, yet a few months into the project realized that they had spent their entire budget on the hardware and production software licenses and could not afford even a single lab license to ensure that software platform updates were successful. They then realized that, due to business process demands, most blades would stay powered off for a number of months until customer demand would ramp up to require the additional capacity, thereby justifying the expenditure.

Virtualization benefits: live migration, high availability, and fault tolerance

There are two core benefits of virtualization that both improve the computing experience and also add another layer of complexity to it: consolidation and resiliency. Consolidation is easy to understand—it is literally multiple operating system instances being tenants on one physical server platform, storage platform, and the networking environment. Provided that there are adequate hardware resources to meet the requirements of the guest OSes, the benefits of virtualization here are fantastic and direct, especially in the multi-core CPU era. Resiliency, on the other hand, is about having enhanced survivability added to the running guest OS, typically through features like vMotion (a tool for migrating VMs between hosts), high availability (HA), and now fault tolerance (FT).

vMotion enables the live migration of a virtual machine across physical hosts, with no downtime or interruption in service. While not that exciting for a small computing environment, it's a fantastic advancement for environments that, for example, need to guarantee up to 99.99 percent uptime of either a service or server. (99.99 percent, or "four nines" uptime means approximately 52.56 minutes of downtime per calendar year, or 4.32 minutes per 30-day month.) When you consider that a single reboot can threaten your uptime rating for a month, it comes as no surprise that most enterprise apps end up being clustered, either at the application level, the server level, at the application tier level, or all of the above.