VMware (and Microsoft) continue to make excellent progress driving the penetration of their data center virtualization offerings. Over half of the servers run by VMware customers are now virtualized. The progress has been so good that now it is time to ask two important questions. Is what is left to virtualize different that what already has been virtualized? And, if what is left is virtualizing business critical applications, will running them on the virtualization platform be any different than what we experience today?

VMware’s Virtualization Progress

At its VMworld conference in the Fall of 2012 VMware presented the following graph depicting its progress to date and its expected progress. This graph shows that VMware is approaching virtualizing 60% of the servers in its customer base, and expects to get to 73% in two years.

% of Workloads Virtualized (click image to expand)

Progress Virtualizing Business Critical Applications

Even more impressive than the fact that the overall virtualization penetration is approaching 60% is the fact that VMware has started to accelerate the progress in virtualizing business critical applications. Notice that VMware has accelerated the progress in virtualizing Oracle databases from 3% year over year to 7% year over year, and accelerated the progress virtualizing SAP from 10% year over year to 12% year over year. It is fair to say that SAP is a business critical application, and that in most cases and Oracle database is part of a business critical application so this demonstrates tremendous progress with both the most important applications, and the ones where the business owners have the greatest levels of concern regarding the impact of virtualization upon the operation of the applications.

% of Workload Instances that are Virtualized (click image to expand)

Does the P2V Process Have to Change for Business Critical Applications?

The progress virtualizing business critical applications, and the obvious push to continue that progress raises a very important question. That question is how should the P2V process be modified to ensure that the applications deliver acceptable performance once they are moved from dedicated physical hardware to shared virtualized hardware?

Let’s briefly review how the performance and capacity management part of the P2V process is done today. The most common process is to use the VMware Capacity Planner to assess how the workloads use physical resources and to then map those physical resources into how virtual resources are allocated to those workloads.

VMware Capacity Planner (click image to expand)

The approach of measuring how much CPU, memory, network I/O and disk I/O a workload generates or consumes on physical hardware and then translating that into virtual resource allocations worked reasonably well for tactical low hanging fruit applications. However when it comes to trying to assure the performance of a business critical application on a virtual infrastructure, the resource utilization inference process has the following flaws:

Virtual resources are not the same as physical resource. A virtual CPU is not exactly the same thing as a physical CPU. The key difference is that the virtual CPU is managed by the hypervisor which imposes some load upon the CPU.

A virtual resource may not perform in the exact same manner as its physical counterpart. If your workload is doing 100 IOPS and you size the virtual environment so that the capacity for that many I/O operations is there for that workload, there is no guarantee that the end-to-end latency of those I/O operations will be the same in the virtual environment as it was in the physical environment.

Sharing of resources introduces serious problems. Let’s assume that a workload runs on a server with four cores. The natural step would be to assign that workload four virtual CPU’s. But if that workload is running on a server with eight cores, and there are four other workloads that each want two virtual CPU’s (the server is over-committed), then the hypervisor will find slots with two available virtual CPU’s much more frequently than it will find a slot with four virtual CPU’s at the same time. Therefore the right thing to do might well be to assign fewer virtual CPU’s so that the workload gets schedule in more frequently.

Finally and most importantly, all of the resource allocation in the world is not going to guarantee acceptable performance. The only way to guarantee acceptable performance is to measure it over on the physical side, use that measurement as the baseline for what is expected on the virtual side and proceed accordingly.

The Role of APM in the Virtualization Process

First of all let’s define success when it comes to the virtualization of a business critical or performance critical application. Success means that the application owner is satisfied with the response time and throughput of the application. Success means that the users of the application are getting their jobs done without the application getting in their way. Success means that users are not calling the application owner complaining about slowness in the application or stalled transactions. Success means that the application owner is not pulling expensive and busy IT operations staff and architects into blamestorming meetings (the objective of a blamestorming meeting is to assign the blame for an application problem). The important thing is that none of these success criteria can be met by ensuring that the application is getting a sufficient or normal amount of resource.

The success criteria can only be met by measuring the response time and throughput of the application and using those metrics (and not resource utilization) as the SLA baselines for performance. So the critical first step is to define performance as response time and throughput and not as resource utilization. Once you have redefined performance as response time and throughput, you then need to measure it. Modern APM tools contain response time vs. load Scalability Analysis as a feature. The graph below is from the Scalability Analysis done by New Relic of the TVP site which shows that as load increases, response time is stable below the desired threshold of 1000MS. This is an example of a web site that has to always be up, always deliver acceptable response time, and is running in a production VMware vSphere environment.

Measuring Performance and Throughput of Business Critical Applications

Here is an important assertion. If you are going to successfully run business critical and performance critical applications in a shared and dynamic virtualized environment, you will need to instrument those applications for response time and throughput before you virtualize them, use the results of that pre-virtualization assessment as the baseline for the definition of a successful virtual deployment of that application, and then continue to monitor that application with the very same tool in production. If you do not do this then you are setting yourself up for a painful virtualization process (with application owners acting as server huggers), and even of you overcome those objections, a substantial amount of your time spent in blamestorming meetings.

The good news is that there is a great set of new APM tools to choose from. These tools focus upon being easy to implement and easy to operate, which gets around the problems with the previous generation of APM tools. They fall into two categories, which map to what kinds of applications you have. If you have custom developed applications and you need to rapidly find problems in your code, then you want a DevOps focused solution. If you have mixture of purchased, custom developed and compound applications than you are going to want an AppOpps focused tool.

The vendors of two categories of tools are profiled below. Remember that the right time to start with these tools is while the application is still running on physical hardware. Only then will you be able to establish a baseline that will keep you out of blamestorming meetings once the application has been virtualized.

Collection of logs and many other metrics into an easily searchable “big data” database

On Premise/ SaaS

A wide variety of collectors that interface to log sources and other sources of data

An application for which a log of some type is generated

Conclusion

The process of virtualizing business critical applications should start with using a modern APM tool to establish a response time and throughput baseline for the application while it is still on physical hardware. That baseline should then serve as the reference SLA once the application is virtualized. Performance needs to be defined as response time and throughput, not resource utilization.

Share this Article:

Bernd Harzog is the Analyst at The Virtualization Practice for Performance and Capacity Management and IT as a Service (Private Cloud).

Bernd is also the CEO and founder of APM Experts a company that provides strategic marketing services to vendors in the virtualization performance management, and application performance management markets.

Prior to these two companies, Bernd was the CEO of RTO Software, the VP Products at Netuitive, a General Manager at Xcellenet, and Research Director for Systems Software at Gartner Group. Bernd has an MBA in Marketing from the University of Chicago.