The VM Mobility Myth

It finally dawned on me that if I have a few hundred to a thousand people sitting in front of me at one of my presentations, I should take advantage of that collective intelligence to perform a little selfish information gathering.

I’ve had an opinion for quite some time that the rampant squawking and generalizations regarding hyper-mobility suggesting VM sprawl and uncontrolled instance spawning was nothing more than FUD given where we are today with the technology and platforms that supposedly enable it.

We constantly hear how organizations big and small are suffering (or will) from the evils of virtualization by way of VM’s and information turning up everywhere, putting your data and assets at risk. It gets worse with the multi-tenancy issues surrounding moving to “The Cloud,” they say.

So in a couple of my panels at RSA, I asked for some sanity and fact checking.

Informally, 95% of those in attendance at the two RSA panels I engaged run VMware in production. I asked that in cases OTHER than failure, how many of those in the audience take advantage of VM mobility (such as VMotion) or some other technological capability to provide autonomic mobility of VM’s in their enterprises.

About 5 people (in crowds of 100+ and 500+ respectively) raised their hands. Given that I asked this question the second time in front of a huge audience at RSA sitting next to the CTO’s of Citrix and VMware, I’m sure they were pretty surprised by the answer, too.

The reality is that in these environments — even extremely complex and large examples — there simply isn’t that much mobility and customers are more interested in resilience than they are agility in terms of what this feature brings. That’s a really interesting and important point.

The reason for this is pretty simple; the capability to provide for integrated networking and virtualization coupled with governance and autonomics simply isn’t mature at this point. Most people are simply replicating existing zoned/perimertized non-virtualized network topologies in their consolidated virtualized environments and waiting for the platforms to catch up. We’re really still seeing the effects of what virtualization is doing to the classical core/distribution/access design methodology as it relates to how shackled much of this mobility is to critical components like DNS and IP addressing and layer 2 VLANs. See Greg Ness and Lori Macvittie’s scribblings.

Furthermore, Workload distribution is simply impractical for anything other than monolithic stacks because the virtualization platforms, the applications and the networks aren’t at a point where from a policy or intelligence perspective they can easily and reliably self-orchestrate.

Don’t get me wrong, autonomics and business process/governance feedback loops are most definitely coming — and are absolutely required for Cloud — but they’re not here and not used much today. This is the hard stuff we’ve skipped over because it’s really freaking hard. Don’t believe me? See how long folks like HP have been at their “Adaptive Enterprise” solutions. That’s why unified fabrics make so much sense; you can get your arms around automating much, much more with a consistent set of enforceable policies and SLAs.

So the next time someone brings up this epidemic of runaway VM’s, ask them to kindly provide you with empirical data demonstrating such as just because it *might* happen, doesn’t mean it *does* happen.

So much of the purported risks associated with virtualization and Cloud are things based on what might happen. There’s a huge difference between possibility and probability. One of them is used for prudent analysis and risk assessment, the other for selling you something. I’ll let you figure out which is which.

The management, visibility and security tools and capabilities are arriving on our doorsteps. When and if this sort of problem actually becomes a problem, it’s quite likely we’ll have a good set of solutions to deal with it.

Until then, challenge these assertions and fears, and ask for proof not pandering to panic.

IT Security — even network engineering and "regular" system administration is often at a SEVERE disconnect from the virtualization infrastructure administrators.

They have NO IDEA what is going on. The virtualization infrastructure and operation thereof is so siloized in most organizations that literally anyone outside of it sees nothing — they're invisible!

Servers come and go — disappear and reappear. Not because of VMotion!!!! I repeat — not because of VMotion!!! Because of the fact that you can shut down a server and copy it somewhere e.g. snapshots (or templates/clones) and bam! All of a sudden a new server comes up that is quite a lot like the old one but possibly not even exactly that one.

Consolidation calls for LESS servers, not MORE. This is another fallacy in your discussion on VM Mobility.

Runaway VMs are real, just not in the way that you think (almost in the opposite way). There is reason to panic because for security reasons — servers may fall outside of AD and GPO OUs and filters, they may change across security or data boundaries, and they definitely could change MAC or IP addresses in a way that violates a live security policy such as firewall, ACL, port-security, et al.

Again, "may" or "might," "could" or "can." If this sort of thing happens and you're saying it's not detected, how besides fishing stories do you prove it does?

Shutting down/starting up VM's is, despite corner cases of command line fu, detectable, auditable and controllable and will be even more so as the toolsets improve (both built-in and augmented.)

I'm specifically addressing capabilities such as VMotion since the press and people selling solutions continue to suggest that VM's simply move willy-nilly across the enterprise with no recourse.

So you're statement "Servers come and go — disappear and reappear. Not because of VMotion!!!! I repeat — not because of VMotion!!!" is exactly my point.

Most environments still make use of collections of flat networks, so what difference does it make where a VM spins up in a single VLAN/broadcast domain if there are not other security controls outside the guest?

The issue here is about rogue behavior of admins — humans violating policy is the issue here.

Even in segmented networks where zones are consolidated, I walk into virtualized datacenters all the time and see blade servers with labels on them that read "Exchange server" or "OWA." Why? Because they don't spin them up any else other than those physical hosts.

Could they? Sure. Do they? I don't know anyone that has substantive metrics that demonstrates this is a huge problem. Until I do, it's FUD.

I don't understand your comment where you said "Consolidation calls for LESS servers, not MORE. This is another fallacy in your discussion on VM Mobility" and I can't grok what "fallacy" you're referring to.

Please provide incident response statistics and/or someone I can talk to in order to validate your comments regarding "runaway VMs" as the 1000 or so people I've interacted with in the last week suggest otherwise.

Humans don't violate policy if there isn't any policy. Virtualization administrators are being told to consolidate infrastructure and save money using virtual infrastructure in almost any way that they can.

The statistic you may be looking for is this: when hardening networks/servers via policy — how do you count networks/servers that are decommissioned (or partially decommissioned)? My point is that you count them as non-existent or "half", which equates to less security. I would consider these kinds of cold (or live) migrations as policy violations and therefore incidents. What else are they?

I'm also saying that the thousand people in your audience don't know what is happening to their infrastructure because they are siloized away from it. If only 5 people in your audience raised their hands about VMotion, then it's quite likely that only those 5 people could even define VMotion. I'm suggesting that people don't know because they can't figure it out with the resources they are currently using.

I have to agree with your main assertion here, we haven't seen this sort of problem yet within Oracle for the very reasons you point out.

All of our internal cloud-like grids leveraging virtualization are carved up into little pools of resources. We do use live migration of VMs to consolidate and free up chunks for larger VMs periodically, but as you've noted above it's within the same subnet or vlan. We've been struggling to overcome the lack of automatic and dynamically configured networks for much longer than the 4 years we've been using virtualization in production.

We've been working closely with many of the network vendors and have had lot of early beta and even prototype HW in our lab, but as of yet no one is quite there with the right combination of factors to make VM sprawl from mobility an issue in the near future.

The one area in which I really see value in using VMotion or similar functionality is in our large test and dev grids that see a lot of cycling of VMs and at times abysmally low utilization. If we had a robust means of migrating VMs across a very large pool of servers, we could actually do so and deep sleep or power down unused servers. It may not seem like much could be gained by this, but when you're talking about 100's of thousands of servers in some of the larger companies out there it can add up very fast.

Next time you get the opportunity to ask your question of an audience, ask one more:

How many utilized an outside consultant to help them set up their environment?

I think you'll find that many of the folks who do not use the "advanced" features did the work themselves and don't understand the technologies that support dynamic workload mobility. This lack of understanding results in a fear of the technology (fear of the unknown) and a reluctance to use it.

I'd be interested to find out if your audience substantiates my suspicions!

"The management, visibility and security tools and capabilities are arriving on our doorsteps. When and if this sort of problem actually becomes a problem, it’s quite likely we’ll have a good set of solutions to deal with it."

One hopes so, and one would like to believe it's because we're looking at what "could be" and trying to be proactive rather than reactive, when it's much more difficult a problem to solve. The ability to recognize where the pitfalls may lie earlier rather than later is the cornerstone of strategic thinking, and it's exactly that kind of thinking we need to continue to do in all aspects surrounding mass virtualization – security, reliability, portabiliity – to ensure that many of the roadbumps are ironed out before folks encounter them.

After all, if folks have the tools they need to properly manage such a dynamic environment, then "when" is unlikely to happen because the risk has already been recognized – and mitigated.

Excellent "real world" feedback and reminder that the sky is not falling. Yet. 😉

I'm a little surprised at the number of people who said they are using the VMotion ability with the full automatics. It must be what Ken Cline mentioned in the case of 'fear of the unknown'. Why would you pay for the enterprise version of such a license and then turn it to manual mode? Aren't you doing a disservice for your organization?

On the comment of 'Runaway VM's' — Am I missing something?

vAdministrators are the only group that can configure this type of migrations through a DRS/HA policy or a command line execution. In either case there is a finite number of servers that the sprawl can span across(the cluster)and storage that the cluster can see. These all seem like finite areas for controls to be put in place to define the proper execution. Storage , Network, OS Name, etc. all can be controlled by existing policy.

As far as I can tell, the same administrator that has access to the iLO port and a CD kit can do the same amount of damage or duplication without authorization.

I don't think it is the technology here. I agree with Hoff. It's the admins.

Chris, how did you ask the question? Those numbers really don't align with our surveys of how people use VMotion and DRS. Either they didn't know (as Ken supposes) or they were thinking you were asking about moving to another zone/DC, not just moving a VM to its neighbor blade for maintenance.

(I agree largely with the bigger point you're making, although 'sprawl' usually refers IMHO to old-school provisioning processes that don't include chargeback or the realities of a new server in 10 mins, and don't have anything to do with VMotion.)

I agree that Sprawl and vMotion are unrelated. From my experience sprawl is typically brought up a couple of ways. First are organizations that have a substantial amount of VM's deployed already but poor management solutions in place. Those customers typically are trying to 'get their hands around what they currently have'. Many consider that they already are in a state of sprawl.

The second are is all about users circumventing process. This often leads to discussions on provisioning platforms like Lifecycle Manager to put controls into place that will 'prevent sprawl'. You can bet that VM's in the cloud and cloud sprawl will be the next marketing buzz item that vendors push.

On the number of 5 out of 100's using vMotion, that just sounds low….but you'd likely get a similar response if you asked a VI admin if their company used a particular feature within a security vendors product. You might want to post the question to the VMware community boards.