Google Compute Engine and live migration

Google Compute Engine (GCE) has been a potential cloud-emperor contender in the shadows, and although GCE is still in beta, it’s been widely speculated that Google will likely be the third vendor in the trifecta of big cloud IaaS market-share leaders, along with Amazon Web Services (AWS) and Microsoft Windows Azure.

Few would doubt Google’s technology prowess, if it decides to commit itself to a business, though. A critical question has remained, though: Will Google be able to deliver technology capabilities that can be used by mere mortals in the enterprise, and market, sell, contract for, and deliver service in a way that such businesses can use? (Its ability to serve ephemeral large-scale compute workloads, and perhaps meet the needs of start-ups, is not in doubt.)

One of the most heartburn-inducing aspects of GCE has been its scheduled maintenance, To quote Google: “For scheduled zone maintenance windows, Google takes an entire zone offline for roughly two weeks to perform various, disruptive maintenance tasks.” Basically, Google has said, “Your data center will be going away for up to two weeks. Deal with it. You should be running in multiple zones anyway.”

Even most cloud-native start-ups aren’t capable of easily executing this way. Remember that most applications are architected to have their data locally, in the same zone as the compute. Without using Google’s PaaS capabilities (like Datastore), this means that the customer needs to move and/or replicate storage into another zone, which also increases their costs. Many applications aren’t large enough to warrant the complexity of a multi-zone implementation, either — not only business applications, but also smaller start-ups, mobile back-end implementations, and so forth.

So inherently, a hard-line stance on taking zones offline for maintenance, limited GCE’s market opportunity. Despite positioning this as a hard-line stance previously, Google has clearly changed its mind, introducing “transparent maintenance”. This is accomplished with a combination of live migration technology, and some innovations related to their implementation of physical data center maintenance. It’s an interesting indication of Google listening to prospects and customers and flexing to do something that has not been the Google Way.

Not only will Google’s addition of migration help data center maintenance, but more importantly, it will mitigate downtime related to host maintenance. Although AWS, for instance, tries to minimize host maintenance in order to avoid instance downtime or reboots, host maintenance is necessary — and it’s highly useful to have a technology that allows you to host maintenance without downtime for the instances, because this encourages you not to delay host maintenance (since you want to update the underlying host OS, hypervisor, etc.).

VMware-based providers almost always do live migration for host maintenance, since it’s one of the core compelling features of VMware. But AWS, and many competitors that model themselves after AWS, don’t. I hope that Google’s decision to add live migration into GCE pushes the rest of the market — and specifically AWS, which today generally sets the bar for customer expectations — into doing the same, because it’s a highly useful infrastructure resilience feature, and it’s important to customers.

More broadly, though, AWS hasn’t really had innovation competitors to date. Microsoft Azure is a real competitor, but other than in PaaS, they’ve largely been playing catch-up. Thanks to its extensive portfolio of internal technologies, Google has the potential ability to inject truly new capabilities into the market. Similar to what customers have seen with AWS — when AWS has been successful at introducing capabilities that many customers weren’t really even aware that they wanted — I expect Google is going to launch truly innovative capabilities that will turn into customer demands. It’s not that AWS is going to simply mount a competitive response — it will become a situation where customers ask for these capabilities, pushing AWS to respond. That should be excellent for the market.

It’s worth noting that the value of Google is not just GCE — it is Google Cloud Platform as a whole, including the PaaS elements. This is similarly true with Microsoft Azure. And although AWS seems to broadly bucketed as IaaS, in reality their capabilities overlap into the PaaS space. These vendors understand that the goal is the ability to develop and deliver business capaiblities more quickly — not to provide cheap infrastructure.

Capabilities equate lock-in, by the way, but historically, businesses have embraced lock-in whenever it results in more value delivered.

1) GCE using KVM makes migration much easier than with some other hypervisors – it’s much less fussy about having exactly the same hardware at both ends. VMotion needs you to be on the same stepping level of CPU, KVM just needs you to have at least the same number of bits. This is particularly important if the scheduled maintenance is to introduce newer hardware (no point in migrating one way if you can’t migrate back because of an upgrade).

2) GCE networks can span availability zones and regions (unlike say AWS VPC subnets). Having the subnet reach from source to destination is pretty important if you want to keep the same IP following migration, so this is a good example of Google using their more advanced networking architecture to facilitate innovations elsewhere in the service.

1. Neither Google nor AWS reveal the distance between their AZ. If it is in the 0-60 mile range, then live migration is possible. Ping latencies intra-AZ and cross-AZ indicate that these are the likely distances.
2. Live migration is a complex, expensive solution, yet it doesn’t guarantee high availability. Customers are on still on the hook for HA for unanticipated failures. In this respect, the AWS approach of “no zone maintenance” is the more sensible approach. If AWS adds live migration for host maintenance instead of bringing down your VMs, that would be far superior to GCE’s approach.

Hello, Neat post. There’s an issue with your web site in internet explorer,
may test this? IE nonetheless is the market leader and a large section of
people will leave out your wonderful writing due to this problem.