I have some bad news for the true believers in virtualization-supported high availability – quite a few of them probably don’t understand how it works. VMware HA is a great solution, but the best it can do is to restart a VM after it crashes or after the hypervisor host fails (and working on the VM level, it usually can’t detect a hung service). The VM has to go through full power-up process and all the services the VM runs have to perform whatever recovery procedures they need to run before the VM (and its services) are fully operational.

VMware FT is an even more interesting case. It runs two parallel copies of the same VM (and ensures they're continuously synchronized) – a perfect solution if you’re running a very lengthy procedure and don’t want a hardware failure to interrupt it. Unfortunately, software failures happen more often than hardware ones ... and if the VM crashes, both copies (running in sync) will crash simultaneously. Likewise, if the application service running in the VM crashes (or hangs), it will do so in both copies of the VM.

You can read more about high availability fallacies in an article I wrote for SearchNetworking (the title is a bit misleading) ... and remember: scale-out application architecture combined with load balancers is still the only way to reach true high availability.

Ivan, the problem is that creating a high availability solution for the front end is a no brainer. Put more than two instances and a LB in front. Done.

The problem is to provide a HA solution for anything that has to do with persistent local data. This may include the database in (relatively) modern 3 tiers app but it also includes more traditional Enterprise applications (Exchange being an example).

It is not even worth discussing how to provide resiliency to the front end. It's done. Focus your energies for the back-end.

We totally agree - back end is a tough nut. However, until you solve the DB (more precisely, ACID data store) problem, you won't have a truly HA application. VMware HA or Windows failover cluster(s) buy you nothing but automatic restart after a hardware failure. The DB service still has to restart (and roll back all pending transactions) after every failure, which takes a significant amount of time.

However, both SQL Server and MySQL offer a redundant server configuration, where the second server can take over immediately when the first one fails. High-end MySQL offers an even better distributed solution. So the problems can be solved ... but it's easier to offload them to someone else and believe in unicorn tears.

None of the things you are referring to Ivan provides a consistent failover scenario at the best of my knowledge. The reason for which it starts sooner on the other side is because it has lost all transactions the application think have been committed. It's good if you are hosting an application that shares pictures... not good if you deal with money.

Having this said there is clearly a trend for which this backend is being made more "scale out" friendly... but it will be a long way to go.

MySQL cluster provides true failover. A data node dies, at least one other node already has all its data. If I remember correctly, it's supported in single IP subnet configuration (with database replication recommended for long-distance needs).

SQL Server provides database mirroring (which can be synchronous if you want to retain total consistency).

I am reading this article again... The funny thing is that I understand what you are trying to get at but this is only true in an ideal world where Applications are specifically written to support a setup that includes load balancers and a shared database. Although everyone wants this to be true, reality is that we are nowhere near this ideal world.

In most enterprise organizations I have been at least 80% of the applications, which are essential to the line-of-business day-to-day, don't support this kind of set up. This is one of the reasons HA is so widely adopted today. On top of that there is a substantial cost associated to load balancers and a shared database configuration (yes needs to be clustered / distributed as well) which might be more than the SLA requires. In those cases vSphere HA / FT / VM and App Monitoring are the way to go. 5 clicks and it is configured, no need to have special skills to enable it... just point and click.

Once again, I agree that using a vFabric load balanced setup (shameless plug :)) would be ideal, but there are far too many legacy apps out there. Even in the largest enterprise orgs the IT department cannot control this, even the line-of-business cannot control it... main reason being that they are suppliers not taking the time to invest.

You are making a lot of assumptions here. You are assuming that all critical applications have a huge database. Many applications that are used on a day-to-day basis have a small database. Many apps for instance used at financial institutions are simple apps just to calculate what a mortgage would cost. Now although this might be 20MB app it is essential to the line-of-business and you might not think it is critical but they feel it is.

We are planning to use VMware's FT to run a redundant Citrix NetScaler VPX for our internet facing applications.(10-30k req/sec)We could go for Netscaler's traditional cluster setup, but that would require us buying 2x licenses. With our existing FT license we get just as much reliability with no extra cost.If software inside of that VM were to die, we would be in exactly the same situation as running it on a dedicated box.

Ivan Pepelnjak, CCIE#1354 Emeritus, is an independent network architect. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.

The author

Ivan Pepelnjak (CCIE#1354 Emeritus), Independent Network Architect at ipSpace.net,
has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced internetworking technologies since 1990.