ESX guest balancing

We have two vmware hosts in a cluster. We have the resources to run all hosts from one server and use the other as a HA failover server if necessary. Should we be doing this? What's the best practice recommendation here?

Ads we halving the likely hood of having an outage if we just run all machines on one host?

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

It depends on how critical the guest machines are. If a guest machine is critical to your business then best practice is to reserve enough resources for it on both - the main host and the failover. If you can "Live" with a temporary outage or shut down other guest machines in case of emergency it is also OK to overload the machines a bit.

There are about 6-7 critical machines, two sets of interdependent ones. So to keep the external side of the business up we need a webserver, api and BD. These are linked and there is no point in one being up if all three aren't.
Our internal CRM is also like this. It needs an API and a DB server. So, the question is, it the servers are balanced between hosts, nad one host goes down, it as it takes all servers down anyway, should they all be on one host, to reduce the chance (by half?) of the all systems being rendered nonfunctional?

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

If you do, then you should use Microsoft Failover Clustering, and have two nodes VMs per host.

In the event a host fails, service will still be available, despite, a few seconds to account for failover,

If you require higher availability than this, you should look at VMware FT.

BUT what if your SAN fails ? all you storage your VMs are on ?

What would you do then ?

If you were using DRS, then you would group all these VMs together as a group, and the group would be moved as a whole group between hosts.

As you need this whole group of VMs to function as a "vApp",if you put them all on a single host, and it fails, 1-2 minutes you will be up and running again, if you spread them between hosts, you'll get better performance, and the outage time is the same 1-2 minutes.

So spreading the VMs between hosts will give you better performance, and the same outage time of 1-2 minutes.

if you want better, consider Microsoft Failover Clustering, but with increased costs of OS License and VM Management, to double up all VMs.

Okay, so you have decided by your design, that a small outage is acceptable.

I would opt for - So spreading the VMs between hosts will give you better performance, and the same outage time of 1-2 minutes.

if you put them all on the same host (seems a waste of hardware and license to have idle doing nothing!).

but you are down to the same 1-2 minutes.

I don't think there is a best solution here, both have an outage of 1-2 minutes, waiting for HA to discover the VMs are not responding, and then restarting the failed.

However, based on your applications, is it better for them ALL to fail, than just half ?

You would have to test this....

We would spread the load across hosts, performance, less VMs to restart, less VMs to move, should you require to do maintenance, restart hosts, less VMs affected, if services fail on Host, and HA and vMotion cannot be used.

If a host has a 1-1000 chance of failing and hosts are linked, then spreading them gives it a 1-500 chance of failing?
If it's all on one host, then it's 1-1000 again right?

The performance issues is a non-starter, hosts are massively overspec'ed so we can run all more than comfortably on one host. When you say both have a 1-2 minute outage, that's not quite right. They both have it, but half the amount of times.

We would spread the load across hosts, performance, less VMs to restart, less VMs to move, should you require to do maintenance, restart hosts, less VMs affected, if services fail on Host, and HA and vMotion cannot be used.

If performance is not an issue, there are still other factors when hosts go wrong, and you cannot HA, vMotion.

BUT if your applications, front-end and back-end are designed to work together, then leave them on the same host, you have the advantage of knowing what these applications and servers are, and how they work.

So leave them all on the same host, and leave the other host running as a "hot spare"

We don't worry about host failure anymore, because we treat hosts in a cluster, like disks in a RAID set.

We just keep adding ESXi hosts.... if a host fails fine, dead, as long as there is enough resources in the cluster that's fine.

We don't find with todays technology that the hardware fails, but the software does!