What is the most suitable deployment mode for vCenter Single-Sign On (SSO) in an environment where there are two (2) physical datacenters running in an Active/Active configuration?

Requirements

1. The solution must be a fully supported configuration
2. Meet/Exceed RTO of 4 hours
3. Environment must support SRM failover between Datacenter A and Datacenter B where an entire datacenter is lost

Assumptions

1.Three (3) vCenter servers will be used, One (1) at Datacenter A and Two (2) at Datacenter B
2. Environment has Two (2) Production clusters (One per Datacenter), and One (1) vCloud Cluster at Datacenter B each with a dedicated vCenter
3. Stretched clusters are not used
4. All vSphere Infrastructure servers (including SSO) are protected by SRM and vSphere HA
5. Inter-site Metropolitan Area Network is high bandwidth (>10Gb) , low latency (<5ms) and highly available (99.999%)
6. The average number of authentications per second for each SSO instance is <30 (Configuration Maximum)

Constraints

1. The environment uses traditional agent based backup solution which may not meet RPO/RTO requirements

Motivation

1. Future proof the environment

Architectural Decision

1. Use “Multi-site” SSO deployment mode
2. Do not use SSO “High Availability” clusters
3. The Primary SSO server will be at Datacenter B
4. The remaining vCenter servers will be “Secondaries” and point to the Datacenter B Primary SSO instance
5. The each SSO instance will be on a dedicated Windows 2008 x64 R2 instance
6. Each SSO instance will use the bundled SQL database
7. (Optional) For greater availability , vCenter Heartbeat will be used to protect each SSO instance

Justification

1. The environment is being designed (where) possible to sustain a Metropolitan Area Network failure between the two (2) datacenters

2. If “High Availability” mode is used, at least one (1) vCenter would be accessing SSO across the MAN link which introduces an unnecessary dependency on the MAN links

3. “High Availability” currently requires manual intervention which can be complicated and problematic

4. “Basic” mode prevents the use of Linked Mode which will make management of the environment more difficult

5. Using Multisite mode allows faster access to authentication services as each SSO instance is configured with Active Directory servers located at the same datacenter.

6. Multisite mode is required for the use of Linked-Mode and Linked Mode will make day to day management easier

7. If one instance SSO goes offline for any reason, this will not impact production virtual machines. It will simply prevent any authentication to the affected vCenter server.

8. Having the SSO Primary at Datacenter B ensures only traffic from one vCenter (Datacenter A vCenter) traverses the MAN link as the third vCenter (for vCloud Director) is at Datacenter B

9. In the event of Datacenter B having a full datacenter wide failure for any reason, the Primary SSO instance being offline will not impact the management of Datacenter A OR the ability for the environment being recovered by SRM.

10. During an SSO upgrade, multiple vCenter’s cannot co-exist and using a centralized (or shared) SSO instance would overly complicate the upgrade process and lead to extended impact to the vSphere environments.

Alternatives

1. Use “Basic” Mode, resulting in a standalone version of SSO for each vCenter server

2. Use “High Availability Cluster” (Shared the same SSO database and identity sources) with one SSO server per physical datacenter

3. Use “Multisite” deployment with “High Availability Clusters” per datacenter

4. Host SSO database on a SQL Server

5. Run SSO on the vCenter server with or without the SSO database locally

6. Run a single SSO instance shared by all three (3) vCenters and use vCenter Heartbeat running across the MAN to protect SSO

Implications

1. Without a “High Availability Cluster” or SSO being protected by vCenter Heartbeat at each datacenter, the SSO for each site is a Single point of failure where authentication to the affected vCenter will fail

2. In the event of one (1) SSO server failing at Datacenter A, the SSO role does not failover to Datacenter B, or vice versa. In this case, All authentication requests on the site where SSO has failed, will fail.

3. Requires the installable version of SSO, which is Windows Only. The use of the vCenter Server Appliance (VCSA) is not available.

What are the most suitable HA / host isolation settings where the environment uses Storage (IBM SVC) with FC connectivity via a dedicated highly available Storage Area Network (SAN) fabric where ESXi Management and Virtual Machine traffic run over a highly available data network?

Requirements

1. Ensure in the event of one or more hosts becoming isolated, the environment responds in an automated manner to recover VMs where possible

Assumptions

1.The Network is highly available (>99.999% availability)
2. The Storage is highly available (>99.999% availability)
3. vSphere 5.0 or later
4. ESXi hosts are connected to the network via two physical separate switches via two physical NICs

Configure Datastore Heartbeating with “Select any of the clusters datastores”

Configure Host Isolation Response to: “Shutdown”

Justification

1. When using FC storage, it is possible for the Management and Virtual Machine Networks to be unavailable, while the Storage network is working perfectly. In this case Virtual machines may not be able to communicate to other servers, but can continuing reading/writing from disk. In this case, they will likely not be servicing customer workloads, as such, Shutting the VM down gracefully allows HA to restart the VM/s on host/s which are not isolated gives the VM a greater chance of being able to resume servicing workloads than remaining on an isolated host.
2. Datastore heartbeating will allow HA to confirm if the host is “isolated” or “failed”. In either case, Shutting down the VM will allow HA to recover the VM on a surviving host.
3. As all storage is presented via Active/Active IBM SVC controllers, there is no benefit is specifying specific datastores to be used for heartbeating
4. The selected isolation addresses were chosen as they are both highly available devices in the network which are essential for network communication and cover the core routing and switching components in the network.
5. In an environment where the Network is highly available an isolation event is extremely unlikely as such, where the three (3) isolation addresses cannot be contacted, it is unlikely the network can be restored in a timely manner OR the host has suffered multiple concurrent failures (eg: Multiple Network Cards) and performing a controlled shutdown helps ensure when the network is recovered, the VMs are brought back up in a consistent state, OR in the event the isolation impacts only a subset of ESXi hosts in the cluster, the VM/s can be recovered by HA and resume normal operations.

Alternatives

1. Set Host isolation response to “Leave Powered On”
2. Do not use Datastore heartbeating
3. Use the default isolation address

Implications

1. In the event the host cannot reach any of the isolation addresses, virtual machines will be Shutdown
2. Using “Shutdown” as opposed to “Power off” ensures a graceful shutdown of the guest operating system, however this will delay the HA restart of the VM for up to 5 mins (300 seconds) if VMware Tools is unable to do a controlled shutdown, in which case after 300 seconds a “Power Off” will be executed.
3. In the unlikely event of network instability, VMs may be Shutdown prematurely.