Knowledge Base

vSphere 5.x support with NetApp MetroCluster (2031038)

Purpose

This article provides information about deploying a vSphere Metro Storage Cluster (vMSC) across two datacenters or sites using NetApp MetroCluster Solution with vSphere 5.0, 5.1, or 5.5. For ESXi 5.0, 5.1, or 5.5, the article applies for FC, iSCSI, and NFS implementations of Stretch and Fabric MetroCluster.

Resolution

What is vMSC?

vSphere Metro Storage Cluster (vMSC) is a new certified configuration for NetApp MetroCluster storage architectures. vMSC configuration is designed to maintain data availability beyond a single physical or logical site. A storage device configured in the vMSC configuration is supported after successful vMSC certification. All supported storage devices are listed on the VMware Storage Compatibility Guide.

What is a NetApp MetroCluster?

NetApp MetroCluster is a synchronous replication solution between two NetApp controllers providing storage high availability and disaster recovery in a campus or metropolitan area. A MetroCluster (MC) configuration consists of two NetApp controllers, each residing in the same data center or across two different physical locations, clustered together. MC handles any single failure in the storage configuration and certain multiple failures without causing disruption to data availability and provides single-command recovery in case of complete site disaster.

What is MetroCluster TieBreaker?

MetroCluster TieBreaker (MCTB) Solution is a plug-in that runs in the background as a Windows service or Unix daemon on an OnCommand Unified Manager (OC UM) host. The OC UM host can be a physical machine or a virtual machine. MCTB provides automatic failover in a MetroCluster Solution in scenarios where automatic failover is not possible. This can occur during an entire Site Failure.

MCTB continuously monitors the MetroCluster controllers and corresponding network gateways from an OnCommand server at a third location. When MCTB detects conditions that require a Cluster Failover on Disaster (CFOD), it issues the necessary commands to initiate the CFOD. Log messages and OnCommand events are generated when necessary to keep the operator informed as to the state of the MetroCluster and MCTB.

Configuration Requirements

These requirements must be satisfied to support this configuration:

For distances under 500 m, stretch MetroCluster configurations can be used, and for distances over 500 m but under 160 km for systems running ONTAP version 8.1.1, a Fabric MetroCluster configuration can be used.

The maximum round trip latency for Ethernet Networks between two sites must be less than 10 ms, and for syncmirror replications must be less than 3 ms.

The storage network must be a minimum of 1 Gbps throughput between the two sites for ISL connectivity.

ESXi hosts in the vMSC configuration should be configured with at least two different IP networks, one for storage and the other for management and virtual machine traffic. The Storage network handles NFS and iSCSI traffic between ESXi hosts and NetApp Controllers. The second network (VM Network) supports virtual machine traffic as well as management functions for the ESXi hosts. End users can choose to configure additional networks for other functionality such as vMotion/Fault Tolerance. VMware recommends this as a best practice, but it is not a strict requirement for a vMSC configuration.

FC Switches are used for vMSC configurations where datastores are accessed via FC protocol, and ESX management traffic will be on an IP network. End users can choose to configure additional networks for other functionality such as vMotion/Fault Tolerance. This is recommended as a best practice but is not a strict requirement for a vMSC configuration.

For NFS/iSCSI configurations, a minimum of two uplinks for the controllers must be used. An interface group (ifgroup) should be created using the two uplinks in multimode configurations.

The VMware datastores and NFS volumes configured for the ESX servers are provisioned on mirrored aggregates.

vCenter Server must be able to connect to ESX servers on both the sites.

The maximum number of Hosts in an HA cluster must not exceed 32 hosts.

Notes:

A MetroCluster TieBreaker Machine should be deployed in a third site, and must be able to access the storage controllers in Site one and Site two to initiate a CFOD in case of an entire site failure.

Solution Overview

The NetApp Unified Storage Architecture offers an agile and scalable storage platform. All NetApp storage systems use the Data ONTAP operating system to provide SAN (FC, iSCSI) and NFS.

MetroCluster leverages NetApp HA CFO functionality to automatically protect against controller failures. Additionally, MetroCluster layers local SyncMirror, cluster failover on disaster (CFOD), hardware redundancy, and geographical separation to achieve extreme levels of availability. Local SyncMirror synchronously mirrors data across the two halves of the MetroCluster configuration by writing data to two plexes: the local plex (on the local shelf) actively serving data and the remote plex (on the remote shelf) normally not serving data. On local shelf failure, the remote shelf seamlessly takes over data-serving operations. No data loss occurs because of synchronous mirroring. Hardware redundancy is put in place for all MetroCluster components. Controllers, storage, cables, switches (fabric MetroCluster), and adapters are all redundant.

A VMware HA/DRS cluster is created across the two sites using ESXi 5.x hosts and managed by vCenter Server 5.x. The vSphere Management, vMotion, and virtual machine networks are connected using a redundant network between the two sites. It is assumed that the vCenter Server managing the HA/DRS cluster can connect to the ESXi hosts at both sites.

Based on the distance considerations, NetApp MetroCluster can be deployed in two different configurations:

Stretch MetroCluster

Fabric MetroCluster

Stretch MetroCluster

This is a Stretch MetroCluster configuration:

Fabric MetroCluster

This is a Fabric MetroCluster configuration:

Note: These illustrations are simplified representations and do not indicate the redundant front-end components, such as Ethernet and fibre channel switches.

The vMSC configuration used in this certification program was configured with Uniform Host Access mode. In this configuration, the ESX hosts from a single site are configured to access storage from both sites.

In cases where RDMs are configured for virtual machines residing on NFS volumes, a separate LUN must be configured to hold the RDM mapping files. Ensure you present this LUN to all the ESX hosts.

A new Master will be selected within the network partition. Virtual machines will remain running. No need to restart virtual machines.

Site 1 and Site 2 simultaneous failure (shutdown) and restoration

Controllers boot up and resync. All LUNs and volumes become available. All iSCSI sessions and FC paths to ESXi hosts are re-established and virtual machines restarted successfully. As a best practice, NetApp controllers should be powered on first and allow the LUNs/volumes to become available before powering on the ESXi hosts.

No impact

ESXi Management network all ISL links failure

No impact to controllers. LUNs and volumes remain available.

If the HA host isolation response is set to Leave Powered On, virtual machines at each site continue to run as storage heartbeat is still active. Partitioned Hosts on site that do not have a Fault Domain Manager elect a new Master.

All Storage ISL Links failure

No Impact to controllers. LUNs and volumes remain available.When the ISL links are back online, the aggregates resync.

No impact

System Manager - Management Server failure

No impact. Controllers continue to function normally.NetApp controllers can be managed using Command Line.