Knowledge Base

Protection and Recovery Limits of SRM 4.x to 5.5.x in a Shared Recovery Site (N:1) Configuration (2008061)

Purpose

A standard VMware vCenter Site Recovery Manager (SRM) configuration has one protection site and one recovery site, with one vCenter Server instance and one SRM Server instance on each site. A shared recovery site (N:1) configuration has multiple protection sites that all recover virtual machines to a single, shared recovery site. In an N:1 configuration, each protection site has its own vCenter Server and SRM Server instances. The recovery site in an N:1 configuration has one shared vCenter Server instance and multiple SRM Server instances that are all registered as extensions to the same shared vCenter Server instance. If you use vSphere Replication, the recovery site has one shared vSphere Replication management server. You can connect a maximum of 10 protected sites to a shared recovery site.

This article provides information about the scalability limits for an N:1 configuration with SRM 4.x to 5.5.x, for both array-based replication and vSphere Replication.

Using Array-based Replication in an N:1 Configuration

You can use array-based replication to perform recovery and reprotect in an N:1 configuration with all SRM 5.x releases. Performing recovery and reprotect with array-based replication in an N:1 configuration is subject to the same protection and recovery limits as for a standard 1:1 configuration. See Operational Limits for SRM and vSphere Replication (2034768).

Using vSphere Replication in an N:1 Configuration

In an N:1 configuration with vSphere Replication, the secondary vSphere Replication management server is shared across the different SRM Server pairs. You can use vSphere Replication to perform recovery and reprotect in an N:1 configuration with certain limitations.

Reprotect with vSphere Replication is not supported in SRM 5.0.x. You cannot perform reprotect by using vSphere Replication with SRM 5.0.x in either a 1:1 or an N:1 configuration. Reprotect with vSphere Replication is supported in SRM 5.1 and later, for both 1:1 and N:1 configurations.

In vSphere Replication 1.0.x and 5.1, the vSphere Replication management server cannot handle concurrent recoveries or reprotects from more than three sites. Additional recoveries can result in an operation timeout error in SRM, with the error message Operation timed out: -1 seconds. To avoid operation timeout errors, avoid running concurrent recoveries or reprotects for more than 3 sites. If you do run concurrent recoveries for more than 3 sites and you encounter an operation timeout error, rerun the failed recovery or reprotect operation. This issue has been fixed in vSphere Replication 5.1.1 and vSphere Replication 5.5.

In vSphere Replication 1.0.x and 5.1, due to restrictions in the number of database connections, the vSphere Replication management server cannot handle concurrent recovery or reprotect operations for more than 80 virtual machines (80 LROs). If you run concurrent recovery or reprotect operations on more than 80 virtual machines, the vSphere Replication management server can encounter a deadlock while waiting for a free database connection. This causes operations to time out in SRM. It sometimes happens during recovery, but it always happens during reprotect if the number of LROs exceeds 80. This issue has been fixed in vSphere Replication 5.1.1 and vSphere Replication 5.5.

SRM handles the recovery or reprotect of a virtual machine as a long running operation (LRO). Each SRM Server instance on the recovery site throttles the number of LROs that it can send simultaneously to the vSphere Replication management server to a maximum of 40. This applies to all releases of vSphere Replication.

In SRM and vSphere Replication 5.5, test recovery, recovery, and reprotect operations can fail in a shared recovery site configuration if the vSphere Replication server experiences a heavy load, resulting in the errorThe connection to the remote server is down. Do not perform concurrent operations on more than 200 virtual machines, with a maximum of number of 20 virtual machines per protected site when using SRM and vSphere Replication 5.5.

How SRM Throttles Concurrent Long Running Operations

Due to the throttling of the number of LRO requests that SRM Server sends to the vSphere Replication management server, the key factor is not the number of simultaneous recovery or reprotect operations that you start for each SRM site pair. The key factor is the total number of concurrent recovery or reprotect operations (LROs) that all SRM site pairs send to the vSphere Replication management server on the recovery site.

Example 1: Excessive LROs

In an N:1 configuration with vSphere Replication 1.0.x or 5.1, you can have more than 3 sites, but you can only start recovery or reprotect operations from 3 sites simultaneously.

In Example 1, the number of recovery or reprotect operations that are started simultaneously is 165. However, the total number of concurrent LRO requests that the SRM Server instances send to the vSphere Replication management server is 100. This total exceeds the limit of 80 concurrent concurrent recovery or reprotect operations (80 LROs) that the vSphere Replication management server can handle simultaneously in vSphere Replication 1.0.x and 5.1.

The SRM Server logs on the recovery site show messages that indicate that the connection to the vSphere Replication management server is down. For example:

Example 2: Successful LROs

In an N:1 configuration with vSphere Replication 1.0x and 5.1, the number of simultaneous recovery or reprotect operations that you start can exceed the number of LRO requests that SRM sites send to the vSphere Replication management server, as long as the number of LROs does not exceed 80.

In example 2, the number of recovery or reprotect operations that are started simultaneously is 135. However, the total number of concurrent LRO requests that the SRM Server instances send to the vSphere Replication management server is 75. This total does not exceed the limit of 80 concurrent recovery or reprotect operations (80 LROs) that the vSphere Replication management server can handle simultaneously. The requests for recovery or reprotect operations succeed, even though the number of simultaneous recovery or reprotect operations that are started is 135.

Resolution

To avoid operation timeout errors in SRM caused by a deadlocked vSphere Replication management server in vSphere Replication 1.0.x and 5.1, do not perform recovery or reprotect operations on more than 80 virtual machines concurrently (80 LROs).

To determine whether the vSphere Replication management server in an N:1 configuration exceeds the limit of 80 LROs, calculate the total number of simultaneous recovery or reprotect operations (T) from all SRM site pairs:

T = Sum [ min(40, N[i]) ]

In this formula, N[i] is the number of virtual machines to recover or reprotect simultaneously on SRM site pair number (i) (the i-th SRM site-pair) and min is the minimum function that returns the smallest of a set of numbers given to it.

If you encounter a timeout caused by a deadlocked vSphere Replication management server on the secondary site, restart the vSphere Replication management server.

If a recovery plan fails due to an overloaded vSphere Replication management server, you can rerun the plan. SRM retries virtual machines that failed. The virtual machines that have already been recovered are left untouched and continue running. This workaround is available only for real recoveries. If a test recovery fails, you cannot rerun the test from the point of failure. You have to cleanup the test and then start it again.

You can also change the database settings to allow more database connections from the vSphere Replication management server, up to a maximum of 500.

NOTE: If you increase the number of database connections, the limits on the numbers of replications and on performing concurrent recovery or reprotect operations from multiple sites still apply. You can configure 500 replications per vSphere Replication appliance and perform concurrent recovery or reprotect operations from a maximum of 3 sites.

NOTE: If you increase the number of database connections, the limits on the number of replications and on performing concurrent recovery operations from multiple sites still apply. You can configure 500 replications per vSphere Replication appliance and perform concurrent recovery operations from a maximum of 3 sites.