SharePoint Server 2013 Disaster Recovery in Microsoft Azure

Summary: Using Azure, you can create a disaster-recovery environment for your on-premises SharePoint farm. This article describes how to design and implement this solution.

When disaster strikes your SharePoint on-premises environment, your top priority is to get the system running again quickly. Disaster recovery with SharePoint is quicker and easier when you have a backup environment already running in Microsoft Azure. This video explains the main concepts of a SharePoint warm failover environment and complements the full details available in this article.

Use this article with the following solution model: SharePoint Disaster Recovery in Microsoft Azure.

Many organizations do not have a disaster recovery environment for SharePoint, which can be expensive to build and maintain on-premises. Azure Infrastructure Services provides compelling options for disaster recovery environments that are more flexible and less expensive than the on-premises alternatives.

The advantages for using Azure Infrastructure Services include:

Fewer costly resources Maintain and pay for fewer resources than on-premises disaster recovery environments. The number of resources depends on which disaster-recovery environment you choose: cold standby, warm standby, or hot standby.

Better resource flexibility In the event of a disaster, easily scale out your recovery SharePoint farm to meet load requirements. Scale in when you no longer need the resources.

Lower datacenter commitment Use Azure Infrastructure Services instead of investing in a secondary datacenter in a different region.

There are less-complex options for organizations just getting started with disaster recovery and advanced options for organizations with high-resilience requirements. The definitions for cold, warm, and hot standby environments are a little different when the environment is hosted on a cloud platform. The following table describes these environments for building a SharePoint recovery farm in Azure.

The farm can be a smaller version of the production farm and then scaled out to serve the full user base.

Cold

The farm is fully built, but the virtual machines are stopped.

Maintaining the environment includes starting the virtual machines from time to time, patching, updating, and verifying the environment.

Start the full environment in the event of a disaster.

It’s important to evaluate your organization’s Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs). These requirements determine which environment is the most appropriate investment for your organization.

The guidance in this article describes how to implement a warm standby environment. You can also adapt it to a cold standby environment, although you need to follow additional procedures to support this kind of environment. This article does not describe how to implement a hot standby environment.

The warm standby disaster-recovery solution requires the following environment:

An on-premises SharePoint production farm

A recovery SharePoint farm in Azure

A site-to-site VPN connection between the two environments

The following figure illustrates these three elements.

Figure: Elements of a warm standby solution in Azure

SQL Server log shipping with Distributed File System Replication (DFSR) is used to copy database backups and transaction logs to the recovery farm in Azure:

DFSR transfers logs from the production environment to the recovery environment. In a WAN scenario, DFSR is more efficient than shipping the logs directly to the secondary server in Azure.

Logs are replayed to the SQL Server in the recovery environment in Azure.

You don’t attach log-shipped SharePoint content databases in the recovery environment until a recovery exercise is performed.

Perform the following steps to recover the farm:

Stop log shipping.

Stop accepting traffic to the primary farm.

Replay the final transaction logs.

Attach the content databases to the farm.

Restore service applications from the replicated services databases.

Update Domain Name System (DNS) records to point to the recovery farm.

Start a full crawl.

We recommend that you rehearse these steps regularly and document them to help ensure that your live recovery runs smoothly. Attaching content databases and restoring service applications can take some time and typically involves some manual configuration.

After a recovery is performed, this solution provides the items listed in the following table.

Table: Solution recovery objectives

Item

Description

Sites and content

Sites and content are available in the recovery environment.

A new instance of search

In this warm standby solution, search is not restored from search databases. Search components in the recovery farm are configured as similarly as possible to the production farm. After the sites and content are restored, a full crawl is started to rebuild the search index. You do not need to wait for the crawl to complete to make the sites and content available.

Services

Services that store data in databases are restored from the log-shipped databases. Services that do not store data in databases are simply started.

Not all services with databases need to be restored. The following services do not need to be restored from databases and can simply be started after failover:

Usage and Health Data Collection

State service

Word automation

Any other service that doesn’t use a database

You can work with Microsoft Consulting Services (MCS) or a partner to address more-complex recovery objectives. These are summarized in the following table.

Table: Other items that can be addressed by MCS or a partner

Item

Description

Synchronizing custom farm solutions

Ideally, the recovery farm configuration is identical to the production farm. You can work with a consultant or partner to evaluate whether custom farm solutions are replicated and whether the process is in place for keeping the two environments synchronized.

Connections to data sources on-premises

It might not be practical to replicate connections to back-end data systems, such as backup domain controller (BDC) connections and search content sources.

Search restore scenarios

Because enterprise search deployments tend to be fairly unique and complex, restoring search from databases requires a greater investment. You can work with a consultant or partner to identify and implement search restore scenarios that your organization might require.

The guidance provided in this article assumes that the on-premises farm is already designed and deployed.

Ideally, the recovery farm configuration in Azure is identical to the production farm on-premises, including the following:

The same representation of server roles

The same configuration of customizations

The same configuration of search components

The environment in Azure can be a smaller version of the production farm. If you plan to scale out the recovery farm after failover, it’s important that each type of server role be initially represented.

Some configurations might not be practical to replicate in the failover environment. Be sure to test the failover procedures and environment to help ensure that the failover farm provides the expected service level.

This solution doesn't prescribe a specific topology for a SharePoint farm. The focus of this solution is to use Azure for the failover farm and to implement log shipping and DFSR between the two environments.

In a warm standby environment, all virtual machines in the Azure environment are running. The environment is ready for a failover exercise or event.

The following figure illustrates a disaster recovery solution from an on-premises SharePoint farm to an Azure-based SharePoint farm that is configured as a warm standby environment.

Figure: Topology and key elements of a production farm and a warm standby recovery farm

In this diagram:

Two environments are illustrated side by side: the on-premises SharePoint farm and the warm standby farm in Azure.

Each environment includes a file share.

Each farm includes four tiers. To achieve high availability, each tier includes two servers or virtual machines that are configured identically for a specific role, such as front-end services, distributed cache, back-end services, and databases. It isn't important in this illustration to call out specific components. The two farms are configured identically.

The fourth tier is the database tier. Log shipping is used to copy logs from the secondary database server in the on-premises environment to the file share in the same environment.

DFSR copies files from the file share in the on-premises environment to the file share in the Azure environment.

Log shipping replays the logs from the file share in the Azure environment to the primary replica in the SQL Server AlwaysOn availability group in the recovery environment.

In a cold standby environment, most of the SharePoint farm virtual machines can be shut down. (We recommend occasionally starting the virtual machines, such as every two weeks or once a month, so that each virtual machine can sync with the domain.) The following virtual machines in the Azure recovery environment must remain running to help ensure continuous operations of log shipping and DFSR:

The following figure shows an Azure failover environment in which the file share virtual machine and the primary SharePoint database virtual machine are running. All other SharePoint virtual machines are stopped. The virtual machine that is running Windows Server Active Directory and DNS is not shown.

Figure: Cold standby recovery farm with running virtual machines

After failover to a cold standby environment, all virtual machines are started, and the method to achieve high availability of the database servers must be configured, such as SQL Server AlwaysOn availability groups.

If multiple storage groups are implemented (databases are spread across more than one SQL Server high availability set), the primary database for each storage group must be running to accept the logs associated with its storage group.

Multiple technologies are used in this disaster recovery solution. To help ensure that these technologies interact as expected, each component in the on-premises and Azure environment must be installed and configured correctly. We recommend that the person or team who sets up this solution have a strong working knowledge of and hands-on skills with the technologies described in the following articles:

Finally, we recommend scripting skills that you can use to automate tasks associated with these technologies. It’s possible to use the available user interfaces to complete all the tasks described in this solution. However, a manual approach can be time consuming and error prone and delivers inconsistent results.

In addition to Windows PowerShell, there are also Windows PowerShell libraries for SQL Server, SharePoint Server, and Azure. Don’t forget T-SQL, which can also help reduce the time to configure and maintain your disaster-recovery environment.

To support log shipping in a disaster-recovery solution, a file share virtual machine is added to the cloud service where the database roles reside. The file share also serves as the third node of a Node Majority for the SQL Server AlwaysOn availability group. This is the recommended configuration for a standard SharePoint farm that uses SQL Server AlwaysOn availability groups.

Figure: Placement of a file server used for a disaster recovery solution

In this diagram, a file share virtual machine is added to the same cloud service in Azure that contains the database server roles. Do not add the file share virtual machine to an availability set with other server roles, such as the SQL Server roles.

If you are concerned about the high availability of the logs, consider taking a different approach by using SQL Server backup and restore with Azure Blob Storage Service. This is a new feature in Azure that saves logs directly to a blob storage URL. This solution does not include guidance about using this feature.

When you design the recovery farm, keep in mind that a successful disaster recovery environment accurately reflects the production farm that you want to recover. The size of the recovery farm is not the most important thing in the recovery farm’s design, deployment, and testing. Farm scale varies from organization to organization based on business requirements. It might be possible to use a scaled-down farm for a short outage or until performance and capacity demands require you to scale the farm.

Configure the recovery farm as identically as possible to the production farm so that it meets your service level agreement (SLA) requirements and provides the functionality that you need to support your business. When you design the disaster recovery environment, also look at your change management process for your production environment. We recommend that you extend the change management process to the recovery environment by updating the recovery environment at the same interval as the production environment. As part of the change management process, we recommend maintaining a detailed inventory of your farm configuration, applications, and users.

Before this phase, you didn’t deploy virtual machines to the Virtual Network. The virtual machines for hosting Active Directory and DNS are likely not the largest virtual machines you need for the solution. Before you deploy these virtual machines, first create the largest virtual machine that you plan to use in your Virtual Network. This helps ensure that your solution lands on a tag in Azure that allows the largest size you need. You do not need to configure this virtual machine at this time. Simply create it, and set it aside. If you do not do this, you might run into a limitation when you try to create larger virtual machines later, which was an issue at the time this article was written.

To set up file replication by using DFSR, use the DNS Management snap-in. However, before the DFSR setup, log on to your on-premises file server and Azure file server and enable the service in Windows.

From the Server Manager Dashboard, complete the following steps:

Configure the local server.

Start the Add Roles and Features Wizard.

Open the File and Storage Services node.

Select DFS Namespaces and DFS replication.

Click Next to finish the wizard steps.

The following table provides links to DFSR reference articles and blog posts.

Log shipping is the critical component for setting up disaster recovery in this environment. You can use log shipping to automatically send transaction log files for databases from a primary database server instance to a secondary database server instance. To set up log shipping, see Configure log shipping in SharePoint 2013.

The goal of this final phase is to verify that the disaster recovery solution works as planned. To do this, create a failover event that shuts down the production farm and starts up the recovery farm as a replacement. You can start a failover scenario manually or by using scripts.

The first step is to stop incoming user requests for farm services or content. You can do this by disabling DNS entries or by shutting down the front-end web servers. After the farm is ”down,” you can fail over to the recovery farm.

You must stop log shipping before farm recovery. Stop log shipping on the secondary server in Azure first, and then stop it on the primary server on-premises. Use the following script to stop log shipping on the secondary server first and then on the primary server. The database names in the script might be different, depending on your environment.

Backups must be restored in the order in which they were created. Before you can restore a particular transaction log backup, you must first restore the following previous backups without rolling back uncommitted transactions (that is, by using WITH NORECOVERY):

The full database backup and the last differential backup – Restore these backups, if any exist, taken before the particular transaction log backup. Before the most recent full or differential database backup was created, the database was using the full recovery model or bulk-logged recovery model.

All transaction log backups – Restore any transaction log backups taken after the full database backup or the differential backup (if you restore one) and before the particular transaction log backup. Log backups must be applied in the sequence in which they were created, without any gaps in the log chain.

To recover the content database on the secondary server so that the sites render, remove all database connections before recovery. To restore the database, run the following SQL statement.

When you use T-SQL explicitly, specify either WITH NORECOVERY or WITH RECOVERY in every RESTORE statement to eliminate ambiguity—this is very important when writing scripts. After the full and differential backups are restored, the transaction logs can be restored in SQL Server Management Studio. Also, because log shipping is already stopped, the content database is in a standby state, so you must change the state to full access.

In SQL Server Management Studio, right-click the WSS_Content database, point to Tasks > Restore, and then click Transaction Log (if you have not restored the full backup, this is not available). For more information, see Restore a Transaction Log Backup (SQL Server).

You must start a full crawl for each content source to restore the Search Service. Note that you lose some analytics information from the on-premises farm, such as search recommendations. Before you start the full crawls, use the Windows PowerShell cmdlet Restore-SPEnterpriseSearchServiceApplication and specify the log-shipped and replicated Search Administration database, Search_Service__DB_<GUID>. This cmdlet gives the search configuration, schema, managed properties, rules, and sources and creates a default set of the other components.

To start a full crawl, complete the following steps:

In the SharePoint 2013 Central Administration, go to Application Management > Service Applications > Manage service applications, and then click the Search Service application that you want to crawl.

On the Search Administration page, click Content Sources, point to the content source that you want, click the arrow, and then click Start Full Crawl.

The following table shows how to recover services that have log-shipped databases, the services that have databases but are not recommended to restore with log shipping, and the services that do not have databases.

Important:

Restoring an on-premises SharePoint database into the Azure environment will not recover any SharePoint services that you did not already install in Azure manually.

Table: Service application database reference

Restore these services from log-shipped databases

These services have databases, but we recommend that you start these services without restoring their databases

These services do not store data in databases; start these services after failover

Machine Translation Service

Managed Metadata Service

Secure Store Service

User Profile. (Only the Profile and Social Tagging databases are supported. The Synchronization database is not supported.)

Microsoft SharePoint Foundation Subscription Settings Service

Usage and Health Data Collection

State service

Word automation

Excel Services

PerformancePoint Services

PowerPoint Conversion

Visio Graphics Service

Work Management

The following example shows how to restore the Managed Metadata service from a database.

This uses the existing Managed_Metadata_DB database. This database is log shipped, but there is no active service application on the secondary farm, so it needs to be connected after the service application is in place.

First, use New-SPMetadataServiceApplication, and specify the DatabaseName switch with the name of the restored database.

Next, configure the new Managed Metadata Service Application on the secondary server, as follows:

You must manually create DNS records to point to your SharePoint farm.

In most cases where you have multiple front-end web servers, it makes sense to take advantage of the Network Load Balancing feature in Windows Server 2012 or a hardware load balancer to distribute requests among the web-front-end servers in your farm. Network load balancing can also help reduce risk by distributing requests to the other servers if one of your web-front-end servers fails.

Typically, when you set up network load balancing, your cluster is assigned a single IP address. You then create a DNS host record in the DNS provider for your network that points to the cluster. (For this project, we put a DNS server in Azure for resiliency in case of an on-premises datacenter failure.) For instance, you can create a DNS record, in DNS Manager in Active Directory, for example, called http://sharepoint.contoso.com, that points to the IP address for your load-balanced cluster.

For external access to your SharePoint farm, you can create a host record on an external DNS server with the same URL that clients use on your intranet (for example, http://sharepoint.contoso.com) that points to an external IP address in your firewall. (A best practice, using this example, is to set up split DNS so that the internal DNS server is authoritative for contoso.com and routes requests directly to the SharePoint farm cluster, rather than routing DNS requests to your external DNS server.) You can then map the external IP address to the internal IP address of your on-premises cluster so that clients find the resources they are looking for.

From here, you might run into a couple of different disaster-recovery scenarios:

Example scenario: The on-premises SharePoint farm is unavailable because of hardware failure in the on-premises SharePoint farm. In this case, after you have completed the steps for failover to the Azure SharePoint farm, you can configure network load balancing on the recovery SharePoint farm’s web-front-end servers, the same way you did with the on-premises farm. You can then redirect the host record in your internal DNS provider to point to the recovery farm’s cluster IP address. Note that it can take some time before cached DNS records on clients are refreshed and point to the recovery farm.

Example scenario: The on-premises datacenter is lost completely. This scenario might occur due to a natural disaster, such as a fire or flood. In this case, for an enterprise, you would likely have a secondary datacenter hosted in another region as well as your Azure subnet that has its own directory services and DNS. As in the previous disaster scenario, you can redirect your internal and external DNS records to point to the Azure SharePoint farm. Again, take note that DNS-record propagation can take some time.

If you are using host-named site collections, as recommended in Host-named site collection architecture and deployment (SharePoint 2013), you might have several site collections hosted by the same web application in your SharePoint farm, with unique DNS names (for example, http://sales.contoso.com and http://marketing.contoso.com). In this case, you can create DNS records for each site collection that point to your cluster IP address. After a request reaches your SharePoint web-front-end servers, they handle routing each request to the appropriate site collection.

We designed and tested a proof-of-concept environment for this solution. The design goal for our test environment was to deploy and recover a SharePoint farm that we might find in a customer environment. We made several assumptions, but we knew that the farm needed to provide all of the out-of-the-box functionality without any customizations. The topology was designed for high availability by using best practice guidance from the field and product group.

The following table describes the Hyper-V virtual machines that we created and configured for the on-premises test environment.

Table: Virtual machines for on-premises test

Server name

Role

Configuration

DC1

Domain controller with Active Directory.

Two processors

From 512 MB through 4 GB of RAM

1 x 127-GB hard disk

RRAS

Server configured with the Routing and Remote Access Service (RRAS) role.

Two processors

2–8 GB of RAM

1 x 127-GB hard disk

FS1

File server with shares for backups and an end point for DFSR.

Four processors

2–12 GB of RAM

1 x 127-GB hard disk

1 x 1-TB hard disk (SAN)

1 x 750-GB hard disk

SP-WFE1, SP-WFE2

Front-end web servers.

Four processors

16 GB of RAM

SP-APP1, SP-APP2, SP-APP3

Application servers.

Four processors

2–16 GB of RAM

SP-SQL-HA1, SP-SQL-HA2

Database servers, configured with SQL Server 2012 AlwaysOn availability groups to provide high availability. This configuration uses SP-SQL-HA1 and SP-SQL-HA2 as the primary and secondary replicas.

Four processors

2–16 GB of RAM

The following table describes drive configurations for the Hyper-V virtual machines that we created and configured for the front-end web and application servers for the on-premises test environment.

Table: Virtual machine drive requirements for the Front End Web and Application servers for the on-premises test

The following table describes drive configurations for the Hyper-V virtual machines created and configured to serve as the on-premises database servers. On the Database Engine Configuration page, access the Data Directories tab to set and confirm the settings shown in the following table.

Table: Virtual machine drive requirements for the database server for the on-premises test

During the different deployment phases, the test team typically worked on the on-premises architecture first and then on the corresponding Azure environment. This reflects the general real-world cases where in-house production farms are already running. What is even more important is that you should know the current production workload, capacity, and typical performance. In addition to building a disaster recovery model that can meet business requirements, you should size the recovery farm servers to deliver a minimum level of service. In a cold or warm standby environment, a recovery farm is typically smaller than a production farm. After the recovery farm is stable and in production, the farm can be scaled up and out to meet workload requirements.

This phase involved setting up a domain environment for the on-premises farm and for the recovery farm in Azure. In addition to the normal tasks associated with configuring Active Directory, the test team implemented a routing solution and a VPN connection between the two environments.

In addition to the farm servers, it was necessary to provision servers for the domain controllers and configure a server to handle RRAS as well as the site-to-site VPN. Two file servers were provisioned for the DFSR service, and several client computers were provisioned for testers.

The SharePoint farms were deployed in two stages in order to simplify environment stabilization and troubleshooting, if required. During the first stage, each farm was deployed on the minimum number of servers for each tier of the topology to support the required functionality.

We created the database servers with SQL Server installed before creating the SharePoint 2013 servers. Because this was a new deployment, we created the availability groups before deploying SharePoint. We created three groups based on MCS best practice guidance.

Configure AlwaysOn and create the three availability groups for the farm.

Provision AZ-APP1 to host Central Administration.

Provision AZ-WFE1 and AZ-WFE2 to host the distributed cache.

After we configured the distributed cache and added test users and test content, we started stage two of the deployment. This required scaling out the tiers and configuring the farm servers to support the high-availability topology described in the farm architecture.

The following table describes the virtual machines, cloud services, and availability sets we set up for our recovery farm.

After the test team stabilized the farm environments and completed functional testing, they started the following operations tasks required to configure the on-premises recovery environment:

Configure full and differential backups.

Configure DFSR on the file servers that transfer transaction logs between the on-premises environment and the Azure environment.

Configure log shipping on the primary database server.

Stabilize, validate, and troubleshoot log shipping, as required. This included identifying and documenting any behavior that might cause issues, such as network latency, which would cause log shipping or DFSR file synchronization failures.

Check for a missing service application association between your content site collection and your content type hub. In addition, under the Managed Metadata - <site collection name> Connection properties screen, make sure this option is enabled: This service application is the default storage location for column specific term sets.

When setting up user profiles, you need the Active Directory forest name. In the Add Roles and Features Wizard, ensure that you have enabled the Active Directory Module for Windows PowerShell (under the Remote Server Administration Tools>Role Administration Tools>AD DS and AD LDS Tools section). In addition, run the following commands before using Get-ADForest to help ensure that your software dependencies are loaded.

This happens because the default backup preference for an availability group is Prefer Secondary. Ensure that you run the log shipping job from the secondary server for the availability group instead of the primary; otherwise, the job will fail silently.

Services might take several minutes to start, depending on the performance and current load of your SharePoint Server. Manually click Start for the service and provide adequate time for startup while occasionally refreshing the Services on Server screen to monitor its status. In case the service remains stopped, enable SharePoint diagnostic logging, attempt to start the service again, and then check the log for errors. For more information, see Configure diagnostic logging in SharePoint 2013