Create a high availability architecture and strategy for SharePoint 2013

Summary: Learn how to combine farm architecture and technology to create a highly available environment in a single SharePoint 2013 farm.

A high-availability strategy is an important requirement for a production SharePoint 2013 environment. An end-to-end strategy includes operational processes, platform governance, architecture, and technical solutions. This article focuses on the architectural and technical aspects of high availability. The guidance explains specific SharePoint design elements and the technical options that will determine your strategy for high availability.

Note:

High availability and disaster recovery are not the same things. Although there is overlap in planning and solutions, they are subsets of business continuity. The purpose of high availability is to provide resiliency within the primary data center and planned downtime. The purpose of disaster recovery is to enable an organization to resume computer operations in a secondary data center when a disaster at the primary data center makes the infrastructure unusable. For information about disaster recovery for SharePoint 2013, see Choose a disaster recovery strategy for SharePoint 2013.

High availability is generally used to describe the ability of a system to continue operating and provide resources to its users when a failure occurs in one or more of the following categories in a fault domain: hardware, software, or application. The level of availability is expressed as a measure of the percentage of time that a system is continuously operational to support business functions. The required level of availability varies among organizations. Although this requirement may also vary among business units, a service level agreement is for the organization as a whole. From the perspective of users, a Sharepoint farm is available when users can access the farm and use the features and services that they must have to do their work.

A highly available SharePoint farm has the following goals and characteristics:

The farm design reduces potential points of failure. Because it is improbable that you can eliminate all failure points, the overall strategy must address how to respond to a failure event.

Failover events are seamless and have minimal effect on user activities.

The farm continues to operate at reduced capacity instead of failing completely.

The farm is resilient. Incidents that affect service occur infrequently, and timely and effective action is taken when they do occur.

Before you can create a realistic and economical high-availability architecture and strategy for your SharePoint environment, you have to define and quantify your availability goals. These goals reflect the extent to which your organization depends on SharePoint 2013 and how a loss of service might affect the organization's operations. The effect of the loss of service depends on the nature of the loss (full or partial) and the duration of the loss.

Lost revenue is usually identified as a leading result of reduced service or a complete loss of service, especially for companies that conduct business online. However, other less visible consequences are equally damaging to an organization. For example, an organization can experience loss of confidence by partners, suppliers or customers, diminished respect of the corporate brand, and legal issues.

A successful high-availability strategy must reflect the specific needs of your organization. Additionally, it must provide an optimal balance between business requirements, IT service level agreements (SLAs), and the availability of technical solutions, IT support capabilities, and infrastructure costs.

After you identify availability requirements for your organization, you can begin to create a high-availability design and a strategy to reduce the risk of downtime and reduced operations. IT professionals who design and deploy highly available systems use the following guiding principles to meet their goals:

Eliminate single points of failure for each fault domain and the entire system at every possible layer (the operating system, software and the SharePoint application).

Note:

A fault domain provides the scope and boundary of a physical point of failure. The March 2011 Issue of IEEE Computer Magazine gives this definition: "A fault domain is a set of hardware components – computers, switches, and more – that share a single point of failure."
For more information about fault domains and upgrade domains, see Window Azure Fault Domain and Upgrade Domain Explained for IT Pros.

Implement very rapid fault detection, isolation, and resolution.

Note:

An SLA is a negotiated agreement between IT service providers (an internal IT group or external vendor) and user representatives. An SLA is used to identify and quantify the required services and the support that the service provider will provide. SLAs are clear, specific, and precise to avoid misunderstandings about expectations of providers and users. Clarity and preciseness are important because SLAs typically specify significant financial penalties where third-party service providers are engaged.

High availability and fault tolerance are not the same things. A definition of high availability is important because fault tolerance is frequently used synonymously to describe how high availability is implemented.

High availability solutions are broad in scope and provide a set of system-wide, shared resources that are integrated to provide predefined required services. The solution uses different combinations of industry-standard hardware and software to minimize downtime and restore services when the system or part of the system fails.

A fault-tolerant solution is hardware-centric and uses specialized hardware to detect faults and instantly switch to a redundant hardware component. This component can be a processor, memory board, power supply, I/O subsystem, or storage subsystem. The switch to a redundant component provides a high level of service.

A cost-benefits analysis of fault-tolerant solutions and high availability solutions enables organizations to create an effective strategy to meet the availability goals for their SharePoint farm. Typically there are cost tradeoffs between the two solutions. For more information, see Evaluating High-Availability (HA) vs. Fault Tolerant (FT) Solutions.

Availability is measured in relation to being operational 100% of the time, or never down. The common measure of availability in the IT world is expressed as a number of 9s, ranging from one nine (90%) to five nine (99.999%), the ideal. The number of nines measure is the percentage of time that a given system is running, functioning, and available to users.

Note:

Uptime is frequently used synonymously with availability. However, this is misleading because a computer system can be running but not able to provide the services and functionality that users need.

y is the total number of minutes that a system or service is unavailable

As you can see in the following table, which correlates availability percentage with calendar time equivalents, five 9s of availability is difficult to achieve. This level of availability is also expensive, complex, and in some cases involves risk. For more perspective on five 9s, read Vijay Gill's post, How many Nines? and Sean Hull's post, The myth of five nines - Why high availability is overrated.

Note:

Three 9s of availability is the norm for most businesses that operate SharePoint farms in their datacenter. This level of availability is also typical for a SharePoint farm that is deployed in a hosted environment or in the cloud.

Correlation of % availability to calendar downtime

Acceptable availability percentage

Downtime per day

Downtime per month

Downtime per year

90 (one nine)

144.00 minutes

72 hours

36.5days

99 (two nines)

14.40 minutes

7 hours

3.65 days

99.9 (three nines)

86.40 seconds

43 minutes

8.77 hours

99.99 (four nines)

8.64 seconds

4 minutes

52.60 minutes

99.999 (five nines)

0.86 seconds

26 seconds

5.26 minutes

Although out of scope for this article, several of the following aspects of an SLA should be reflected in a high-availability design.

Availability definition and scope

You can't build an environment for availability or negotiate an SLA until you define availability. It has to be a measure of the ability of users to complete the normal tasks that their job requires and use the functions and services that SharePoint provides. SharePoint workloads that an organization uses determine this definition and the scope of availability. Workload requirements vary among customers and are based on the specific needs of each organization, the functions that the farm provides, and the profile of the users of the farm.

Exclusions to availability calculations

Exclusions to availability are as important to the design of availability as they are to an SLA. Every system requires routine maintenance. Planned downtime or reduced levels of service are not part of availability calculations. Typical exclusions are scheduled maintenance hours and planned downtime for activities such as quarantining a virus or responding to a security threat.

Downtime metrics

The previous "Correlation of % availability to calendar downtime" table identifies downtime for each 9 of availability. The following measures are used to calculate availability:

Mean time between failures (MTBF) - The expected time between two consecutive failures for a repairable system.

Mean time to failure (MTTF) - The expected time to failure for a system that cannot be repaired.

Mean time to repair or replace (MTTR) - The expected time to repair or replace a failed component.

The following formula calculates availability: Availability = MTTF / (MTTF + MTTR). You can use this formula to calculate total availability and improve it by using more reliable hardware and software components to increase MTTF and decrease MTTR.

The relationship of performance to availability

Performance is not isolated from availability. Service providers and consumers define quantifiable performance benchmarks for service levels when a system is running under normal conditions. However, reduced availability occurs when an extraordinary event significantly affects performance to make the system basically unusable—by definition, the farm is not available. Here are typical examples of extraordinary events:

Denial of service attacks on public-facing web servers

Poorly formed queries that use up database server resources or database transactions that lock tables

Wide-area network (WAN) failures or high network latency caused by events in other locations

As more organizations move to geographically distributed SharePoint farms or hosted environments, network latency is extremely important for availability planning.

A process that implements high availability is one of the more expensive investments for a SharePoint farm. As the level of availability and the number systems that you want to make highly available increases, complexity and cost of an availability solution also increases. The following costs are typically part of an investment in high availability:

Additional infrastructure components such as network adapters, switches, and power for redundancy.

Additional hardware, software, or software licenses to support various farm roles that provide workload redundancy across the farm architecture.

Spare hardware to replace failed equipment.

Although it's common to prepare spare computers in varying states of readiness to use for routine maintenance or to replace a failed server, your investment is in hardware that sits idle.

Note:

Advances in virtualization technology enable organizations to use virtual computers as hot, warm, or cold spares. Virtual computers may be suitable to provide the same functionality. Virtualization can provide flexibility and cost efficiency. However, you must verify that a virtual machine has the capacity to handle the load of the physical computer that it will replace.

Increased maintenance and support costs that are proportional to the level of availability and the solutions that are used to meet availability requirements.

Anticipated changes to the farm, such as scaling out. When you scale out a farm, the availability solution has to be able to reflect all the changes to the farm topology. Costs will probably increase.

A robust detection and alerting system that provides rapid fault detection. This system can use existing fault detection tools and can include health monitoring and alerting tools such as System Center Operations Manager.

Integration or customization costs required to implement high availability for the farm itself or to meet broader data center requirements.

Evaluate the cost of better availability in the context of your core business needs. In many cases every organizational unit does not require the same level of availability. Consider varying levels of availability for different sites, different services, or different farms.

Referring back to the "Correlation of % availability to calendar downtime" table, five 9s of availability means that over the course of a year the system is only down for 5.26 minutes! Although you can achieve this level of availability, the cost is prohibitive for many organizations. A key decision is to determine the point where an investment in an additional nine of availability does not provide a cost benefit in relation to the effect of a failure.

The following illustration shows how you can distribute and configure different parts of a SharePoint environment to increase availability across a farm. This example also shows how redundancy can address fault domains.

Note:

Our example is not comprehensive. For example it does not show all the fault domains and fault-tolerant hardware.

Examples of redundancy in a farm topology to address points offailure

Referring to the topology in the previous illustration, note the following:

The farm servers in this example can be physical computers or virtual machines that are deployed on Hyper-V host servers. The principle of identifying and responding to points of failure applies to both types of environment.

Four servers (W1-W4) are dedicated to serving content and this redundancy increases availability if a failure occurs in one or more servers. This level of redundancy also enables the farm to continue operations when software updates are applied.

The farm database servers are redundant and database high availability can be achieved by using database mirroring or clustering.

In a virtual environment the virtual machines are put on separate Hyper-V host servers to eliminate a single point of failure. This approach to virtual machine placement follows best practice guidelines for availability and performance.

The primary database server (labeled 1) and Rack 2 (labeled 2), that contains two of the virtualization host computers, are identified as fault domains to show how your farm and infrastructure can be viewed as a collection of fault domains. This shows how you can do an in-depth analysis of your environment to develop an overall strategy and cost benefit analysis.

Other farm roles and services

Our example does not include all the roles, services, and service applications that might be running in a specific SharePoint farm. You cannot use a generic approach to high availability for everything in a SharePoint farm. Some important exclusions to using a standard approach to high availability are as follows:

Although service applications can run on multiple computers, which we recommend, some have unique installation and configuration requirements for high availability. The User Profile application is a well-known example.

After you design an architecture that supports highly available roles and workloads, you can use fault-tolerant components to increase availability. Fault tolerant solutions are available across the infrastructure, which includes the databases.

Fault tolerance is readily available for almost every hardware component in the infrastructure of a SharePoint farm. As part of your high availability design, determine the parts of the infrastructure that should be fault-tolerant from an operational and cost perspective. Just because you can make every part of the infrastructure fault-tolerant doesn't mean that you should.

Because the SharePoint platform and its application workloads depend on the availability and reliability of all the SharePoint databases, highly available databases are an extremely important aspect of your high availability strategy. You can use the following features as fault-tolerant solutions for SharePoint database servers and databases:

A failover cluster requires shared disk storage between two computers. In a two node configuration, the computers are configured as active/passive which provides a fully redundant instance of the primary node. The passive node is only brought online when the primary node fails. The shared disk is only presented to one computer at a time. This configuration typically requires the most additional hardware. In SQL Server 2012, this type of cluster configuration is an AlwaysOn Failover Cluster Instance, and it is a specific way to install SQL Server. Because of the configuration requirements, you cannot take a standard SQL Server installation and easily change it to a Failover Cluster Instance.

An AlwaysOn Availability Group is a different technology in SQL Server 2012 (think of it as a descendant of Database Mirroring) that uses some features exposed by Windows Clustering. However, it does not require shared disk storage, and the computers in an availability group do not have to have a specialized configuration of SQL Server installed on them. After a database server is added to a Windows Cluster, it is fairly easy to enable AlwaysOn Availability Groups and then configure the availability group that you want.

In summary, any server that runs SQL Server 2012 Enterprise Edition can use AlwaysOn Availability Groups by joining a cluster and configuring the availability group. AlwaysOn failover clusters require special hardware and configuration steps to set up Failover Cluster Instances. Each of these technologies has its use for specific environments, and both are complimentary competitors. For more information about these features, see Microsoft SQL Server AlwaysOn Solutions Guide for High Availability and Disaster Recovery.

Important:

Because each SQL Server high availability option has its own features, strengths, and weaknesses, one option is not necessarily better than another. For example, in a given scenario that uses AlwaysOn Availability Groups, minimizing data lose might be better than any performance gain that AlwaysOn Failover Cluster Instances achieves. You must choose a high-availability solution that is based on your business requirements and IT infrastructure requirements.

A determining factor in selecting a SQL Server option to use is the SharePoint databases. You must understand the characteristics of the SharePoint 2013 databases. Each database may have specific requirements or constraints that will determine the SQL Server fault-tolerant solution that is appropriate and fully supported in your production environment. We recommend that you review the following articles:

A failover cluster is a combination of one or more nodes or servers, and two or more shared disks. Although an instance of a failover cluster appears as a single computer, the instance provides failover from one node to another if the current node becomes unavailable. SharePoint 2013 can run on any combination of active and passive nodes in a cluster that SQL Server supports.

SharePoint 2013 references the cluster as a whole. Therefore, failover is automatic and seamless from the perspective of SharePoint 2013.

Note:

When either a planned or unplanned failover happens, connections are dropped and must be established again when transitioning from one cluster node to another cluster node.

For detailed information about SQL Server failover clustering, see the following articles:

The key benefit of SQL Server AlwaysOn Availability Groups and SQL Server Database Mirroring is that both provide complete or almost complete data redundancy depending on how you configure them for transaction processing. In addition to minimizing data loss, automatic failover minimizes downtime for production databases.

Important:

Although SQL Server 2012 supports database mirroring, this feature is deprecated. We recommend that you avoid using this feature in new development work. Plan to change applications that currently use this feature. Use AlwaysOn Availability Groups instead.

AlwaysOn Availability Groups

The SQL Server AlwaysOn Availability Groups feature is both a high-availability and disaster-recovery solution that provides an enterprise-level alternative to database mirroring. AlwaysOn Availability Groups supports a failover environment for one or more user databases contained in a user-defined collection. This collection, an availability group, consists of the following components:

Replicas, which are a discrete set of user databases called availability databases that are handled as a single unit. An availability group supports one primary replica and up to four secondary replicas.

A specific instance of SQL Server to host each replica and to maintain a local copy of each database that belongs to the availability group.

When an availability group fails over to a target instance or target server, all databases in the group also fail over. Because SQL Server 2012 can host multiple availability groups on a single server, you can configure AlwaysOn to fail over to SQL Server instances on different servers. This reduces the need for idle, high-performance standby servers to handle the full load of the primary server, which is one of the many benefits of availability groups.

Note:

Database issues, such as a database becoming suspect due to a loss of a data file, deletion of a database, or corruption of a transaction log do not cause a failover.

Database mirroring provides database redundancy by keeping a mirrored copy of databases on the primary database server. Mirroring is implemented on a per-database basis and only works with databases that use the full recovery model.

Note:

There are two mirroring operating modes. One of them, high-safety mode, supports synchronous operation. In high-safety mode, when a session starts, the mirror server synchronizes the mirror database and the principal database as quickly as possible. As soon as the databases are synchronized, a transaction is written to the log on the secondary server and then replayed. (Control returns to the principal server as soon as the transaction is hardened.) The other mirroring mode is high-performance, which uses asynchronous operation to reduce transaction latency, at the cost of increased data loss.

For high-availability mirroring in a SharePoint farm, you must use high-safety mode with automatic failover. High-safety database mirroring requires three server instances: a principal, a mirror, and a witness. The witness server enables SQL Server to automatically fail over from the principal server to the mirror server. Failover from the principal database to the mirror database typically takes several seconds.

The choice of a SQL Server technology for high availability and disaster recovery should be based on your organization's business goals for Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Although RPO and RTO are typically associated with disaster recovery, some failure events are outside the scope of a disaster but require recovery from local backup media in the primary datacenter.

Cluster member takes over almost immediately after failure. A lag occurs while the cluster node spins up.

Replica takes over almost immediately after failure. A lag occurs while the secondary replica spins up.

Mirror takes over as soon as the redo queue is processed.

Transactional consistency

Yes

Yes

Yes

Transactional concurrency

Yes

Yes

Yes

Time to recovery

Shorter time to recover than an availability group.

Longer time to recover than a failover cluster, but faster recovery time than a mirrored solution.

Slightly longer time to recover than cluster or availability group.

Steps required for failover

Database nodes automatically detect a failure.

SharePoint 2013 references the cluster so that failover is seamless and automatic.

The Availability Group listener automatically detects a failure and failover is seamless and automatic.

The database automatically detects failure.

SharePoint 2013 is aware of the mirror location, if it was configured correctly so that failover is automatic.

Protection against failed storage

The failover cluster itself does not provide data protection. The amount of data loss depends on the storage system implementation. For example, a SAN environment has redundant components such as multiple file paths, RAID, and hot spares.

Protects against failed storage because the primary replica writes to the local disks on the secondary replicas.

Protects against failed storage because both the principal and mirror database servers write to local disks.

Storage types supported

Requires shared storage which is more expensive than dedicated storage.

Can use less expensive directly attached storage solutions.

Can use less expensive directly attached storage.

Location requirements

Members of the cluster must be on the same subnet.

Note:

This is not the case with SQL Server 2012.

Replicas can be on different subnets as long as latency does not cause performance issues.

Principal, mirror, and witness servers must be on the same LAN (up to 1 millisecond latency round-trip).

Recovery model

SQL Server full recovery model recommended. You can use the SQL Server simple recovery model. However, the only available recovery point if the cluster is lost will be the last full backup.

Requires SQL Server 2012 full recovery model.

Requires SQL Server full recovery model.

Performance overhead

Some decrease in performance may occur while a failover is occurring. The server will be unavailable during failover and connections are dropped and then established again on the new active node.

AlwaysOn Availability Groups introduce transactional latency because of synchronous commit on the secondary replicas. The amount of latency depends on the number of secondary replicas that have to be synchronized.

Memory and processor overhead is greater than clustering, but less than mirroring.

High-availability mirroring introduces transactional latency because it is synchronous. It also requires additional memory and processor overhead.

Operations overhead

Set up and maintained at the server level.

The operational overhead is greater than clustering and mirroring. AlwaysOn requires overhead at the level of the SQL Server database server in addition to the Windows Server level.

Note:

Server-level objects such as logons and agent jobs must be maintained manually.

If you add content databases, you have to add them to an availability group and then synchronize the primary replica to the secondary replicas.

A SharePoint farm environment requires multiple configuration steps to make sure that the SharePoint 2013 connection string is correctly associated with the availability group listener name.

The operations overhead is more than clustering. Must be set up and maintained for all databases. Reconfiguring after failover is manual.

Note:

Server-level objects such as logons and agent jobs must be maintained manually.

If you add content databases, you have to add them to the principal and then synchronize the principal to the mirror.

Some enterprises have data centers that are located in close proximity to one another, connected by high-bandwidth fiber optic links. When this environment is available it is possible to configure the two data centers as a single farm. This distributed farm topology is called a "stretched" farm.

For stretched farm architecture to work as a supported high availability solution the follow prerequisites must be met:

There is a highly consistent intra-farm latency of <1ms (one way), 99.9% of the time over a period of ten minutes. (Intra-farm latency is commonly defined as the latency between the front-end web servers and the database servers.)

The bandwidth speed must be at least 1 gigabit per second.

To provide fault tolerance in a stretched farm, use the standard best practice guidance to configure redundant service applications and databases.

Your high availability strategy must include the appropriate backup and restore operations to make sure that the SharePoint farm is resilient. When an incident, such as a media failure or user error occurs, you must be able to restore the affected part of the farm environment or farm data in a timely manner. An effective backup and restore solution should enable you to meet the Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) that you define.