Be Invincible with AlwaysOn Availability Groups

I've got bad news: Superman doesn't respond to calls for database help. I know, outages seem really important to you and the rest of the IT crew at the Daily Planet, but Clark Kent doesn't put the cape on for your kind of corruption.

If you need your SQL Server databases to be highly available, you're going to have to do it yourself. In the past, this was a classic good news/bad news scenario. The good news was that SQL Server had an abundance of features to help with high availability, such as clustering, mirroring, replication, and log shipping. The bad news was that businesses' needs were usually complex enough that no one feature worked by itself, so you had to duct tape multiple features together. This meant you had to be good at a lot of things, and all of them had to interoperate well. That rarely happened.

SQL Server 2012's new AlwaysOn Availability Groups purports to be all good news. To figure out whether it meets your needs for high availability and disaster recovery, let's start by defining four acronyms: DR, HA, RTO, and RPO.

Defining DR, HA, RTO, and RPO

If you're responsible for making your SQL Server environment highly available, you need to hand the worksheet shown in Figure 1 to the business stakeholders for each application on the database server. Ask them to fill it in. Here's how to explain the worksheet to them.

Figure 1: Worksheet for the Business Stakeholders

Disaster recovery (DR) is what you do when your data center is reduced to a smoking crater because Superman archenemy Lex Luthor launched a missile that hit Metropolis. This scenario gets all the hype, but in reality, businesses usually suffer from more frequent hiccups that are less widespread.

High availability (HA) means that your databases won't disappear completely when your primary database server goes bump in the night. HA refers to smaller, localized outages like a server's motherboard flaking out, a memory chip coming down with Alzheimer's, or Daily Planet photojournalist Jimmy Olsen tripping over the power cord.

For both DR and HA scenarios, you need to come up with two numbers: RPO and RTO. Well, actually, you won't be the one coming up with the numbers. The business stakeholders need to make that decision, but you're the one who has to ask the tough questions.

Recovery time objective (RTO) sets the countdown clock. It's the amount of time the business is willing to be down. The clock starts at the moment the problem happens, not the time you find out about it. The clock keeps ticking until the end users are able to do their jobs again. Common RTOs are zero, one minute, one hour, and one day.

The business stakeholders will probably want to pick a range between one hour and one day, but that's not really how technology works. A one-hour solution can include some manual work to fail over between systems, but the standby system must already be built and the data must already be restored. If you have to start the restore from scratch, you need a one-day threshold. You might think you can execute a restore from scratch in an hour, but what happens when the restore fails or someone runs a RESTORE WITH RECOVERY along the way? You'll blow your timeline. RTO isn't about the best-case scenario. It's about the worst-case scenario, and things have already started to go wrong. Remember, you'll be in a failure situation. This won't be your finest hour.

Recovery point objective (RPO) represents how much data the business is willing to lose in the event of an incident. Like RTO, RPO is measured in terms of time (and not data size). Common RPOs are zero (no data loss), one minute's worth of data loss, one hour's worth of data loss, and one day's worth of data loss.

After you have the RTOs and RPOs from the business stakeholders, you can pick the right technology to meet their needs—and give them a rough idea of what it'll cost.

When you're just starting to build your documentation, stick with the RTO and RPO numbers. After you've rehearsed, polished, and optimized your recovery plan, you can start getting fancy with goals such as an RTO of 3 hours and 45 minutes.

Matching Business Needs to SQL Server Features

If the business stakeholders picked an RTO of zero, meaning no downtime is allowed whatsoever, then your choice is simple: You must have two independent SQL Server machines up and running at all times, always accepting connections that can delete, update, and insert data. There's no other method to accomplish zero RTO because any failover technology will take at least a few seconds to fail over in the best-case scenario. For everything else, here's how you decide.

If you need an RTO of one minute, you need automatic failover with no human intervention. You just can't guarantee that you can detect a problem, alert a human being, get the human to a computer, and get the problem solved in less than 60 seconds. Automatic failover means that you have two systems with exactly the same data at all times. SQL Server won't automatically fail over to a database server that's out of sync with the primary server. This requires synchronous transaction commits across two servers.

If you can live with a longer RTO and some data loss, you can have a secondary server that's slightly out of sync with the primary server. You can afford to commit transactions immediately on the primary server and just send transaction log data over to the secondary server in the background. When disaster strikes, you get a human being involved, decide to fail over with data loss, and issue the necessary commands.

Business stakeholders usually want both synchronous and asynchronous capabilities in the same system. For HA, they want automatic failover with no data loss. However, if disaster strikes and Superman can't stop the missile headed at the data center, they want manual failover to a separate data center with some acceptable data loss. In the past, meeting both of these needs meant combining clustering and log shipping, combining clustering and mirroring, or combining mirroring and log shipping.

With SQL Server Availability Groups, you can now accomplish both needs by using one feature but with different settings on each replica. For example, in Figure 2, there are three SQL Server machines in the availability group. Two of them are in the primary data center, and the other one is in a secondary data center. The two servers in the primary data center use synchronous mode so that transactions are written to both the primary server and the nearby secondary server at the same time before the application gets the message back that the transaction has been committed. Later, when convenient for SQL Server, the transaction is copied across the WAN to the secondary data center.

Figure 2: Sample SQL Server Availability Group

If the primary server dies for some reason (e.g., has a hardware failure, Windows crashes), the nearby replica takes over immediately. If the whole primary data center dies, you can manually fail over to a replica in the secondary data center at the cost of some data loss.

Flexible to Meet Changing Business Needs

You know how it goes—three months after you build this big, complex system, the boss is going to change his mind about what he wants. He's going to tell you that he wants more protection, more speed, more scale-out reads, different hardware, virtual servers, or more cowbell.

An availability group can have five total SQL Server machines involved, and up to three of them can do synchronous commits. You can even make changes between synchronous and asynchronous on the fly without taking the primary replica down. Just step into a phone booth, tweak a few settings, and emerge stronger.

But There's Kryptonite, and It's Everywhere

Despite his superhuman speed, strength, and x-ray vision, Superman has to be on the lookout for kryptonite, an element that can bring him to his knees. Unfortunately, an availability group solution can be full of hidden kryptonite just waiting to bring you to your knees. Knowing about the following hidden dangers can go a long way toward preventing problems.

Even though you might not be using a failover cluster instance (FCI) in your availability group, the Windows boxes involved in the availability group all need the Windows clustering feature. Availability groups rely on clustering services for things like knowing which node is primary, passing the server name around, and making your life miserable. Let's start with the basics—quorum.

Superheroes, as powerful as they are individually, still belong to teams like the Justice League of America or the X-Men. The team members work together and watch out for each other. If one member gets into trouble, the rest take over and help out. That's how Windows clustering works. The individual nodes (servers) form a team (cluster), and the nodes keep checking on each other. If one server locks up, experiences a hardware failure, or gets radiation sickness from being scratched by Nuclear Man, a different node takes over and starts hosting the SQL Server instance.

While it's obvious when a superhero is in trouble, it's not so obvious when a server is in trouble. Back in Figure 2, there are two servers in the primary data center and one server all by its lonesome in the secondary (SQLDR1) data center. By default, all three of these servers have voting rights in the cluster, and you need a majority of voters online for the availability group to stay online. If you lose the network connection between the two data centers, SQLDR1 will be all by itself, unable to see the other two nodes. It'll only see one of the three voters (itself), which is a minority and not enough to establish quorum. It thinks there's been some sort of a disaster, so it'll take its availability group replica offline. SQLPROD1 and SQLPROD2 will both still see each other since they're in the same data center, and with their two votes, they'll still have a majority. They'll stay online, serving their databases.

But even though SQLPROD1 and SQLPROD2 are still online with two out of three votes, it doesn't mean you're safe if you're using Windows Server 2008 R2 or earlier versions. If either server drops offline for any reason—whether it's a reboot, a Windows crash, or someone trying to do ill-timed Windows updates—the remaining server will just have one out of three votes, and it'll take itself offline. Windows Server 2012 recalculates quorum voting and adjusts for this scenario—and that's just one of many reasons why I recommend using Windows Server 2012 when implementing AlwaysOn Availability Groups. (For another reason, check out the Microsoft article "SQL Server 2012 service crashes when a replica SQL Server 2012 instance goes offline on a Windows Server 2008 R2-based failover cluster," which explains a failure scenario I ran into at StackOverflow.com.)

Let's take another typical failure scenario: an outage at the primary data center. If the primary data center loses power or the network switches fail, both SQLPROD1 and SQLPROD2 will drop offline. SQLDR1, which is all by itself with just one out of three votes, will think something's wrong with its network connection. SQLDR1 will take itself offline just when you need it the most. You'll have to force it online manually. Thus, you'll need to take care of some housekeeping when SQLPROD1 and SQLPROD2 come back online because they might have committed transactions that SQLDR1 never saw.

Here's another scenario to keep in mind. Suppose that you take down the network switches as part of planned data center maintenance, or even worse, they fail unexpectedly. The instant any replica loses its network connection to the others, that replica's databases go offline. This is dramatically different from database mirroring or log shipping, where SQL Server just says, "Alright, high availability isn't working, but I'll keep the databases online until things are okay again." Even the slightest network hiccup can cause rolled back transactions, failed SQL Server Agent jobs, failed backups, and more. When working with AlwaysOn Availability Groups, it's absolutely crucial that every server have at least two network ports plugged into two completely separate network switches, and combine the networks using teaming software. Redundancy is important here.

Understanding the voting implications of the quorum configuration is the single most important aspect of designing an availability group solution. You need to know what will stay online when each component fails. You also need to document and rehearse how to react to each failure. When I'm building an availability group solution for a client, I whiteboard out all the parts and we talk through the symptoms of failures. If there's an IP address conflict for the cluster admin IP, what will it look like? How will we recognize the problem? What pieces of the solution will stop working? How will we fix the problem, and how long will it take?

Even Backups and DBCCs Are Complicated

AlwaysOn Availability Groups let you run full backups and transaction log backups on any replica, not just the primary. In SQL Server Management Studio (SSMS), you can right-click an availability group and set the backup preferences, such as backing up only the primary server, backing up only secondary servers, and even picking which secondary servers to back up. This means you have to set up backup jobs on every replica and leave the jobs enabled. Both maintenance plans and Ola Hallengren's excellent free backup scripts will check your priority preferences for each database when they run. If the scripts or maintenance plans are on the right replica to run backups for a given database, they'll back the database up. If not, they'll skip the database. Either way, the job will run successfully on each replica.

However, it's up to you to check for success, and it's not easy. It's a three-step process. First, you have to make sure the jobs are actually configured correctly on each node. I've seen scenarios where all the jobs were running successfully according to the parameters they were given, but all of them had a variety of configuration errors. As a result, one set of databases were never backed up when they were failed over to a certain node in the cluster.

Second, to get a holistic backup picture, you have to manually combine the msdb backup history data across all replicas. If you only query one replica, it'll report that the databases haven't been backed up, unless it happened to be the replica running the backups. This causes all off-the-shelf monitoring software to constantly report false alarms that say no backups have been run recently.

Finally, you need to check for corruption. This might seem simple—you just need to check the primary replica, right? After all, that's where the writes are taking place. Unfortunately, that isn't enough. You also have to check the replica performing the backups. If that replica experiences storage corruption, you could be backing up corrupt data. When you need to restore it, you won't have the primary replica around to give you clean pages. You have to know that you're backing up good data, and the only way to do that is to use DBCC statements to check the replica doing the backups.

Even Superheroes Need Discipline

AlwaysOn Availability Groups can help you meet business needs without leaping tall buildings in a single bound. They're chock full of power, but in order to use that power wisely, you have to do a lot of planning, testing, and documentation. Your next step is to check out my free AlwaysOn Availability Groups installation checklist. After that, it's time to get fitted for a cape.

From the Blogs

My initial goal in writing this series of posts was to outline some of the concerns surrounding Availability Groups (AGs) and SQL Server Agent Jobs – and call out how there is virtually no guidance from Microsoft on this front and then detail some of the pitfalls and options available for tackling this problem domain. I initially expected this series of posts to have between 25 and 30 posts – according to some of the early outlines I created ‘way back when’....More

Throughout this series of posts I’ve taken a somewhat pessimistic view of how SQL Server Agent jobs are managed within most organizations – meaning that most of the code and examples I’ve provided up until this point were based on assumptions about how CHANGE to jobs is managed. That pessimism, to date, has come in two forms:...More

In this series of posts I’ve called out some of the concerns related to SQL Server AlwaysOn Availability Groups and their interaction with SQL Server Agent jobs – both in the form of Batch Jobs (see post #3) and backups....More