Monday, August 14, 2017

Taking the Availability Group Database Level Health Detection option for a spin

With SQL Server 2016 the Availability Groups now include the option to enable Database Level Health Detection. While this option is turned off by default, Microsoft strongly recommends you enable this on all Availability Groups.... it is just off by default for backwards compatibility and so you opt-in by choice.

In this post I will take a deep look into just what sort of database issues will cause the AG to failover with this option enabled.

For this walk through I am using the SQL Server AlwaysOn deployment in the Azure Market Place and have provided a link to all the scripts below so you can try out this functionality and show off your skills in your own demonstrations. If you don't have an Azure Subscription all you need is an environment with two Replicas and multiple disks mounted in the virtual machines (so you can separate the transaction logs of different databases).

To follow along with the scripts and screen shots here are the details of resources within my test lab.

To make this demo easier, I have created a file share on the File Share Witness to store backups. It is best practice that when using Availability Groups you use a central location for storing backups from all replicas. Obviously a File Share Witness is not that place but for this demo lab it is fine.

Central Backup Share: \\cluster-fsw\sqlbackups

All the scripts used in this post are located here. This link is also contained within my final thoughts at the end of this post.

Ok, lets get started.

1. Create the demo databases
To enhance this demo, we will setup new databases and an availability group to show that how this setting only impacts one AG and not another.

Run the following TSQL to setup the demo databases. NOTE: I have explicitly put the Transaction Logs for the SuspectDb on a different volume to the CorruptDb

After that step you should have an Availability Group that looks like this:

On your primary replica you should have the following databases

On the secondary replica you should only have the SuspectDb for now. We still need to do some work to setup the CorruptDb.

If you do not have the SuspectDb on the secondary replica after joining the replica to the AG then the Automatic Seeding option may not have succeeded. The script provided contains the required steps to restore and join the db on the seconary replica.

3. Verify Database Level Health Detection option is enabled
Open the properties of the AG and make sure the option is checked.

Alternatively check the value of db_failover is 1 in sys.availability_groups
[04.VerifyAg.EnableDBHealthLevelDetection.sql]

--- YOU MUST EXECUTE THE FOLLOWING SCRIPT IN SQLCMD MODE.
:Connect SQLSERVER-0
USE [master]
GO
SELECT name, db_failover FROM sys.availability_groups
GO

If you need to enable the setting either use the GUI or run 03.AlterAG.EnableDBHealthLevelDetection.sql

--- YOU MUST EXECUTE THE FOLLOWING SCRIPT IN SQLCMD MODE.
:Connect SQLSERVER-0
USE [master]
GO
ALTER AVAILABILITY GROUP [DbHealthOptDemoAg] SET(DB_FAILOVER = ON);
GO

4. Corrupt the demo database

The first thing we are going to look at is what happens when you have page level corruption, and is corruption really corruption. Full credit to Paul Randle for an existing blog on how to do this. See https://www.sqlskills.com/blogs/paul/dbcc-writepage/

Because we are going to use DBCC WRITEPAGE and need the database to be in single user mode, we have not yet been able to add it to the Availability Group. This won't impact the demo though, as when a database is brought online or added to an availability group the integrity of data pages are not validated.

Run the following statements to corrupt the database. WARNING!!! Use this at your own risk and never ever ever ever do this on a production environment.

Msg 824, Level 24, State 2, Line 27SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x3ea8609e; actual: 0x3ea8259e). It occurred during a read of page (1:320) in database ID 6 at offset 0x00000000280000 in file 'F:\DATA\CorruptDb_data.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

5. Add the database into the Availability Groups
Now that we have a corrupted database, lets add it into the availability group using the following statements

[06.AddCorruptDbToAG.sql]

--- YOU MUST EXECUTE THE FOLLOWING SCRIPT IN SQLCMD MODE.
-- Add the database to the AG on the Primary
:Connect SQLSERVER-0
USE [master]
GO
ALTER AVAILABILITY GROUP [DbHealthOptDemoAg]
MODIFY REPLICA ON N'SQLSERVER-1' WITH (SEEDING_MODE = AUTOMATIC)
GO
USE [master]
GO
ALTER AVAILABILITY GROUP [DbHealthOptDemoAg]
ADD DATABASE [CorruptDb];
GO
-- Make sure the secondary is set to Auto Seed with CREATE DB permissions
:Connect SQLSERVER-1
ALTER AVAILABILITY GROUP [DbHealthOptDemoAg] GRANT CREATE ANY DATABASE;
GO
-- if auto seeding doesn't automatically work, check the logs as if you haven't cleaned up
-- the data and log files from previous demos they may prevent the auto seeding.
-- The following statements can be used if AUTO Seeding doesn't run (once the issue is resolved)
:Connect SQLSERVER-0
BACKUP DATABASE [CorruptDb] TO DISK = N'\\CLUSTER-FSW\SQLBACKUPS\CorruptDb_addtoAG.bak' WITH FORMAT;
GO
:Connect SQLSERVER-1
RESTORE DATABASE [CorruptDb] FROM DISK = N'\\CLUSTER-FSW\SQLBACKUPS\CorruptDb_addtoAG.bak' WITH NORECOVERY;
GO
:Connect SQLSERVER-0
BACKUP LOG [CorruptDb] TO DISK = N'\\CLUSTER-FSW\SQLBACKUPS\CorruptDb_addtoAG.trn' WITH FORMAT;
GO
:Connect SQLSERVER-1
RESTORE LOG [CorruptDb] FROM DISK = N'\\CLUSTER-FSW\SQLBACKUPS\CorruptDb_addtoAG.trn' WITH NORECOVERY;
GO
ALTER DATABASE [CorruptDb] SET HADR AVAILABILITY GROUP = [DbHealthOptDemoAg];
GO

6. Reviewing the behavior of a corrupted database in Availability Groups
Now you should have both databases in the Availability Group. Open the Availability Group Dashboard and take note that everything is in a health state.

.... but wait. Didn't we turn on DB health level detection for the AG and corrupt the database. Run a statement to force SQL Server to generate the Logical consistency I/O error like before

But did this cause a failover. Check the state of the Availability Group either within Object Explorer or within the dashboard.

Notice that this didn't cause a failover. This is because this is not enough to affect the database state. If this database was not in an Availability Group it would still remain in an online state (as it was when we corrupted it). The Database Health Level Detection only triggers when an event is significant enough to affect the database state (e.g. Offline, Suspect, etc). To quote the official documentation "The database level health detection notices when a database is no longer in the online status".

So the lesson at this point is that database corruption, at least page level corruption is not enough to trigger the failover. Even if you run a DBCC CHECKDB it won't trigger the failover (HINT: give that a go for yourself). So the recommendation here is that you should be running regular DBCC CHECKDB on secondaries and failing over frequently (thus running consistency checks on all replicas regularly).

7. Simulate Suspect database state
So now lets look at what happens when something significant happens to affect the database status. The easiest way to simulate this would be to use the ALTER DATABASE statement, however when a database is added to an Availability Group there are many statements that the database engine will not allow you to execute to protect the state of the database.

So the way we can replicate this is to simulate an I/O failure.

Remember when we created the SuspectDb on SQLSERVER-0 I set the Transaction Log's path to a different disk/partition to the other databases.

Open Disk Management on the primary node SQLSERVER-0

Locate the Disk hosting the G:\ volume, or whatever volume you have stored the SuspectDb Transaction Log file. Right click the disk and select Offline.

In SQL Management Studio, refresh the Availability Groups branch of Object Explorer for SQLSERVER-0. Notice that it is still the primary replica.

At this stage we haven't caused an I/O operation (in our lab there is no Maintenance Tasks to backup the Transaction Logs). So lets insert some data to cause the Transaction Log to be accessed.

[08.WriteToSuspectDbTransLog.sql]

--- YOU MUST EXECUTE THE FOLLOWING SCRIPT IN SQLCMD MODE.
:Connect SQLSERVER-0
USE [SuspectDb]
GO
INSERT INTO [dbo].[DemoData] (demoData) VALUES ('Data to push through the TransLog');
GO

Bingo we generated an I/O which is significant enough to affect the database status.

Msg 945, Level 14, State 2, Line 3Database 'SuspectDb' cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.

But what happened to the Availability Group. Refresh the Object Explorer for SQLSERVER-0, and notice the status of the database and the Availability Group role.

So it looks like we failed over, but only for the DBHealthOptAg, because that was the only AG where a database status changed. Let's check the state of our Availability Group though post failover. Switch to SQLSERVER-1 in Object Explorer and expand the Availability Groups. Open the Dashboard for the Availability Group.

This confirms we have failed over with SQLSERVER-1 now the primary replica due to the I/O error experienced on SQLSERVER-0 due to a storage sub-system error (loss of disk). The DB Health Level Detection setting worked. Without that setting enabled this would not have caused a failover.

My final thoughts

The Database Level Health Check setting is a great new capability for increasing the availability of your databases contained in an Availability Group when one of the databases experiences a significant enough issue to affect the status of the database, like a loss of disk. It still will not protect you from certain data issues like corruption. So monitoring and maintenance is still critical!

Finally, you should also consider the Flexible Automatic Failover Policy. While I do not recommend changing this blindly, you should ensure you understand what role that setting plays in Automatic Failover.

Legal Stuff: The contents of this blog is provided “as-is”. The information, opinions and views expressed are those of the author and do not necessarily state or reflect those of any other company with affiliation to the products discussed. This includes any URLs or Tools. The author does not accept any responsibility from the use of the information or tools mentioned within this blog, and recommends adequate evaluation against your own requirements to measure suitability.

No comments:

Post a Comment

About Me

I have been working in I.T. over 17 years. I currently work for Microsoft as a Premier Field Engineer specializing in MS SQL Server, Azure, BI, PowerShell, Automation, and Monitoring. Prior to this I worked for WebCentral / MelbourneIt as a Senior DBA and Technical Lead within the SQL Server team.

In 2016 I was diagnosed with Bowel Cancer.

I am a Touch Football Referee and have officiated at an international level at the highest level.

I also have a number of other hobbies/passions:- Music (song writing and performing)- Photography- Personal Training / Fitness