SQL Server Always ON and Change Data Capture: Friends or Foes?

SQL Server Always ON and Change Data Capture: Friends or Foes?

I talked about Always ON and Change Data Capture in one of the Open Talks at recently concluded SQLServerGeeks Annual summit – 2016. The topic of discussion was Always ON and Change Data Capture: Friends or Foes? I thought its worth a write as well because this combination is very less talked about.

Let me tell you a short story about how and why I chose the title for my talk.

We have a two node windows cluster in our environment. SQL Server 2012 Enterprise Edition is installed on both the nodes as standalone and both these nodes are a part of Always ON configuration. Always ON is configured in synchronous-commit mode which means that the transactions wait to send the transaction confirmation to the client until the secondary replica has hardened the log to disk. There is a database in this Always On configuration which has CHANGE DATA CAPTURE (abbreviated as CDC) feature enabled on it. Simple and straight till now.

One day we got a requirement to introduce another replica in Always ON configuration (Don’t ask me the reason, that doesn’t matter for this post). We followed basic steps. Another windows server was built and made a part of existing cluster. Then on a Friday afternoon, we started SQL Server related stuff with the weekend already on our minds. SQL Server 2012 was installed as standalone on new node and it was enabled for Always ON (By right clicking on SQL Server service and enabling Always ON, We all know this. Right?). This new node was added as a third replica in current Always ON configuration. While doing this I was continuously keeping a close eye on the clock at the bottom right of my laptop and I was delighted to see that this is nearly the end of my shift and the weekend is about to begin. I left the rest of the things like restoring the database, syncing it with other replicas, moving logins, jobs etc. pending for Monday.

Before the mysterious Monday begins, let me tell you how we use CDC. In our environment the production data gets inserted in to SQL Server’s CDC enabled database. We have a database in another RDBMS (other than SQL Server) which we initially sync’d with SQL server’s CDC enabled database using a manual process. The clients see the data from other RDBMS. This was implemented to separate out Read and write activity. Obviously there could have been other methods to achieve this but that’s a different discussion. In order to make sure that the data in the other RDBMS remains in sync with SQL server, a process is defined which queries SQL Server’s CDC tables using function cdc.fn_cdc_get_net_changes to see what changes those tables have and then it applies those changes to other RDBMS real time. So both the RDBMS’s always remain in sync with each other. A job has been scheduled to send emails if these RDBMS’s become out of sync.

So finally the Monday begins and I had no realization that the storm was brewing. I was still under weekend’s hangover when I logged on to my system. Hangover was torn apart in to pieces by a plethora of emails I saw in my Inbox. What did the emails say? SQL Server is out of sync with other RDBMS and the lag was huge. Clients were screaming and my desk soon looked like a honeycomb surrounded by bees.

What could have happened? I had no idea so I started to investigate. I let my memory rewind to think about Friday and I recalled that the only thing changed was the addition of third replica to Always ON configuration. Did this set the house on fire? Yes, indeed. Below is why:

One thing I did not do after adding the third replica was I did not bring the databases on third replica in sync with other replicas. I did not restore any databases on Friday. So what? Well, the thing to note about Change Data Capture is that it uses Transactional replication’s log reader agent. By default the log reader will not process log records that have not already been hardened at all availability group secondary replicas. So because the database was not restored on the new secondary instance, the logs were not able to harden, log reader agent was not able to process the records and hence Change Data Capture stopped working. This phenomenon is explained in detail here.

What this link also talks about is the failover situation in Synchronous and Asynchronous commit mode. If there is a catastrophic event, primary replica node gets compromised and the databases failover to secondary replica. As the old primary is not available, the logs will not be able to harden on old primary (new secondary) so Change Data Capture will not work. If you have Always ON configured in Asynchronous-Commit mode, Then you can enable trace flag 1448 and log reader will keep processing the log records. This trace flag has no effect if the Always ON is configured in synchronous-commit mode. In order to make sure log reader keeps on processing the log records in synchronous-commit mode, you have to take the database out of availability group till you fix the old primary node or remove the faulty replica from Always ON configuration.

Everything said above to handle failover situation is manual but there needs to be a way to automate this stuff so that it can be handled without waiting for human intervention. I wrote a stored procedure to automate all this. This stored procedure is scheduled to run every 5 mins via a SQL Server Agent job on all the Always ON replicas. This will ensure that in the event of any issue with Always ON, Change Data Capture continues to work. The procedure accepts two parameters which are Availability Group name and Database name (which is CDC enabled database for our purpose). It is assumed that the CDC capture and clean up jobs are already created on all the replicas. This is a very basic stored procedure and this can be improved with more additions based upon the requirement. The definition of this Stored Procedure is as given below:

Below are the details of what this Stored procedure does:
1. The values of current replica role, Availability Group Id and Availability Mode are assigned to defined variables.
2. If the replica role is Primary, It checks the state of secondary replica as seen from primary.
a. If secondary replica is not healthy (it checks the state of secondary replica 5 times in 3 minutes interval)
a1. It checks how many databases are not in sync. If there are any databases out of sync, it sends an email saying there are databases out of sync.
a2. Then it looks for the CDC database that was passed as parameter.
->If CDC enabled database is out of sync and availability mode is asynchronous, it enables Trace Flag 1448 and sends related email.
->If CDC enabled database is out of sync and availability mode is synchronous, it removes the database from availability group and sends related email
b. If secondary replica is healthy, it checks the status of capture and cleanup job on primary.
b1. If the jobs are disabled, it enables those and starts the capture job
b2. If the jobs are enabled, it checks whether capture is job is running, if not running , it starts the capture job.

3. If the replica role is secondary, It checks the status of capture and clean up job. If the jobs are enabled, it disables the jobs and stops the capture job if it is running.

So, the bottom line is that Always ON and Change Data Capture are not Friends inherently but you can be a mediator to help them shake hands and hug each other.

And, I accept that we missed a very basic step in the excitement of weekend which caused all this blunder but lessons learnt from such outages sit in memory for long (longer than book reading). Waiting for the next one to knock the door