Who is it for?Administrators charged with diagnosing and resolving Database Replication Service (DRS) problems.

How does it work?

This guide helps determine the type of DRS problem you are experiencing, explains how the processes work and offers troubleshooting suggestions for some of the most common problems.

Estimated time of completion:

15-30 minutes.

1

Getting Started

When faced with a Database Replication Service (DRS) problem in Microsoft Configuration Manager, the beginning investigative phase is the most critical point. Any type of change or fix should be made only after careful study and understanding of the problem at hand.

Start by gathering information related to the history of the problem. Many times DRS problems can ultimately be traced back to a recent change made in the environment. Keep in mind that you should not focus solely on Configuration Manager, as changes to Windows or SQL Server can cause DRS problems as well. Having a clear understanding of any recent changes in the environment can provide important clues as to the source of the problem.

Note

You can find the latest Configuration Manager builds and updates here.

Once you’ve investigated environmental changes and made sure that your updates are in order, the next thing you want to do is run the Replication Link Analyzer (RLA). To launch RLA, open the Monitoring workspace and click the Database Replication node, then right click the link that is having a problem and select Replication Link Analyzer, as shown below.

Note:

RLA runs within the context of whomever launches it from the console, so be sure that the account you use has administrative privileges on both SQL and site servers.

RLA will check the following on both sites:

The SMS service is running

SMS Replication Configuration Monitor component is running

Ports required for SQL replication are enabled

SQL version is supported

Network is available between the two sites

There’s enough space for the SQL database

SQL Server Broker service configuration exists

SQL Server Broker service certificate exists

Known errors in SQL log files

Whether the replication queues are disabled

Time is in sync

Is the transmission of data stuck?

Does a Key Conflict exist?

If RLS finds known problems, it will offer to try to fix them for you.

The RLA output report is also very straight forward; it tells you what it checked and what rules were run in addition to whether they passed or failed. Here is an example:

My problem is solved

I am still have DRS problems

0

Getting Started

When faced with a Database Replication Service (DRS) problem in Microsoft Configuration Manager, the beginning investigative phase is the most critical point. Any type of change or fix should be made only after careful study and understanding of the problem at hand.

Start by gathering information related to the history of the problem. Many times DRS problems can ultimately be traced back to a recent change made in the environment. Keep in mind that you should not focus solely on Configuration Manager, as changes to Windows or SQL Server can cause DRS problems as well. Having a clear understanding of any recent changes in the environment can provide important clues as to the source of the problem.

Note

You can find the latest Configuration Manager builds and updates here.

Once you’ve investigated environmental changes and made sure that your updates are in order, the next thing you want to do is run the Replication Link Analyzer (RLA). To launch RLA, open the Monitoring workspace and click the Database Replication node, then right click the link that is having a problem and select Replication Link Analyzer, as shown below.

Note:

RLA runs within the context of whomever launches it from the console, so be sure that the account you use has administrative privileges on both SQL and site servers.

RLA will check the following on both sites:

The SMS service is running

SMS Replication Configuration Monitor component is running

Ports required for SQL replication are enabled

SQL version is supported

Network is available between the two sites

There’s enough space for the SQL database

SQL Server Broker service configuration exists

SQL Server Broker service certificate exists

Known errors in SQL log files

Whether the replication queues are disabled

Time is in sync

Is the transmission of data stuck?

Does a Key Conflict exist?

If RLS finds known problems, it will offer to try to fix them for you.

The RLA output report is also very straight forward; it tells you what it checked and what rules were run in addition to whether they passed or failed. Here is an example:

My problem is solved

I am still have DRS problems

0

Getting details with SPDiagDRS

Assuming Replication Link Analyzer was unable to detect and resolve the problem, Run SPDiagDRS and see if it can offer any clues as to what may be failing.

To run SPDiagDRS, open SQL Management Studio and connect to the two servers on each side of the link having the problem. On each CM_xxx database, run the command Exec SPDiagDRS.

The following is a breakdown of the various SPDiagDRS sections and some common places to look for problems. A simple search for error messages and codes found here will often times guide you to the source of the problem.

Section 1

SiteStatus: This tells us whether the site is replicating or not. Anything other than ACTIVE is not good.

CertificateThumbprint: The thumbprint of the certificate used for authentication that contains the site’s public key (Local DB trusts remote DB).

Section 2

IncomingMessageInQueue: This tells us the incoming backlog that a site has. If the backlog is high due to the number of sites reporting to it, you may see the links going to a degraded or failed state because the heartbeat synchronizations are not processed in time.

OutgoingMessageInQueue: This tells us the backlog that has yet to clear as we wait for the sites to receive the messages. This generally fluctuates, however if it continues to grow then this can represent a problem. Further troubleshooting should be performed to determine which site is not getting the messages.

Section 3

This is simply the detailed view of the "Initialization Detail" in the console.

Section 4

This is the detailed view of "Replication Detail" in the console. This gives more information about the flow between each replication group.

Section 5

This section has some important and useful information about the sites we are connecting TO. In this example we are on Primary site server 002 and 001 is the CAS. If we had a secondary site under 002 it would be shown here. On a CAS, all the Primaries would be reflected but not Secondaries.

Primary Site 002 example:

CAS Site 001 example:

Section 6

This provides the general information of the sites in the hierarchy, the SiteServerName and DBServer names, as well as the status and version. You can see here that a different primary site, 003, is showing as being in Maintenance Mode. On working systems, Section 6 should be identical between the CAS and all Primaries in the hierarchy.

Section 7

The two bottom sections contain detailed information on the heartbeat or LastSentStatus for each group as well as conversationIDs, etc., and the built in replication options configured for each group.

My problem is solved

I am still having DRS problems

0

Check RCMCtrl.log for errors

Next you will want to check RCMCtrl.log on each site for errors, as this will often times provide valuable clues regarding the source of the problem. For example, you may find that replication is in a Failed state for a site and that replication has not occurred for some time. In this scenario, you may find that RCMCtrl.log contains entries similar to the following:

If you see entries similar to these, make sure that the SMS Executive and the Site Component Manager services are running on the site in question. If not, this may be why replication is in a Failed state. If not running, start the SMS Executive and/or Site Component Manager services manually and troubleshoot the services if they fail to start.

Another example of an error you might find in RCMCtrl.log is the following:

What’s happening here is that while the .CAB file sent from the parent was unpacked by the despooler, the space on the drive was exhausted, thus it was only able to uncompress some of the files. If you view despool.log it will have a 2147024784 failure which refers to insufficient disk space. To resolve this type of issue simply free up disk space on the drive.

My problem is solved

I am still having DRS problems

0

Check RCM.log for errors

After checking RCMCtrl.log you’ll next want to take a look at RCM.log.

You may find that replication is in a Failed state for a site and that replication has not occurred for some time. In this scenario, you may find that RCM.log contains entries similar to the following:

If you see entries similar to these, make sure that the SMS Executive and the Site Component Manager services are running on the site in question. If not, this may be why replication is in a Failed state. If not running, start the SMS Executive and/or Site Component Manager services manually and troubleshoot the services if they fail to start.

Another example of an error you might find in RCM.log is the following:

What’s happening here is that while the .CAB file sent from the parent was unpacked by the despooler, the space on the drive was exhausted, thus it was only able to uncompress some of the files. If you view despool.log it will have a 2147024784 failure which refers to insufficient disk space. To resolve this type of issue simply free up disk space on the drive.

My problem is solved

I am still having DRS problems

0

Check Despool.log for errors

After RCM.log, you should next check Despool.log for relevant information and error messages. For example, if you have packages that are not being distributed from a primary site to a secondary, you will typically see an error similar to the following in Despool.log:

The key here is "Despooler failed to execute the instruction, error code = 12". This means that the package has reached the distribution point, however the instructions to process it were not received at the secondary site.

The easiest fix is to reinitialize the Secondary Site Data replication group by doing the following steps:

Create a blank text file name 'Secondary Site Data.pub'

Copy the file to the RCM.box of the secondary site server

This will force replication between the primary and secondary site for the secondary site data replication group.

My problem is solved

I am still having DRS problems

0

Check for BCP Problems

If you still haven’t found the source of the problem, it could be that the replication process was simply interrupted because the bulk copy program (BCP) was going too slow. Is sender throttled to this site and perhaps this is slowing down the BCP transfer? To verify, open the console and go to Administration -> Overview -> Hierarchy Configuration -> File Replication, then right-click the site that would be sending the data. Verify that the schedule availability is set to "Open for all Priorities" and that Rate Limits is set to "Unlimited to this Site".

If things are working but the data set from the BCP process is large and taking a long time to send, you can increase the number of sender threads to speed things up. The defaults are listed below. If your sender log is consistently advising "no more threads available" or "Using 5 or 5" or "Using 3 of 3", this is a good indication that you may want to increase the sender threads.

Note

If increased, the setting takes effect in real time with no restart of anything required.

Also be aware that if you have a rate limit set to “Limited to specified maximum transfer rates by hour“ (as shown below), Configuration Manager will only use one sender thread at a time when transferring to that site regardless of what the number of sender threads are set to. The default setting of “Unlimited When Sending to this destination” will use all the configured sender threads.

My problem is solved

I am still having DRS problems

0

Congratulations

Congratulations! Your Database Replication Service problem has been resolved.