Troubleshooting DRS in Configmgr

In this article, I will attempt to peel back some of the mystery that goes on with replication in Configmgr and provide some additional help when it comes to resolving replication issues, and they will occur.

The Basics of Replication in Configmgr

Configmgr underwent a shift several years ago as Microsoft tried to fix that backlog issue that customers often times encountered with the file based replication style. The hierarchy architecture was flattened and replication went from files to SQL, for the most part at least. This is well covered on TechNet so I will spare you the details.

There are some additional basics of replication that should be stated for completeness of the topic.

Configmgr does not use the built-in SQL replication it uses components of SQL and manages the replication of data on its own. The easiest way I can state this is, the product team coded their own replication into the product, it does not use the built in features of Microsoft SQL Replication.

There are mainly two types of data replication between sites, Global data and Site data. Global data is replicated from the CAS downward, while site data is the opposite.

Global data includes data such as collection rules.

Site data includes things like collection results and client data.

Each data type contains several replication groups that logically groups data from different tables together. These replication groups are classified as global and site replication groups.

select * from vReplicationData is a query to display all replication groups.

If you want to see what replication groups are included as part of global or site data, simply modify the previous query by specifying the type of data, similar to these.

select * from vReplicationData where Replicationpattern = ‘global‘

select * from vReplicationData where Replicationpattern = ‘site‘

Want to see what data is part of that replication group from the output of the previous two queries? The first column of the output should be ID, simply select the matching ID from the replication group you want to investigate and use the following query. Replace 30 with the ID number you are interested in.

select * from vArticleData where ReplicationID = 30

ID 1 – 14 is global data, 15 is cloud data, 16 – 30 is for site data, 31 – 33 is for secondary site replication data, beyond 33 is typically more site data replication groups.

Since we are almost to the real meat of what data is being sent why not look and see what tables are included in a group? And what better to look at than HWINV data? This query will return the list of tables that is being replicated as part of any of the replication groups with Hardware_Inventory in the name of the group.

select ArticleName from ArticleData where ReplicationID in (select ID from vReplicationData where ReplicationGroup like ‘Hardware_Inventory%’)

UPDATE: A friend was gracious enough to send me an email regarding the previous query and how to better see what data is in a replication group I recommend using his below. He is easily one of the foremost experts on DRS and how to troubleshoot replication. He and his team have one of the largest implementations of SCCM globally and it is magnificently run. Just from an efficiency standpoint, his query is much better and more elegant.

“HINV replication groups vary by install and by how much you extend the mof. I don’t think any 2 groups look alike or contain the same data.

To see what data is in each I use this:”

SELECT Rep.ReplicationGroup,

Rep.ReplicationPattern,

App.ArticleName, App.ReplicationID

FROM vArticleData AS App

INNER JOIN v_ReplicationData AS Rep ON App.ReplicationID = Rep.ID

ORDER BY Rep.ReplicationGroup, App.ArticleName

Another important aspect of DRS is the SQL Service Broker (SSB) which handles incoming and outgoing messages, guaranteeing their delivery by allowing them to use an asynchronous queue to store the messages in. If this cannot function the flow of data into and out of SQL replication will stop.

Site Data Processing – An Example

When a client runs its scheduled hardware inventory it stores the output of the WMI data in an XML file. Then the client copies that XML file up to its management point (MP). Assuming the MP is not a primary site server, the MP message handler processes the clients XML file and it gets converted into a MIF file. Then the MP File Dispatch Manager takes the MIF file and uploads into the clients primary site servers dataldr inbox folder. The MIF file is then read by the data loader component and the data is inserted in the SQL database.

Technically, there are more than two types of data being replicated, but technically Configmgr can also use the built-in SQL replication too, technically.

When the Data Replication Service (DRS) stops working it can be a nightmare. If replication breaks and you cannot fix it within a certain amount of time you will lose data and if you have to re-initialize your replication it generate gigs and gigs of network traffic across the wire as you replicate all that data to each site server, again. Below are some additional tips to help when troubleshooting DRS when the Replication Link Analyzer (RLA) doesn’t do the trick.

If RLA fails any remediation actions while it is running, the log files contain more detail than the XML file. Also, ensure that it was able to restart the SMS_SITE_COMPONENT_MANAGER and SMS_EXECUTIVE services.

Verbose Logging

The first step is to get more information from the log files, this is accomplished the same way as all other logs in CM, by modifying the amount of logging through registry keys. There are two different log types we are going to use, your standard text based log files you typically view with notepad, splunk, or cmtrace, while the other type of log files are viewed in SQL because they are SQL managed components of Configmgr and the logging is stored in SQL tables not like the .log files for the majority of Configmgr’s components. You can add this to the long list of reasons why SQL should be ON BOX and CM admins should have sysadmin forever on their SQL instance(s). Moving on.

*Make sure you return this back to 0 after you have resolved your DRS issues.

By default the two replication groups that record messages to the Rcmctrl.log are Site Control Data and Configuration Data, if you see errors with another replication group you can include it in the logging by adding it to the following registry key.

*Make sure you return this back to the default after you have resolved your DRS issues. The default is: Site Control Data,Configuration Data

SQL Components Logging

vLogs (stored in SQL views) – Because some components of RCM are running in SQL Server hosted managed code, each of these components is provided a table in CM SQL db to record log messages to. Logging is recorded in: vDrsReceivedMessages, vDrsReceivedHistory, vDrsSendHistory, vDrsSentMessages, vDrsSyncHistory, and vDrsSyncMessages

*Make sure you return this back to 1 after you have resolved your DRS issues.

Now that additional details are being logged here are two queries to run that will display details of the vLogs information. If you cannot determine the source of the problem using this information I have listed below some additional queries and troubleshooting information.

Query the SQL vLog

Now we are ready to query the vLog and get extra details. A word of caution first, I recommend that you use the first query that only returns the first 1000 messages, by running the query and not limiting it you risk making things worse by adding additional pressure to your SQL db when it may already be in a degraded state and using the maximum amount of resources it has available.

This query returns the last thousand messages which have been logged, ordered by the time they were written into the database.

select Top 1000 * from vLogs order by LogTime desc

This query does not limit the messages returned and may cause your server to fall over! This is your second and final warning.

select * from vLogs where LogTime > GETDATE()-1 and ProcedureName <> ‘spDRSSendChangesForGroup’ ORDER BY LogTime DESC

spdiagDRS (stored procedure) – This stored procedure provides an overview of the state of DRS replication at the site including status, messages in queue, messages processed, messages sent, conflicts, current link status, last sync for each replication group, and versions. This storedproc will give you most of the SQL related information you will need for a lot of replication troubleshooting, or at least point you in the best direction to look deeper.

exec spdiagdrs

Replication and Troubleshooting Certificates

Often times when the basics of DRS troubleshooting have not gotten to root cause and I get contacted it ends up being issues related to certificates. A couple common issues are that the certificates do not match or are missing. Another issue can be from changing the account that SQL is running under, if install and use the SYSTEM account to start the SQL services and later on change the accounts to a service account or domain account it will break because the new account does not have the rights to read the original master key used to generate and validate the certs.

“Service Broker login attempt failed with error: ‘Connection handshake failed. The certificate used by the peer is invalid…State89“

Let’s next verify we are headed down the correct rabbit hole by first running the following query. Withing the results of this query, the transmission_status column should display any errors that are related to network communications, such as firewalls blocking replication or authentication errors.

select * from sys.transmission_queue

In the case of Error 15581 you should see the following or something similar.

To resolve this you have two options, delete the current master key and generate a new master key in SQL or assign the new account full control to the MachineKeys folder. Giving the new account full control to all and the child objects as well is the faster and easier solution. If you want to go the route of generating a new key you will need to use the storedproc spCreateAndBackupSQLCert to build the new key and copy the certs to all of the site servers participating in replication. You can see more on how to accomplish the second method below but it also provides some good information on additional troubleshooting certs.

exec spCreateAndBackupSQLCert

Query to display certificates

spDiagDRSCertInfo

select * from vSMS_SC_SiteDefinition_Properties where name=’SQLServerSSBCertificateThumbprint’

use master

select name, cert_serial_number from sys.certificates

Here you can see the output from my CAS and it’s child primary site server. The first example is the CAS server row 9 shows the endpoint cert from the child primary site (MSC). Notice the thumbprints in row 8 and 9 show they have properly exchanged the correct certificate versions as the serial numbers in the second column match.

You should see the same certificate thumbprint listed in this registry key. If not that is a problem and see the script below on exporting and import certificates.

If you are still not able to determine the source of your replication issues here are a few SQL queries that may help isolate the problem.

Query to find a LockID of locked resources

select * from SEDO_LockState where LockStateID = 1

Query to list DRS Conflicts

select * from DrsConflictInfo

Query for the link status

select * from RCM_ReplicationLinkStatus

Query for Service Broker status

select * from sys.tcp_endpoints where type_desc = ‘SERVICE_BROKER’

SQL Routes

select * from sys.routes

Query to view the first 1000 messages in the DRS queue

select top 1000 *, casted_message_body = case message_type_name when ‘X’ then cast(message_body AS NVARCHAR(MAX)) else message_body end from [CM_CAS].[dbo].[ConfigMgrDRSQueue] with(NOLOCK)

Query for all sites replication status

select * from ServerData

There is so many things that can go wrong with replication and it is such a large topic it is difficult to cover it. Hopefully this information, while a little disorganized, is helpful in your troubleshooting efforts of replication. One final piece of advice, if you have to reinitialize your replication make sure you fully understand how much data is going to be sent before you do it. Feel free to contact me if you have questions.

Best,

Anthony

@_configmgr

(c) 2015 configmgr.com. All rights reserved. You may not copy and post more than a single paragraph without written authorization from the author. You may not copy and paste this article on any other blog or website without written authorization from the author.