Troubleshooting Transactional Replication

Transactional replication is a useful way to keep schema and data for specific objects synchronized across multiple SQL Server databases. Replication can be used in simple scenarios involving a few servers or can be scaled up to complex, multi-datacenter distributed environments. However, no matter the size or complexity of your topology, the number of moving parts involved with replication means that occasionally problems will occur that require a DBA’s intervention to correct.

In this article, I’ll show you how to use SQL Server’s native tools to monitor replication performance, receive notification when problems occur, and diagnose the cause of those problems. Additionally, I'll look at three common transactional replication problems and explain how to fix them.

A View Into Replication Health

Replication Monitor is the primary GUI tool at your disposal for viewing replication performance and diagnosing problems. Replication Monitor was included in Enterprise Manager in SQL Server 2000, but in SQL Server 2005, Replication Monitor was separated from SQL Server Management Studio (SSMS) into a standalone executable. Just like SSMS, Replication Monitor can be used to monitor Publishers, Subscribers, and Distributors running previous versions of SQL Server, although features not present in SQL Server 2005 won’t be displayed or otherwise available for use.

To launch Replication Monitor, open SSMS, connect to a Publisher in the Object Explorer, right-click the Replication folder, and choose Launch Replication Monitor from the context menu. Figure 1 shows Replication Monitor with several registered Publishers added. Replication Monitor displays a tree view in the left pane that lists Publishers that have been registered; the right pane’s contents change depending on what’s selected in the tree view.

Selecting a Publisher in the tree view shows three tabbed views in the right pane: Publications, which shows the name, current status, and number of Subscribers for each publication on the Publisher; Subscription Watch List, which shows the status and estimated latency (i.e., time to deliver pending commands) of all Subscriptions to the Publisher; and Agents, which shows the last start time and current status of the Snapshot, Log Reader, and Queue Reader agents, as well as various automated maintenance jobs created by SQL Server to keep replication healthy.

Expanding a Publisher node in the tree view shows its publications. Selecting a publication displays four tabbed views in the right pane: All Subscriptions, which shows the current status and estimated latency of the Distribution Agent for each Subscription; Tracer Tokens, which shows the status of recent tracer tokens for the publication (I’ll discuss tracer tokens in more detail later); Agents, which shows the last start time, run duration, and current status of the Snapshot and Log Reader agents used by the publication; and Warnings, which shows the settings for all warnings that have been configured for the publication.

Right-clicking any row (i.e., agent) in the Subscription Watch List, All Subscriptions, or Agents tabs will display a context menu with options that include stopping and starting the agent, viewing the agent’s profile, and viewing the agent’s job properties. Double-clicking an agent will open a new window that shows specific details about the agent’s status.

Distribution Agent windows have three tabs: Publisher to Distributor History, which shows the status and recent history of the Log Reader agent for the publication; Distributor to Subscriber History, which shows the status and recent history of the Distribution Agent; and Undistributed Commands, which shows the number of commands at the distribution database waiting to be applied to the Subscriber and an estimate of how long it will take to apply them. Log Reader and Snapshot Reader agent windows show only an Agent History tab, which displays the status and recent history of that agent.

When a problem occurs with replication, such as when a Distribution Agent fails, the icons for the Publisher, Publication, and agent will change depending on the type of problem. Icons overlaid by a red circle with an X indicate an agent has failed, a white circle with a circular arrow indicates an agent is retrying a command, and a yellow caution symbol indicates a warning. Identifying the problematic agent is simply a matter of expanding in the tree view the Publishers and Publications that are alerting to a condition, selecting the tabs in the right pane for the agent(s) with a problem, and double-clicking the agent to view its status and information about the error.

Measuring the Flow of Data

Understanding how long it takes for data to move through each step is especially useful when troubleshooting latency issues and will let you focus your attention on the specific segment that’s problematic. Tracer tokens were added in SQL Server 2005 to measure the flow of data and actual latency from a Publisher all the way through to Subscribers (the latency values shown for agents in Replication Monitor are estimated). Creating a tracer token writes a special marker to the transaction log of the Publication database that’s read by the Log Reader agent, written to the distribution database, and sent through to all Subscribers. The time it takes for the token to move through each step is saved in the Distribution database.

Tracer tokens can be used only if both the Publisher and Distributor are on SQL Server 2005 or later. Subscriber statistics will be collected for push subscriptions if the Subscriber is running SQL Server 7.0 or later and for pull subscriptions if the Subscriber is running SQL Server 2005 or higher. For Subscribers that don’t meet these criteria (non-SQL Server Subscribers, for example), statistics for tracer tokens will still be gathered from the Publisher and Distributor. To add a tracer token, you must be a member of the sysadmin fixed server role or db_owner fixed database role on the Publisher.

To add a new tracer token or view the status of existing tracer tokens, navigate to the Tracer Tokens tab in Replication Monitor. Figure 2 shows an example of the Tracer Tokens tab showing latency details for a previously inserted token. To add a new token, click Insert Tracer. Details for existing tokens can be viewed by selecting from the drop-down list on the right.

Know When There Are Problems

Although Replication Monitor is useful for viewing replication health, it’s not likely (or even reasonable) that you’ll keep it open all the time waiting for an error to occur. After all, as a busy DBA you have more to do than watch a screen all day, and at some point you have to leave your desk.

However, SQL Server can be configured to raise alerts when specific replication problems occur. When a Distributor is initially set up, a default group of alerts for replication-related events is created. To view the list of alerts, open SSMS and make a connection to the Distributor in Object Explorer, then expand the SQL Server Agent and Alerts nodes in the tree view. To view or configure an alert, open the Alert properties window by double-clicking the alert or right-click the alert and choose the Properties option from the context menu. Alternatively, alerts can be configured in Replication Monitor by selecting a Publication in the left pane, viewing the Warnings tab in the right pane, and clicking the Configure Alerts button. The options the Alert properties window offers for response actions, notification, and so on are the same as an alert for a SQL Server agent job. Figure 3 shows an example of the Warnings tab in Replication Monitor.

There are three alerts that are of specific interest for transactional replication: Replication: Agent failure; Replication: Agent retry; and Replication Warning: Transactional replication latency (Threshold: latency). By default, only the latency threshold alerts are enabled (but aren’t configured to notify an operator). The thresholds for latency alerts are configured in the Warnings tab for a Publication in Replication Monitor. These thresholds will trigger an alert if exceeded and are used by Replication Monitor to determine if an alert icon is displayed on the screen. In most cases, the default values for latency alerts are sufficient, but you should review them to make sure they meet the SLAs and SLEs you’re responsible for.

A typical replication alert response is to send a notification (e.g., an email message) to a member of the DBA team. Because email alerts rely on Database Mail, you’ll need to configure that first if you haven’t done so already. Also, to avoid getting inundated with alerts, you’ll want to change the delay between responses to five minutes or more. Finally, be sure to enable the alert on the General page of the Alert properties window.

Changes to alerts are applied to the Distributor and affect all Publishers that use the Distributor. Changes to alert thresholds are applied only to the selected Publication and can’t be applied on a Subscriber-by-Subscriber basis.

Other Potential Problems to Keep an Eye On

Two other problems can creep up that neither alerts nor Replication Monitor will bring to your attention: agents that are stopped, and unchecked growth of the distribution database on the Distributor.

A common configuration option is to run agents continuously (or Start automatically when SQL Server Agent starts). Occasionally, they might need to be stopped, but if they aren’t restarted, you can end up with transactions that accumulate at the Distributor waiting to be applied to the Subscriber or, if the log reader agent was stopped, transaction log growth at the Publisher. The estimated latency values displayed in Replication Monitor are based on current performance if the agent is running, or the agent’s most recent history if it’s stopped. If the agent was below the latency alert threshold at the time it was stopped, then a latency alert won’t be triggered and Replication Monitor won’t show an alert icon.

The dbo.Admin_Start_Idle_Repl_Agents stored procedure in Web Listing 1 can be applied to the Distributor (and subscribers with pull subscriptions) and used to restart replication agents that are scheduled to run continuously but aren’t currently running. Scheduling this procedure to run periodically (e.g., every six hours) will prevent idle agents from turning into bigger problems.

Unchecked growth of the distribution database on the Distributor can still occur when all agents are running. Once commands have been delivered to all Subscribers, they need to be removed to free space for new commands. When the Distributor is initially set up, a SQL Server Agent job named Distribution clean up: distribution is created to remove commands that have been delivered to all Subscribers. If the job is disabled or isn’t running properly (e.g., is blocked), commands won’t be removed and the distribution database will grow. Reviewing this job’s history and the size of the distribution database for every Distributor should be part of a DBA’s daily checklist.

Common Problems and Solutions

Now that you have the tools in place to monitor performance and know when problems occur, let’s take a look at three common transactional replication problems and how to fix them.

Distribution Agents fail with the error message The row was not found at the Subscriber when applying the replicated command or Violation of PRIMARY KEY constraint [Primary Key Name]. Cannot insert duplicate key in object [Object Name].

Cause: By default, replication delivers commands to Subscribers one row at a time (but as part of a batch wrapped by a transaction) and uses @@rowcount to verify that only one row was affected. The primary key is used to check for which row needs to be inserted, updated, or deleted; for inserts, if a row with the primary key already exists at the Subscriber, the command will fail because of a primary key constraint violation. For updates or deletes, if no matching primary key exists, @@rowcount returns 0 and an error will be raised that causes the Distribution Agent to fail.

Solution: If you don’t care which command is failing, you can simply change the Distribution Agent’s profile to ignore the errors. To change the profile, navigate to the Publication in Replication Monitor, right-click the problematic Subscriber in the All Subscriptions tab, and choose the Agent Profile menu option. A new window will open that lets you change the selected agent profile; select the check box for the Continue on data consistency errors profile, and then click OK. Figure 4 shows an example of the Agent Profile window with this profile selected. The Distribution Agent needs to be restarted for the new profile to take effect; to do so, right-click the Subscriber and choose the Stop Synchronizing menu option. When the Subscriber’s status changes from Running to Not Running, right-click the Subscriber again and select the Start Synchronizing menu option.

This profile is a system-created profile that will skip three specific errors: inserting a row with a duplicate key, constraint violations, and rows missing from the Subscriber. If any of these errors occur while using this profile, the Distribution Agent will move on to the next command rather than failing. When choosing this profile, be aware that the data on the Subscriber is likely to become out of sync with the Publisher.

If you want to know the specific command that’s failing, the sp_browsereplcmds stored procedure can be executed at the Distributor. Three parameters are required: an ID for the Publisher database, a transaction sequence number, and a command ID. To get the Publisher database ID, execute the code in Listing 1 on your Distributor (filling in the appropriate values for Publisher, Subscriber, and Publication).

To get the transaction sequence number and command ID, navigate to the failing agent in Replication Monitor, open its status window, select the Distributor to Subscriber History tab, and select the most recent session with an Error status. The transaction sequence number and command ID are contained in the error details message. Figure 5 shows an example of an error message containing these two values.

Finally, execute the code in Listing 2 using the values you just retrieved to show the command that’s failing at the Subscriber. Once you know the command that’s failing, you can make changes at the Subscriber for the command to apply successfully.

Distribution Agent fails with the error message Could not find stored procedure 'sp_MSins_'.

Cause: The Publication is configured to deliver INSERT, UPDATE, and DELETE commands using stored procedures, and the procedures have been dropped from the Subscriber. Replication stored procedures aren’t considered to be system stored procedures and can be included using schema comparison tools. If the tools are used to move changes from a non-replicated version of a Subscriber database to a replicated version (e.g., migrating schema changes from a local development environment to a test environment), the procedures could be dropped because they don’t exist in the non-replicated version.

Solution: This is an easy problem to fix. In the published database on the Publisher, execute the sp_scriptPublicationcustomprocs stored procedure to generate the INSERT, UPDATE, and DELETE stored procedures for the Publication. This procedure only takes one parameter—the name of the Publication—and returns a single nvarchar(4000) column as the result set. When executed in SSMS, make sure to output results to text (navigate to Control-T or Query Menu, Results To, Results To Text) and that the maximum number of characters for results to text is set to at least 8,000. You can set this value by selecting Tools, Options, Query Results, Results to Text, Maximum number of characters displayed in each column). After executing the stored procedure, copy the scripts that were generated into a new query window and execute them in the subscribed database on the Subscriber.

Distribution Agents won’t start or don’t appear to do anything.

Cause: This typically happens when a large number of Distribution Agents are running on the same server at the same time; for example, on a Distributor that handles more than 50 Publications or Subscriptions. Distribution Agents are independent executables that run outside of the SQL Server process in a non-interactive fashion (i.e., no GUI). Windows Server uses a special area of memory called the non-interactive desktop heap to run these kinds of processes. If Windows runs out of available memory in this heap, Distribution Agents won’t be able to start.

Solution: Fixing the problem involves making a registry change to increase the size of the non-interactive desktop heap on the server experiencing the problem (usually the Distributor) and rebooting. However, it’s important to note that modifying the registry can result in serious problems if it isn’t done correctly. Make sure to perform the following steps carefully and back up the registry before you modify it:

Start the Registry Editor by typing regedit32.exe in a run dialog box or command prompt.

Navigate to the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\SubSystems key in the left pane.

In the right pane, double-click the Windows value to open the Edit String dialog box.

Locate the SharedSection parameter in the Value data input box. It has three values separated by commas and should look like the following:

Monitoring Your Replication Environment

When used together, Replication Monitor, tracer tokens, and alerts are a solid way for you to monitor your replication topology and understand the source of problems when they occur. Although the techniques outlined here offer guidance about how to resolve some of the more common issues that occur with transactional replication, there simply isn’t enough room to cover all the known problems in one article. For more tips about troubleshooting replication problems, visit the Microsoft SQL Server Replication Support Team’s REPLTalk blog.