Loose Truncation in Exchange Server 2013 SP1

From the earliest days of Microsoft Exchange Server, running out of disk space has been a problem with very bad consequences. Thankfully, the era of limited and expensive disks is long gone. Disks are now plentiful and cheap. However, problems can still arise when disk space comes under pressure. When that happens, it's usually up to administrators to resolve the crisis. This isn't so bad when it happens in the middle of the day. It's a totally different matter when a disk runs out of space at 3 a.m.

Database availability groups (DAGs) are one of the fundamental building blocks of Exchange. Since their introduction in Exchange Server 2010, DAGs have proven their worth in terms of delivering high availability through database replication. On the downside, DAGs can be complex to deploy and manage. Microsoft acknowledged this in Exchange Server 2013 by introducing features such as database AutoReseed to help ease the administrative burden.

Loose truncation is a new feature in Exchange 2013 SP1. It's designed to ensure that disks holding the transaction logs waiting to be replayed into database copies don't run out of space. After I explain the problem that loose truncation solves, I'll explain how it works and how to configure and monitor it.

Understanding Why Loose Truncation Is Needed

To understand what problem loose truncation solves, you need to understand the replication process. In Exchange, the Replication service copies transaction logs to servers that hold copies of databases. Those logs are then inspected and replayed on the receiving server to bring their database copies up-to-date with the active copy.

Exchange keeps transaction logs until the Replication service is happy that the data contained in the logs has been committed to all copies. At this point, the Active Manager process on the server holding the active copy of the database calculates a global truncation point (the generation number of the logs that are no longer required) and communicates it to the servers hosting the other copies, which causes those servers to remove (truncate) the set of transaction logs that they hold. Log truncation is important because it ensures that Exchange doesn't occupy more disk space than it requires.

When the replication process is healthy, very few transaction logs will await replay. However, when replication is suspended for one or more passive database copies (by running the Suspend-MailboxDatabaseCopy cmdlet or using the Exchange Admin Center—EAC), Active Manager doesn't calculate a truncation point. (Sometimes you need to suspend replication, such as when you're reseeding a failed database copy or when maintenance will cause a database to be offline for more than a few hours.) All of this is logical because the transaction logs being generated on the active copy will be required to bring the now-suspended copies up-to-date when replication resumes, and you don't know which copy will be activated at that point. Therefore, it makes no sense to truncate the log set.

The problem that appears is potential space exhaustion on the disks holding the replicated transaction logs. Large and busy databases can generate gigabytes of logs daily, all of which are replicated. Everything works well if plenty of space is available on the drives that hold the replicated logs, and the log set is truncated regularly. When truncation isn't occurring, the log set continues to expand. If this continues for a number of days, it's easy to see how several gigabytes of logs might accumulate and potentially fill a disk.

Up to now, the solution has been to use the Remove-MailboxDatabaseCopy cmdlet to remove any passive database copies before maintenance, then use the Add-MailboxDatabaseCopy cmdlet to re-add them afterward. This solution is effective, but it requires manual administrator intervention and complicates the maintenance process. The loose truncation feature in Exchange 2013 SP1 offers a different solution.

Understanding How Loose Truncation Works

Loose truncation is disabled by default. When it's enabled, each database copy measures the free disk space on the drive holding the replicated logs. Normal (or "tight") truncation applies for as long as sufficient free disk is available on the drive. Loose truncation kicks in when a low-space threshold is exceeded. (The default is 200GB.) No UI is available in EAC to control the low-space threshold for database copies, so this value must be set in the system registry on each server that holds passive copies. A number of other registry values are used to determine how loose truncation behaves.

When loose truncation is enabled, Active Manager on the server holding the active database copy continues to calculate and publish a truncation point as it normally would. The difference is that the calculation now ignores the passive database copy that has the most logs to replay. This copy is sometimes referred to as the "oldest straggler." The truncation point is then calculated based on the state of log replay that exists on the other servers that hold passive database copies. For example, if a database has three copies, one of which is offline, Active Manager might do the following:

Copy 3 is the oldest straggler, so it's ignored by Active Manager. Copy 1 has the largest queue (20), so the truncation point is calculated at 173,481 (173,501 − 20). Active Manager therefore advises all database copies that they can truncate logs up to generation 173,481. If loose truncation isn't used, the truncation point would be 63,301 (173,501 − 110,200) and the servers holding copy 1 and copy 2 would be forced to hold an additional 110,180 logs (173,481 − 63,301).

This description is a simplified version of what actually happens. It's intended to prove the usefulness of ignoring a suspended database copy when attempting to minimize the use of space. DAGs are deployed to achieve high availability, so minimizing disk space isn't the only factor that should be taken into consideration. It's important to balance disk space usage with retaining sufficient data to allow recovery to occur should a database copy fail. And really, given the size and cost of the disks available today, there's no excuse for not providing sufficient storage for database copies and their logs.

When you enable loose truncation, Exchange defines a threshold value for the minimum number of transaction logs that it needs to protect, or the number of logs that should be retained for active and passive copies, even when disk space usage is becoming low. By default, for the active copy, Exchange keeps an additional 10,000 logs over what it calculates should be truncated and an additional 100,000 logs for passive copies.

The number of retained logs for passive copies is further adjusted upward to include an additional 10 percent of the number of logs. This adjustment is necessary to ensure that lagged database copies (which always have a big replay queue) retain sufficient transaction logs to still be useful when loose truncation is active. Taking the values just shown, the truncation point for copy 1 is therefore:

(173,501 − (10,000 + (20/10)) = 163,499

Managing the truncation point is a good way to stop transaction logs from exceeding the available space on a disk. Additional protection is secured because passive copies are able to make a decision, without reference to the active copy, about whether to truncate their logs should disk space become scarce.

Truncating logs without considering whether a suspended copy needs to use those logs to bring itself up-to-date has implications. When the suspended copy is brought back online, a chunk of data that it should have is missing and it can't be retrieved. Exchange therefore puts the database copy into a FailedAndSuspended state. If AutoReseed is configured for the database, the database copy will be automatically reseeded. If AutoReseed isn't configured, an administrator can use the Update-MailboxDatabaseCopy cmdlet to reseed the database copy from another healthy copy.

Configuring Loose Truncation

To configure loose truncation, you need to create three registry entries on each DAG member server. All three entries need to be created under the registry key HKLM\Software\Microsoft\ExchangeServer\V15\BackupInformation and need to use DWORD values. Table 1 lists the entries and their default values. After you create the registry entries, you don't need to restart the service unless you really want to. The Replication service reads these values every 15 minutes and alters Exchange's behavior accordingly.

Table 1: Loose Truncation Registry Entries

Registry Entry

Description

Default Value

LooseTruncation_MinCopiesToProtect

Enables loose truncation on a server if set to anything other than 0. In some respects, it represents the number of passive copies to protect from loose truncation on the active server. Despite its name, it doesn't affect how Active Manager considers passive copies when it calculates a truncation point.

0

LooseTruncation_MinDiskFreeSpaceThresholdInMB

Specifies the threshold (in MB) for the available disk space that must exist on the disk holding transaction logs before loose truncation is used.

200GB

LooseTruncation_MinLogsToProtect

Specifies the minimum number of logs to retain on healthy database copies when truncation is performed. If the default value is changed, the new value applies to both active and passive copies.

10,000 for active copies and 100,000 for passive copies

The LooseTruncation_MinCopiesToProtect entry is badly named. Essentially, this entry controls whether loose truncation isn't in use (the default value of 0) or is in use. The value that you enter here is important. If the specified value is less than the number of passive copies (for example, set to 1 when there are two passive copies of each database), loose truncation is enabled. If the specified value is greater than or equal to the number of passive copies, loose truncation isn't used. The feature is enabled, but it's blocked because of the high value. It's helpful to think about this registry entry as a binary on-off switch, where 0 is off and 1 is on and use the value in that manner.

The LooseTruncation_MinLogsToProtect entry also deserves discussion. One way of thinking about this entry is to consider that the feature is divided into two components: active copy loose truncation and passive copy loose truncation. If the disk holding the logs for the active copy exceeds the free space threshold (200GB by default), Active Manager ignores the last straggler and uses a safety margin of 10,000 logs to ensure that a reasonable number of logs are available should they be required by a passive copy. However, if the disk holding the logs for a passive copy hits the threshold, the Replication service retains 100,000 logs and deletes any others to free space up so that new logs can be replicated. Other servers that hold passive copies of the same database aren't affected by this action and won't delete any logs. Note that if you change the default value for the LooseTruncation_MinLogsToProtect entry, the new value is used for both active and passive copies.

Monitoring Loose Truncation

How will you know when loose truncation is in effect? Assuming the registry updates are in place, loose truncation will kick in automatically when the thresholds are exceeded. Although you can monitor log free space where the transaction logs are stored by checking the number of logs held there, it's easier to check the TruncationDebug crimson channel in the Event Viewer. Exchange logs event 170 when normal log truncation occurs for a database and that's what you should expect to see on DAG members. For example, Figure 1 shows that Active Manager has determined that the global truncation point for database DB2 can be moved from generation 4,448 to 4,459. Because transaction logs use hex names, this equates to transaction logs 1160 to 116B. After the truncation point is moved, the Replication service will delete all logs prior to the truncation point, meaning that the latest log on all servers hosting copies of database DB2 will be generation 4,459 or filename Exx0000116B.log (where Exx is the database prefix).

When loose truncation is being used, you'll see events 848 (for the active database copy) and 861 (for a passive database copy) recorded instead of event 170. It's therefore possible to monitor for events 848 and 861 and use them as a danger signal indicating potential disk space exhaustion.

A Good Thing

Loose truncation might not be required in your deployment if you use large disks that have plenty of available space to hold transaction logs. However, this feature might prove very useful if you're concerned about disk management. Either way, it's evidence of the growing maturity of DAGs and their surrounding technology, which is a good thing.