Exchange 2010: Log Truncation and Checkpoint At Log Creation in a Database Availability Group

In previous versions of Exchange, when a backup was completed, almost all log files prior to the current log file were truncated from the system. Administrators monitoring the directory would originally see many logs, and post backup note that only a few logs remained. In Exchange 2010 Service Pack 1 and later administrators note that multiple log files remain on the disk post backup or the appearance that no log files have truncated at all. In many cases this leads to a belief that logs are actually not truncating successfully or that there is an issue with backups.

Why do we see logs remaining on disk for longer? Exchange 2010 SP1 and newer introduces a change in the behavior of log truncation. The changes were taken to ensure that replicated copies of databases within a database availability group always had the appropriate log files on the source server to complete an incremental resynchronization.

The change to log truncation is the tracking of Checkpoint At Log Creation. Remember that in a database availability group we can expect the checkpoint to be approximately 100 logs (or slightly more) off the current log file – this is known as checkpoint depth. As Exchange creates new log files we stamp into the header of the new log files what log file the checkpoint was pointing at when the current log was created. For example, let us say that log file 0xA679 (42617) was just created as the current ENN.log. We can expect that the checkpoint at log creation value stamped within the header of this log file would be approximately 0xA16 (42517). You can see the checkpoint at log creation value by using eseutil /ml <logfilename> to dump the header of a log file.

Within a DAG all servers that contain a replicated copy of a database report the maximum log file that is eligible for truncation. These values are reported to the active node which subsequently calculates the maximum log file for truncation. In simplest terms the following process occurs:

Passive copy on Node-2 reports OK to truncate log 0xA679 (42617).

Passive copy on Node-3 reports OK to truncate log file 0xA678 (42616)

Passive copy on Node-4 reports ok to truncate log 0xA679 (42617).

The active node determines that the best log file eligible for truncation based on the passive copies is 0xA678 (42616). [This is essentially the minimum of all reported OK logs to truncate.]

The active node then looks at the checkpoint at log creation of 0xA678 (42616) and determines that value is 0xA614 (42516). In this example that would be 100 logs off the best log reported for truncation of the passive copies.

The active node sets the truncation point to be log 0xA614 (42516).

Therefore after a successful backup logs prior to 0x614 (42516). would truncate.

This essentially means that 100 additional logs that would have previously truncated prior to this change do not truncate.

Taking into account checkpoint at log creation administrators can better understand how log files are truncated and why log files remain on disk after a backup that might have in prior versions been truncated.

I have a question. Is there any way to manipulate with Checkpoint At Log Creation time? Is it possible to influence on its value?

Some of our customers have problems after backup, because there copies sporadically become FailedAndSuspended with "Log file 1 missing" error. Since full backup is intended to truncate old log files, there should be such ancient "log file 1", because it was truncated long time ago. The behavior repeats sporadically, not always, on different databases. MS Partner Support does not what to do with it, and the backup software vendors also give up finding a solution for the problem. Fortunately, it is possible to resume the FailedAndSuspended copy manually in EMC. But we need to find a solution to get rid of the problem and routine to manually resume the copy every time it fails after a backup.

To my knowledge there is no difference between the versions documented here and SP3. Also – as far as I know – there is no correlation between the number of items that the index is backlogged and how we truncate log files.

There should be no expectation that the backup solution has any impact on how logs are truncated, as the information store / replication service is what's responsible for determining log file truncation.

With Exchange Exchange 2010 SP3 we have seen the amount of Logfiles that stay after successfull back on A DAG around 200+. Prior to SP3 this may have been 100+. This is also confusing and seeme to be undocumented on MS side which clearly makes a lot of trouble on the market. There are hundres of threads and blogs who describe this and tend to go in direction of search indexer/Catalog files blocking.

What Software you finally use Windows Backup, Netapp Snap Manager for Exchange or any Exchange Aware Backup seems to have no influence on the amount of logfiles the DAG keeps even after backup. Make sure you have enough Test data in your Test mailbox (Attachments).

In a pre-prod environment we have many dbs that don’t yet have active user mailboxes, just health mailboxes. Logs are generated every 15 mins exactly. In these cases the difference between the passive copy’s “ok to truncate” and the active copy’s is often times hundreds of logs. So after truncation you could have 300-500 or more logs remaining that are days old. Is that due to the infrequency that the logs are generated?