instead of completing properly and listing the corruptions in the database.

Whenever CHECKDB is using a database snapshot, it must check that the page it read through the snapshot does not have an LSN (Log Sequence Number) higher than that when the snapshot was created. If it did, this would mean that the page was modified AFTER the snapshot was created and hence CHECKDB would be working from an inconsistent view of the database. If this case is discovered, CHECKDB stops immediately. When I rewrote CHECKDB for SQL Server 2005, I changed a bunch of code assertions into seperate states of the 8967 error, so that CHECKDB would fail gracefully if some condition occured that indicates a bug or something that should never happen. State 216 is for the bad LSN condition I've just described.

I used to think it was caused by a race condition with the NTFS code that implements sparse files, which is used by the hidden database snapshot that CHECKDB uses by default. However, I've come to learn that this is a bug in CHECKDB (not one of mine I should say :-)) that causes this behavior under certain circumstances when corruption is present. The bug is that if a corrupt page fails auditing inside CHECKDB, the LSN check is still performed. If the corruption affects the LSN stamped in the page header, the 8967 error could be triggered. I've seen this a handful of times in the last few weeks – hence the need for a blog post. I've discussed this with the dev team and hopefully the fix will make it into the next SPs for 2005 and 2008 (too late to fix such a rare problem in such a critical component at this stage of 2008 development). They're going to put a KB article together too – but in the meantime, I wanted to get this on the Internet so Google/Live Search pick it up.

[Edit: June 24 2010 – this still hasn't been fixed AFAIK – see KB 960791 which just discusses the workaround.)

Now let's repro the problem. Starting with a simple database and table, I'll find the first page so I can corrupt it.

Now I'm going to corrupt the LSN on that page such that it's guaranteed to be higher than the creation LSN of the database snapshot (basically by filling the first part of the page header LSN field with 0xFF).

DBCC CHECKDB ('TestCheckdbBug') WITH ALL_ERRORMSGS, NO_INFOMSGS;GO

Msg 8967, Level 16, State 216, Line 1
An internal error occurred in DBCC that prevented further processing. Contact Customer Support Services.
Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.

Bingo! And in the error log, there's some diagnostic information so we can tell which page caused the problem:

Note the page ID (in black bold above) tells us the bad page and the LSN (in blue bold above) reflects the corruption that I caused. If the page ID field of the header was corrupt, it wouldn't be possible to tell from these diagnostics which page is corrupt. However, all is not lost. This bug means that under these circumstances the default online behavior of CHECKDB can't run. The workaround is to use the WITH TABLOCK option of CHECKDB, which does offline checking and doesn't need the snapshot – but the trade-off is that an exclusive database lock is required for a short time and then shared table locks for all tables in the database (this is why online is the default). Running this option on my corrupt database gives:

Which are the exact same results we had before I corrupted the LSN field (this is expected, as there is no check of a page's LSN field EXCEPT when running from a database snapshot). Now we can proceed to restore/repair as appropriate.

So – a scary little bug that has caused some people headaches, but I want to stress again – this can only happen if the database is ALREADY corrupt, and that it's quite rare. Hope this helps some of you picking this up from search engines in the future.

An excellent analysis and well put together article, much appreciated.

I have also run into this error in the following case:
1) SQL 2005 install on VMWare ESX server
2) two databases restored from same BAK file
3) db1 has all mdf & ldf files in single directory on hard drive on VMWare image
4) db2 has all mdf & ldf files in single directory on external hard drive using USB connection
5) DBCC CHECKDB on db1 returns expected output
6) DBCC CHECKDB on db2 returns Msg 8967, Level 16, State 216, Line 1…..

A couple of iterations with different BAK files yield the same result. Any thoughts, particularly with reference to VMWare? The hardware (external HD & USB cable) run fine for all other applications used for to date.

Do you always get that problem storing the database on the USB drive? If so, the USB or VMWare I/O layer to the USB is corrupting the database… try running CHECKDB using WITH TABLOCK and it’ll tell you where.

What happens if you restore onto the USB through an instance not installed on VMWare?

Running a FULL backup of a 60 GB database over a slow link between sites that takes over 4 hours to complete will probably result in some corruption esp. if the DB has image LOBs.

The initial error during a test restore of the 5-hour backup said:

Damage to the backup set was detected.
RESTORE WITH CONTINUE_AFTER_ERROR was successful but some damage was encountered. Inconsistencies in the database are possible.

The database was left in a SUSPECT state and as "RESTORING" as I used NO Recovery.
I removed the suspect bit and finished off the RESTORE with the differential backup taken after the full backup. That worked , but a DBCC VERIFY told me:

Msg 8967, Level 16, State 216, Line 1
An internal error occurred in DBCC that prevented further processing. Contact Customer Support Services.
DBCC results for ‘thisDB60Gb’.
Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.
CHECKDB found 0 allocation errors and 0 consistency errors in database ‘thisDB60Gb’.

This seems to be a case where the log was actually corrupt, rather than hitting the bug in question. In this case, the behavior is exactly as expected. As to trusting the repairs, I can only say that I wrote the repair code for 2005, and there hasn’t been found a bug in it yet :-)