This was originally posted as two posts on the SQL Server Storage Engine site. It was very popular so I’ve combined the two posts together and added a bunch more commentary – especially on page checksums and IO errors.

It’s almost inevitable that at some point every DBA will face dealing with corruption – so it’s very important that you know how the server will behave when corruption happens. You also need to make sure that whatever logic you’ve created to catch corruptions (either through error log parsing, alerts, or Agent jobs – topic for a future post) will actually work.

To do all this you need a corrupt database to play with. Earlier this year I created two corrupt databases – one for 2000 (attached in broken2000.zip) and one for 2005 (attached in broken2005.zip). The two attached files can be restored by unzipping them and then using the following syntax (substituting the correct backup name):

The databases have the same schema – a table called brokentable with schema (c1 int, c2 varchar(7000)) with one row in it. The table has a single data-page which I’ve corrupted differently in 2000 and 2005:

2000: The corrupt page has page ID (1:75) and the page header is corrupt so that selecting from the table will result in a 605 error which will kill the connection.

2005: The corrupt page has page ID (1:143) and the page header is corrupt such that the page checksum is bad.

A cautionary note on page checksums – if you upgrade a database from 2000 to 2005 and turn on page checksums, nothing happens! It’s not until a database page is read into the buffer pool, changed in some way and then written back out to disk that it will have a page checksum stamped on it. This means that once you turn them on, you need to trigger a page checksum being written to each page in some way – e.g. rebuilding all indexes or forcing an in-place update of all table rows. Neither of these is very palatable and there’s no tool to force page checksums in SQL Server 2005. Unfortunately there are no plans to include such a tool in SQL Server 2008 either.

Below I’ve listed a few things you can try out to see what would happen on your database if a checksum failed. These are all using the 2005 corrupt database.

Query errors

Any query that touches that page is going to fail with an 824 error. The IO errors in 2005 are different from 2000 – they’ve been split into 3:

823 – a hard IO error. This is where SQL Server has asked the OS to read the page but it just can’t.

824 – a soft IO error. This is where the OS could read the page but SQL Server decided that the page was corrupt – for example with a page checksum failure

825 – a read-retry error. This is where either an 823 or 824 occured, SQL server retried the IO automatically and it succeeded. This error is written to the errorlog only – you need to be aware of these as they’re a sign of your IO subsystem going awry. There’s no way to turn off read-retry and force SQL Server to ‘fail-fast’ – whether this behavior is a good or bad thing can be argued both ways – personally I don’t like it.

SELECT * FROM [broken]..[brokentable];
GO

Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x7232c940; actual: 0x720e4940). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file 'c:\sqlskills\broken.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

DBCC CHECKDB

DBCC CHECKDB throws us some nice errors. Note that it doesn’t actually mention a page checksum failure. DBCC CHECKDB is the only thing in SQL Server that can ‘eat’ IO errors and convert them into non-fatal corruption errors. Note in the DBCC CHECKDB output below that the repair level needed to repair this error is ‘repair_allow_data_loss’ – this is because the repair for a page with any kind of IO error on it is to delete the page, fix-up all relevant linkages, and rebuild any referencing indexes.

If you have page checksums turned on, you should always use the WITH CHECKSUM option when taking backups. This will cause the page checksums to be checked as they’re read into the backup. If a bad page checksum is found, the backup will stop and print a message identifying the bad page. Using the WITH CHECKSUM option on a backup will also generate a checksum over the entire backup stream and store it in the backup. This means we can detect a damaged backup by recalculating the checksum and comparing it against that stored in the backup – in much the same way that page checksums work.

The backup has failed because of a bad page checksum. However, we can force it to backup. If this is the only copy of the database we have, and we’re being forced to run repair to fix a corruption, for instance, then we want to make sure we have a backup to restore from in case something goes wrong with the repair. Even a backup that contains a corrupt database is better than no backup at all. In this case, we can use the CONTINUE_AFTER_ERROR option which will force the backup to continue when it finds a bad page.

You can check the high-level validity of any backup set using the RESTORE VERIFYONLY command.

RESTORE VERIFYONLY FROM DISK = N'c:\sqlskills\broken2005.bck';
GO

The backup set on file 1 is valid.

What about on the backup that we forced using CONTINUE_AFTER_ERROR?

RESTORE VERIFYONLY FROM DISK = N'c:\sqlskills\broken2.bck';
GO

The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.

Isn’t that cool? It tells us that the backup was already corrupt when it was written. Ok – let’s ask it to specifically check the checksums in the backup. This will look through all the pages in the backup that have page checksums, check them, and recalculate the backup stream checksum.

The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.

In our case, we get the same as above because the database was known to be corrupt when the backup was taken, so none of the checksum checking is done. What about if we try to check the checksums on the initial backup?

Msg 3187, Level 16, State 1, Line 1
RESTORE WITH CHECKSUM cannot be specified because the backup set does not contain checksum information.
Msg 3013, Level 16, State 1, Line 1
VERIFY DATABASE is terminating abnormally.

We can’t do that as the backup wasn’t taken using the WITH CHECKSUM option in the first place, even though some of the database pages may have page checksums on them.

RESTORE

How about we try to overwrite the existing ‘broken’ database with the one from the second backup we took?

It won’t let us because the backup contains corrupt data (and it knows that because we forced the backup to complete using the CONTINUE_AFTER_ERROR option). However, there may be cases where you’ve lost your database and all you have is a corrupt backup. In this case it may be better to restore what data you do have rather than lose everything. You can do it using the CONTINUE_AFTER_ERROR option on the RESTORE command this time:

Processed 160 pages for database 'broken', file 'broken' on file 1.
Processed 1 pages for database 'broken', file 'broken_log' on file 1.
The backup set was written with damaged data by a BACKUP WITH CONTINUE_AFTER_ERROR.
RESTORE WITH CONTINUE_AFTER_ERROR was successful but some damage was encountered. Inconsistencies in the database are possible.
RESTORE DATABASE successfully processed 161 pages in 0.392 seconds (3.364 MB/sec).

Isn’t that cool? It works BUT it tells us that the backup set contained corrupt data and that the database was restored but could have corrupt data in.

Summary

Have a play about with these databases to familiarize yourself with the kind of responses you’ll get from the various tools when a corruption exists, and how to work around it if need be.

Let me know if you want to see any particular kinds of corruptions explored, or want a database with something specific corrupted in.

12 Responses to Example 2000/2005 corrupt databases and some more info on backup, restore, page checksums and IO errors

Hi Paul,
Hope you are doing fine. Wonderful post though I am too late to pick this up, I am just now practicing it.

Anyways, I will get to the point. Just for practicing, I repaired the broken database with REPAIR_ALLOW_DATA_LOSS option as that is the minimal option. Before doing this put it on the SINGLE USER mode which I done and it repaired successfully and the database was still in SINGLE user mode.

Now, to practice more I have actually restored the database from the corrupted broken database but what happens is the database goes back to multiuser mode automatically.

I have tested the same repaired database while still in single user mode, restored it from a clean backup and the database still in single user mode.

My question:
Why does the database change to multiuser with a corrupted backup?

Do you have a recommended method for triggering a page checksum to be written to each page? I’m looking at enabling the CHECKSUM PAGE_VERIFY option in my databases and was wondering if either of your recommendations above was more preferred?

I’d say rebuilding indexes would be the most effective as you can combine it with your regular index maintenance. I’m still working on a way to do this without having to take such (possible drastic) measures.

I was just trying different scheneriosn in corruption. I have created one database (5120kb datafile). just afer creating database i stopped sql server and open datafile in editplus. i delete few lines from lower part of datafile and followe these steps : MY Database is coming back to suspected

If Datafile is courrupted
———————————–>
select * from sys.databases — SUSPECT
—————————–>
1. EXEC sp_resetstatus ‘sus’
—-> Warning: You must recover this database prior to access.
2. dbcc checkdb (sus)
—-> Check statement aborted. Database contains deferred transactions.
—-> Datbase is still in suspect mode
3. ALTER DATABASE sus SET EMERGENCY
—-> Datbase is in emergency mode
4. ALTER DATABASE
sus SET SINGLE_USER
WITH ROLLBACK IMMEDIATE
—->
5. DBCC CheckDB (‘sus’, REPAIR_ALLOW_DATA_LOSS)
Msg 5028, Level 16, State 4, Line 1
The system could not activate enough of the database to rebuild the log.
DBCC results for ‘SUS’.
CHECKDB found 0 allocation errors and 0 consistency errors in database ‘SUS’.
Msg 7909, Level 20, State 1, Line 1
The emergency-mode repair failed.You must restore from backup.
6.(Re-running same) DBCC CheckDB (‘sus’, REPAIR_ALLOW_DATA_LOSS)
—-> Database ‘SUS’ cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.
—-> Database goes in suspect mode again
6. ALTER DATABASE sus SET MULTI_USER
———>
Msg 926, Level 14, State 1, Line 1
Database ‘SUS’ cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.
Msg 5069, Level 16, State 1, Line 1
ALTER DATABASE statement failed.
Msg 5125, Level 24, State 2, Line 1
File ‘C:\SUS.mdf’ appears to have been truncated by the operating system. Expected size is 5120 KB but actual size is 5112 KB.
Msg 3414, Level 21, State 1, Line 1
An error occurred during recovery, preventing the database ‘SUS’ (database ID 12) from restarting. Diagnose the recovery errors and fix them, or restore from a known good backup. If errors are not corrected or expected, contact Technical Support.

I have set up alerts to pick up any 3043 error thrown by the backup jobs (which run WITH CHECKSUM). Early today we received one from a transaction log backup and a few minutes ago ran DBCC CHECKDB WITH NO_INFOMSGS on the database that had complained. It did not return any errors.

Could you explain what might have happened with the bad checksum that triggered the error?

It’s likely that whatever corruption existed at the time the backup failed was no longer part of the database when you ran the DBCC CHECKDB. A page might have been deallocated, so DBCC CHECKDB wouldn’t have any cause to read it.