Post navigation

Don’t Break the Chain!

Let me start by saying I’ve often been very skeptical of SQL Server backups done by 3rd party tools. There’s really two reasons. For one, many years ago (when I first started working with SQL Server) they often simply weren’t good. They had issues with consistency and the like. Over time and with the advent of services like VSS, that issue is now moot (though, I’ll admit old habits die hard).

The second reason was I hate to rely on things that I don’t have complete control over. As a DBA, I feel it’s my responsibility to make sure backups are done correctly AND are usable. If I’m not completely in the loop, I get nervous.

Recently, a friend had a problem that brought this issue to light. He was asked to go through their SQL Server backups to find the time period when a particular record was deleted so they could develop a plan for restoring the data deleted in the primary table and in the subsequent cascaded deletes. Nothing too out of the ordinary. A bit tedious, but nothing too terrible.

So, he did what any DBA would do, he restored the full backup of the database for the date in question. Then he found the first transaction log and restored that. Then he tried to restore the second transaction log.

The log in this backup set begins at LSN 90800000000023300001, which is too recent to apply to the database. An earlier log backup that includes LSN 90800000000016600001 can be restored.

Huh? Yeah, apparently there’s a missing log. He looks at his scheduled tasks. Nope, nothing scheduled. He looks at the filesystem. Nope, no files there.

He tries a couple of different things, but nope, there’s definitely a missing file. Anyone who knows anything about SQL Server backups, knows that you can’t break the chain. If you do, you can’t get too far. This can work both ways. I once heard of a situation where the FULL backups weren’t recoverable, but they were able to create a new empty database and apply five years worth of transaction logs. Yes, 5 years worth.

This was the opposite case. They had the full backup they wanted, but couldn’t restore even 5 hours worth of logs.

So where was that missing transaction log backup?

My friend did some more digging in the backup history files in the MSDB and found this tidbit:

backup_start_date

backup_finish_date

first_lsn

last_lsn

physical_device_name

11/9/2016 0:34

11/9/2016 0:34

90800000000016600000

90800000000023300000

NUL

There was the missing transaction backup. It was a few minutes after the full backup, and definitely not part of the scheduled backups he had setup. The best he can figure is the sysadmin had set SAN Snapshot software to take a full backup at midnight and then for some reason a transaction log backup just minutes later.

That would have been fine, except for one critical detail. See that rightmost column (partly cut-off)? Yes, ‘physical_device_name’. It’s set to NUL. So the missing backup wasn’t made to tape or another spot on the disk or anyplace like that. It was sent to the great bit-bucket in the sky. In other words, my friend was SOL, simply out of luck.

Now, fortunately, the original incident, while a problem for his office, wasn’t a major business stopping incident. And while he can’t fix the original problem he was facing, he discovered the issues with his backup procedures long before a major incident did occurr.

I’m writing about this incident for a couple of reasons. For one, it emphasizes why I feel so strongly about realistic DR tests. Don’t just write your plan down. Do it once in awhile. Make it as realistic as it can be.

BTW, one of my favorite tricks that I use for multiple reasons is to setup log-shipping to a 2nd server. Even if the 2nd server can never be used for production because it may lack the performance, you’ll know very quickly if your chain is broken.

Also, I thought this was a great example of where doing things twice doesn’t necessarily make things less resistant to disaster. Yes, had this been setup properly it would have resulted in two separate, full backups being taken, in two separate places. That would have been better. But because of a very simple mistake, the setup was worse than if only one backup had been written.