Off-Box Backups and Luke-Warm Standby Servers, Part III

Until now, my previoustwo posts have largely just been a review of best practices for making backups redundant – with only a single mention about RESTORE VERIFYONLY’s LOADHISTORY argument as a means of providing a rationale for enabling luke-warm failover servers from those backups.

And, again, just for the record: there ARE better High Availability solutions out there for SQL Server (such as Mirroring, Log Shipping, SQL Server 2012’s AlwaysOn happiness, clustering, and --in some cases -- replication). As such, the point of these posts is really just to cover options for less-expensive failover options for organizations that don’t need ‘full-blown’ failover options and as a means of describing some additional ways to make ‘full-blown’ HA solutions more redundant and capable of recovery – as you never know when you’ll need some sort of Remote-Availability or smoke-and-rubble contingency. That, as I’ve said over and over again, high-availability isn’t the same as disaster recovery.

So, with all of that said, the big problem at this point is that just having copies of your backups in secondary/remote locations is NOT a disaster recovery plan. Yes, having copies of your backups off-site is better than nothing – but merely having copies of your backups off-site isn’t going to help much in an emergency – especially if we’re talking about multiple databases of any decent size. That, and unless you’re regularly testing these secondary/luke-warm backups, you not only have any idea whether they’re viable or not, but you don’t accurately know if they’re capable of meeting RPOs and RTOs.

To that end, you need what I like to call a ‘mass recovery’ script.

The Benefits of a Mass Recovery Script

In my mind there are two primary benefits of a 'mass recovery' script. First and foremost, my experience is that most DBAs aren’t as familiar with all of the ins and outs of restoring databases in an emergency as they need to be. Or, as I like to tell many of my clients: “The LAST thing you want to be doing in a disaster scenario (with your boss standing over your shoulder) is be figuring out how to properly restore a number of databases while management is worried about when people, systems, and expensive activities can get back online.”

Second, even if you’re a ninja at restoring databases from scratch (where you’ll have to use T-SQL against raw files – unless you’ve done something with RESTORE VERIFYONLY’s LOADHISTORY functionality), you’ll still typically need to wire up the restoration of a decent number of log-file backups – which can quickly become tedious and subject to error. Especially if you have multiple databases that go down. (For example, assume that you’re making FULL backups every Sunday Night, and then DIFFERENTIAL backups every week-night at 3AM, and your database goes down at 10AM. If you’ve got any type of ‘real-world’ RPOs, you’re likely going to need log-file backups every 15 minutes (or less) in order to ensure copies of operations are successfully duplicated. That, in turn, means that you’ll be looking at needing to apply roughly 7 hours worth of log file backups or ~28 log file backups per database in an emergency scenario.)

So, in terms of this second point, I like to point out that manually restoring databases, one file at a time from the GUI is simply an option that does not scale. So, unless you’ve gone ahead and imported backup file meta-data into your luke-warm standby server (using the LOADHISTORY argument of RESTORE VERIFYONLY) and can accurately 'COUNT on that data being up-to-date, you’re going to be wrestling with lots of files in most luke-warm disaster recovery scenarios. As such, the use of a mass-recovery script can be used to very easily automate the process of bringing numerous databases (or even a single database) back online in a disaster scenario.

Better yet, once you’ve got a mass recovery script up and running, you can drop in on your luke-warm failover servers periodically (say, once a week – depending upon the sensitivity and importance of your data) and run this script to test it against Recovery Time Objectives and to likewise test it against RPOs as well.

Creating a Mass Recovery Script

When it comes to building a mass-recovery script, there are sadly a large number of factors and concerns to pay attention to – to the point where there really isn’t going to be a ‘one size fits all’ approach to executing luke-warm failovers that will work everywhere and for every situation. For example, I keep referring to this encapsulation of logic needed to restore databases as a 'mass' recovery script – meaning that, in my experience, it’s pretty typically to need to bring multiple databases online – instead of just a single database. But, in some environments, there might only be a single database. In other environments there might be multiple, different, databases, while in other environments there might be multiple, similar, or even related databases that either have different business priorities or which might infer some sort of hierarchical set of dependencies. In other words, there’s simple a large amount of complexity that has to be adequately addressed when it comes to restoring off-box or luke-warm backups as part of a disaster recovery contingency plan.

To that end, I’ve created two lists – of obvious and not-so-obvious things you’ll want to account for when creating recovery scripts for luke-warm backup servers.

Obvious Concerns For Luke-Warm Backup Scripts

Paths. In an ideal world you’d have identical drive and folder/pathing configurations on lukewarm backup servers as you would on primary production hosts. Only, that ends up typically being cost-prohibitive in most cases – and difficult to keep in sync. Consequently, whatever scripts your using to recover databases on a lukewarm host are typically going to need to account for potential different locations for data and log files.

FULL vs DIFF vs Log File Backups. Another obvious thing that your mass-recovery script will have to deal with is the fact that all ‘backups’ are not created equal – meaning that any script you’ll create to quickly ‘whip through’ the process of restoring databases from backups will have to ‘know’ how to work with FULL, DIFFERENTIAL, and LOG FILE backups as needed in order to restore the latest version of a database from backup files on hand. And, of course, this means that whatever script you’ll end up creating needs to be 100% coupled to media choices, pathing, and naming conventions used in the creation of backups.

Third Party Backups. It should also go without saying that if you’re using third-party backups (for compression and/or encryption), you’ll need to account for these variables as well when creating a mass-recovery script. That said, in most cases, third party backup solutions are GOBs easier to use against ‘piles’ of backup files than SQL Server Management Studios crappy Recovery GUI that ONLY allows you to select a single .trn file at time (for example) – to the point where if you only have a few databases, you MAY actually find that using the GUI that ships with your third-party backup solution is good enough for recovery purposes. Of course, you won’t know this without testing. That, and while SQL Server’s licensing allows you to deploy full-blown instances of SQL Server at no cost for DR purposes, many third party vendors do NOT allow the same kind of deployment options – meaning that you may have to pay for full or partial licenses of third-party backup software on luke-warm backup servers – so make sure to check into these details as needed.

Not-So-Obvious Concerns for Luke-Warm Backup Scripts

Permissions. While you as a DBA may be able to see that all of your SQL Server backups are comfortably nestled in, say, your F:\SQLBackups\ folder, that doesn’t mean that SQL Server will be able to see those backups on a luke-warm failover server. (On the primary server where the backups are made, SQL Server will have rights to see and interact with the files it creates – but that will NOT be the same case on a failover/backup server where these files have merely been copied.) Consequently, you’ll need to make sure that the Service Account that SQL Server is running under for your luke-warm failover server has access to the copied backups that have been moved on-box from an off-box source.

Logins. Recovering data after a disaster is your primary concern. But even if you’ve been able to recover all data within specified RPOs, that doesn’t do any good if end-users and applications can’t reach that data. Accordingly, you’ll need to make sure that you’ve got some sort of mechanism in place that allows applications to point at a new host. Similarly, you’ll need to make sure that you’ve properly defined, and mapped, logins on the failover server so that end-users and applications can log in and access data correctly. For more information on these concepts, I recommend that you watch Copying and Moving SQL Server Logins – where I describe many of the nuances associated with these needs (along with providing links to scripts you can use to script/migrate logins and to correct issues with ‘orphaned users’).

The Tail End of the Log. In a typical, on-box, disaster recovery scenario, the first operation in any DR plan should be to back-up the tail end of the transaction log (or the active portion of the transaction log that hasn’t been backed up yet) – as a means of protecting transactions that have already completed but which haven’t been backed up yet. Only, when it comes to off-box/redundant backups of the type I’ve described in this and my two previous posts, it’s going to go without saying that you’ll almost never have access to this valuable resource. If, however, you’re setting up some sort of secondary/off-box contingency and think you may have access to the tail-end of the log, you’ll want to account for that as needed – as any opportunity to leverage the tail-end of the log is commonly going to be worth its weight in gold in terms of the amount of work it will prevent in terms of ‘re-entering’ transactions and operations that would otherwise be lost.

Errors. Nobody likes dealing with errors – and so in a perfect world there wouldn’t be. But we don’t live in a perfect world, and (frankly) since we’re already talking about mechanisms for dealing with the ugliness of system-wide failures that have forced you on to secondary/luke-warm hosts for potentially mission-critical data, it goes without saying that errors are a fact of life. More importantly, the last thing you want to have happen in a set of recovery scripts is to have an error go un-noticed, un-attended, or un-addressed – especially if that means you end up losing precious time. Accordingly, good mass recovery scripts account for the fact that errors can and will occur, and then attempt to minimize the potential impact of those errors. (In my experience, the occurrence of errors can be mitigated through regular testing but that won’t completely eliminate the possibility of errors cropping up. As such, I typically take the approach of just ensuring that those errors are called out and NOT compounded by attempting to apply more backups or operations against databases that have encountered a recovery error. In this way I don’t waste precious time by having automation routines continue to work with something that can no longer be automated, and I also help avoid the mental stress and overhead that comes when you’re bombarded with gobs of errors being thrown as opposed to a single error being encountered and called out.)

Example Mass Recovery Script

As a more ‘hands-on’ approach to outlining some of the specifics that go into creating and using a mass-recovery script, I’m including a very ROUGH copy of a script I put together during the holidays. It’s similar to some scripts I’ve written for clients – but focuses on the use of ‘native’ SQL Server backups (instead of being designed to work against large numbers of databases that are backed up with third-party backup solutions).

/*
----------------------------------------------------------------------------------------
INSTRUCTIONS
A) Don't run this in production. Run it on a luke-warm backup server.
B) Check out the Notes/Warnings/Section Below.
This script WORKS, but it's very limited and should NEVER be used without testing
and validation prior to use with production/emergency data.
C) PRESS CTRL+T to kick output into 'text' mode. It runs fine in grid mode (CTRL+D), but
is optimized for execution in text-mode.
D) Set necessary parameters down in the PARAMETERS section.
Make sure that you SQL Server Service account has necessary perms against the folders
specified.
E) To test it out, try running it in @debug = 1, @executeCommandsFirst to see if the
syntax/pathing/etc all look correct.
F) Once you've validated that this basic script works and can perform a 'mass' recovery operation
against your data, revisit the NOTES/WARNINGS/CONSIDERATIONS section and FINISH this script
for your environment. In other words, this script is a 'skeleton' designed to give you an idea
of KEY operations - not as a complete DR / Failover solution.

----------------------------------------------------------------------------------------
NOTES/WARNINGS/CONSIDERATIONS:
- DO NOT RUN THIS ON YOUR PRODUCTION SERVER. IT WILL OVERWRITE YOUR DATABASES.
- This Script REQUIRES xp_cmdshell. If you don't know how to turn that on, read up on
the SECURITY implications of turning that on in your environment - they're HUGE.
- This script ASSUMES that FULL, DIFF, and LOG backups contain FULL, DIFF, and LOG in the
respective file-names (somewhere).
- This script works - but should NEVER be used in your environment against production
data that you care about.
- This script is ONLY designed as a rough outline of KEY operations to tackle during
a mass-recovery operation on a LUKE-WARM standby server (or a server where SQL SErver is already installed
but where no DBs are running).
- This Script WILL replace DBs if they already exist on the target. If they don't exist on the
target host, it will re-map their files to the directories specified.
- HOWEVER, this script ONLY accounts for a primary .mdf and a single .ldf - if you need more than
this, you'll need to address that yourself.
- This script does NOT account for users/apps being logged into dbs that are being restored.
- This script does NOT account for data at the tail-end of your log on a PRODUCTION server. It's designed
for a smoke and rubble/failover/remote-available contingency.
- This script does NOT account for permissions/logins. If you need help with those
see the following video: http://www.sqlservervideos.com/video/copying-and-moving-sql-server-logins/
- Error Handling in this script is ONLY stubbed in and has NOT been tested.

----------------------------------------------------------------------------------------
PARAMETERS:
@debug - specifies whether or not the commands to be executed will be 'echoed' out
to the 'messages' tab during execution of this routine. If @debug is set to 1 and
the @executeCommmands is set to 0, you'll get a verbose output/summary of all commands
that WOULD have been executed had this script been run normally - but NONE of the
actual commands will be run or executed.
@executeCommmands - specifies whether or not the actual RESTORE and other commands should
actually be run or executed.
@dbsToRestore - A comma-delimited list of databases to be restored. Datbases are restored in
the order specified in this list. So, databases that should be restored sooner than
others should be listed first/prioritized as needed.
@backupsRoot - This is the path to the base backup folder for the server in question.
This script EXPECTS that the base/root folder for backups will have a sub-folder
in it for every database you wish/need to restore - and that each subfolder will
contain FULL, [DIFFERENTIAL], and LOG file backups as needed.
@dbsRoot - This is the root path to the location where end-user databases are stored.
@recoverDatabases - Specifies whether or not databases should be recovered. In most disaster
recovery scenarios, this should be true/1 - unless there's a need to add more transaction
log backups for some reason. However, if trying to set up log shipping or something else,
then set this to 0 - so that other log files can be added/etc.
*/

-- Start applying Log File Backups:
DECLARE logFiles CURSOR FAST_FORWARD FOR
SELECT [output]
FROM @backupFiles
WHERE [output] NOT LIKE '%FULL%' AND [output] NOT LIKE '%DIFF%' -- don't grab the FULL backup or the DIFF backup if there was one.
ORDER BY id ASC

DECLARE @logFilebackupName sysname
DECLARE @command nvarchar(800)

OPEN logFiles

FETCH NEXT FROM logFiles INTO @logFilebackupName
WHILE @@FETCH_STATUS = 0 BEGIN

IF @executeCommmands = 1 BEGIN
-- don't recover the database if we've bumped into errors:
IF NOT EXISTS(SELECT NULL FROM @Errors WHERE dbname = @targetDatabase)
EXECUTE (@recovery)
ELSE
RAISERROR('NOT recovering %s due to errors.',1,0, @targetDatabase) WITH NOWAIT
END
END

-- load up another db, and go around the horn again...
FETCH NEXT FROM targets INTO @targetDatabase
END

CLOSE targets
DEALLOCATE targets

-- Report on Errors:
IF @executeCommmands = 1 BEGIN
IF EXISTS(SELECT NULL FROM @Errors) BEGIN
RAISERROR('Outputting Error Information',1,0) WITH NOWAIT
SELECT * FROM @Errors
END
END

GO

The script itself is fairly basic and simple. By specifying base paths to folders where backups for multiple databases can be stored (assuming, of course, that database backups are all stored in their own folders), the script takes advantage of xp_cmdshell (an extended procedure that hackers can wreak the ultimate havoc with on your box if they’re able to take control via SQL Injection) to query the OS and get a list of all backup files in a given directory. It then finds the latest FULL backup, restores it, looks for the latest/last DIFFERENTIAL backup if there is one (and applies it if needed), and then finds all Transaction Log backups since the last FULL/DIFFERENTIAL backup and applies them one after the other until they’re all gone – at which point it then attempts to RECOVER the database.

All in all, the script is VERY rudimentary, and just ‘automates’ the process of restoring files/media as needed. It also comes with a huge number of caveats (in the NOTES/WARNINGS/CONSIDERATIONS section) – as the script is NOT meant to go into your production environment as a full-blown solution. Instead, it’s best thought of as a bit of scaffolding or an outline of key concepts and concerns that you’ll want to address. But, by taking a script like this and making it your own, you CAN come up with some great solutions that will allow for excellent recovery options to luke-warm standby servers. I’ve actually got solutions similar to this one working for a couple of clients where the use of these kinds of ‘mass recovery’ scripts allow me (or my clients) to periodically go in and verify that RPOs and RTOs can be met in the WORST kind of disaster – such as when existing HA functionality fails, or when an entire data-center is lost. So, in that regard I have full confidence in the overall approach being outline here. But in order for something like this to work for you, I can’t stress enough that you’ll need to take what I’ve provided here as a STARTING POINT from which you’ll need to throw in your own, additional, effort and testing to get it to be a viable solution for your needs.