Pop Quiz Hotshot! Your Server Dies! What do you do? What do you do?

You get a frantic phone call at 3am on a Sunday. It’s your boss, and there’s been a fire. The server room, and perhaps the whole building, is still smoldering. He asks you the following questions:

Is our data safe?

Is it off site? Is it secure or on someone’s kitchen table?

When was the last time you checked the back up? You are checking them, right?

Are you sure you can restore the M2M files? You have been performing test restores, right?

There are many blog articles on the web about the importance of disaster recovery, and you’ve probably read a few of them. I won’t re-hash everything here, but do you have the answers to all of those questions? If not (and you value your job), stop reading this and address them immediately. I’ll wait….

Good! Now that the above are covered lets address some M2M specific issues.

Can you build your SQL Server and install everything without help? Servers seem to prefer to die on weekends, and M2M Support is not available late at night or on weekends and holidays.

Do you have documentation handy with all of your M2M and other settings?

Do you have instructions ready to install all of your optional modules? Do you have the latest install files for them? I recommend keeping all of this (including the install files for everything) in electronic format with your back-ups if possible.

I remember watching a movie where special forces soldiers spent hours assembling and disassembling their rifles blindfolded. The point was that they became intimately familiar with their weapon and knew everything about it. Ultimately, this made them more effective soldiers.

Recently I wrote disaster recovery instructions for SQL Server, Made2Manage, and the associated modules for my current employer. One of my fellow employees performed the same process on a Domain Controller and Exchange Server. We then verified them by restoring everything from bare hardware, back-ups, and our instructions in an isolated room without access to the internet. Our manager also put us under stress by telling us our results were going to be public and that we had better not let him down.

Why a closed room and stress? It simulates a disaster. Can you be assured that you will have access to the internet in a disaster? Even if you could, would your boss want to wait while you download service packs? Since we had all of the files and had practiced, we built and tested our servers in less than 4 hours.

What are the benefits of a complete test?

To prove we could do it.

I learned exactly what makes M2M work.

I’m now confident that I can fix anything on that server.

Incidentally, with the exception of our database server, the rest of our servers are replicated in two other sites in the United States. In the near future, our SQL Server will be replicated the same way. So we should never actually be “under the gun” so to speak.

18 comments to Pop Quiz Hotshot! Your Server Dies! What do you do? What do you do?

David,
Question on the practice restore. If it doesn’t work and you are practicing on your production server (say on the weekend) what do you do? If not on your production server than is it valid? I back up server data to a Buffalo Terastation located in a separate building on site. The strategy assumes only one of the buildings would be destroyed. In my bosse’s words, if both the buildings were destroyed no need to worry about the data….

Forget the problem of both of the buildings being *destroyed* – think smaller. What would happen if both buildings had no power for two weeks? No internet connectivity? These are completely legit problems – I faced them during the aftermath of Hurricane Wilma in Miami and Hurricane Ike in Houston. Nothing in my buildings got touched, but without power or internet, I needed another plan – fast.

Brent,
You make a valid point. However, as a manufacturer we simply do not operate when there is no power. Several years ago there was a meltdown that left a good part of the midwest and parts of the east coast without power. The culprit was supposedly FirstEnergy, located in our neck of the woods. We had to close down for a few days as without power we could not operate. It is simply not practical for a manufacturer to maintain duplicate facilites.

Rick – you don’t have to maintain duplicate facilities, you just have to have a plan. This is why there’s offsite tape vendors like Iron Mountain that will come out to your location and pick up backup tapes on a regular basis. If you can’t afford that, you can rotate tapes out yourself, or log ship over the wire to a colo file server. The important thing is just to be able to recover – and every business needs to be able to recover.

I agree with Ricks original point only I have it even worse. I don’t have a backup or test server to practice on. I can’t afford to take down our production server on a weekend and risk not having it working again Monday morning. The budget simply does not allow for the purchasing of an additional server.

Scott,
Exactly. It is something I have never heard explained. By the way, the president of my company says that in the event we loose two or more buildings at the site, the owners will all be heading to warmer climes as the property is put up for sale. So I guess not all businesses need a continuation plan…

Rick, I personally would not use my production server for testing. In fact, I don’t do any development or customization on it if I can help it. Everyone should have a Test Server. If you (with your recovery plan) can build a server from scratch, install all the necessary components, there’s no reason to think that it won’t work flawlessly with your production Server.

Scott, as far as cost, you can build one of these out of an old, decommissioned desktop computer and the only cost you have is for SQL Server Developer edition. The cost is about $50.

You don’t need to install Visual FoxPro on the server itself. Therefore you could use your company PC for that. Also, you can use Windows Server for a 60 day trial period at no cost. Of course you can complete disaster recovery exercises in less than 60 days.

In addition, a test server (for custom code and such) only needs Windows XP, not server and most office PC’s come with a licensed copy so there’s no extra cost there either.

I read your Test Server article. Is a test server running on XP, actually a valid test? This kind of goes along with Rick’s question. How do you know that code you write will work on your production server if you aren’t running the same operating system on your test box?

Andrew & Richard – great questions, but here’s the thing: some testing is better than no testing. If you’re sitting around waiting for the company to buy you an environment that’s an exact duplicate for production as your very first test environment, you’re working for someone with a lot more money than most. Your job counts on your ability to restore your data, and if your company hasn’t bought you a testing environment that’s robust, you have to start somewhere. David’s article is about getting started with fire drill restores, and sadly, I just see way too many “DBAs” who have never tested their restores. That’s just dangerous.

I’m using SQL 2008 on Microsoft Server 2003. I know that’s not the ideal situation. I should be using SQL 2000 on my development/test machine, but I am experimenting and learning 2008. So far I have not run into any inconsistencies.

I created a “test/Report” SQL server to provide access for Excel users to do as they please without affecting the prod environment. It runs without taking M2M offline – using a snapshot publication initially.

/* This procedure exists on the Report M2M server – */
/* I also have a code snippet that confirms that there are no open tansacts or users on the table prior to copy*/
/* I drop the report server tabels, then move only the core M2M tables over, then run any SQL Stored Procedures on the report server – thus taking the load off the production server and generating whatever they need via a schedule task using VB.Net */

CREATE PROCEDURE spn_CopyFromM2MProdSQL
As
/* Drop the Tables from the Report server – the report server name is M2MRPTSQLServer, the database is ‘RPTM2MDATA01’ */
begin
if exists(select * from RPTM2MDATA01.INFORMATION_SCHEMA.tables where TABLE_CATALOG = ‘RPTM2MDATA01′ AND TABLE_SCHEMA=’dbo’ AND TABLE_NAME=’apitem’)
DROP TABLE RPTM2MDATA01.dbo.apitem
if exists(select * from RPTM2MDATA01.INFORMATION_SCHEMA.tables where TABLE_CATALOG = ‘RPTM2MDATA01′ AND TABLE_SCHEMA=’dbo’ AND TABLE_NAME=’apmast’)
DROP TABLE ReportM2MDATA01.dbo.apmast
if exists(select * from RPTM2MDATA01.INFORMATION_SCHEMA.tables where TABLE_CATALOG = ‘RPTM2MDATA01′ AND TABLE_SCHEMA=’dbo’ AND TABLE_NAME=’apvend’)
DROP TABLE ReportM2MDATA01.dbo.apvend

/*repeat for each table you may need, like SOMAST, INMAST etc – Follow the same for other companies.*/
end

/* COPY the database tables from the production SQL M2M instance – M2MProdSQL to the new M2M Report Server – ‘RPTM2MDATA01’ */
begin
/*M2MProdSQL is the production M2M SQL server */
SELECT * INTO m2mdata01.dbo.apitem FROM M2MProdSQL.m2mdata01.dbo.apitem
SELECT * INTO m2mdata01.dbo.apmast FROM M2MProdSQL.m2mdata01.dbo.apmast
SELECT * INTO m2mdata01.dbo.apvend FROM M2MProdSQL.m2mdata01.dbo.apvend
end
GO

Do you want to know how many connections you have to each SQL database? I find this useful, as every now and then I find old connections that should be cleared but hung for some reason or other.

Larry, I glanced at your code and it’s interesting. However, there are much easier ways to make a copy of the M2M database for reporting. The easiest I can think of is simply restoring a backup of your live data to another server for reporting. That way you don’t have to worry about users being out of the system or other issues.

BTW, you really should look into the new M-Data Analytics project. I think we could use your skills and you seem like someone who would like to go further with SQL Server as well.

Good article, David, and I know I’m late to respond here. I think that Brent is right, you have to think smaller. In 20 years in this business, I’ve dealt with hurricanes and large disasters, but the most common ones are small. An internet line gets cut. A fire in the building that requires us to move out for 2 weeks. A server dropped on another server and we have to rebuild both.

I agree that most of the time companies don’t want duplicate facilities, and they rarely, in my experience, purchase hardware that mirrors production, but you still have to plan out, think about, and test some other hardware. I need to have test servers that I can set up to take some of the load from production. These days, in 2012, so much desktop hardware can run a decent load. Even if you have a 64core production server with 512GB of RAM, I’m sure that a quad core, 16GB server will be able to run the system. It can’t support the load, but it can run, and you can test to see if you’ve set it up properly.

Preparing for DR is the biggest issue, and I love the analogy of rifle disassembly/reassembly. You should be able to do this stuff easily, though I’m not sure we’re ever without the Internet in some form these days.