Monthly Archives: October 2011

I have a database that needs to be online 24/7/365.242199. Ok, so it doesn’t reeeealy need to be online that much. Uptime does not need to be measured in milliseconds for any of my company’s applications. No one’s life depends on this database being online so I’m quite alright with small outages. When revenue generating systems go offline, it’s bad, but not life threatening in my case.

That said, it’s a nice bonus for my reputation if I can keep a database online for a month straight or maybe even a year. However, I need to be careful not to make HA decisions purely based on my reputation.

I will not be upgrading any production or QA Servers to SQL 2012, until at the very least 6 months after RTM. I am making this a rule of my own so I am not tempted by any of the shiny new features in AlwaysOn. I will investigate them, but from what I have heard, all solutions still drop connections so it’s not worth upgrading for that. Is a dropped connection an outage? I think so but I’ll let you talk to your users and decide.

The Project

We chose VMware and two physical host servers for minimal redundancy. VMware allows for flexibility, scalability some awesome HA/DR tools. For the server architecture it was easy to spin up multiple DC/IIS/SQL servers to support this application. Database mirroring was the obvious choice because I could then patch the SQL Server with a few seconds or less of downtime with this particular application.

But, we only had two physical servers. A requirement of an automatic failover is a witness server in addition to the principal and mirror servers. We have separate VMs but only two physical servers. I put some hard thought into this and decided that the witness vm should be on the same physical host as the mirror. In the event of a hardware failure on the first host, the witness will still be able to “turn on” the mirror. If there was a DC/IIS/SQL box on one host, users wouldn’t have to wait for the VMware HA features to kick in.

Even given the failure scenario I am about to describe, I will still keep it that way.

Boom, disaster strikes

Turns out vLAN settings are a single point of failure. If a vital vLAN was blown away, some servers may have the ability to talk to each other, but not their redundant pairs. My particular failure allowed any VMs sitting on the same hosts to talk to each other, but if they were sitting on another host, they could not talk. On top of VMs on separate hosts not being able to talk to each other, the VMs could not talk to anything else, intranet or internet. The host management traffic is on a separate vLAN so I was able to access the console of the VMs for troubleshooting.

Those of you familiar with database mirroring might have already guessed what happened but there is an interesting twist. What happened was the witness wasn’t able to talk to the principal database. It was able to talk to the mirror database because it was on the same host so it switched that mirror to principal. Boom, bam we’ve got two working principals…or maybe. Of course I didn’t really think about this scenario until after everything was working. One of the troubleshooting steps we took was to Vmotion all the servers to one host. I believe at that point, the witness was able to re-sync the new principal with the new mirror.

Is that a big deal? Well, it depends. Had internet traffic been flowing into the load balance IIS boxes, writes could be occurring on both databases.

Unless of course the principal, not being able to talk to the witness AND mirror servers would have caused it to go into offline mode. The quorum works like it should 2 out of 3 win.

#sqlhelp was able to calm my nerves and buy me some time to test this out myself. I wanted to believe split-brain couldn’t happen when designing the solution but when it actually did happen, I had doubts. Even in this scenario, the vLAN came back online and everything was fine. All databases and mirroring connections were restored and synchronized without intervention. Myth-busted!

I’ve only done this a couple times. I want to get better at it. I have an opportunity to have this discussion in the near future and this is my plan of attack. The first misstep is to become combative and defensive so the use of “attack” is sarcasm 🙂

So here is a general back and forth discussion that can cause loops and be very time consuming.

Users Q: How much is this going to cost?
IT Q: What is your RPO/RTO?
Users Q: What is RPO/RTO?
IT A: RPO/RTO is how much data can you loose and how fast do you want to be back up.
Users Q: How much is this going to cost?
IT Q: What is your RPO/RTO?
Users: What is RPO/RTO?

Its great to have a snowballing email chain and be able to see everyone chime in on the discussion. This kind of thing happens and here is how I plan to avoid it in the future.

So to exit this loop, you should provide a menu of basic options. Something like good/better/best. However, that doesn’t seem to work out to well because its difficult to put a price tag on a shared environment. If you tell them how expensive your storage is they may send you a 1TB drive they picked up at Best Buy.

Newer high end SANs have the ability to “tier” storage so the important applications have the right of way. I don’t condone this as a good practice because I have not yet seen it work… yet. What is happening is more code is being injected into the SAN to over compensate for architectural problems. More code means more code updates, and that can cause major problems you could have avoided in the first place.

Back to the discussion, the meeting starts and everything is fine, people are happy to chat but eventually work has to start.

IT Q. How much dataloss is acceptable?
Users A. none

So the frustrations start. A flurry of emotions are sparked when this response is given because we know all the time, effort and money that will get tossed at this for potentially, no valid reason.

Its easy to avoid this discussion and just do nightly backups. If you do nightly backups in simple mode for SQL, you can guarantee 24HR RPO. But mostly, backups happen at night. 99.99% of the time, IT isn’t going to log in and run a backup, they will take care of it in the morning… or next week. So as long as the failure doesn’t allow users to still enter data into the system you can have a 24HR RPO but if it does we’re looking at more like 36HR RPO.

What I suggest is not leading with the RPO/RTO definitions. This conversation should start with a little about the application.

1. How many users
2. How many concurrent users
3. What kind of data
4. How is the data accessed
5. What kind of data is outside of the database (for app consistent restores)
6. What are the maintenance windows
7. What will the growth rate be for data & users
8. What is the roll out schedule (can we start small and grow?)

Getting to know the application implies the feeling that we, IT, care about the application. This is a good thing, we want to care and really do. With this information we can start to compare/contrast it to some other more important applications and some less important applications. Maybe at this point it is safe to let the users know what the other application’s restore objectives are. The users will recognize where their application sits in the food chain and probably be totally fine with it. They don’t use just one app all day long so they will be happy to know you have taken the time to properly organize your priorities.

What the end game for an internal RPO/RTO discussion should be is communication. Communicate the challenges IT faces in writing. It doesn’t have to be some kind of contract. Just like in any other project that ran a year past due, restore objectives are really just goals. You cannot accurately predict future disasters. You should however try to predict the future because you may be a lot closer than you think.

There are still a lot of servers I manage that are not a part of WSUS yet. This means going in and manually installing patches. I knew that I had a lot of these servers and just going down the list one by one isn’t very efficient. One thing that would really help is if I could get a list that is ordered by the last time it was patched. Then I could target the worst offenders first, and if I had time I could move on to the servers with fewer patches needed.

This simple command will get a list of all the patches that have been installed with a few other pieces of information.

wmic qfe get > c:wu.txt

For my computer this report is formatted with fixed width columns and could be parsed. I could add a node parameter to the above command and run it for each server. Then I would have all the data I need to get my report. However, that is just too much data and a lot of parsing. I stepped back over an older project that I still use and pulled out some of the WMI code. To dump everything in one heap and then sort through it would be inefficient. To make this more efficient I looked at the WMI class Win32_QuickFixEngineering or qfe.

The qfe has this nice column “InstallDate” which really sounded like a winner to me. Its actually a date type and could be very useful. However, its always NULL for every server I checked. I went back to the text report scratching my head and saw that there was another column “InstalledOn”. That is the column I need.

I noticed that on some servers this value was blank for a lot of patches. I am assuming a bit here but it appears they are patches that were installed during sysprep or in some other fashion early in the servers life. The next thing I realized is that all I really need is the top 1 in descending order. This is when I found out there is no sorting in WMI. I would have to sort the results after the fact.

The MangementObject that gets returned is a bit tricky to work with so the first thing I do is read the results into an array. ArraysLists are easy to add to “.add” and also easy to sort. A .Sort() and then a .Reverse() is all I needed to pick out the last patch date. It was sitting in my array(0).

I ran into a snag rather quickly. Turns out vista era Microsoft decided it would be a great idea to start returning this as a 16 character hex value. Part of my code had to handle 4 different date types if you count NULLs.

'oReturn is one record in the managementobject. You get one for each patch
Dim oDate As Object
If oReturn("InstalledOn").Length = 16 Then
oDate = DateTime.FromFileTimeUtc(Int64.Parse(oReturn("InstalledOn"), System.Globalization.NumberStyles.AllowHexSpecifier))
ElseIf oReturn("InstalledOn").Length &gt; 7 And oReturn("InstalledOn").Length &lt; 11 Then
oDate = CDate(oReturn(&quot;InstalledOn&quot;))
Else
'I don't know what date type you are... you must not be a recent patch.
oDate = DateTime.Now.AddYears(-100)
End If

When searching for answers to my questions I did read some complaints about all patches not showing up in the results. To solve my problem it wasn’t really necessary to be totally accurate. I just needed a date the last time the server had some tlc. So far it has been very accurate.

My blog now runs on what is known generically as a virtual appliance. The idea behind an appliance is that the server prep work and software installs are done in advance so you just have to press the power button.

Vmware Player is software that fits inbetween your current operating system and your virtual machine. So if you download the virtual machine from http://www.turnkeylinux.org/ all you have to do is then open the .vmx file and you have a computer inside of a computer that can host your blog. For the wordpress appliance you just have to enter a few different administrative passwords and start blogging. It takes literally minutes to be up and running. Some of the post configurations I did were:

Set a static ip from webmin so I could forward traffic from my router to a static IP instead of a dynamic IP.

My key complaint about a turnkey solution is this; when a problem occurs, how do you fix it? Granted, troubleshooting isn’t the greatest way to learn something, but it is the way I have learned the few linux skills I have today. Maybe the best thing to do is get started with turnkey solutions and then graduate into a roll-your-own solution.

Another point to be made is on security. Turnkey solutions are as secure as you are going to get, however, just like Windows, the more users the better the target. I have always been skeptical of after market operating systems wether it is a pirated copy of Windows or a virtual appliance. In theory, someone could plant a piece of code so low into the system that you will never know its there. That is probably as outlandish as proposing that someone could slip a backdoor into the .NET framework. The core concepts of turnkey and .NET are similar, let an expert do the work so you don’t have to.

The benefits FAR outweigh the costs(if any). I am so grateful for the folks who have taken their time to develop turnkey. I now can explore many other platforms quite easily. What I would like to see is someone with an android development virtual appliance.