Main menu

Posts Tagged 'Drivers'

Recently on one of our technical forums I contributed to a discussion about the Windows operating system. One of our director’s saw the post and thought it might be of interest to readers of the InnerLayer as well. The post focused on the pros and cons of Windows 2008 from the viewpoint of a systems / driver engineer (aka me). If you have no technical background, or interest in Microsoft operating system offerings, what follows probably will not be of interest to you—just the same, here is my two cents.

Microsoft is no different than any other developer when it comes to writing software--they get better with each iteration. There is not a person out there who would argue that the world of home computers would have been better off if none of us ever progressed beyond MS-DOS 1.0. Not to say there is anything wrong with MS-DOS. I love it. And still use it occasionally doing embedded work. But my point is that while there have certainly been some false starts along the way (can you say BOB), Microsoft's operating systems generally get better with each release.

So why not go out and update everything the day the latest and greatest OS hits the shelves? Because as most of you know, there are bugs that have to get worked out. To add to that, the more complex the OS gets, the more bugs there are and the more time it takes to shake those bugs out. Windows Server 2008 is no different. In my experience there are still a number of troublesome issues with W2K8 that need to be addressed. Just to name a few:

UAC (user access control) - these are the security features that give us so much headache. I'm not saying we don't need the added security. I'm just saying this is a new arena for MS and they still have a lot to learn. After clicking YES, I REALLY REALLY REALLY WANT TO INSTALL SAID APPLICATION for the 40th time in a day, most administrators will opt to disable UAC, thereby thwarting the added security benefits entirely. If I were running this team at MS I'd require all my developers to take a good hard look at LINUX.

UMD (user mode drivers) - the idea of running a device driver, or a portion of a device driver, in the restricted and therefore safe user memory of the kernel is a great idea in terms of improving OS reliability. I've seen numbers suggesting that as many as 90% of hard OS failures are caused by faulty third-party drivers mucking around in kernel mode. However implementing user mode drivers adds some new complexities if hardware manufacturers don't want to take a performance hit and from my experience not all hardware vendors are up to speed yet.

Driver Verification - this to me is the most troublesome and annoying issue right now with the 64-bit only version of W2K8. Only kernel mode software that has been certified in the MS lab is allowed to execute on a production boot of the OS. Period. Since I am writing this on the SoftLayer blog, I am assuming most of you are not selecting hardware and drivers to run on your boxes. We are handling that for you. But let me tell you it’s a pain in the butt to only run third party drivers that have been through the MS quality lab. Besides not being able to run drivers we have developed in house it is impossible for us to apply a patch from even the largest of hardware vendors without waiting on that patch to get submitted to MS and then cleared for the OS. A good example was a problem we ran into with an Intel Enet driver. Here at SoftLayer we found a bug in the driver and after a lot of back and forth with Intel's Engineers we had a fix in hand. But that fix could not be applied to the W2K8 64-bit boxes until weeks later when the fix finally made it from Intel to MS and back to Intel and us again. Very frustrating.

Okay, so now that you see some of the reasons NOT to use MS Windows Server 2008 what are some of the reasons it’s at least worth taking a look at? Well here are just a few that I know of from some of the work I have done keeping up to speed with the latest driver model.

Improved Memory Management – W2K8 issues fewer and larger disk I/O's than its 2003 predecessor. This applies to standard disk fetching, but also paging and even read-aheads. On Windows 2003 it is not uncommon for disk writes to happen in blocks

Improved Data Reliability - Everyone knows how painful disk corruption can be. And everyone knows taking a server offline on a regular basis to run chkdsk and repair disk corruption is slow. One of the ideal improvements in terms of administering a websever is that W2K8 employs a technology called NTFS self-healing. This new feature built into the file system detects disk corruption on the fly and quarantines that sector, allowing system worker-threads to execute chkdsk like repairs on the corrupted area without taking the rest of the volume offline.

Scalability - The W2K8 kernel introduces a number of streamlining factors that greatly enhance system wide performance. A minor but significant change to the operating system's low level timer code, combined with new I/O completion handling, and more efficient thread pool, offer marked improvement on load-heavy server applications. I have read documentation supporting claims that the minimization in CPU synchronization alone results directly in a 30% gain on the number of concurrent Windows 2008 users over 2003. That's not to say once you throw in all the added security and take the user mode driver hit you won't be looking at 2003 speeds. I'm just pointing out hard kernel-level improvements that can be directly quantified by multiplying your resources against the number of saved CPU cycles.

Alright, no need to beat a dead horse. My hope was if nothing else to muddy the waters a bit. The majority of posts I read on our internal forums seemed to recommend avoiding W2K8 like the plague. I'm only suggesting while it is certainly not perfect, there are some benefits to at least taking it for a test drive. Besides, with SoftLayer's handy dandy portal driven OS deployment, in the amount of time it took you to read all my rambling you might have already installed Windows Server 2008 and tried it out for yourself. Okay, maybe that's a bit of an exaggeration. But still...you get the idea!

My grandmother used to say an ounce of prevention is worth a pound of cure. Usually this was her polite way of telling me to pick my skateboard up off the stairs before she stepped on it and broke her neck or to put a sheet of newspaper over her antique kitchen table before I began refueling my model airplane. All very sound advice looking back. And now here I find myself repeating the same adage some twenty years later in the context of predicting mechanical drive failure. An ounce of prevention is worth a pound of cure.

Hard disk drive manufacturers recognized both the reality and the advantages of being able to predict normal hard disk failures associated with drive degradation sometime around 2003. This led a number of leading hard disk makers to collaborate on a standard which eventually became known as SMART. This acronym stands for Self-Monitoring, Analysis and Reporting Technology and when used properly is a formidable weapon in any system administrator's arsenal.

The basic concept is that firmware on the hard disk itself will record and report key "attributes" of that drive which when monitored and analyzed over time can be used to predict and avoid catastrophic hard disk failures. Anyone who has been around computers for more than a day knows the terrible feeling that manifests in the pit of your stomach when it becomes apparent that your server or workstation will not boot because the hard disk has cratered. Luckily, we ALL of course back up our hard drives daily! Right?

All kidding aside even with a recent back up just the task of restoring and getting your system back in working order is a serious hassle and it’s not something you get the luxury of scheduling if the machine is critical to operations and failed in the middle of your work day or worse yet, the middle of your beauty sleep. That is where SMART comes in. When properly used SMART data can give “clues” that a drive is reaching a failure point--prior to it failing. This in turns means you can schedule a drive cloning and replacement within your next regular maintenance window. Really aside from a hard disk that lasts forever what more could an administrator ask for?

SMART drive data has been described as a jigsaw puzzle. That's because it takes monitoring a myriad of data points consistently over time to be able to put together a picture of your hard disk health. The idea is that an administrator regularly records and analyzes characteristics about the installed spinning media and looks for early warning signs that something is going wrong. While different drives have different data points, some of the key and most common attributes are:

head flying height

data throughput performance

spin-up time

re-allocated sector count

seek error rate

seek time performance

spin try recount

drive calibration retry count

These items are considered typical drive health indicators and should be base-lined at drive installation and then monitored for significant degradation. While the experts still disagree on the exact value of SMART data analysis, I have seen sources that claim at least 30% of drive failures can be detected some 60 days prior to the actual failure through the monitoring of SMART data.

Of course not all drive failures can be predicted. Plus some failures are caused by factors other than drive degradation. Consider drives damaged by power surges or drives that are dropped in shipping as good examples of drive failures that cannot normally be detected through SMART monitoring. However in my humble opinion even one hard disk failure prevented over the course of my career is something to celebrate--unless you happen to own stock in McNeil Consumer Healthcare, a.k.a. the distributors of Tylenol!

So what does this have to do with SoftLayer? Well I am certainly not claiming that SoftLayer is going to predict all your hard drive disasters so there is no reason for you to back up your data. In fact, I recommend not just backing it up but backing it up in geographically disparate locations (did I mention we have data centers in Dallas and Seattle?). What I do mean to share is that technologies like SMART data are just one of the many ways SoftLayer is currently investigating to improve what is already the best hosting company in the business.

I should know. I was tasked with writing the low-level software to extract this data. That’s right. SoftLayer has engineers working at the application layer, down at the device driver layer, and everywhere in between. If that doesn’t give you a warm fuzzy about your hosting company, I don’t know what will.