Recovering from NT Startup Failures, Part 1

That would you do if one of your core production servers crashed the next time you reboot it? More important, how much time would you need to fix the problem? For most Windows NT administrators, the thought of a mission-critical production server experiencing STOP errors (aka the blue screen of death) or any form of server outage makes them break out in a cold sweat.

A hosed NT system is never fun, but an unavailable critical server means lost productivity, lost time, lost money, and, of course, an angry boss. In this first installment of a two-part article, I discuss advanced tools and procedures that you can use to improve the availability of your network servers and to increase your chances of recovering from an NT boot failure. In addition, I delve into lesser-known techniques that you can employ right away to help you recover a downed NT system in the future. In this article, I don't address clustering solutions, and I assume that each system is a standalone, nonclustered NT system without system-level failover.

Common Calamities Although various circumstances can cause an NT system to crash at startup, the result of these circumstances is usually the dreaded blue screen of death, which Screen 1, page 100, exemplifies. After NT halts the system, it displays this screen to protect the system against data corruption. In addition to being blue as its name implies, a blue screen displays important information about the system's state at the time of the STOP error. The screen lists the STOP code, the location in memory where the problem occurred, and the drivers loaded in memory when the STOP took place. However, pinning down the source of a STOP error isn't always easy. In my experience, a problem usually develops from one of the following scenarios:

You install software that corrupts the HKEY_LOCAL_MACHINE portion of the Registry—particularly, software that installs new services or drivers. This action usually results in a STOP error or blue screen, which indicates that the system Registry or a particular hive file failed.

You change a system's network configuration, which causes NT to rewrite network bindings and their related Registry entries (i.e., NT corrupts or overwrites critical OS files with invalid or incompatible versions while the system is in use).

You install a new service or driver on the system, which causes a system-level incompatibility problem that results in a STOP error when you reboot (i.e., underlying file corruption has occurred on a key system file that you loaded into memory before the corruption).

Each of these situations has a different set of underlying causes and solutions, so let's look at each scenario individually.

Registry Corruption The system Registry is the heart of an NT installation. Thus, depending on the nature and extent of the damage, a corrupted Registry often results in a STOP error or blue screen of death at startup. Damage to the Registry can be physical or logical. Physical damage means that something (usually disk-related corruption) has scrambled the Registry hive files (e.g., the SOFTWARE or SYSTEM files in the \%winntroot%\system32\config folder). Logical damage means that a third-party application, a user, or NT has written invalid data to the Registry, which can trigger an NT startup failure if the logically damaged Registry entry is critical.

Unfortunately, you can't always tell whether a damaged Registry is the cause of your system's STOP error. The STOP error might identify a telltale sign such as a hard Registry error or a reference to a particular damaged hive file. However, in some cases, the STOP error doesn't indicate Registry damage.

If you suspect a Registry-related problem, the first line of defense is to restore a previous known-good Registry configuration. You can use several methods to accomplish this solution.

The Last Known Good Configuration option. You access this option by pressing the space bar when the system prompts you during the NT boot process, and selecting the option to restore a previous configuration. This method is the quickest and easiest solution, if it works. Unfortunately, this solution's failures outweigh its successes in real-world applications because its scope is only a previously known-good incarnation of one portion of the Registry (i.e., a ControlSet00X Registry subtree of the HKEY_LOCAL_MACHINE\SYSTEM key). You have a better chance of success using the Last Known Good Configuration option if the problem is localized to this portion of the Registry and an event that immediately precedes the invocation of the Last Known Good Configuration option caused the problem. However, this procedure won't cure most of your Registry-corruption ills.

NT Setup's Repair process and an Emergency Repair Disk (ERD). You can use NT Setup's Repair process to inspect and replace individual Registry hive files if the Last Known Good Configuration option fails to resolve the problem. After you insert your ERD, Setup lists the options you can select to specify which portions of the NT installation you want Setup to inspect, as Screen 2 shows. If you select Inspect registry files, Setup displays a list of Registry hive files and lets you select which files you want Setup to replace. Setup takes the replacement files from the ERD or, if you didn't provide an ERD, from the \%systemroot%\repair folder. The ERD and the \%systemroot%\repair folder store replacement files in compressed format, and each hive file has an underscore (_) extension (e.g., SYSTEM._, SOFTWARE._).

Using the most recent replacement files is important so that you don't lose application and service configuration information. (For information about how to update your ERD, see Michael Reilly's "The Emergency Repair Disk," January 1997.) In addition, don't restore the SAM and SECURITY hives on an NT server domain controller, unless you used the rdisk /s (or /s-) option when you ran the ERD utility (i.e., rdisk.exe). Otherwise, Setup overwrites your SAM database with the database version Setup created during the original NT installation and creates a new set of problems. In addition, ensure that you created the replacement files under the same service pack level as the files you're replacing because Service Pack 3 (SP3) and later make security-related changes to the SAM and SECURITY hives. Otherwise, you might not be able to log on after the repair is complete. Restoring the SAM and SECURITY files usually won't resolve your Registry corruption problems anyway because the SYSTEM and SOFTWARE hives usually cause Registry boot problems. Thus, start restoring previous Registry files with the SYSTEM and SOFTWARE files, and replace the SYSTEM hive first because it contains references to important system components, including drivers and services.

An alternate/parallel NT installation. Using an alternate/parallel NT installation to recover the Registry is my favorite solution. Booting an alternate NT installation lets you access NTFS-based volumes on the system that would otherwise be inaccessible, and a parallel installation gives you access to the primary installation's Registry files so that you can repair or replace them. (You can also gain this type of access by using ERD Commander from Systems Internals at http://www.sysinternals.com or NTFSDOS from Winternals Software at http:// www.winternals.com.) After you boot to an alternate installation, you can perform the same actions that you can perform using NT Repair, but with more flexibility and options. Although this method isn't the solution Microsoft recommends, I think it's the best Registry repair process for advanced NT users. (For more information about parallel NT installations, see the sidebar "Think Parallel.")

Before you begin, make a backup copy of the Registry files. I usually back up the existing files into a subdirectory of the folder that contains the Registry files (e.g., \%systemroot%\system32\ config\backup). After you back up the files, you can experiment with replacing individual Registry hive files. However, you can't simply copy the replacement versions, because the ERD and \%systemroot%/repair folder store these files in compressed format. To use the files, employ the expand.exe command to manually expand them. For example, to expand a compressed copy of the SYSTEM hive from an ERD or the \%systemroot%\repair folder, type the following command at an NT or DOS command prompt:

expand system._ system

Copy the resulting file to the \%systemroot%\system32\config folder of the primary installation, and reboot the system.

If you don't want to deal with compressed files, you can use the Microsoft Windows NT Server 4.0 Resource Kit regback.exe utility to maintain extra copies of the Registry. This handy tool makes a backup that contains all the system Registry hive files in uncompressed format. In addition, this tool automatically backs up the SAM and SECURITY hives, so you don't have to worry about using special switches. However, regback.exe's uncompressed Registry copies consume a lot of space and might not fit on a 3.5" disk. The safest place to store regback.exe-created Registry backups is on a partition other than the NT boot partition—preferably a partition on a different physical hard disk. For maximum protection against hardware-related failures that render the Registry hive files inaccessible, store an extra copy of each server's Registry on a different system.

Overwritten or Corrupted Files One of NT 4.0's serious downfalls is its use of shared system files, which third-party application vendors can freely overwrite with out-of-date or otherwise incompatible support files. In addition, NT doesn't do much to protect itself against the replacement of other key system files, such as system services' files and drivers. In some cases, these conflicts are merely annoying because they cause unwanted errors or application failures. However, this type of problem can result in the inability to start NT. (Windows 2000—Win2K—removes some of this risky exposure by privatizing application DLLs and providing greater protection from overwriting critical system files.)

To repair damaged or incompatible files on an NTFS volume, you can use a parallel NT installation or NT Setup's Repair process. To repair FAT volumes, you can use a DOS or Windows 9x boot disk to access the volume.

Replacing files from a parallel installation is easier if you know which files are invalid or damaged. As a disaster-prevention measure, create an installation source on your hard disk or a CD-ROM that contains copies of the latest core NT system files for the service pack on your system. If you're running a parallel NT installation that you patched to the same service-pack level as the primary installation, you can use that installation as your source. However, if your parallel installation isn't the same service-pack level as your primary installation, create a separate directory that contains the latest versions of the primary installation's files.

To use NT Setup's Repair process to replace damaged or conflicting files, select the Verify Windows NT system files option when Setup presents you with the list of repair options. Microsoft intended this feature to let you quickly identify files that are different from the original NT installation files. However, an NT installation that you've installed a service pack on causes Setup to list most files as unoriginal because the service pack has modified them. Thus, your best bet is to instruct Setup to replace all nonoriginal files by selecting the A option and reapplying the latest service pack after NT is back up and running.

Alternatively, you can replace NT system files with original versions using NT Setup's upgrade option to reinstall NT. Although some users circumvent the previous NT Setup Repair process and jump into an upgrade installation, I don't recommend this solution for several reasons. First, the upgrade process usually takes much longer than the repair process. Second, the upgrade process is more involved and poses greater risks to your system. Finally, if an upgrade installation successfully resolves your original problem, it will probably cause a tcpip.sys blue screen error (i.e., STOP error 0x00000050). When you install NT 4.0 or NT 4.0 SP1 over NT 4.0 SP2 or later, the installation doesn't replace the SP2 or later version of tcpip.sys. Thus, the driver fails the base version of NT or NT SP1. To avoid this mess, first use the NT Setup Repair process' Verify Windows NT system files option to replace the existing files with the original versions. If NT Setup's Repair process doesn't resolve the boot problem, you can run the NT Setup upgrade option without fear of the tcpip.sys blue screen, because NT Setup's Repair process has replaced the SP2 or later version of tcpip.sys with the original version.

An Ounce of Prevention The difference between a quick fix and a major nightmare is often one preparatory step. Tools, such as parallel NT installations and additional backup copies of the Registry, improve your chances of resolving NT startup failures. Therefore, be sure that your servers are always prepared for the worst.

Next month, I'll discuss the third most common cause of NT startup blue screens: an autostarting service or driver that causes a STOP blue screen when it initializes. I'll teach you about some additional recovery tricks, including a method for remotely repairing the Registry of a failed installation from within a parallel NT installation. In addition, I'll show you third-party tools that can bail you out of trouble when a system won't boot.

Discuss this Article 26

Sean Daily (not verified)

on Mar 6, 2001

In the sidebar "Think Parallel" in "Recovering from NT Startup Failures, Part 1" (September 1999), I discuss a procedure that you can use to solve your problem. You can find the article online at http://www .win2000mag.com. To view the sidebar, enter 7075 in the InstantDoc ID text box; enter 7076 to view the whole article. You can also find tips about building multi-OS systems in "Mastering Multibooting Madness" (July 1999) and "Multibooting Windows 2000 Systems" (Summer 2000).
--Sean Daily

Thank you very much for the advice in Sean Daily's "Recovering from NT Startup Failures, Part 1" (September 1999). The author mentions that creating a parallel Windows NT installation provides you with a back door to your system when your primary installation is down. I have several questions about parallel installations. Why can't I just use the boot disk that I created instead of a parallel NT installation? From the NT Setup's Repair options, I have the opportunity to inspect the Registry files. If I choose the \%systemroot%\repair folder instead of the Emergency Repair Disk (ERD), which one has the most recent files?
--­Thomas Leung

I just want to let you know how much
I appreciate Sean Daily's "Recovering from NT Startup Failures, Part 1" (September 1999). As a beginner who is working toward completing my six exams for an MCSE, I found the article very practical. I'm looking forward to part 2.

Setting up a parallel NT installation is no different from setting up an initial installation. Simply run NT Setup, but choose to install a new installation rather than to upgrade the existing one (a very important step and the only major catch in the process). When you're finished, the NT Boot Loader menu will show two (four if you count the VGA-mode entries) choices for NT. To make the menu options clearer, you can edit the boot.ini file and rename the parallel installation to something like Windows NT Recovery Installation.
--­Sean Daily

You can recover from this situation in several ways, but the easiest way is to have your customer install a parallel installation of Windows NT. While your customer is booted under that installation, have her set whatever permissions are necessary to get the original installation back up and running. After she can boot back into the original installation, she can use Fixacls from the Microsoft Windows NT Server 4.0 Resource Kit to restore the original permissions on the \%systemroot% folder and its subdirectories.

I don't know how you do it- but just when I start to worry about something, I pick up an NTMagazine and there you are with the answer. Its amazing how you get right into the worry part of my brain and know just when and what I am worrying about.
Funny thing is this time - I was catching up on older issues I hadn't had a chance to read and bingo! The first one I picked up there was this article answering the very question I had been asking but hadn't had time to act on formulating a written plan - What should be my plan of action in case of a server failure.
Thanks NT Mag. You saved me hours - you generally do.
Suzanne Foubert
Systems Administator
Baylor College of Medicine
Houston, Texas

I'm interested in setting up a machine that will boot to DOS, Windows NT running multiple protocols, and NT running only TCP/IP. I've never set up a machine to boot to two different versions of NT. What's the best way to accomplish this task?

After reading "Recovering from NT Startup Failures, Part 1," I thought you might help me with a specific problem. In a stressful moment, one of our customers changed the permissions on the system share C$, setting System to No Access. Now she has an infinite boot on the server--­it boots and boots and boots! My customer tried an Emergency Repair Disk (ERD) without any luck. The entire Microsoft BackOffice product line is installed on the server, and accessing SQL Server 7.0 and the databases is a major concern. Is there any way to help my customer? In the article, the author refers to ERD Commander and NTFSDOS--­is it too late to use these tools?

Part 2 appears in the November 1999 issue (page 83), and I hope you find the article equally helpful. Part 2 delves into several disaster preparation and recovery topics that I didn't have space for in part 1.

In the sidebar "Think Parallel" in "Recovering from NT Startup Failures, Part 1" (September 1999), I discuss a procedure that you can use to solve your problem. You can find the article online at http://www .win2000mag.com. To view the sidebar, enter 7075 in the InstantDoc ID text box; enter 7076 to view the whole article. You can also find tips about building multi-OS systems in "Mastering Multibooting Madness" (July 1999) and "Multibooting Windows 2000 Systems" (Summer 2000).
--Sean Daily

I read Sean Daily's "Recovering from NT Startup Failures, Part 1" (September 1999), which includes the sidebar "Think Parallel." I've installed Windows NT 4.0 once from the three setup disks and another time from a bootable CD-ROM. I want to install a parallel NT installation on my workstation on the same partition as my original installation of NT 4.0, but I don't know where to start. Can you help?

Sean Daily's "Recovering from NT Startup Failures, Part 1" (September 1999) is very informative, but I'd like to know more about troubleshooting Windows NT's memory dump file. If I forward dump-file output to Microsoft, I usually get an answer, but it doesn't help me update my knowledge. Can you provide some guidelines for tracing problems in a memory dump file?

Unfortunately, I can't claim to be an expert at interpreting memory dump files. In the 7 years that I've been working with NT, I've never encountered a situation in which examining a memory dump file proved useful. However, I've successfully recovered from dozens of STOP errors on various NT systems.
The average network administrator will find blue screen information more helpful than a memory dump. The blue screen information contains valuable information about the drivers and services present in memory and the specific STOP error that halted the system. Your best bet is to concentrate on what changed just before the blue screen and which error message you received. Researching (e.g., in the Microsoft Knowledge Base, Deja.com, newsgroups) other occurrences of the particular STOP error is often the most efficient way to resolve problems.
Microsoft provides the memory dump file primarily for software developers, rather than users or network administrators. The dump file provides to developers a real-world stack dump from a customer site that might yield clues about whether a service or driver participated in a particular problem. However, sending these files to the software developer is often impractical because the files are so large. I've found that most developers aren't interested in receiving or analyzing these files, and even Microsoft has recently changed its policies about sending in dump files. Luckily, Windows 2000 (Win2K) makes the memory dump file more useful: You can pare down the file to a minimal set of information that can be more helpful to Microsoft and third-party developers. Perhaps with this change, memory dump files will become more useful. >
--­Sean Daily

In "Recovering from NT Startup Failures, Part 1" (September 1999), Sean Daily describes how to use a parallel installation to repair a crashed Windows NT system. I installed a parallel installation on my second hard disk, and I regularly run a batch file that calls Rdisk and copies the repair disk information to the second hard disk. The batch file also calls Regback to make an uncompressed copy of the hives on the second disk. I also created a readme.txt file that contains instructions that I prepared for the specific machine based on information from the article.
Last night, the dreaded blue screen of death appeared on my otherwise most reliable machine. The error message I got was Unhandled kernel exception. I looked up my repair procedure, which indicated that after booting the parallel installation and backing up C:\winntsystem32\config, I needed to copy the system hive from D:\regback to C:winnt\system32\config and reboot.
Voilà! All is well. If that process hadn't worked, I would have pressed on, replacing other hives in the order the article described. Backing up the system hives regularly is the key.
I'm an end user--­not an IT professional--­so I carry the entire responsibility for my computer system on my shoulders. I have three important things to say: Thank you! Thank you! Thank you!
--­Al Stanbury

Although you can use an NT startup disk in some cases (e.g., a file required to start NT, such as NTLDR, is corrupted or missing; the boot sector is damaged), this strategy wouldn't help in many other situations. For example, if files in the \winnt folder are damaged or the problem involves the Registry, using an NT boot disk won't help. In these cases, the parallel installation will let you access the NTFS boot volume and make the necessary repairs to files or the Registry.
In regard to your second question, the files in the \%systemroot%\repair folder and those on the ERD will usually be the same. Of course, that's assuming that you opted to update the ERD the last time you ran Rdisk (choosing to update is an option, not a requirement). NT first makes a compressed copy of the Registry into the repair folder on the hard disk, then copies those files to a 3.5" disk during the ERD creation process. However, if you have an out-of-date ERD or you didn't update the ERD during the last execution of Rdisk, the hard disk-based copy would be the most recent version.
--­Sean Daily