Windows 2000: Troubleshooting Shock Troops

Nobody knows troubleshooting like Compaq.
The company's Global Services operation has 15,000 consultants
building and maintaining Microsoft-based enterprise solutions
globally. More than 3,200 of them are Windows 2000-certified.
They're so good at what they do, they supported Microsoft's
beta customers for the OS during those companies' deployments.

These guys have seen it all in the course
of their work, from bone-headed migration moves (read
on for details!) to brilliant and elusive technical mysteries.
They share what they know with each otheraround
the world. When a problem arises, chances are, somebody
else in the organization has experienced the same dilemmaand
has derived a solution.

And that's why MCP Magazine asked
a core group of them to share their best troubleshooting
secrets. What they proposed was massivealmost too
comprehensive for a single magazine article. So we let
them pick out and identify a number of problemsand
their solutionsto share with you. These final choices
are dilemmas experienced by a large number of people;
they're serious enough to warn you about beforehand; or
they help resolve a variety of related issues, such as
replication problems. We've divided the troubleshooting
evils into four categories: setup and installation, Active
Directory (AD), networking and clustering. Read and learn.

Setup and Installation

Problem: I've
implemented about 50 Remote Installation Services (RIS)
servers throughout my organization, but we only have one
image. Several of these servers are experiencing problems
with insufficient disk space. There's a Single Interface
Store (SIS) Common Store directory that has copies of
all the files in the image, which seems to be used for
multiple images so they can share these files. If I have
only one image, can I delete the SIS Common Store and
recover the disk space?

Solution: No.
Deletion of the SIS Common Store directory will prevent
RIS image files and any other application with files that
have been converted to reparse points from accessing the
backing file containing the data. In short, it'll break
RIS and, possibly, other applications installed on that
partition.

The function of the SIS Common Store, included
when RIS is installed, is to conserve disk space by eliminating
duplicate files on an NTFS volume. The two SIS components
that RIS installs are SIS filter driver and SIS Groveler.
SIS Groveler scans for files that are identical to one
or more files on the NTFS volume using signatures and
byte-by-byte comparison. It then reports the file to the
SIS filter driver that creates the SIS link (NTFS reparse
points), copies the file to the SIS Common Store Folder,
and renames it with an arbitrary 128-bit globally unique
identifier (GUID) with a .SIS extension. The original
files are changed to reparse points with a "size
on disk" equal to the default cluster size of the
disk in most cases. Only files larger than 32KB are processed
by SIS Groveler. Therefore, we can now have many instances
of a file represented by reparse points link to the actual
data for that file stored in the SIS Common Store Folder.
The file in the SIS Common Store Folder is also called
the "backing file" and contains the data. Figure
1 is an example of how ntoskrnl.exe is copied to SIS Common
Store and renamed. The lower box shows the location and
contents of the SIS Common Store folder, located at the
root level of the drive. It contains the actual files
where the reparse points are directed.

Figure 1. How ntoskrnl.exe is
copied to SIS Common Store and renamed. The lower
box shows the location and contents of he SIS Common
Store folder containing the files where the reparse
points are directed. (Click image to view larger version.)

One caveat: The backup/restore software must
be SIS link-aware. Ntbackup is SIS link-aware and will
call SISbkup.dll to back up and restore properly. Third-party
backup solutions have to know how to call SISbkup.dll
to work properly.

Problem: I just
added a new video card and now my system won't boot. How
can I recover without reinstalling Win2K?

Solution: In
Windows NT 4.0, there were several answers to this:

Boot to Last Known Good (which sometimes
works).

Use the Emergency Repair Disk (which
no one ever has available or updated).

Create a "parallel install."
Create a new installation of NT on another partition
on the disk, boot to that OS and go to the broken configuration
and remove the driver.

Fortunately, Win2K gives us some tools to
repair this problem without a parallel install.

If Last Known Good doesn't work and there's
no system state backup, this can be corrected with either
Safe Mode Boot or Remote Console.

Safe Mode Boot is much like Safe Mode Boot
in Windows 95 or Windows 98. You can start in Safe Mode
by choosing F8 at the boot loader screen, then select
Safe Mode. This will enable you to boot the system with
a minimum set of drivers and services, which allow you
to perform tasks such as disabling a driver or service,
including the one causing the problem. Options for Safe
Mode are basic Safe Mode, which starts the system with
basic drivers; Safe Mode with Networking, which is similar
to Safe Mode but includes networking services for connectivity;
and Safe Mode with Command Prompt, which doesn't start
the GUI. It only starts the command mode.

The Recovery Console is a new tool that gives
you a command-line tool for repairing a system that won't
start. You have three options for invoking the Recovery
Console: booting from the Win2K CD; booting from the startup
floppies; or selecting the Recovery Console from the boot
loader screen (assuming it's been installed). Here are
the console options:

CopyCopies files to another
location or name.

DelDeletes files.

DisableDisables services
or drivers.

FixbootWrites a new boot
sector.

FixmbrRepairs Master Boot
Record, much like FDISK /MBR in DOS.

The Recovery Console also can be customized.
For example, you can install it as part of a large deployment
by using winnt32.exe /cmdcons /unattend.

Active Directory

Problem: I've
heard that Win2K has a limit of about 250 sites. Our deployment
will require more than 1,000 sites. I've read somewhere
that if you have that many sites, you should turn off
the Knowledge Consistency Checker (KCC), but that seems
like a drastic step. What should I do?

Solution: This
is a much advertised and much misunderstood issue. During
the Win2K beta, Compaq was one of the first to see the
problem. The more sites, DCs, and the like you have, the
longer it takes the KCCwhich by default runs every
15 minutesto do its job. When it fires up, it takes
about 90 percent of the CPU of one processor on every
DC (staggered). So the "limit" is whatever you
can live with, remembering that you give up 90 percent
CPU utilization on the DCs.

We believe that with proper design and implementation,
there's no need to turn off the KCC. Doing so would force
you to do all the KCC's work manually, including creating
transitive links, routing around trouble spots, creating
and cleaning up connections, forming the topology using
the spanning tree algorithm, adjusting for failed bridgehead
servers, and so on. I don't believe this is practical.

Replication
Repair Tip

When it comes to replication
repair, we've found that it's important
to be patient. After making changes, you
can try forcing replication (Replication
Monitor has the ability to push the changes
out to the enterprise), but it's quite
surprising at how many issues get resolved
by just waiting and letting replication
move the changes out naturally.

There are several options you can choose
from to get around this limitation. An excellent reference
is KB Q244368, "How to Optimize Active Directory
Replication in a Large Network," that provides equations
to predict the KCC time based on number of sites and domains,
as well as good descriptions of workarounds to this problem.

One is to turn off Auto Site Link Bridging.
Using the equations in Q244368, if you have 1,000 sites
and five domains, the KCC time is about 45 minutes. That
means it takes the KCC 45 minutes to do its job (eating
90 percent of the DC's CPU), then goes to sleep for 15
minutes, then fires up for 45 minutes. So out of every
hour, your DC gets 15 minutes of CPU to do other things.
Not good. However, if you turn Site Link Bridging off,
this drops the KCC time to about three minutes, eliminating
the problem. This eliminates transitive site links, but
in a pure hub and spoke configuration, this isn't usually
a problem. You can build some "backup" links
if you want some redundancy and don't want the KCC to
do it.

Another method is to use Super-Sites, which
Compaq employs. Rather than having every location defined
as a site, collect several locations into a single site.
Because this forces replication in those sites to intra-site
parameters (no data compression, urgent replication, and
so on.), Compaq requires at least a 2MB link between these
sites. Even though Compaq has a number of physical locations
in Canada and Japan, because of the high-speed links between
location, we only needed to define two Active Directory
sites in Canada and two in Japan. Using Super Sites, it
reduced 700 locations to about 80 sites.

In addition to the Design resolutions just
noted there are some technical ways to solve this problem.
Schedule the KCC to run at certain times on each DC, thus
controlling when the CPU is hit. Load balancing is also
an issue when you have more than 100 satellite sites replicating
to a single hub site and one Bridgehead Server (BHS).
With manual intervention you can configure multiple BHS
to share the load. Because both of these issues are more
critical in a branch office environment where locations
are connected with VPN links, Microsoft recently published
an excellent white paper, "Active Directory Branch
Office Planning Guide." It includes a set of scripts
and procedures aimed at scheduling the KCC and building
connections for load balancing. Tools of this nature are
critical if you plan on turning off the KCC. You can download
the white paper at www.microsoft.com/WINDOWS2000/
techinfo/planning/activedirectory/branchoffice/default.asp.

By the way, Microsoft has promised that Windows 2002
will improve the performance of the KCC significantly,
so this problem should go away. [See "Sonic
Boom! Windows 2002 Smashes the Barriers" in the
July 2001 issue of MCP Magazine for more on this.
Ed.]

Problem: I get
Event 1000 and 1001 errors in Application Event Log in
five-minute intervals; Group Policy is not taking effect;
or \%windir%\sysvol\staging and ...\staging areas folders
have large quantities of files.

Solution: This
is usually indicative of a File Replication Service (FRS)
issue. Note that Event 1000 is associated with a wide
variety of descriptions. In this case it's a Userenv event
with the error message "The Group Policy client-side
extension Security was passed flags (17) and returned
a failure status code of (3)." It's also accompanied
by Scecli event 1001 with the message "Security policy
cannot be propagated. Cannot access the template. Error
code = 3."

FRS Replication is probably not working.
FRS is one of the biggest problem areas in the orignial
release of Win2K, but has been improved in Service Pack
2. It's responsible for, among other things, replicating
Group Policy templates (and changes) to all DCs. When
changes are made to a GPO and saved, the changed file
is copied to the %systemroot%\sysvol\staging\domain and
%systemroot%\sysvol\stagingareas\ compaq.com directories
(note that this isn't the sysvol share). The screens in
Figure 2 show the result of making changes to a GPO. The
file name is NTFRS_CMP_ and is put in both directories.

Figure 2. The two default directories
to which changes in GPOs are replicated.

The DC then notifies its partners, which
pull it and notify their partners, and so on. These files
shouldn't stay in the staging folders longer than about
10 minutes. This happens for every change and for DFS
changes as well.

To resolve this problem, back up the group
policy files from %systemroot%\ sysvol\sysvol\compaq.com\policies.
A simple copy to another directory or a network share
is fine. You'll be glad you did! Figure 3 shows the Sysvol
directory structure. Note that the policies are listed
by GUID and exist in the \winnt\sysvol and \winnt\sysvol\sysvol
directories. The GPOs in \winnt\sysvol\sysvol\policies
are the ones that get edited via the policy editor and
are replicated. The gpotool.exe output, gpotool.log, provides
a nice mapping of policy name to GUID as shown in Listing
1. Note the policy GUID at the top of the section and
the "Friendly name" below it.

Listing 1. This log, created
by gpotool.exe, maps the policy name to the GUID.

In diagnosing FRS problems, it's critical
to install Service Pack 2. If you can't install SP2, install
SP1 and hotfix Q272567. If you can't install SP1, just
install the hotfix. The hotfix can be installed pre- or
post-SP1 and is incorporated in SP2. You must minimally
install the hotfix or you may never get to the bottom
of your FRS problems.

Other matters to consider:

Resolve any AD replication problems. FRS
depends on AD replication, so if AD is broken, FRS won't
work either.

Stopping and restarting the File Replication
Service on each DC may fix the problem (watch the staging
areasthere will be a visible reduction in size).

If these tasks don't fix the problem, follow
this procedure, which uses information from KB Q257338,
"Troubleshooting Missing SYSVOL and NETLOG ON Shares
on Windows 2000 Domain Controllers," and our experience:

Stop FRS service on all DCs.

Navigate to the Registry key HKLM\SYSTEM\CurrentControlSet\
Services\NtFrs\Parameters\Backup/ Restore\Process at
Startup and set the BurFlags value to D4 on a source
DC. This is usually the PDC emulator.

The BurFlags value is set to D2 on all "satellite"
DCs in the domain as shown in Figure 4.

Start the FRS service on the hub DC and
one other DC and wait for FRS to synchronize. Repeat
for every DC in the domain. You should see the size
of the staging directories change, and maybe even increase
as the files are moved. As long as they're changing
size, FRS is working. Be patient and let FRS work it
out.

If absolutely necessary, identify the
source DC (the one with the most files in the staging
directory) and delete the files from the staging areas
on the satellite DCs. Then repeat this procedureturning
FRS on each DC, one at a timeuntil it's synchronized.

Figure 4. Setting this Registry
value to 2 can help you get FRS working again. (Click
image to view larger version.)

Problem:
I get Event 13557 in the FRS Log: "Duplicate Connection
Objects."

Solution: This
event, like many in Win2K, has a standard troubleshooting
procedure. However, this is a quick fix and may not solve
the real problem. While I'm a big fan of the abilities
of the KCC, it doesn't do a great job of cleaning up old
connection objects. The easy answer is to go to the Sites
and Services snap-in, find the server logging these errors,
and open the NTDS Settings object. There should only be
one inbound connection object from any single DC.

Duplicate connection objects will break FRS
and AD replication if left unresolved. It's possible that
eventually the KCC will clean them up; if not, you'll
need to do it manually. KB article Q251250, "NTFRS
Event ID 13557 Is Recorded When Duplicate NTDS Connection
Objects Exist," is a good reference, but my experience
has taught me to create a prioritized list of methods
to correct this problem, starting at the top and moving
down.

Remove the duplicates. Simply delete the
duplicate objects in the Sites and Services snap-in. If
they don't come back, you're done. Figure 5 shows duplicate
connection objects on Qtest-MDC1 from Qtest-DC2. In this
case, you could simply delete one of them to fix the problem.

Figure 5. To remove a duplicate
object, simply delete it from the Sites and Services
snap-in. (Click image to view larger version.)

Figure 6. After deleting the
duplicate, make sure you have the KCC recheck the
connections. (Click image to view larger version.)

If you see duplicate connections from several
DCs and don't know which ones to delete, you can delete
all of the connection objects, then right-click on the
NTDS settings object and go to All Tasks | Check Replication
Topology. In Figure 6 we deleted the duplicate connections
from Qtest-DC2 and are ready to "Check the Replication
Topology." This will fire up the KCC and make it
re-evaluate the connections for that DC. It will create
the connection objects needed.

If the duplicate connections get re-created,
you need to find out why. The "why" is most
likely a DNS misconfiguration or failure. In one case
in Compaq's Qtest forest, we noticed a DC in Europe with
2,100 connection objects, inbound from a single DC. We
deleted them, but within a few minutes there were 24 more.
We found that a DNS server had its IP address changed,
breaking the delegation. We corrected the delegation,
deleted all the connections, forced the KCC to check the
topology, and the duplicate connections ceased.

Problem: When
attempting to log on to a Win2K member server or Win2K
Pro workstation using a domain account, the following
error message appears: "Error: Trust Relationship
between this workstation and the Domain Controller Failed."

Solution: This
error is usually caused by the secure channel password
for the member server or workstation getting out of sync
with the DC, but it could be caused by a time-zone shift
between the client and the DC. A typical scenario for
this problem would be removing a computer from a Win2K
domain, A, and joining it to another domain, B, then later
moving it back to the original domain, A. Initially, there's
a machine account for this client on the A domain. When
it's moved to the B domain, it creates a new account on
the B domain and synchs the password with the client.
When it's moved back into the A domain, the machine account
is still thereit doesn't create a new onebut
now the passwords don't match, resulting in the error.
I've also seen it caused by moving a computer between
time zones and not changing the client's time zone information.

To resolve this problem, delete the client's
computer account from domain A and let replication in
the site occur, which should take a maximum of five to
10 minutes. Then configure the client to join a workgroup
and reboot it. This cleans up all the local machine account
information. After the reboot, configure the machine into
the domain and reboot again. This will create a new account
and synch the passwords with the client. The reboot, which
is required anyway, will purge the Kerberos tickets so
new ones will be created with the new access information.

If the problem still exists, it could be
a timing issue. Go to the client, open a command prompt
window, and enter this command:

net time \\domaincontroller /set

where "domaincontroller" is a valid
DC name that can be used to synchronize time on the client.
Remember that Kerberos requires that the time difference
between the two systems be less than five minutes.

Be
Resourceful

Microsoft doesn't want you flying blind
when troubleshooting. It offers many
useful diagnostic and troubleshooting
helpmates. Learn them, then use them.
They include Support Tools and Resource
Kit utilities. Remember to get verbose
outputwhen troubleshooting, more
knowledge is better.

Support Tools is found on the Win2K
Server and Advanced Server CDs in \Support\Tools.
Just run setup to install them. These
tools are lifeblood, so much so that
they should be installed on every domain
controller (DC).

For general AD diagnostics, netdiag.exe
and dcdiag.exe are two of the best.
They'll generate netdiag.log and dcdiag.log
files, which give great information
concerning trusts, DNS, NetBIOS names,
TCP/IP details and more.

Nltest.exe is a quick way to return
network information such as a computer's
site, site coverage and a list of DCs
in the domain. You can also use it to
query the domain trusts.

When it comes to replication issues,
Replication Monitor and repadmin.exe
are invaluable tools.

Problem: I just
upgraded my NT 4.0 domain to all Win2K DCs and everything
is broken. How can I recover my NT 4.0 domain? (By the
way, I didn't remove a BDC before the upgrade as Microsoft
recommends, and I have no backup!)

Solution: This
scenario describes a call I got from a customer. It's
absolutely the coolest thing I've done in Win2K troubleshooting.
He had a single NT domain with a PDC and two BDCs. He
upgraded the BDC first (don't ask me how), then the PDC.
In the meantime, the other BDC had a disk crash. The Win2K
domain was brokenno user authentication, no replication,
no services. He wanted to recover the NT domain, but had
no NT 4.0 machines left and no backup. Fortunately, he'd
left it in mixed mode, so he still had a copy of the SAM
database. In mixed-mode, you should still be able to add
an NT 4.0 BDC and get the NT domain back. Since he was
"dead" anyway, we had nothing to lose, so we
used the following process and it worked! I've never seen
this in any Microsoft document or training course. Here's
the process:

Pick the healthiest DC to be used as a
source.

Transfer all the FSMO roles to this machine
if it isn't the FSMO already.

Turn the other DC off.

Pre-create a computer account for a new
NT 4.0 BDC in the AD. This can be done by using Win2K's
Server Manager (svrmgr.exe) or with the netdom command.
Warning: Don't use NT 4.0's version of svrmgr.exeit
won't work. Win2K's version is built in. To use
netdom on a Win2K DC, type:

netdom add bdcname /domain:domain name
/dc

where bdcname is the name of the new BDC
and domain name is the name of the Win2K domain (such
as Compaq.com).

Install a computer (we picked the other
Win2K machine we just turned off) as the Windows NT
4.0 BDC and join the Win2K domain (using the NetBIOS
name, of course). Once this BDC joins the domain, it
will sync with the PDC and get the SAM. Now you have
the NT 4.0 domain intact on this BDC. Shut down the
Win2K DC, leaving only the NT 4.0 BDC.

Promote the NT 4.0 BDC to PDC.

Reinstall the Win2K DC as an NT 4.0 BDC
in the recovered NT 4.0 domain so you're back on solid
ground. Add a second BDC for safety, let it sync with
the others and pull it offline (which should have been
done in the first place).

Now do the migration right. Upgrade the
NT 4.0 PDC and create the Win2K domain.

Upgrade the BDC to Win2K as a replica
DC in the domain.

It took the customer the better part of a
day to do that, but it worked. He recovered all his accounts
and completed the Win2K upgrade. Note: If the
original Win2K domain (the broken one) had been changed
to Native mode, none of this would have worked.

Making
Active Directory Happy

The two biggest issues with making
sure AD is working properly are DNS
and replication. If they work, AD's
generally happy. Here are some general
replication tips to make sure replication's
working:

Get comprehensive replication error
listings from all DCs in a domain
from Replication Monitor/Action Menu/Domain/Search
DCs for Replication Errors.

Get a status report from Replication
Monitor. Right click on a server icon
and select Generate Status Report.

Run repadmin.exe /showreps to look
for errors.

In Sites and Services or Replication
Monitor, force replication between
two DCs.

Force the KCC to regenerate the
topology (Sites and Services or Replication
Monitor). Look for failures.

To see if the domain naming context
is being replicated, create a test
user account on a DC, then force replication
to another DC. Look at the Users and
Computers snap-in on that DC and see
if the test user's there.

To see if the Configuration and
Schema naming context is being replicated,
create a test site, and force replication,
then see if the other DC gets the
new site.

Networking

Problem: Why
is it when I enter a Route Add command, the route doesn't
show up in the RRAS list of static routes?

Solution: There's
been quite a lot of confusion about the different ways
to define static routes in Win2K Server. It started with
the introduction of RRAS in NT 4.0, but it's still in
the product today. This issue must be understood before
any network troubleshooting takes place.

The problem is that Win2K Server allows for
two separate ways of adding routes. The best way is to
enter the static routes in RRAS.RRAS is a kernel-mode
service with sophisticated routing capabilities. The other
way, and the result of the ROUTE ADD command, is to enter
the routes as a user-mode function. This routing method
stems from NT 3.x days and shouldn't be used if you can
avoid it. (Microsoft kept it around to avoid breaking
existing scripts that customers might have.)

Figure 8. Persistent routes are
automatically established when a system comes online.
(Click image to view larger version.)

Figure 9. The Registry can help
confirm the persistent routes in your network. (Click
image to view larger version.)

As Figure 7 shows, there are two interfaces
in this system. The default gateway points to 216.82.49.33,
and there's an internal card with address 10.0.2.1. The
second route states that all 10.0.2.0 traffic is directly
available to the internal subnet. For our example of the
routing confusion, let's introduce a new internal subnet
of 11.11.11.0. The old way of doing this is to issue the
following command:

route add 11.11.11.0
mask 255.255. 255.0 10.0.2.1 -p

The -p option at the end states that this
route's persistent and should always exist when the system
comes online. Figure 8 shows the result of this command.

Notice that the persistent route is clearly
listed in the routing table near the end Additionally,
it's in the Active Routes list.

Since the route exists in the routing entries
list, the network works as expected. In fact, a peek into
the registry shows the persistent routes list (just like
in NT 4.0). The route's listed as expected, in Figure
9.

We've established that the backward compatibility
still exists and works in Win2K Server routing. Now let's
move forward.

Win2K Server has two new ways to add static
routes that allow the RRAS engine to handle the entries.
The first way is to simply use the RRAS snap-in (see Figure
10). This has the advantage of being fairly obvious, but
if you have more than just a few entries, this process
would be too time-consuming.

Win2K also introduces a powerful command
shell called NETSH. If you have a number of static routes
and you need to create or modify a batch file, use this
command. The equivalent command to the ROUTE ADD command
we were using is:

netsh routing ip add persistentroute
11.11.11.0 255.255.255.0

"Private" nhop=10.0.2.1

Here you're defining a persistent route,
but you must also define the interface that's handling
this route and the next hop address. On this server, the
internal address is named Private (Network Places | Properties
| Interfaces). Because this route is being handled directly
by this server instead of passing it off to another router,
our next hop is the same interface. Figures 11 and 12
show the ROUTE PRINT result from this command.

As you can see, neither of the backward compatibility
areas contain the new route that we've just added. The
ROUTE PRINT command lists it in the routing entries, but
doesn't know that it's a persistent route. RRAS, however,
does (see Figure 13).

As you can imagine, this can cause confusion.
If you manage servers performing routing functions and
you're using static routes, I'd recommend changing from
the user-mode ROUTE ADD command to using RRAS routing.
The server will be able to handle more traffic with better
performance; all your routing information will be in a
unified location; and the router will have more flexibility
in the RRAS environment.

When troubleshooting any network or routing
issues, it's important to discover the complete picture
of the routes applied to a server to fully understand
the network details. Make sure that you look in both RRAS
and ROUTE PRINT or the Registry list.

Figure 10. You can add static
routes through the GUI shown here, but for more than
a few entries, using a command-line utility is better.
(Click image to view larger version.)

Keep in mind that DNS is the foundation
for Windows 2000, especially when you're
troubleshooting Win2K. DNS will touch
all aspects of the infrastructure. Make
sure it's working and error-free before
digging any deeper into a problem. An
entire article could be written on DNS
troubleshooting alone, but here are
some basics.

Design the DNS structure. Get help
if you don't know how.

Keep it simple. Unless you have
some very slow links to sites,
we usually recommend three name
servers per domain. You may want
more at remote (slow link) sites.

Work out interoperability with
your corporate root name server.
There are a number of options
here, and Win2K DNS will play
nicely with BIND servers if you
do it right.

Make sure the DNS server and zone
configurations are correct, with delegations,
forwarding and name server lists pointing
to the right IP addresses.

Make sure DC names and domain names
are resolved correctly.

Make sure client DNS configuration
is pointing to the right name servers.
Assuming a Win2K DNS name server is
hosting the Win2K domain:

DNS servers' TCP/IP properties
should point to themselves for preferred
DNS and to the other name servers
in the domain as "additional"
DNS servers.

DNS servers at the Win2K root
domain should forward to the name
servers registered on the Internet
for Internet access. This could
be a company-owned or ISP-owned
server.

Clients should point to the Win2K
DNS servers authoritative for their
domain. Order them with thsest"
servers hie "cloghest in the
list.

Watch the DNS event logs for
errors, but note that DNS errors
will occur in the Directory Services
and System logs as well.

Cluster Troubleshooting

In troubleshooting cluster problems, a number
of fundamental proactive and reactive tasks apply in almost
all cases.

Proactive Tasks
Get to know your cluster. It's hard to zero in on a problem
when you don't have a feel for how your cluster behaves
when healthy. To do this, make sure cluster logging is
enabledyou can't troubleshoot a cluster problem
otherwise. It's enabled by default in Win2K; but if you're
running NT 4.0 Enterprise Edition, refer to KB Q168801,
"How to Enable Cluster Logging in Microsoft Cluster
Server," to turn on cluster logging.

Next, get familiar with the content of the
log file. Because the content of the log file is verbose
and cryptic, it's often hard to determine if a message
is benign or malignant. Therefore, it's good practice
to periodically save a copy of the cluster log file on
all cluster servers. This can be used as a reference to
compare against, once you experience a problem. You should
also save a copy after you make changes to your cluster
configuration. The cluster log will look very different
before and after you've clustered SQL Server 2000!

Then remember the adage "When it rains,
it pours." Since there's a good chance that next
time you experience a cluster problem you'll also experience
other problems, download and print out some good troubleshooting
documentation, including:

KB Q223258, "How to Install the
NTOP on MSCS 1.0 with SQL Server 6.5 or 7.0."

Something else you can do is upgrade to Win2K.
Clustering is a lot more finicky on NT 4.0 than on Win2K,
mostly because Windows NT 4.0 has the Option Pack. Then
do the same for your cluster-aware BackOffice products.
You can cluster SQL 2000 more reliably than SQL 6.5 or
SQL 7.0!

Finally, I can't overemphasize the need for
a good backup. Make backups and once in a while test your
recovery procedures.

Reactive Tasks
Isolate the errors in the cluster log by comparing what's
normal from your saved cluster log with the events logged
during the problem time. You might need to look at the
cluster log on all servers in the cluster. Remember that
the cluster log timestamp is GMT, so you need to calculate
GMT based on your time zone setting. Once you've identified
a problem area, cross-reference with the event log. Remember
that those are in local time, not GMT! In Win2K you only
need to look at the event log from one server since it's
replicated among cluster servers.

If you don't understand an error code, use
the "Net HelpMsg" command to try to get a better
description of the error. Also use the Knowledge Base
whenever possible.

This last bit of advice might come as a shock.
Most likely, your No. 1 requirement is to get the cluster
and application working as fast as possible. Most important
is to understand the root cause of the problem so you
can prevent it from occurring again. Once you know what
caused the problem, consider all the optionsyou
can attempt to fix the problem or you can re-install.
I've found that very often it's faster to re-install a
server or a cluster than to fix a complex problem. This
option is often overlooked until many hours have been
spent fighting a complex problem. The solution path you
take will depend on the clustered application.

Troubleshooting Disk
Problems
So, what to do if the problem isn't the cluster, but the
disk? If your disk problem occurs right after you installed
your cluster, it's probably a misconfiguration. Backtrack
and verify the integrity of your shared I/O subsystem
without clustering. In general, it's easier to troubleshoot
standalone systems than clustered servers. Don't hesitate
to stress-test your disks before you clusterSCSI
termination problems can hide when doing casual checks.
The simplest stress test might be a full (not fast) format
of the disk.

If, however, your disk problem occurs on
a mature cluster, it's most likely caused by hardware
failure. Since disk handling is very different in Win2K
than NT 4.0, make sure you follow the procedure for the
right version of Windows.

It's likely you'll need to disable clustering
temporarily and access your disk directly. Remember this:
Once you disable cluster service and the cluster disk
driver, make sure that you never boot more than one
server at a time or you will corrupt your shared
disks!

To access disks without cluster software
involvement temporarily:

Shut down and power off Server B.

Follow one of these route:s

If you're running NT 4.0:

On Server A, from Control Panel | Services
change the startup of the Cluster Server service from
Automatic to Disabled. To do this, highlight the Cluster
Server service, and select Startup. Note:
Don't stop the Cluster Server service.

From Control Panel | Devices, change the
startup of the Cluster Disk device from System to Disabled.
To do this, highlight the Cluster Disk device, and select
Startup. Note: Don't stop the Cluster Disk
device.

At the bottom, expand "Services
and Applications" and select Services. Right-click
on Cluster Service and expand Properties. In the Startup
type box, click the dropdown arrow and select Disabled.
Then select OK to go back to Computer Management. Note:
Don't stop the Cluster Server service.

At the top, select System Tools and highlight
Device Manager. The visible devices appear in the results
pane. On the toolbar, select View and click on Show
Hidden Devices. A Non-Plug and Play Drivers option will
appear in the results pane. Expand that. Right-click
on Cluster Disk Driver and select Properties, then click
on the Driver tab. The Startup box will be at the bottom.
Click on the options dropdown arrow and select Disabled.
Then select OK to return to Computer Management. Note:
Don't attempt to stop the Cluster Disk device.

In the results pane, right-click on Cluster
Network Driver and select Properties as before and select
the Driver tab. Select Disable and OK. Note:
Don't stop the Cluster Network device.

Finally, reboot Server A.

After the reboot, verify via the proper
disk administration utility your access to the shared
storage devices. The shared disks should show up as
available and online. If you still have disk problems,
it wasn't a cluster problem. If you need to format a
disk, do it now. If you need to set the disk signature
(in Win2K), do it now. If you want to perform some I/O
test, do it now. If you need to restore some data, you
can also do it now.

When you're finished working with the
disks in non-clustered mode, on Server A, follow one
of these paths:

For NT 4.0:

From Control Panel | Services, change
the startup of the cluster service from Disabled to
Automatic.

From Control Panel | Devices, change
the startup of Cluster Disk device from Disabled to
System.

For Win2K:

In Computer Management, under Services,
reset the Cluster Service to Automatic.

In Computer Management, under Device
Manager, display the hidden devices and reset the Cluster
Disk device to System.

In Computer Management, under Device
Manager, reset the Cluster Network to System.

Reboot Server A, then restart Server
B.

How
To Become an Expert Troubleshooter

Let's review some of the basic troubleshooting
steps we use when dealing with a Windows
2000 problem at a client site.

Gather all the information about
the problem from the person experiencing
it. (This will be easier for you than
for us because you know your environment.)

To start, ask some probing questions:
"What was the exact error message?"
Get the user to e-mail a screenshot,
if necessary. "What were you
doing when it happened?" Get
the account and computer used, applications
running, and so on. "Have you
seen this before?" If so, get
exact details of the previous incident.
"Was it working before?"
"Is there anything else that
doesn't work?" "What changed
prior to this problem in your environment?"
Something had to change if it just
"quit working."

Next, remember that logs are your
troubleshooting friends. Get event
logs, from both the client and server.
You'd be surprised how many customers
call us before ever looking at the
event logs or getting exact error
messages. Several Registry settings
permit you to dump verbose output
to the event logs for a variety of
things such as replication, name resolution
and Group Policy application. See
Microsoft Knowledge Base articles
Q220940, "How to enable diagnostic
event logging for Active Directory
services," and Q186454, "How
to enable user environment event logging
in Windows 2000."

On the same topic, don't get just
any logsget relevant logs
such as dcpromo.log, userenv.log,
startup.log and netlogon.log. Win2K
has provided improved troubleshooting
capabilities with these logs, so
use them. If you haven't discovered
the userenv.log, see Q221833, "How
to enable user environment debug
logging in retail builds of Windows
2000."

Check out network connectivity.
Make sure everyone can talk to everyone
else. If not, find out if others on
different subnets or remote sites
are experiencing the same problem
or if it's isolated to a particular
site. Determine if it can be reproduced
elsewhere.

Next, check Group Policies in Win2K.
They're complicated, to put it mildly,
and can cause a host of problems.
This touches network and domain security,
desktop environments, account authentication
and software installation.