Wednesday, February 11, 2009

Why some applications malfunction when one of the Domain Controllers is down?

Why some applications malfunction when one of the Domain Controllers is down? - Or -How to switch to disaster recovery site without booting my clients

Every Sysadmin installs at least two Domain Controllers on his domain for redundancy andfault tolerance. But what actually happens when one of the DC's is down?

If you do a simple and disconnect one of your DCs from the network, you'll see that abouthalf of the workstations and member server who hasn't booted since the DC is down areexperiencing problems such as sluggishness, performance issues and some of theapplications simply stop working. The reason for that is the way Netlogon works.

Netlogon is the process which is responsible, among other tasks, to detect Active Directoryenvironment and the closest DC. The detection process is called DC Locator.

It is implemented in the NetAPI.DLL in a function named dsGetDCName and invoked by theNetlogon service when the service starts. The DC Locator process sends a request to allDomain Controllers in the domain and waits for them to respond. Once responded,Netlogon caches the Domain Controller who was first to respond and saves its detailsin the cache. From that moment, every call made by any application for the dsGetDCNamefunction returns this DC.

The DC Locator process does not re-check the availability of the cached DC periodically..Therefore, if this DC is gone for any reason, workstations and member servers who have alreadycached this DC remain with the faulty cache until the workstation is rebooted. As a result,any application that needs to access the DC (and call the dsGetDCName for it) receives thefaulted DC and is expected to have problems when trying to connect to it.

In the last years, fault tolerance became an essential requirement in many organizations.Many enterprises implement expensive disaster recovery sites, buy expensive clusters andreplicate data to at least one additional location.

When the disaster does happen and the main site is going down, this limitation will causeyou lots of trouble until you reboot your entire organization.