I have a domain in production on two sites (subnets, via "Sites and
Services") with originally two DCs. One went down due to HDD (-> old
hardware) error. Now, occasionally, clients cant access/find the file
server (domain member). This does not occur on all clients at the
same time, however, so I am rather sure it is not the file server
itself, but a DNS problem.

I couldn't find anything diagnostic in the logs. Default log level
was not informative, I think, while log level 10 I just could not
handle/analyze properly.

Can someone recommend a log level? Should I look on the DC or on the
file server?

Do I have to remove the offline DC completely from DNS and Sites and
Services for this mess to stop?

I appreciate any advice.
Cheers,
Ole

Ole,

If you haven't already removed the dead DC from your network you
should do that first.

Your clients DNS may still be pointing to the offline DC causing look
up delays. Also did you have your DC's pointing to themselves for DNS
or each other?

Thank you for your help!

I had trouble with fail-safe tests regarding DC redundancy a while ago.
Some time after discussing it here on the list I finally got it working
(had something to do with IPv6). So I can say I have tested the absence
of a DC, and it did not lead to any trouble (except for a very short
moment due to DNS caching, supposedly). Now it does, which is weird.

When the drive errors on the now broken DC manifested, the domain acted
weirdly. When I took that DC completely offline, everything went back to
normal. Now issues are showing up. Just so much for the background.

The current situation is very much like in the fail-safe tests, with two
exceptions: the remaining DC (FSMO role holder) is the primary DNS
server on all Windows machines, and I updated the resolv.conf on that DC
to only point to itself. This DC and several Windows clients got
restarted after that, but issues persist.

Actually, the DCs (resolv.conf) were pointing to each other initially,
and I think that was at least one root of the evil. I think this advice
in the Samba wiki actually is rather bad (and unnecessary with Samba, as
has been pointed out, before?).

Regarding demoting the dead DC: My Samba version is rather old (4.2.5).
The problem is that I chose the uid/gid scopes unwisely. And I read on
some patch notes that I can't update anymore, because newer versions of
Samba actually require those scopes to be set in a very specific way. So
perhaps demoting via the newly available method is not an option here.

What I can think of is:
- removing the dead DC from the clients DNS config, of course
- removing it from AD DNS
- removing it from AD Sites and Services
- and removing it from AD Users and Computers

What else does the Samba script for demoting a DC do? Can I do that
manually, too? I repeat: it was not the FSMO role holder.