Just thought I’d put these here for my own easy reference. I keep forgetting these records and when there’s an issue I end up Googling and trying to find them! These are DNS records you can query to see if clients are able to lookup the PDC, GC, KDC, and DC of the domain you specify via DNS. If this is broken nothing else will work. :)

PDC

_ldap._tcp.pdc._msdcs.<DnsDomainName>

GC

_ldap._tcp.gc._msdcs.<DnsDomainName>

KDC

_kerberos._tcp.dc._msdcs.<DnsDomainName>

DC

_ldap._tcp.dc._msdcs.<DnsDomainName>

You would look this up using nslookup -type=SRV <Record>.

As a refresher, SRV records are of the form _Service._Proto.Name TTL Class SRV Priority Weight Port Target. The _Service._Proto.Name is what we are looking up above, just that our name space is _msdcs.<DnsDomainName>.

Cool tip via a Microsoft blog post. If you have a connection object in your AD Sites and Services that was manually created and you now want to switch over to letting KCC generate the connection objects instead of using the manual one, the easiest thing to do is convert the manually created one to an automatic one using ADSI Edit.

1.) Open ADSI Edit and go to the Configuration partition.

2.) Drill down to Sites, the site where the manual connection object is, Servers, the server where the manual connection object is created, NTDS Settings

3.) Right click on the manual connection object and go to properties

4.) Go to the Options attribute and change it from 0 to 1 (if it’s an RODC, then change it from 64 to 65)

5.) Either wait 15 minutes (that’s how often the KCC runs) or run repadmin /kcc to manually kick it off

A while ago I was reading up on the Domain Controller process to confirm some stuff before changes I was making at work. Found a couple of good links, still got them open in my browser but before closing them out I thought I should paste them here as a reference to myself. I know most of the info, but occasionally you forget what you know or need a quick confirmation.

I have been trying to read on ADFS nowadays. It’s my new area of interest! :) Wrote a document at work sort of explaining it to others, so here’s bits and pieces from that.

What does Active Directory Federation Services (ADFS) do?

Typically when you visit a website you’d need to login to that website with a username/ password stored on their servers, and then the website will give you access to whatever you are authorized to. The website does two things basically – one, it verifies your identity; and two, it grants you access to resources.

It makes sense for the website to control access, as these are resources with the website. But there’s no need for the website to control identity too. There’s really no need for everyone who needs access to a website to have user accounts and passwords stored on that website. The two steps – identity and access control – can be decoupled. That’s what ADFS lets us do.

With ADFS in place, a website trusts someone else to verify the identity of users. The website itself is only concerned with access control. Thus, for example, a website could have trusts with (say) Microsoft, Google, Contoso, etc. and if a user is able to successfully authenticate with any of these services and let the website know so, they are granted access. The website itself doesn’t receive the username or password. All it receives are “claims” from a user.

What are Claims?

A claim is a statement about “something”. Example: my username is ___, my email address is ___, my XYZ attribute is ___, my phone number is ____, etc.

When a website trusts our ADFS for federation, users authenticate against the ADFS server (which in turn uses AD or some other pool to authenticate users) and passes a set of claims to the website. Thus the website has no info on the (internal) AD username, password, etc. All the website sees are the claims, using which it can decide what to do with the user.

Claims are per trust. Multiple applications can use the same trust, or you could have a trust per application (latter more likely).

All the claims pertaining to a user are packaged together into a secure token.

What is a Secure Token?

A secure token is a signed package containing claims. It is what an ADFS server sends to a website – basically a list of claims, signed with the token signing certificate of the ADFS server. We would have sent the public key part of this certificate to the website while setting up the trust with them; thus the website can verify our signature and know the tokens came from us.

Relying Party (RP) / Service Provider (SP)

Refers to the website/ service who is relying on us. They trust us to verify the identity of our users and have allowed access for our users to their services.

I keep saying “website” above, but really I should have been more generic and said Relying Party. A Relying Party is not limited to a website, though that’s how we commonly encounter it.

A Relying Party can be another ADFS server too. Thus you could have a setup where a Replying Party trusts an ADFS service (who is the Claims Provider in this relationship), and the ADFS service in turn trusts a bunch of other ADFS servers depending on (say) the user’s location (so the trusting ADFS service is a Relying Party in this relationship).

Claims Provider (CP) / Identity Provider (IdP)

The service that actually validates users and then issues tokens. ADFS, basically.

Note: Claims Party is the Microsoft terminology.

Secure Token Service (STS)

The service within ADFS that accepts requests and creates and issues security tokens containing claims.

Claims Provider Trust & Relying Party Trust

Refers to the trust between a Relying Party and Identity Provider. Tokens from the Identity Provider will be signed with the Identity Provider’s token signing key – so the Relying Party knows it is authentic. Similarly requests from the Relying Party will be signed with their certificate (which we can import on our end when setting up the trust).

Claims Provider Trust is the trust relationship a Relying Party STS has with an Identity Provider STS. This trust is required for the Relying Party STS to accept incoming claims from the Identity Provider STS.

Relying Party Trust is the trust relationship an Identity Provider STS has with a Relying Party STS. This trust is requires for the Identity Provider STS to send claims to the Relying Party STS.

Web Application Proxy (WAP)

Access to an ADFS server over the Internet is via a Web Application Proxy. This is a role in Server 2012 and above – think of it as a reverse proxy for ADFS. The ADFS server is within the network; the WAP server is on the DMZ and exposed to the Internet (at least port 443). The WAP server doesn’t need to be domain joined. All it has is a reference to the ADFS server – either via DNS, or even just a hosts file entry. The WAP server too contains the public certificates of the ADFS server.

Miscellaneous

ADFS Federation Metadata – this is a cool link that is published by the ADFS server (unless we have disabled it). It is https://<your-adfs-fqdn>/FederationMetadata/2007-06/FederationMetadata.xml and contains all the info required by a Replying Party to add the ADFS server as a Claims Provider.

This also includes Base64 encoded versions of the token signing certificate and token decrypting certificates.

SAML Entity ID – not sure of the significance of this yet, but this too can be found in the Federation Metadata file. It is usually of the form http://<your-adfs-fqdn>/adfs/services/trust and is required by the Relying Party to setup a trust to the ADFS server.

SAML endpoint URL – this is the URL where users are sent to for authentication. Usually of the form http://<your-adfs-fqdn>/adfs/ls. This information too can be found in the Federation Metadata file.

Praise aside, it is a good article on how subnet and site definitions are used to find a Domain Controller closest to you, and especially how it works across forest trusts.

Update: There’s a part two which I came across later. This covers a specific edge case.

Using Catch-All subnets in AD – Wanted to know how catch-all subnets in AD Sites will interact with specific ones. This one explained it. The specific one takes precedence. Which is exactly what you want. :)

Do cross forest trusts need PDCs that can communicate with each other? The answer seems to be NO, check out this discussion for more.

Was trying to wrap my head around ADFS and Certificates today morning. Before I close all my links etc I thought I should make a note of them here. Whatever I say below is more or less based on these links (plus my understanding):

If the certificate is from a public or enterprise CA, however, then the key can be exported (I think).

The exported public certificate is usually loaded on the service provider (or relying party; basically the service where we can authenticate using our ADFS). They use it to verify our signature.

The ADFS server signs tokens using this certificate (i.e. uses its private key to encrypt the token or a hash of the token – am not sure). The service provider using the ADFS server for authentication can verify the signature via the public certificate (i.e. decrypt the token or its hash using the public key and thus verify that it was signed by the ADFS server). This doesn’t provide any protection against anyone viewing the SAML tokens (as it can be decrypted with the public key) but does provide protection against any tampering (and verifies that the ADFS server has signed it).

This can be a self-signed certificate. Or from enterprise or public CA.

By default this certificate expires 1 year after the ADFS server was setup. A new certificate will be automatically generated 20 days prior to expiration. This new certificate will be promoted to primary 15 days prior to expiration.

This auto-renewal can be disabled. PowerShell cmdlet (Get-AdfsProperties).AutoCertificateRollover tells you current setting.

The lifetime of the certificate can be changed to 10 years to avoid this yearly renewal.

No need to worry about this on the WAP server.

There can be multiple such certificates on an ADFS server. By default all certificates in the list are published, but only the primary one is used for signing.

The “Token-decrypting” certificate.

This one’s a bit confusing to me. Mainly coz I haven’t used it in practice I think.

This is the certificate used if a service provider wants to send usencrypted SAML tokens.

Take note: 1) Sending us. 2) Not signed; encrypted.

As with “Token-signing” we export the public part of this certificate, upload it with the 3rd party, and they can use that to encrypt and send us tokens. Since only we have the private key, no one can decrypt en route.

This can be a self-signed certificate.

Same renewal rules etc. as the “Token-signing” certificate.

Important thing to remember (as it confused me until I got my head around it): This is not the certificate used by the service provider to send ussigned SAML tokens. The certificate for that can be found in the signature tab of that 3rd party’s relaying party trust (see screenshot below).

Another thing to take note of: A different certificate is used if we want to sendencrypted info to the 3rd party. This is in the encryption tab of the same screenshot.

Turns out I was mistaken in my previous post. A few minutes after enabling inheritance, I noticed it was disabled again. So that means the groups must be protected by AD.

I knew of the AdminSDHolder object and how it provides a template set of permissions that are applied to protected accounts (i.e. members of groups that are protected). I also knew that there were some groups that are protected by default. What I didn’t know, however, what that the defaults can be changed.

Initially I did a Compare-Object -ReferenceObject (Get-ADPrincipalGroupMembership User1) -DifferenceObject (Get-ADPrincipalGroupMembership User2) -IncludeEqual to compare the memberships of two random accounts that seemed to be protected. These were accounts with totally different roles & group memberships so the idea was to see if they had any common groups (none!) and failing that to see if the groups they were in had any common ancestors (none again!)

Then I Googled a bit :o) and came across a solution.

Before moving on to that though, as a note to myself:

The AdminSDHolder object is at CN=AdminSDHolder,CN=System,DC=domain,DC=com. Find that via ADSI Edit (replace the domain part accordingly).

Right click the object and its Security tab lists the template permissions that will be applied to members of protected groups. You can make changes here.

SDProp is a process that runs every 60 minutes on the DC holding the PDC Emulator role. The period can be changed via the registry key HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Parameters\AdminSDProtectFrequency. (If it doesn’t exist, add it. DWORD).

Bingo! Most of my admin groups were protected, so most admin accounts were protected. All I have to do now is either un-protect these groups (my preferred solution), or change the template to delegate permissions there.

Update: Simply un-protecting a group does not un-protect all its members (this is by design). The member objects too have their adminCount attribute set to 1, so apart from fun-protecting the groups we must un-protect the members too.

Something I hadn’t realized about adminCount. This attribute does not mean a group/ user will be protected. Instead, what it means is that if a group/ user is protected, and its ACLs have changed and are now reset to default, then the adminCount attribute will be set. So yes, adminCount will let you find groups/ users that are protected; but merely setting adminCount on a group/ user does not protect it. I learnt this the hard way while I was testing my changes. Set adminCount to 1 for a group and saw that nothing was happening.

Also, it is possible that a protected user/ group does not have adminCount set. This is because adminCount is only set if there is a difference in the ACLs between the user/ group and the AdminSDHolder object. If there’s no difference, a protected object will not have the adminCount attribute set. :)

Today I cracked a problem which had troubled us for a while but which I never really sat down and actually tried to troubleshoot. We had an OU with 3rd level admin accounts that no one else had rights to but wanted to delegate certain password related tasks to our Service Desk admins. Basically let them reset password, unlock the account, and enable/ disable.

Here’s some screenshots for the delegation wizard. Password reset is a common task and can be seen in the screenshot itself. Enable/ Disable can be delegated by giving rights to the userAccountControl attribute. Only force password change rights (i.e. no reset password) can be given via the pwdLastSet attribute. And unlock can be given via the lockoutTimeattribute.

Problem was that in my case in spite of doing all this the delegated accounts had no rights!

Snooping around a bit I realized that all the admin accounts within the OU had inheritance disabled and so weren’t getting the delegated permissions from the OU (not sure why; and no these weren’t protected group members).

Of course, enabling is easy. But I wanted to see if I could get a list of all the accounts in there with their inheritance status. Time for PowerShell. :)

The Get-ACL cmdlet can list access control lists. It can work with AD objects via the AD: drive. Needs a distinguished name, that’s all. So all you have to do is (Get-ADUser <accountname>).DistinguishedName) – prefix an AD: to this, and pass it to Get-ACL. Something like this:

1

Get-ACLAD:$((Get-ADUserrakhesh).DistinguishedName)

The default result is useless. If you pipe and expand the Access property you will get a list of ACLs.

Of interest to us is the AreAccessRulesProtected property. If this is True then inheritance is disabled; if False inheritance is enabled. So it’s straight forward to make a list of accounts and their inheritance status:

So that’s it. Next step would be to enable inheritance on the accounts. I won’t be doing this now (as it’s bed time!) but one can do it manually or script it via the SetAccessRuleProtection method. This method takes two parameters (enable/ disable inheritance; and if disable then should we add/ remove existing ACEs). Only the first parameter is of significance in my case, but I have to pass the second parameter too anyways – SetAccessRuleProtection($False,$True).

Update 2: Didn’t realize I had many users in the built-in protected groups (these are protected even though their adminCount is 0 – I hadn’t realized that). To unprotect these one must set the dsHeuristics flag. The built-in protected groups are 1) Account Operators, 2) Server Operators, 3) Print Operators, and 4) Backup Operators. See this post on instructions (actually, see the post below for even better instructions).

Update 3: Found this amazing page that goes into a hell of details on this topic. Be sure to read this before modifying dsHeuristics.

I have these tabs open in my browser from last month when I was doing some WMI based GPO targeting. Meant to write a blog post but I keep getting side tracked and now it’s been nearly a month so I have lost the flow. But I want to put these in the blog as a reference to my future self.

Note to self: the “Domain Admins” and “Enterprise Admins” groups aren’t the primary groups in a domain. The primary group is the “Administrators” group, present in the “Builtin” folder. The other two groups are members of this group and thus get rights over the domain. The “Enterprise Admins” group is also a member of the “Administrator” group in all other domains/ child-domains of that forest, hence its members get rights over those domains too.

So if you want to create a separate group in your domain and want to give its members domain admin rights over (say) a child domain, all you need to do is create the group (must be Global or Universal) an add this group to the “Administrators” group in the child domain. That’s it!

Earlier today I had blogged about nslookup working but ping and other methods not resolving names to IP addresses. That problem started again, later in the evening.

Today morning though as a precaution I had enabled the DNS Client logs on my computer. (To do that open Event Viewer with an admin account, go down to Applications and Services Logs > Microsoft > Windows > DNS Client Events > Operational – and click “Enable log” in the “Actions” pane on the right).

That showed me an error along the following lines:

A name not found error was returned for the name vcenter01.rakhesh.local. Check to ensure that the name is correct. The response was sent by the server at 10.50.1.21:53.

Interesting. So it looked like a particular DC was the culprit. Most likely when I restarted the DNS Client service it just chose a different DC and the problem temporarily went away. And sure enough nslookup for this record against this DNS server returned no answers.

I fired up DNS Manager and looked at this server. It seemed quite outdated with many missing records. This is my simulated branch office DC so I don’t always keep it on/ online. Looks like that was coming back to bite me now.

The DNS logs in Event Manager on that server had errors like this:

The DNS server was unable to complete directory service enumeration of zone TrustAnchors. This DNS server is configured to use information obtained from Active Directory for this zone and is unable to load the zone without it. Check that the Active Directory is functioning properly and repeat enumeration of the zone. The extended error debug information (which may be empty) is “”. The event data contains the error.

So Active Directory is the culprit (not surprising as these zones are AD integrated so the fact that they weren’t up to date indicated AD issues to me). I ran repadmin /showrepl and that had many errors:

Naming Context: CN=Configuration,DC=rakhesh,DC=local

Source: COCHIN\WIN-DC03 ******* WARNING: KCC could not add this REPLICA LINK due to error.

Great! I fired up AD Sites and Services and the links seemed ok. Moreover I could ping the DCs from each other. Event Logs on the problem DC (WIN-DC02) had many entries like this though:

The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server win-dc01$. The target name used was Rpcss/WIN-DC01. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Please ensure that the target SPN is registered on, and only registered on, the account used by the server. This error can also happen when the target service is using a different password for the target service account than what the Kerberos Key Distribution Center (KDC) has for the target service account. Please ensure that the service on the server and the KDC are both updated to use the current password. If the server name is not fully qualified, and the target domain (RAKHESH.LOCAL) is different from the client domain (RAKHESH.LOCAL), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

Hmm, secure channel issues? I tried resetting it but that too failed:

1

2

C:\>nltest/sc_reset:rakhesh.local

I_NetLogonControlfailed:Status=50x5ERROR_ACCESS_DENIED

(Ignore the above though. Later I realized that this was because I wasn’t running command prompt as an admin. Because of UAC even though I was logged in as admin I should have right clicked and ran command prompt as admin).

Since I know my environment it looks likely to be a case of this DC losing trust with other DCs. The KRB_AP_ERR_MODIFIED error also seems to be related to Windows Server 2003 and Windows Server 2012 R2 but mine wasn’t a Windows Server 2003. That blog post confirmed my suspicions that this was password related.

That confirms the problem. WIN-DC02 thinks its password last changed on 9th May whereas WIN-DC01 changed it on 25th July and replicated it to WIN-DC03.

Interestingly that date of 25th July is when I first started having problems in my test lab. I thought I had sorted them but apparently they were only lurking beneath. The solution here is to reset the WIN-DC02 password on itself and WIN-DC01 and replicate it across. The steps are in this KB article, here’s what I did:

On WIN-DC02 (the problem DC) I stopped the KDC service and set it to start Manual.

Purge the Kerberos ticket cache. You can view the ticket cache by typing the command: klist. To purge, do: klist purge.

Open a command prompt as administrator (i.e. right click and do a “Run as administrator”) then type the following command: netdom resetpwd /server WIN-DC01.rakhesh.local /UserD MyAdminAccount /PasswordD *

Restart WIN-DC02.

After logon start the KDC service and set it to Automatic.

Checked the Event Logs to see if there are any errors. None initially but after I forced a sync via repadmin /syncall /e I got a few. All of them had the following as an error:

2148074274 The target principal name is incorrect.

Odd. But at least it was different from the previous errors and we seemed to be making progress.

After a bit of trial and error I noticed that whenever the KDC service on the DC was stopped it seemed to work fine.

I could access other servers (file shares), connect to them via DNS Manager, etc. But start KDC and all these would break with errors indicating the target name was wrong or that “a security package specific error occurred”. Eventually I left KDC stay off, let the domain sync via repadmin /syncall, and waited a fair amount of time (about 15-20 mins) for things to settle. I kept an eye on repadmin /replsummary to see the deltas between WIN-DC02 and the rest, and also kept an eye on the DNS zones to see if WIN-DC02 was picking up newer entries from the others. Once these two looked positive, I started KDC. And finally things were working!

I spent the better part of today evening trying to sort this issue. But didn’t get any where. I don’t want to forget the stuff I learnt while troubleshooting so here’s a blog post.

Today evening I added one of my ESXi hosts to my domain. The other two wouldn’t add, until I discovered that the time on those two hosts were out of sync. I spent some time trying to troubleshoot that but didn’t get anywhere. The NTP client on these hosts was running, the ports were open, the DC (which was also the forest PDC and hence the time keeper) was reachable – but time was still out of sync.

The command has an interactive mode (which you get into if run without any switches; read the manpage for more info). The -p switch tells ntpq to output a list of peers and their state. The KB article above suggests running this command every 2 seconds using the watch command but you don’t really need to do that.

Important points about the output of this command:

If it says “No association ID's returned” it means the ESXi host cannot reach the NTP server. Considering I didn’t get that, it means I have no connectivity issue.

If it says “***Request timed out” it means the response from the NTP server didn’t get through. That’s not my problem either.

If there’s an asterisk before the remote server name (like so) it means there is a huge gap between the time on the host and the time given by the NTP server. Because of the huge gap NTP is not changing the time (to avoid any issues caused by a sudden jump in the OS time). Manually restarting the NTP daemon (/etc/init.d/ntpd restart) should sort it out.

The output above doesn’t show it but one of my problem hosts had an asterisk. Restarting the daemon didn’t help.

The refid field shows the time stream to which the client is syncing. For instance here’s the w3tm output from my domain:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

C:\Windows\system32>w32tm/monitor/domain:rakhesh.local

WIN-DC01.rakhesh.local***PDC***[10.50.0.20:123]:

ICMP:0msdelay

NTP:+0.0000000soffset from WIN-DC01.rakhesh.local

RefID:'LOCL'[0x4C434F4C]

Stratum:1

WIN-DC02.rakhesh.local[10.50.1.21:123]:

ICMP:1msdelay

NTP:+0.0127058soffset from WIN-DC01.rakhesh.local

RefID:WIN-DC01.rakhesh.local[10.50.0.20]

Stratum:2

WIN-DC03.rakhesh.local[10.50.0.22:123]:

ICMP:1msdelay

NTP:+0.0183887soffset from WIN-DC01.rakhesh.local

RefID:WIN-DC01.rakhesh.local[10.50.0.20]

Stratum:2

Warning:

Reverse name resolution is best effort.It may not be

correct since RefID field intime packets differs across

NTP implementations and may not be using IP addresses.

Notice the PDC has a refid of LOCL (indicating it is its own time source) while the rest have a refid of the PDC name. My ESXi host has a refid of .INIT. which means it has not received any response from the NTP server (shouldn’t the error message have been something else!?). So that’s the problem in my case.

Obviously the PDC is working because all my Windows machines are keeping correct time from it. So is vCenter. But some my ESXi hosts aren’t.

I have no idea what’s wrong. After some troubleshooting I left it because that’s when I discovered my domain had some inconsistencies. Fixing those took a while, after which I hit upon a new problem – vCenter clients wouldn’t show me vCenter or any hosts when I login with my domain accounts. Everything appears as expected under the administrator@vsphere.local account but the domain accounts return a blank.

While double-checking that the domain admin accounts still have permissions to vCenter and SSO I came across the following error:

Great! (The message is “Cannot load the users for the selected domain“).

I am using the vCenter appliance. Digging through the /var/log/messages on this I found the following entries:

Searched Google a bit but couldn’t find any resolutions. Many blog posts suggested removing vCenter from the domain and re-adding but that didn’t help. Some blog posts (and a VMware KB article) talk about ensuring reverse PTR records exist for the DCs – they do in my case. So I am drawing a blank here.

Odd thing is the appliance is correctly connected to the domain and can read the DCs and get a list of users. The appliance uses Likewise (now called PowerBroker Open) to join itself to the domain and authenticate with it. The /opt/likewise/bin directory has a bunch of commands which I used to verify domain connectivity:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

vcenter01:/opt/likewise/bin# ./lw-get-dc-list rakhesh.local

Got3DCs:

===========

DC1:Name='win-dc01.rakhesh.local',Address='10.50.0.20'

DC2:Name='win-dc03.rakhesh.local',Address='10.50.0.22'

DC3:Name='win-dc02.rakhesh.local',Address='10.50.1.21'

vcenter01:/opt/likewise/bin# ./lw-get-dc-name rakhesh.local

Printing LWNET_DC_INFO fields:

===============================

dwDomainControllerAddressType=23

dwFlags=62461

dwVersion=5

wLMToken=65535

wNTToken=65535

pszDomainControllerName=WIN-DC01.rakhesh.local

pszDomainControllerAddress=10.50.0.20

pucDomainGUID(hex)=16A28365D368F448814501162042CC DF

pszNetBIOSDomainName=RAXNET

pszFullyQualifiedDomainName=rakhesh.local

pszDnsForestName=rakhesh.local

pszDCSiteName=COCHIN

pszClientSiteName=COCHIN

pszNetBIOSHostName=WIN-DC01

pszUserName=<EMPTY>

vcenter01:/opt/likewise/bin# ./lw-enum-users

User info(Level-0):

====================

Name:RAXNET\admin

Uid:1204290036

Gid:1204290049

Gecos:<null>

Shell:/bin/sh

Home dir:/home/local/RAXNET/admin

User info(Level-0):

====================

Name:RAXNET\guest

Uid:1204290037

Gid:1204290050

Gecos:<null>

Shell:/bin/sh

Home dir:/home/local/RAXNET/guest

All looks well! In fact, I added a user to my domain and re-ran the lw-enum-users command it correctly picked up the new user. So the appliance can definitely see my domain and get a list of users from it. The problem appears to be in the upper layers.

In /var/log/vmware/sso/ssoAdminServer.log I found the following each time I’d query the domain for users via the SSO section in the web client:

For no apparent reason my home testlab went wonky today! Not entirely surprising. The DCs in there are not always on/ connected; and I keep hibernating the entire lab as it runs off my laptop so there’s bound to be errors lurking behind the scenes.

Anyways, after a reboot my main DC was acting weird. For one it took a long time to start up – indicating DNS issues, but that shouldn’t be the case as I had another DC/ DNS server running – and after boot up DNS refused to work. Gave the above error message. The Event Logs were filled with two errors:

Event ID 4000: The DNS server was unable to open Active Directory. This DNS server is configured to obtain and use information from the directory for this zone and is unable to load the zone without it. Check that the Active Directory is functioning properly and reload the zone. The event data is the error code.

Event id 4007: The DNS server was unable to open zone <zone> in the Active Directory from the application directory partition <partition name>. This DNS server is configured to obtain and use information from the directory for this zone and is unable to load the zone without it. Check that the Active Directory is functioning properly and reload the zone. The event data is the error code.

A quick Google search brought up this Microsoft KB. Looks like the DC has either lost its secure channel with the PDC, or it holds all the FSMO roles and is pointing to itself as a DNS server. Either of these could be the culprit in my case as this DC indeed had all the FSMO roles (and hence was also the PDC), and so maybe it lost trust with itself? Pretty bad state to be in, having no trust in oneself … ;-)

The KB article is worth reading for possible resolutions. In my case since I suspected DNS issues in the first place, and the slow loading usually indicates the server is looking to itself for DNS, I checked that out and sure enough it was pointing to itself as the first nameserver. So I changed the order, gave the DC a reboot, and all was well!

In case the DC had lost trust with itself the solution (according to the KB article) was to reset the DC password. Not sure how that would reset trust, but apparently it does. This involves using the netdom command which is installed on Server 2008 and up (as well as on Windows 8 or if RSAT is installed and can be downloaded for 2003 from the Support Tools package). The command has to be run on the computer whose password you want to reset (so you must login with an account whose initials are cached, or use a local account). Then run the command thus:

I have used netdom in the past to reset my testlab computer passwords. Since a lot of the machines are usually offline for many days, and after a while AD changes the computer account password but the machine still has the old password, when I later boot up the machine it usually gives are error like this: “The trust relationship between this workstation and the primary domain failed.”