Had an interesting issue at work today. When our Exchange servers (which are in a 2 node DAG) rebooted after patch weekend one of them had trouble starting the Information Store service. The System log had entries such as these (event ID 7024) –

Unable to initialize the Information Store service because the clocks on the client and server are skewed. This may be caused by a time change either on the client or on the server, and may require a restart of that computer. Verify that your domain is correctly configured and is currently online.

So it looked like time synchronization was an issue. Which is odd coz all our servers should be correctly syncing time from the Domain Controllers.

Our Exchange team fixed the issue by forcing a time sync from the DC –

1

net time\\NameOfDC/set

I was curious as to why so went through the System logs in detail. What I saw a sequence of entries such as these –

1

2

3

4

5

6

7

13:21:32-The operating system started at system time2016-05-21T09:21:32.125599300Z.

13:27:40-The system time has changed to2016-05-21T09:27:40.802000000Zfrom2016-05-21T09:22:12.545270300Z.

13:27:55-The Windows Time service entered the running state.

13:27:57-The time provider NtpClient is currently receiving valid time data from NameOfDC.domain(ntp.d|0.0.0.0:123->x.y.z.p:123).

13:22:43-The system time has changed to2016-05-21T09:22:43.211046200Zfrom2016-05-21T09:28:11.768054400Z.

13:22:43-The time service is now synchronizing the system time with the time source NameOfDC.domain(ntp.d|0.0.0.0:123->x.y.z.p:123).

13:22:43-The system time has changed to2016-05-21T09:22:43.211000000Zfrom2016-05-21T09:22:43.211046200Z.

Notice how time jumps ahead 13:21 when the OS starts to 13:27 suddenly, then jumps back to 13:22 when the Windows Time service starts and begins syncing time from my DC. It looked like this jump of 6 mins was confusing the Exchange services (understandably so). But why was this happening?

I checked the time configuration of the server –

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

C:\Users\zrsasiz>w32tm/query/configuration

[Configuration]

<snip>

[TimeProviders]

NtpClient(Local)

DllName:C:\Windows\system32\w32time.dll(Local)

Enabled:1(Local)

InputProvider:1(Local)

CrossSiteSyncFlags:2(Local)

AllowNonstandardModeCombinations:1(Local)

ResolvePeerBackoffMinutes:15(Local)

ResolvePeerBackoffMaxTimes:7(Local)

CompatibilityFlags:2147483648(Local)

EventLogFlags:1(Local)

LargeSampleSkew:3(Local)

SpecialPollInterval:3600(Local)

Type:NT5DS(Local)

VMICTimeProvider(Local)

DllName:C:\Windows\System32\vmictimeprovider.dll(Local)

Enabled:1(Local)

InputProvider:1(Local)

NtpServer(Local)

DllName:C:\Windows\system32\w32time.dll(Local)

Enabled:0(Local)

InputProvider:0(Local)

Seems to be normal. It was set to pick time from the site DC via NTP (the first entry under TimeProviders) as well as from the ESXi host the VM is running on (the second entry – VM IC Time Provider). I didn’t think much of the second entry because I know all our VMs have the VMware Tools option to sync time from the host to VM unchecked (and I double checked it anyways).

Only one of the mailbox servers was having this jump though. The other mailbox server had a slight jump but not enough to cause any issues. While the problem server had a jump of 6 mins, the ok server had a jump of a few seconds.

I thought to check the ESXi hosts of both VMs anyways. Yes, they are not set to sync time from the host, but let’s double check the host times anyways. And bingo! turns out the ESXi hosts have NTP turned off and hence varying times. The host with the problem server was about 6 mins ahead in terms of time from the DC, while the host with the ok server was about a minute or less ahead – too coincidental to match the time jumps of the VMs!

So it looked like the Exchange servers were syncing time from the ESXi hosts even though I thought they were not supposed to. I read a bit more about this and realized my understanding of host-VM time sync was wrong (at least with VMware). When you tick/ untick the option to synchronize VM time with ESX host, all you are controlling is a periodic synchronization from host to VM. This does not control other scenarios where a VM could synchronize time with the host – such as when it moves to a different host via vMotion, has a snapshot taken, is restored from a snapshot, disk is shrinked, or (tada!) when the VMware Tools service is restarted (like when the VM is rebooted, as was the case here). Interesting.

So that explains what was happening here. When the problem server was rebooted it synced time with the ESXi host, which was 6 mins ahead of the domain time. This was before the Windows Time service kicked in. Once the Windows Time service started, it noticed the incorrect time and set it correct. This time jump confused Exchange – am thinking it didn’t confuse Exchange directly, rather one of the AD services running on the server most likely, and due to this the Information Store is unable to start.

This is a good article that explains the Windows Time service and its configurations. Covers both registry keys and GPOs. This is another good article that goes into even more detail.

Any Windows machine can be set up to sync time in one of four ways: (1) no syncing! (2) sync from specified NTP servers (3) sync via domain hierarchy (i.e. members sync from a DC in the domain; DCs sync from PDC of the parent domain/ forest root domain) (4) use either of the above (i.e. NTP and domain hierarchy). Default mechanism on domain joined computers is domain hierarchy (the setting is called NT5DS). Stand-alone machines have the default as NTP servers (the setting is called NTP; the default server is time.windows.com though you can change it (and probably recommended that you change it?)).

For machines that are off and on the domain – e.g. laptops – it is better to set their time sync mechanism as any. They needn’t always have contact with the DC to sync time.

When specifying NTP time servers you also specify flags. Check this post for an explanation of the flags. There are four possible flags: 0x01 SpecialInterval; 0x02 UseAsFallbackOnly; 0x04 SymmetricActive; 0x08 Client.

Flag UseAsFallbackOnly means the server is only used if the others are unavailable. Check out this post for an example of this.

Flag SpecialInterval lets you change how often the NTP server is polled. By default the interval is determined by Windows based on the quality of time samples, but you can use the above flag and set a registry keyHKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\TimeProviders\NtpClient\SpecialPollInterval to change the polling interval.

I am not sure what the other two flags do. The Client flag seems to be a commonly used one. Some posts/ articles use it, others don’t. The default time.windows.com setting uses this flag as well as the SpecialInterval.

I spent the better part of today evening trying to sort this issue. But didn’t get any where. I don’t want to forget the stuff I learnt while troubleshooting so here’s a blog post.

Today evening I added one of my ESXi hosts to my domain. The other two wouldn’t add, until I discovered that the time on those two hosts were out of sync. I spent some time trying to troubleshoot that but didn’t get anywhere. The NTP client on these hosts was running, the ports were open, the DC (which was also the forest PDC and hence the time keeper) was reachable – but time was still out of sync.

The command has an interactive mode (which you get into if run without any switches; read the manpage for more info). The -p switch tells ntpq to output a list of peers and their state. The KB article above suggests running this command every 2 seconds using the watch command but you don’t really need to do that.

Important points about the output of this command:

If it says “No association ID's returned” it means the ESXi host cannot reach the NTP server. Considering I didn’t get that, it means I have no connectivity issue.

If it says “***Request timed out” it means the response from the NTP server didn’t get through. That’s not my problem either.

If there’s an asterisk before the remote server name (like so) it means there is a huge gap between the time on the host and the time given by the NTP server. Because of the huge gap NTP is not changing the time (to avoid any issues caused by a sudden jump in the OS time). Manually restarting the NTP daemon (/etc/init.d/ntpd restart) should sort it out.

The output above doesn’t show it but one of my problem hosts had an asterisk. Restarting the daemon didn’t help.

The refid field shows the time stream to which the client is syncing. For instance here’s the w3tm output from my domain:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

C:\Windows\system32>w32tm/monitor/domain:rakhesh.local

WIN-DC01.rakhesh.local***PDC***[10.50.0.20:123]:

ICMP:0msdelay

NTP:+0.0000000soffset from WIN-DC01.rakhesh.local

RefID:'LOCL'[0x4C434F4C]

Stratum:1

WIN-DC02.rakhesh.local[10.50.1.21:123]:

ICMP:1msdelay

NTP:+0.0127058soffset from WIN-DC01.rakhesh.local

RefID:WIN-DC01.rakhesh.local[10.50.0.20]

Stratum:2

WIN-DC03.rakhesh.local[10.50.0.22:123]:

ICMP:1msdelay

NTP:+0.0183887soffset from WIN-DC01.rakhesh.local

RefID:WIN-DC01.rakhesh.local[10.50.0.20]

Stratum:2

Warning:

Reverse name resolution is best effort.It may not be

correct since RefID field intime packets differs across

NTP implementations and may not be using IP addresses.

Notice the PDC has a refid of LOCL (indicating it is its own time source) while the rest have a refid of the PDC name. My ESXi host has a refid of .INIT. which means it has not received any response from the NTP server (shouldn’t the error message have been something else!?). So that’s the problem in my case.

Obviously the PDC is working because all my Windows machines are keeping correct time from it. So is vCenter. But some my ESXi hosts aren’t.

I have no idea what’s wrong. After some troubleshooting I left it because that’s when I discovered my domain had some inconsistencies. Fixing those took a while, after which I hit upon a new problem – vCenter clients wouldn’t show me vCenter or any hosts when I login with my domain accounts. Everything appears as expected under the administrator@vsphere.local account but the domain accounts return a blank.

While double-checking that the domain admin accounts still have permissions to vCenter and SSO I came across the following error:

Great! (The message is “Cannot load the users for the selected domain“).

I am using the vCenter appliance. Digging through the /var/log/messages on this I found the following entries:

Searched Google a bit but couldn’t find any resolutions. Many blog posts suggested removing vCenter from the domain and re-adding but that didn’t help. Some blog posts (and a VMware KB article) talk about ensuring reverse PTR records exist for the DCs – they do in my case. So I am drawing a blank here.

Odd thing is the appliance is correctly connected to the domain and can read the DCs and get a list of users. The appliance uses Likewise (now called PowerBroker Open) to join itself to the domain and authenticate with it. The /opt/likewise/bin directory has a bunch of commands which I used to verify domain connectivity:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

vcenter01:/opt/likewise/bin# ./lw-get-dc-list rakhesh.local

Got3DCs:

===========

DC1:Name='win-dc01.rakhesh.local',Address='10.50.0.20'

DC2:Name='win-dc03.rakhesh.local',Address='10.50.0.22'

DC3:Name='win-dc02.rakhesh.local',Address='10.50.1.21'

vcenter01:/opt/likewise/bin# ./lw-get-dc-name rakhesh.local

Printing LWNET_DC_INFO fields:

===============================

dwDomainControllerAddressType=23

dwFlags=62461

dwVersion=5

wLMToken=65535

wNTToken=65535

pszDomainControllerName=WIN-DC01.rakhesh.local

pszDomainControllerAddress=10.50.0.20

pucDomainGUID(hex)=16A28365D368F448814501162042CC DF

pszNetBIOSDomainName=RAXNET

pszFullyQualifiedDomainName=rakhesh.local

pszDnsForestName=rakhesh.local

pszDCSiteName=COCHIN

pszClientSiteName=COCHIN

pszNetBIOSHostName=WIN-DC01

pszUserName=<EMPTY>

vcenter01:/opt/likewise/bin# ./lw-enum-users

User info(Level-0):

====================

Name:RAXNET\admin

Uid:1204290036

Gid:1204290049

Gecos:<null>

Shell:/bin/sh

Home dir:/home/local/RAXNET/admin

User info(Level-0):

====================

Name:RAXNET\guest

Uid:1204290037

Gid:1204290050

Gecos:<null>

Shell:/bin/sh

Home dir:/home/local/RAXNET/guest

All looks well! In fact, I added a user to my domain and re-ran the lw-enum-users command it correctly picked up the new user. So the appliance can definitely see my domain and get a list of users from it. The problem appears to be in the upper layers.

In /var/log/vmware/sso/ssoAdminServer.log I found the following each time I’d query the domain for users via the SSO section in the web client:

The first of my (hopefully!) many posts on Active Directory, based on the WorkshopPLUS sessions I attended last month. Progress is slow as I don’t have much time, plus I am going through the slides and my notes and adding more information from the Internet and such.

This one’s on the services that are critical for Domain Controllers to function properly.

DHCP Client

In Server 2003 and before the DHCP Client service registers A, AAAA, and PTR records for the DC with DNS

In Server 2008 and above this is done by the DNS Client

Note that only the A and PTR records are registered. Other records are by the Netlogon service.

File Replication Services (FRS)

Replicates SVSVOL amongst DCs.

Starting with Server 2008 it is now in maintenance mode. DFSR replaces it.

To check whether your domain is still using FRS for SYSVOL replication, open the DFS Management console and see whether the “Domain System Volume” entry is present under “Replication” (if it is not, see whether it is available for adding to the display). If it is present then your domain is using DFSR for SYSVOL replication.

Alternatively, type the following command on your DC. If the output says “Eliminated” as below, your domain is using DFSR for SYSVOL. (Note this only works with domain functional level 2008 and above).

1

2

3

4

C:\>dfsrmig/getglobalstate

CurrentDFSRglobalstate:'Eliminated'

Succeeded.

Stopping FRS for long periods can result in Group Policy distribution errors as SYSVOL isn’t replicated. Event ID 13568 in FRS log.

Apart from the dfsrmig command mentioned in the FRS section, the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\DFSR\Parameters\SysVols\Migrating Sysvols\LocalStateregistry key can also be checked to see if DFSR is in use (a value of 3 means it is in use).

If a DC is offline/ disconnected from its peers for a long time and Content Freshness Protection is turned on, when the DC is online/ reconnected DFSR might block SYSVOL replications to & from this DC – resuling in Group Policy distribution errors.

Content Freshness Protection is off by default. It needs to be manually turned on for each server.

Content Freshness Protection exists because of the way deletions work.

DFSR is multi-master, like AD, which means changes can be made on any server.

When you delete an item on one server, it can’t simply be deleted because then the item won’t exist any more and there’s no way for other servers to know if that’s the case because the item was deleted or because it wasn’t replicated to that server in the first place.

So what happens is that a deleted item is “tombstoned“. The item is removed from disk but a record for it remains the in DFSR database for 60 days (this period is called the “tombstone lifetime”) indicating this item as being deleted.

During these 60 days other DFSR servers can learn that the item is marked as deleted and thus act upon their copy of the item. After 60 days the record is removed from the database too.

In such a context, say we have DC that is offline for more than 60 days and say we have other DCs where files were removed from SYSVOL (replicated via DFSR). All the other DCs no longer have a copy of the file nor a record that it is deleted as 60 days has past and the file is removed for good.

When the previously offline DC replicates, it still has a copy of the file and it will pass this on to the other DCs. The other DCs don’t remember that this file was deleted (because they don’t have a record of its deletion any more as as 60 days has past) and so will happily replicate this file to their folders – resulting in a deleted file now appearing and causing corruption.

It is to avoid such situations that Content Freshness Protection was invented and is recommended to be turned on.

Here’s a good blog post from the Directory Services team explaining Content Freshness Protection.

DNS Client

For Server 2008 and above registers the A, AAAA, and PTR records for the DC with DNS (notice that when you change the DC IP address you do not have to update DNS manually – it is updated automatically. This is because of the DNS Client service).

Note that only the A, AAAA, and PTR records are registered. Other records are by the Netlogon service.

DNS Server

The glue for Active Directory. DNS is what domain controllers use to locate each other. DNS is what client computers use to find domain controllers. If this service is down both these functions fail.

Kerberos Distribution Center (KDC)

Required for Kerberos 5.0 authentication. AD domains use Kerberos for authentication. If the KDC service is stopped Kerberos authentication fails.

NTLM is not affected by this service.

Netlogon

Maintains the secure channel between DCs and domain members (including other DCs). This secure channel is used for authentication (NTLS and Kerberos) and DC replication.

Writes the SRV and other records to DNS. These records are what domain members use to find DCs.

The records are also written to a file %systemroot%\system32\config\Netlogon.DNS. If the DNS server doesn’t support dynamic updates then the records in this text file must be manually created on the DNS server.

The Windows Time service on every domain member looks to the DC that authenticates them for time time updates.

DCs in the domain look to the domain PDC for time updates.

Domain PDCs look to the domain PDC of the domain above/ sibling to them. Except the forest root domain PDC who gets time from an external source (hardware source, Internet, etc).

From this link: there are two registry keys HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Config\MaxPosPhaseCorrection and HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Config\MaxNegPhaseCorrection that restrict the time updates accepted by the Windows Time service to the number of seconds defined by these values (the maximum and minimum range). This can be set directly in the registry or via a GPO. The recommended value is 172800 (i.e. 48 hours).

w32tm

The w32tm command can be used to manage time. For instance:

To get an idea of the time situation in the domain (who is the master time keeper, what is the offset of each of the DCs from this time keeper):

1

w32tm/monitor

To ask the Windows Time service to resync as soon as possible (the command can target a remote computer too via the /computer: switch)

1

w32tm/resync

Same as above but before resyncing redetect any network configuration changes and rediscover the sources:

1

w32tm/resync/rediscover

To get the status of the local computer (use the /computer: switch to target a different computer)

1

w32tm/query/status

To show what time sources are being used:

1

w32tm/query/source

To show who the peers are:

1

w32tm/query/peers

To show the current time zone:

1

w32tm/tz

You can’t change the time zone using this command; you have to do:

1

tzutil/s"Time Zone Name"

On the PDC in the forest root domain you would typically run a command like this if you want it to get time from an NTP pool on the Internet:

specify a list of peers to sync time from (in this example the NTP Pool servers on the Internet);

the /update switch tells w32tm to update the Windows Time service with this configuration change;

the /syncfromflags:MANUAL tells the Windows Time service that it must only sync from these sources (other options such as “DOMHIER” tells it to sync from the domain peers only, “NO” tells it sync from none, “ALL” tells it to sync from both the domain peers and this manual list);

the /reliable:YES switch marksthis machine as special in that it is a reliable source of time for the domain (read this link on what gets set when you set a machine as RELIABLE).

Note: You must manually configure the time source on the PDC in the forest root domain and mark it as reliable. If that server were to fail and you transfer the role to another DC, be sure to repeat the step.

On other machines in the domain you would run a command like this:

1

w32tm/config/update/syncfromflags:DOMHIER/reliable:NO

This tells those DCs to follow the domain hierarchy (and only the domain hierarchy) and that they are not reliable time sources (this switch is not really needed if these other DCs are not PDCs).

Active Directory Domain Services (AD DS)

Provides the DC services. If this service is stopped the DC stops acting as a DC.

Pre-Server 2008 this service could not be stopped while the OS was online. But since Server 2008 it can be stopped and started.

The Active Directory Database Mounting Tool was new to me so here’s a link to what it does. It’s a pretty cool tool. Starting from Server 2008 you can take AD DS and AD LDS snapshots via the Volume Snapshots Service (VSS) (I am writing a post on VSS side by side so expect to see one soon). This makes use of the NTDS VSS writer which ensures that consistent snapshots of the AD databases can be taken. The AD snapshots can be taken manually via the ntdsutil snapshot command or via backup software or even via images of the whole system. Either ways, once you have such snapshots you can mount the one(s) you want via ntdsutil and point Active Directory Database Mounting Tool to it. As the tool name says it “mounts” the AD database in the snapshot and exposes it as an LDAP server. You can then use tools such as ldp.exe of the AD Users and Computers to go through this instance of the AD database. More info on this tool can be found at this and this link.

AD WS is what the PowerShell Active Directory module connects to.

It is also what the new Active Directory Administrative Center (which in turn uses PowerShell) too connects to.

AD WS is installed automatically when the AD DS or AD LDS roles are installed. It is only activated once the server is promoted to a DC or if and AD LDS instance is created on it.