I recommend sticking to IT Pro features; the consumer side’s covered and the biggest value is your Administrator experience. The NDA is not off - I still cannot comment on the future of Windows 8 or tell you if we already have plans to do X with Y. This is a one-way channel from you to us (to the developers).

Question

We were chatting here about password synchronization tools that capture password changes on a DC and send the clear text password to some third party app. I consider that a security risk...but then someone asked me how the password is transmitted between a domain member workstation and a domain controller when the user performs a normal password change operation (CTRL+ALT+DEL and Change Password). I suppose the client uses some RPC connection, but it would be great if you could point me to a reference.

Answer

Windows can change passwords many ways - it depends on the OS and the component in question.

1. For the specific case of using CTRL+ALT+DEL because your password has expired or you just felt like changing your password:

If you are using a modern OS like Windows 7 with AD, the computer uses the Kerberos protocol end to end. This starts with a normal AS_REQ logon, but to a special service principal name of kadmin/changepw, as described in http://www.ietf.org/rfc/rfc3244.txt.

The computer first contacts a KDC over port 88, then communicates over port 464 to send along the special AP_REQ and AP_REP. You are still using Kerberos cryptography and sending an encrypted payload containing a KRB_PRIV message with the password. Therefore, to get to the password, you have to defeat Kerberos cryptography itself, which means defeating the crypto and defeating the key derived from the cryptographic hash of the user's original password. Which has never happened in the history of Kerberos.

The parsing of this kpasswd traffic is currently broken in NetMon's latest public parsers, but even when you parse it in WireShark, all you can see is the encryption type and a payload of encrypted goo. For example, here is that Windows 7 client talking to a Windows Server 2008 R2 DC, which means AES-256:

This uses “RPC over SMB with Named Pipes” with RPC packet privacy. You are using NTLM v2 by default (unless you set LMCompatibility unwisely) and you are still double-protected (the payload and packets), which makes it relatively safe. Definitely not as safe as Win7 though – just another reason to move forward.

You can disable NTLM in the domain if you have Win2008 R2 DCs and XP is smart enough to switch to using Kerberos here:

Question

If I try to remove the AD Domain Services role using ServerManager.msc, it blocks me with this message:

But if I remove the role using Dism.exe, it lets me continue:

This completely hoses the DC and it no longer boots normally. Is this a bug?

And - hypothetically speaking, of course - how would I fix this DC?

Answer

Don’t do that. :)

Not a bug, this is expected behavior. Dism.exe is a pure servicing tool; it knows nothing more of DCs than the Format command does. ServerManager and servermanagercmd.exe are the tools that know what they are doing. Update: Although as Artem points out in the comments, we want you to use the Server Manager PowerShell and not servermanagercmd, which is on its way out.

To fix your server, pick one:

Boot it into DS Repair Mode with F8 and restore your system state non-authoritatively from backup (you can also perform a bare metal restore if you have that capability - no functional difference in this case). If you do not have a backup and this is your only DC, update your résumé.

Boot it into DS Repair Mode with F8 and use dcpromo /forceremoval to finish what you started. Then perform metadata cleanup. Then go stand in the corner and think about what you did, young man!

Question

We are getting Event ID 4740s (account lockout) for the AD Guest account throughout the day, which is raising alerts in our audit system. The Guest account is disabled, expired, and even renamed. Yet various clients keep locking out the account and creating the 4740 event. I believe I've traced it back to the occasional attempt of a local account attempting to authenticate to the domain. Any thoughts?

Answer

You'll see that when someone has set a complex password on the Guest account, using NET USER for example, rather than having it be the null default. The clients never know what the guest password is, they always assume it's null like default - so if you set a password on it, they will fail. Fail enough and you lock out (unless you turn that policy off and replace it with intrusion protection detection and two-factor auth). Set it back to null and you should be ok. As you suspected, there a number of times when Guest is used as part of a "well, let's try that" algorithm:

To set it back you just use the Reset Password menu in Dsa.msc on the guest account, making sure not to set a password and clicking ok. You may have to adjust your domain password policy temporarily to allow this.

As for why it's "locking out" even though it's disabled and renamed:

It has a well-known SID (S-1-5-21-domain-501) so renaming doesn’t really do anything except tick a checkbox on some auditor's clipboard

Disabled accounts can still lock out if you keep sending bad passwords to them. Usually no one notices though, and most people are more concerned about the "account is disabled" message they see first.

Question

What are the steps to change the "User Account" password set when the Network Device Enrollment Service (NDES) is installed?

Answer

When you first install the Network Device Enrollment Service (NDES), you have the option of setting the identity under which the application pool runs to the default application pool identity or to a specific user account. I assume that you selected the latter. The process to change the password for this user account requires two steps -- with 27 parts (not really…).

1. First, you must reset the user account's password in Active Directory Users and Computers.

2. Next, you must change the password configured in the application pool Advanced Settings on the NDES server.

a. In IIS manager, expand the server name node.

b. Click on Application Pools.

c. On the right, locate and highlight the SCEP application pool.

d. In the Action pane on the right, click on Advanced Settings....

e. Under Process Model click on Identity, then click on the … button.

f. In the Application Pool Identity dialog box, select Custom account and then click on Set….

g. Enter the custom application pool account name, and then set and confirm the password. Click Ok, when finished.

h. Click Ok, and then click Ok again.

i. Back on the Application Pools page, verify that SCEP is still highlighted. In the Action pane on the right, click on Recycle….

j. You are done.

Normally, you would have to be concerned with simply resetting the password for any service account to which any digital certificates have been assigned. This is because resetting the password can result in the account losing access to the private keys associated with those certificates. In the case of NDES, however, the certificates used by the NDES service are actually stored in the local computer's Personal store and the custom application pool identity only has read access to those keys. Resetting the password of the custom application pool account will have no impact on the master key used to protect the NDES private keys.

Question

If I have only one domain in my forest, do I need a Global Catalog? Plenty of documents imply this is the case.

Answer

All those documents saying "multi-domain only" are mistaken. You need GCs - even in a single-domain forest - for the following:

(Update: Correction on single-domain forest logon made, thanks for catching that Yusuf! I also added a few more breakage scenarios)

Perversely, if you have enabled IgnoreGCFailures (http://support.microsoft.com/kb/241789); turning it on removes universal groups from the user security token if there is no GC, meaning they will logon but not be able to access resources they accessed fine previously).

If your users logon with UPNs and try to change their password (they can still logon in a single domain forest with UPN or NetBiosDomain\SamAccountName style logons).

Even if you use Universal Group Membership Caching to avoid the need for a GC in a site, that DC needs a GC to update the cache.

MS Exchange is deployed (All versions of Exchange services won't even start without a GC).

Using the built-in Find in the shell to search AD for published shares, published DFS links, published printers, or any object picker dialog that provides option "entire directory" will fail.

DPM agent installation will fail.

AD Web Services (aka AD Management Gateway) will fail.

CRM searches will fail.

Probably other third parties of which I'm not aware.

We stopped recommending that customers use only handfuls of GCs years ago - if you get an ADRAP or call MS support, we will recommend you make all DCs GCs, unless you have an excellent reason not. Our BPA tool states that you should have at least one GC per AD site: http://technet.microsoft.com/en-us/library/dd723676(WS.10).aspx.

Answer

The symlink replicates; however, the underlying data does not replicate just because there is a symlink. If the data is not stored within the RF, you end up with a replicated symlink to nowhere:

Server 1, replicating a folder called c:\unfiltersub. Note how the symlink points to a file that is not in the scope of replication:

Server 2, the symlink has replicated - but naturally, it points to an un-replicated file. Boom:

If the source data is itself replicated, you’re fine. There’s no real way to guarantee that though, except preventing users from creating files outside the RF by using permissions and FSRM screens. If your end users can only access the data through a share, they are in good shape. I'd imagine they are not the ones creating symlinks though. ;-)

Question

I read your post on career development. There are many memory techniques and I know everyone is different, but what do you use?

[A number of folks asked this question - Neditor]

Answer

When I was younger, it just worked - if I was interested in it, I remembered it. As I get older and burn more brain cells though, I find that my best memory techniques are:

Periodic skim and refresh. When I have learned something through deep reading and hands on, I try to skim through core topics at least once a year. For example, I force myself to scan the diagrams in the all the Win2003 Technical Reference A-Z sections, and if I can’t remember what the diagram is saying, I make myself read that section in detail. I don’t let myself get too stale on anything and try to jog it often.

Mix up the media. When learning a topic, I read, find illustrations, and watch movies and demos. When there are no illustrations, I use Visio to make them for myself based on reading. When there are no movies, I make myself demo the topics. My brain seems to retain more info when I hit it with different styles on the same subject.

I teach and publically write about things a lot. Nothing hones your memory like trying to share info with strangers, as the last thing I want is look like a dope. It makes me prepare and check my work carefully, and that natural repetition – rather than forced “read flash cards”-style repetition, really works for me. My brain runs best under pressure.

Your body is not a temple (of Gozer worshipers). Something of a cliché, but I gobble vitamins, eat plenty of brain foods, and work out at least 30 minutes every morning.

I hope this helps and isn’t too general. It’s just what works for me.

Other Stuff

Have $150,000 to spend on a camera, a clever director who likes FPS gaming, and some very fit paint ballers? Go make a movie better than this. Watch it multiple times.

Hi guys, Joji Oshima here again. Today I want to talk about using Custom Views in the Windows Event Viewer to filter events more effectively. The standard GUI allows some basic filtering, but you have the ability to drill down further to get the most relevant data. Starting in Windows Vista/2008, you have the ability to modify the XML query used to generate Custom Views.

Limitations of basic filtering:

Basic filtering allows you to display events that meet certain criteria. You can filter by the event level, the source of the event, the Event ID, certain keywords, and the originating user/computer.

Basic Filter for Event 4663 of the security event logs

You can choose multiple events that match your criteria as well.

Basic filter for Event 4660 & 4663 of the security event logs

A real limitation to this type of filtering is the data inside each event can be very different. 4663 events appear when auditing users accessing objects. You can see the account of the user, and what object they were accessing.

Sample 4663 events for users ‘test5’ and ‘test9’

If you want to see events that are only about user ‘test9’, you need a Custom View and an XML filter.

Using XML filtering and Custom Views:

Custom Views using XML filtering are a powerful way to drill through event logs and only display the information you need. With Custom Views, you can filter on data in the event. To create a Custom View based on the username, right click Custom Views in the Event Viewer and choose Create Custom View.

Click the XML Tab, and check Edit query manually. Click ok to the warning popup. In this window, you can type an XML query. For this example, we want to filter by SubjectUserName, so the XML query is:

After you type in your query, click the Ok button. A new window will ask for a Name & Description for the Custom View. Add a descriptive name and click the Ok button.

You now have a Custom View for any security events that involve the user test9.

Take It One Step Further:

Now that we’ve gone over a simple example, let’s look at the query we are building and what else we can do with it. Using XML, we are building a SELECT statement to pull events that meet the criteria we specify. Using the standard AND/OR Boolean operators, we can expand upon the simple example to pull more events or to refine the list.

Perhaps you want to monitor two users - test5 and test9 - for any security events. Inside the search query, we can use the Boolean OR operator to include users that have the name test5 or test9.

The query below searches for any security events that include test5 or test9.

Event Metadata:

At this point you may be asking, where did you come up with SubjectUserName and what else can I filter on? The easiest way to find this data is to find a specific event, click on the details tab, and then click the XML View radio button.

From this window, we can see the structure of the Event’s XML metadata. This event has a <System> tag and an <EventData> tag. Each of these data names can be used in the filter and combined using standard Boolean operators.

With the same view, we can examine the <System> metadata to find additional data names for filtering.

Now let’s say we are only interested in a specific Event ID involving either of these users. We can incorporate an AND Boolean to filter on the System data.

Broader Filtering:

Say you wanted to filter on events involving test5 but were unsure if it would be in SubjectUserName, TargetUserName, or somewhere else. You don’t need to specify the specific name that the data can be in, but just search that some data in <EventData> contains test5.

The query below looks for events that any data in <EventData> equals test5.

Multiple Select Statements:

You can also have multiple select statements in your query to pull different data in the same log or data in another log. You can specify which log to pull from inside the <select> tag, and have multiple <select> tags in the same <query> tag. The example below will pull 4663 events from the security event log and 1704 events from the application event log.

XPath 1.0 Limitations:

Windows Event Log supports a subset of XPath 1.0. There are limitations to what functions work in the query. For instance, you can use the "position", "Band", and "timediff" functions within the query but other functions like "starts-with" and "contains" are not currently supported.

Conclusion:

Using Custom Views in the Windows Event Log can be a powerful tool to quickly access relevant information on your system. XPath 1.0 has a learning curve but once you get a handle on the syntax, you will be able to write targeted Custom Views.

Joji "the sieve" Oshima

[Check out pseventlogwatcher if you want to combine complex filters with monitoring and automation. It’s made by AskDS superfan Steve Grinker: http://pseventlogwatcher.codeplex.com/ – Neditor]

Hi. Mark again. As part of my role in Premier Field Engineering, I’m sometimes called upon to visit customers when they have a critical issue being worked by CTS, needing another set of eyes. For today’s discussion, I’m going to talk you through, one such visit.

It was a dark and stormy night …

Well not really – it was mid-afternoon but these sorts of things always have that sense of drama.

The Problem

Custom applications were hard coded to use the PDC Emulator (PDCe) for authentication – a strategy the customer later abandoned to eliminate a single point of failure. The issue was hot because the PDCe was not processing authentication requests after a reboot.

The customer had noticed lsass.exe consuming a lot of CPU and this is where CTS were focusing their efforts.

The Investigation

Starting with the Directory Service event logs, I noticed the following:

Event Type:Information

Event Source:NTDS Replication

Event Category:Replication

Event ID:1555

Date:<Date>

Time:<Time>

User:NT AUTHORITY\ANONYMOUS LOGON

Computer:<Name of PDCe>

Description:

The local domain controller will not be advertised by the domain controller locator service as an available domain controller until it has completed an initial synchronization of each writeable directory partition that it holds. At this time, these initial synchronizations have not been completed.

The synchronizations will continue.

also:

Event Type:Warning

Event Source:NTDS Replication

Event Category:Replication

Event ID:2094

Date:<Date>

Time:<Time>

User:NT AUTHORITY\ANONYMOUS LOGON

Computer:<Name of PDCe>

Description:

Performance warning: replication was delayed while applying changes to the following object. If this message occurs frequently, it indicates that the replication is occurring slowly and that the server may have difficulty keeping up with changes.

A common reason for seeing this delay is that this object is especially large, either in the size of its values, or in the number of values. You should first consider whether the application can be changed to reduce the amount of data stored on the object, or the number of values. If this is a large group or distribution list, you might consider raising the forest version to Windows Server 2003, since this will enable replication to work more efficiently. You should evaluate whether the server platform provides sufficient performance in terms of memory and processing power. Finally, you may want to consider tuning the Active Directory database by moving the database and logs to separate disk partitions.

If you wish to change the warning limit, the registry key is included below. A value of zero will disable the check.

With this information in hand, I had a chat with the customer hoping we’d identify a relevant change in the environment leading up to the outage. It became apparent they’d configured a policy for deploying RDP session certificates. Furthermore, they’d noticed clients receiving many of these certificates instead of the expected one.

RDP session certificates are Secure Sockets Layer (SSL) certificates issued to Remote Desktop servers. It is also possible to deploy RDP session certificates to client operating systems such as Windows Vista and Windows 7. More on this later…

The customer and I examined a sample client and found 285 certificates! In addition to this unusual behaviour, the certificates were being published to Active Directory. There were 3700 affected clients – approx. 1 million certificates published to AD!

The Story So Far

We’ve injected huge amounts of certificate data into the userCertificate attribute of computer objects, we’ve got replication backlog due to memory allocation issues and the DC can’t complete an initial sync before advertising itself as a DC.

What Happened Next Uncle Mark?!

The CTS engineer back at home base wanted to gather some debug logging of LSASS.exe. While attempting to gather such a log, the PDCe became completely unresponsive and we had to reboot.

While the PDCe rebooted, the customer disabled the policy responsible for deploying RDP session certificates.

After the reboot, the PDCe had stopped logging event 1079 (for memory allocation failures) but in addition to event 1555 and 2094, we were now seeing:

Event TypeWarning

Event Source:NTDS Replication

Event Category:DS RPC Client

Event ID:1188

Date:<Date>

Time:<Time>

User:NT AUTHORITY\ANONYMOUS LOGON

Computer:<Name of PDCe >

Description:

A thread in Active Directory is waiting for the completion of a RPC made to the following domain controller.

Domain controller:

<_msdcs DNS record of replication partner>

Operation:

get changes

Thread ID:

<Thread ID>

Timeout period (minutes):

5

Active Directory has attempted to cancel the call and recover this thread.

Both of these changes require a reboot. The customer was hesitant to reboot again and while they thought it over, initial sync completed.

With the PDCe authenticating clients, I headed home to get some sleep. The customer had disabled the RDP session certificate deployment policy and was busy clearing the certificate data out of computer objects in Active Directory.

Why?

The next day, I went looking for root cause. The customer had followed some guidance to deploy the RDP session certificates. Some of the guidance noted during the investigation is posted here:

I set up a test environment and walked through the guidance. After doing so, I did not experience the issue. I was getting a single certificate no matter how often I would reboot or apply Group Policy. In addition, RDP session certificates were not being published in Active Directory. Publishing in Active Directory is easily explained by this checkbox:

An examination of the certificate template confirmed they had this checked.

So why were clients in the customer environment receiving multiple certificates while clients in my test environment received just one?

The Win

I noticed the following point in the guidance being followed by the customer:

A bit of an odd recommendation. Sure enough, the customer’s template had different names for “Template display name” and “Template name”. I changed my test environment to make the same mistake and suddenly I had a repro – a new certificate on every reboot and policy refresh.

Some research revealed that this was a known issue. One of these fields checks whether an RDP session certificate exists while the other field obtains a new certificate. Giving both fields the same name works around the problem.

Conclusion

So in the aftermath of this incident, there are some general recommendation that anyone can take to help avoid this kind of situation.

Hello guys and gals, Kim Nichols here with my first AskDS post. While deciding on a title, I did a quick search on the word "heck" on our AskDS blog to see if Ned was going to give me any grief. Apparently, we "heck" a lot around here, so I guess it's all good. :-)

I'm hoping to shed some light on USMT's /genmigxml switch and uncover the truth behind which XML files must be included for both scanstate and loadstate. I recently had a USMT 4 case where the customer was using the /genmigxml switch during scanstate to generate a custom XML file, mymig.xml. After creating the custom XML, the file was added to the scanstate command via the /i:mymigxml switch along with any other custom XML files. When referencing the file again on loadstate, loadstate failed with errors similar to the following:

From this error, it appears that user KIMN\test2 is invalid for some reason. What is interesting is if that user logs on to the computer prior to running loadstate, loadstate completes successfully. Requiring all users to log on prior to migrating their data is not recommended and can cause issues with application migration.

I did some research to get a better understanding of the purpose behind the /genmigxml switch and why we hadn't received more calls on this issue. Here's what I found:

“This option specifies that the ScanState command should use the document finder to create and export an .xml file that defines how to migrate all of the files found on the computer on which the ScanState command is running. The document finder, or MigXmlHelper.GenerateDocPatterns helper function, can be used to automatically find user documents on a computer without authoring extensive custom migration .xml files.”

"In USMT 4.0, the MigXmlHelper.GenerateDocPatterns function can be used to automatically find user documents on a computer without authoring extensive custom migration .xml files. This function is included in the MigDocs.xml sample file downloaded with the Windows AIK. "

We can use /genmigxml to get an idea of what the migdocs.xml file is going to collect for a specific user. We don't specifically document what you should do with the generated XML besides review it. Logic might lead us to believe that, similar to the /genconfig switch, we should generate this XML file and include it on both our scanstate and our loadstate operations if we want to make modifications to which data is gathered for a specific user. This is where we run into the issue above, though.

If we take a look inside this XML file, we see a list of locations from which scanstate will collect documents. This list includes the path for each user profile on the computer. Here's a section from mymigxml.xml in my test environment. Notice that this is the same user from my loadstate log file above.

So, if including this file generates errors, why use it? The answer is /genmigxml was only intended to provide a sample of what will be migrated using the standard XML files. The XML is machine-specific and not generalized for use on multiple computers. If you need to alter the default behavior of migdocs.xml to exclude or include files/folders for specific users on a specific computer, modify the file generated via /genmigxml for use with scanstate. This file contains user-specific profile paths so don't include it with loadstate.

But wait… I thought all XML files had to be included in both scanstate and loadstate?

The actual answer is it depends. In the USMT 4.0 FAQ, we specify including the same XML files for both scanstate and loadstate. However, immediately following that sentence, we state that you don't have to include the Config.xml on loadstate unless you want to exclude some files that were migrated to the store.

The more complete answer is the default XML files (migapp.xml & migdocs.xml) need to be included in both scanstate and loadstate if you want any of the rerouting rules to apply; for instance, migrating from one version of Office to another. Because migapp.xml & migdocs.xml transform OS and user data to be compatible with a different version of the OS/Office, you must include both files on scanstate and on loadstate.

As for your custom XML files (aside from the one generated from /genmigxml), these only need to be specified in loadstate if you are rerouting files or excluding files that were migrated to the store from migrating down to the new computer during loadstate.

To wrap this up, in most cases migdocs.xml migrates everything you need. If you are curious about what will be collected you can run /genmigxml to find out, but the output is computer-specific, you can’t use it without modification.

The Internet is a public toilet of misunderstanding, opinions masquerading as facts, and general ignorance.

Nothing wrong with the first point; we’re outnumbered at least a thousand to one, so it’s natural for advertising to target the majority. The second point I can’t abide by; I despise misinformation.

Nothing has changed with my NDA - I cannot discuss Windows 8 in detail, speak of the future, or otherwise get myself fired. Nevertheless, I can point you to accurate content that’s useful to an IT Professional craving more than just the new touchscreen shell for tablets. My links talk a little Windows Server and show features that Mom won’t be using.

So, in vague order and with no regard to the features being Directory Services or not, here are the goods. Some are movies and PowerPoint slides, some are text. Some are Microsoft and some are not. Many are buried in the //Build site. I added some exposition to each link so I don’t feel so dirty.

Mark Minasi's September NewsletterAn excellent take from the longtime Windows technology writer (this fits as both intro and AD details, but I recommend it especially if you're a DS geek - especially if you virtualize DCs)

Remember, everything is subject to change and refers only to the Developer Preview release from the //Build conference; Windows 8 isn’t even in beta yet. Grab the client or server and see for yourself.

And no matter what link you click, I don’t recommend reading the comments. See point 2.

Mark here again. Following a recent experience in the field (yes, Premier Field Engineers leave the office), I thought it’d be useful to discuss Active Directory Topology and how it influences DFS Namespace and DFS Folder referrals.

Let’s look at the following environment for the purposes of our discussion

Let’s suppose that the desired referral behaviour is for clients to use local DFS targets, then the DFS target in the hub site and finally, any other target.

Note: DFS Namespace and Folder referrals are site-costed by default in Windows Server 2008 or later. The feature is also available in Windows Server 2003 (SP1 or later) but is disabled by default. The use of site-costed referrals is controlled by the following registry value

Great! This is pretty much what we’d designed for – the local target first, the hub second and random ordering of equally costed targets after that. The exception is DFS Target B, which cannot be costed without a site-link between site Spoke-B and any other site.

Scenario B

Clients assigned an IP Address in the subnet 192.168.2.0/24 will associate themselves with the site Spoke-B. The DFS referral process will offer them the ordered list

In this scenario, we correctly receive the local DFS target heading the list but the rest of the referral order is random. Without a site-link from the site Spoke-B to any other site, DFS cannot calculate the cost of targets beyond the client site.

Scenario C

Clients assigned an IP Address in the subnet 192.168.3.0/24 do not associate with any site. The pattern seen in the site diagram above suggests the client should associate with the site Spoke-C but the missing subnet definition leaves clients in limbo. In fact, nltest.exe will show you this

Here the local target is ordered first. DFS Target E and DFS Target Hub have the same cost and are ordered next in random order. This is because sites Hub, Spoke-D and Spoke-E are all linked with the same site-link and therefore have the same cost.

DFS Target A and DFS Target C are offered in random order following DFS Target E and DFS Target Hub – again because they have the same cost. Lastly, the un-costed DFS Target B is ordered.

Clients assigned an IP Address in the subnet 192.168.5.0/24 will associate themselves with the site Spoke-E and experience the same referral order as clients in site Spoke-D except the position of DFS Target D and DFS Target E in the referral order will be swapped.

This is close to the design goal but the site-link connecting three sites may cause DFS Target Hub to appear slightly out of order.

Conclusion

I’ve seen many poorly managed Active Directory topologies – most often when Domain Controllers reside in a central site. A well-defined topology is important for other reasons than DC replication/location – DFS referrals being a big one. Without properly defined site-links, subnets and site-to-subnet mappings, users may find themselves unwittingly directed to a file server in Mordor.

It’s harder to let go of old components and protocols than dropping old habits. But, I’m falling back to an old habit myself…there goes the New Year resolution.

Quite recently we were faced with a new aspect of an old story. We hoped this problem would cease to exist as customers move forward with Kerberos-based solutions and other methods that facilitate Kerberos, such as smartcard PKINIT.

Yes, there are still some areas where we have to use NTLM for the sake of compatibility or absence of a domain controller. One of the most popular scenarios is disconnected clients using RPC over HTTP to connect to an Exchange mailbox. Another one is web proxy servers - which still often use NTLM although they and most browsers support Kerberos also.

With RPC over HTTP you have two discrete NTLM authentications: the outer HTTP session is authenticated on the frontend server and the inner RPC authentication is done on the mailbox server. The NTLM load from proxy servers can be even worse - as each TCP session has to be authenticated - and some browsers frequently recycle their sessions.

One way or the other, you end up with a high rate of NTLM authentication requests. And you may have already found the “MaxConcurrentAPI“ parameter, which is the number of concurrent NTLM authentications processed by the server. Historically there has been constant talk about a default of 2. However, the defaults are quite different:

Member-Workstation: 1

Member-Server: 2

Domain Controller: 1

The limit applies per Secure Channel. Members can only have one secure channel to a DC in the domain of which they are a member. Domain Controllers have one Secure Channel per trusted domain. However, as many customers follow a functional domain model of “user domains” and “resource domains”, the list of domains actually used for authentication is low and thus DCs are limited to 1 concurrent authentication for a certain “user domain”. Check out this diagram:

In this diagram, you see authentication requests started against servers in the left-hand forest as colored boxes by users in the right-hand forest. We are using the default values of MaxConcurrentAPI. The requests are forwarded along the trust paths to the right-hand forest. The trust paths used are shown by the arrows.

Now you see that on each resource forest DC up to 2 requests from member resource servers are queued. On the downstream DC, you get a maximum of 1 request from the grand-child domain. The same applies to the forest root DC. In this case, the only active authentication call for forest 1 is for the forest 2 grand-child domain, shown with brown API slots and arrows. Now that’s a real convoy…

The hottest link is between the forest root domains as every NTLM request needs to travel through the secure channels of forest1 root DCs with forest2 root DCs.

From the articles you may know “MaxConcurrentAPI” can be increased to 10 with a registry change. Well, Windows Server 2008 and Windows Server 2008 R2 have an update which pushes the limit to 150:

975363 A time-out error occurs when many NTLM authentication requests are sent from a computer that is running Windows Server 2008 R2, Windows 7, Windows Server 2008, or Windows Vista in a high latency network

This should be of some help… In addition, Windows Server 2008 and later include a performance object called ”Netlogon” which allows you monitoring the throughput, load and duration of NTLM authentication requests. You can add that to Windows Server 2003 using an update:

928576 New performance counters for Windows Server 2003 let you monitor the performance of Netlogon authentication

The article also offers a description of the counters. When you track the performance object you notice each secure channel is visible as a separate instance. This allows you to track activity per domain, what DCs are used and whether there are frequent fail-overs.

Beyond the article, these are our recommendations regarding performance baselines and alerts:

Performance counter

Recommendation

Semaphore Waiters

All Semaphores are busy, we have threads and thus logons waiting in the queue. This counter is a candidate for a warning.

Semaphore Holders

This is the number of currently active callers. This is a candidate for a baseline to monitor. If this is approaching your maximum setting in baselines, you need to act.

Semaphore Acquires

This counts the total # of requests over this secure channel. When the secure channel fails and is reestablished, the count restarts from 0. Check the _Total instance for a counter for the whole server. Good to monitor the trend in baselines.

Semaphore Timeouts

An authentication thread has hit the time-out for the waiting and the logon was denied. So the logon was slow, and then it failed. This is a very bad user experience and the secure channel is overloaded, hung or broken. Also check the _Total instance.

This is ALERT material.

Average Semaphore Hold Time

This should provide the average response time quite nicely. This is also a candidate for baseline monitoring for trends.

When it comes to discussing secure channels and maximum concurrency and queue depth, you also have to talk about how the requests are routed. Within a forest, you notice that the requests are sent directly to the target user domain.

When Netlogon finds that the user account is from another forest, it however has to follow the trust path, similar to what a Kerberos client would do (just the opposite direction). So the requests are forwarded to the parent domain and eventually arrive at the forest root DCs and from there across the forest boundary. You can easily imagine the Netlogon Service queues and context items look like rush hour at the Frankfurt airport.

So who cares?

You might say that besides the domains becoming bigger nowadays, there’s not a lot of news for folks running Exchange or big proxy server farms. Well, recently we became aware of a new source of NTLM authentication requests that was in the system for quite some time, but that now has reared its head. Recently customers have decided to turn this on, perhaps due to recommendations in a few of our best practices guides. We’re currently working on having these updated.

RPC Interface Restriction was introduced in Windows XP Service Pack 2 and Windows Server 2003 Service Pack 1 and offers the options to force authentication for all RPC Endpoint Mapper requests. The goal was to prevent anonymous attacks on the service. The goal may also have been avoiding denial of service attacks, but that one did not pan out very well. The details are described here:

In this description, the facility is hard-coded to use NTLM authentication. Starting with Windows 7, the feature can also use Kerberos for authentication. So this is yet another reason to update.

The server will only require authentication (reject anonymous clients) if “RestrictRemoteClients” is set to 1 or higher. When you have the combinations of applications with dynamic endpoints, many clients and frequent reconnects in the deployments, you get a sustainable number of authentications.

Some of the customers affected were quite surprised about the NTLM authentication volume, as they had everything configured to use Kerberos on their proxy servers and Exchange running without RPC over HTTP clients.

Exchange with MAPI clients is an application architecture that uses many different RPC interfaces, all using Endpoint Mapper. The list includes Store, NSPI, Referrer plus a few operating system interfaces like LSA RPC, each one of them triggering NTLM authentications. The bottleneck is then caused by the queuing of requests, done in each hop along the trust path.

Similar problems may happen with custom applications using RPC or DCOM to communicate. It all comes down to the rate of NTLM authentications induced on the AD infrastructure.

In our testing we found that not all RPC interfaces are happy with secure endpoint mapper, see the blog of Ned.

What are customers doing about it?

Most customers are then going to increase “MaxConcurrentAPI” which provides relief. Many customers also add monitoring of Netlogon performance counters to their baseline. We also have customers who start to use secure channel monitoring, and when they see that a DC is heaping incoming secure channels, they use “nltest /SC_RESET” to balance resource domain controllers or member servers evenly across the downstream domain controllers.

And yes, one way out of this is also setting the RPC registry entries or group policy to the defaults, so clients don’t attempt NTLM authentication. Since this setting was often required by the security department, it is probably not being changed in all cases. Some arguments that the secure Endpoint Mapper may not provide significant value are as follows:

1. The call is only done to get the server TCP port. The communication to the server typically is authenticated separately.

2. If the firewall does not permit incoming RPC endpoint mapper request from the Internet, the callers are all from the internal network. Thus no information is disclosed to outside entities if the network is secure.

3. There are no known vulnerabilities in the endpoint mapper. It was once justified when there were vulnerabilities, but not today.

4. If you can’t get the security policy changed, ask the IT team to expedite Windows 7 deployment as it does not cause NTLM authentication in this scenario.

Ah, those old habits, they always come back on you. The hope you now have tools and countermeasures to make all this more bearable.

Update 5/1/2012:

There is an update available that adds NetLogon events 5816-5819 when you experience Semaphore Waiters and Semaphore Timeouts. This will allow you to find bottlenecks in your trust graph quickly. Check out:

Hiya folks, Ned here again. When interviewing a potential support engineer at Microsoft, we usually start with a softball question like “what are the five FSMO roles?” Everyone nails that. Then we ask what each role does. Their face scrunches a bit and they get less assured. “The RID Master… hands out RIDs.” Ok, what are RIDs? “Ehh…”

That’s trouble, and not just for the interview. Poor understanding of the RID Master prevents you from adding new users, computers, and groups, which can disrupt your business. Uncontrolled RID creation forces you to abandon your domain, which will cost you serious money.

Today, I discuss how to protect your company from uncontrolled RID pool depletion and keep your domain bustling for decades to come.

Background

Relative Identifiers (RID) are the incremental portion of a domain Security Identifier (SID). For instance:

S-1-5-21-1004336348-1177238915-682003330-2100

==>

S-1-5-Domain Identifier-Relative Identifier

A SID represents a unique trustee, also known as a "security principal" – typically users, groups, and computers – that Windows uses for access control. Without a matching SID in an access control list, you cannot access a resource or prove your identity. It’s the lynchpin.

Every domain has a RID Master: a domain controller that hands each DC a pool of 500 RIDs at a time. A domain contains a single RID pool which generates roughly one billion SIDs (because of a 30-bit length, it’s 230 or 1,073,741,823 RIDs). Once issued, RIDs are never reused. You can’t reclaim RIDs after you delete security principals either, as that would lead to unintended access to resources that contained previously issued SIDs.

Anytime you create a writable DC, it gets 500 new RIDs from the RID Master. Meaning, if you promote 10 domain controllers, you’ve issued 5000 new RIDs. If 8 of those DCs are demoted, then promoted back up, you have now issued 9000 RIDs. If you restore a system state backup onto one of those DCs, you’ve issued 9500 RIDs. The balance of any existing RIDs issued to a DC is never saved – once issued they’re gone forever, even if they aren’t used to create any users. A DC requests more RIDs when it gets low, not just when it is out, so when it grabs another 500 that becomes part of its "standby" pool. When the current pool is empty, the DC switches to the standby pool. Repeat until doomsday.

The normal operations are out of your control and unlikely to cause problems even in the biggest environments. For example, even though Microsoft’s Redmond AD dates to 1999 and holds the vast majority of our resources, it has only consumed ~8 million RIDs - that's 0.7%. In contrast, some of the abnormal operations can lead to squandered RIDs or even deplete the pool altogether, forcing you to migrate to a new domain or recover your forest. We’ll talk more about them later; regardless of how you are using RIDs, the key to avoiding a problem is observation.

Monitoring

You now have a new job, IT professional: monitoring your RID usage and ensuring it stays within expected patterns. KB305475 describes the attributes for both the RID Master and the individual DCs. I recommend giving it a read, as the data storage requires conversion for human consumption.

Monitoring the RID Master in each domain is adequate and we offer a simple command-line tool I’ve discussed before – DCDIAG.EXE. Part of Windows Server 2008+ or a free download for 2003, it has a simple test that shows the translated number of allocated RIDs called rIDAvailablePool:

Turn one of those PowerShell samples into a script that runs as a scheduled task that updates a log every morning and alerts you to review it. You can also use LDP.EXE to convert the RID pool values manually every day, if you are an insane person.

You should also consider monitoring the RID Block Size, as any increase exhausts your global RID pool faster. Object Access Auditing can help here. There are legitimate reasons to increase this value on certain DCs. For example, if you are the US Marine Corps and your DCs are in a warzone where they may not be able to talk to the RID Master for weeks. Be smart about picking values - you are unlikely to need five million RIDs before talking to the master again; when the DC comes home, lower the value back to default.

The critical review points are:

You don’t see an unexpected rise in RID issuance.

You aren’t close to running out of RIDs.

Let’s explore what might be consuming RIDs unexpectedly.

Diagnosis

If you see a large increase in RID allocation, the first step is finding what was created and when. As always, my examples are PowerShell. You can find plenty of others using VBS, free tools, and whatnot on the Internet.

You need to return all users, computers, and groups in the domain – even if deleted. You need the SAM account name, creation date, SID, and USN of each trustee. There are going to be a lot of these, so filter the returned properties to save time and export to a CSV file for sorting and filtering in Excel. Here’s a sample (it’s one wrapped line):

Here I ran the command, then opened in Excel and sorted by newest to oldest:

Errrp, looks like another episode of “scripts gone wild”…

Now it’s noodle time:

Does the user count match actual + previous user counts (or at least in the ballpark)?

Are there sudden, massive blocks of object creation?

Is someone creating and deleting objects constantly – or was it just once and you need to examine your audit logs to see who isn’t admitting it?

Has your user provisioning system gone berserk (or run by someone who needs… coaching)?

Have you changed your password policy and are now trying to create enabled users that do not meet password requirements (this uses up a RID during each failed creation attempt).

Do you use a VDI system that constantly creates and deletes computer accounts when provisioning virtual machines - we’ve seen those too: in one case, a third party solution was burning 4 million computer RIDs a month.

If the RID allocations are growing massively, but you don’t see a subsequent increase in new trustees, it’s likely someone increased RID Block Size inappropriately. Perhaps they set hexadecimal rather than decimal values – instead of the intended 15,000 RIDs per allocation, for example, you’d end up with 86,016!

It may also be useful to know where the updates are coming from. Examine each DC’s RidAllocationPool for increases to see if something is running on - or pointed at – a specific domain controller.

Recovery

You know there’s a problem. The next step is to stop things getting worse (as you have no way to undo the damage without recovering the entire forest).

If you identified the cause of the RID exhaustion, stop it immediately; your domain’s health is more important. If that system continues in high enough volume, it’s going to force you to abandon your domain.

If you can’t find the cause and you are anywhere near the end of your one billion RIDs, get a system state backup on the RID Master immediately. Then transfer the RID Master role to a non-essential DC that you shut down to prevent further issuance. The allocated RID pools on your DCs will run out, but that stops further damage. This gives you breathing space to find the bad guy. The downside is that legitimate trustee creation stops also. If you don’t already have a Forest Recovery process in place, you had better get one going. If you cannot figure out what's happening, open a support case with us immediately.

... it is too late. Like having a smoke detector that only goes off when the house has burned down. Now you cannot create a trust for a migration to another domain. If you reach that stage, open a support case with us immediately. This is one of those “your job depends on it” issues, so don’t try to be a lone gunfighter.