Warning: As is my wont, this is a deep dive post. Make sure you’ve had your coffee before proceeding.

Last month at Microsoft Ignite, many exciting new features rolling out in Server 2019 were talked about. (Watch MS Ignite sessions here and here.)

But now I want to talk about an enhancement to on-premises Active Directory in Server 2019 that you won’t read or hear anywhere else. This specific topic is near and dear to my heart personally.

The intent of the first section of this article is to discuss how Active Directory’s sizing of the ESE version store has changed in Server 2019 going forward. The second section of this article will discuss some basic debugging techniques related to the ESE version store.

One component of all ESE database instances is known as the version store. The version store is an in-memory temporary storage location where ESE stores snapshots of the database during open transactions. This allows the database to roll back transactions and return to a previous state in case the transactions cannot be committed. When the version store is full, no more database transactions can be committed, which effectively brings NTDS to a halt.

In 2016, the CSS Directory Services support team blog, (also known as AskDS,) published some previously undocumented (and some lightly-documented) internals regarding the ESE version store. Those new to the concept of the ESE version store should read that blog post first.

In the blog post linked to previously, it was demonstrated how Active Directory had calculated the size of the ESE version store since AD’s introduction in Windows 2000. When the NTDS service first started, a complex algorithm was used to calculate version store size. This algorithm included the machine’s native pointer size, number of CPUs, version store page size (based on an assumption which was incorrect on 64-bit operating systems,) maximum number of simultaneous RPC calls allowed, maximum number of ESE sessions allowed per thread, and more.

Since the version store is a memory resource, it follows that the most important factor in determining the optimal ESE version store size is the amount of physical memory in the machine, and that – ironically – seems to have been the only variable not considered in the equation!

The way that Active Directory calculated the version store size did not age well. The original algorithm was written during a time when all machines running Windows were 32-bit, and even high-end server machines had maybe one or two gigabytes of RAM.

As a result, many customers have contacted Microsoft Support over the years for issues arising on their domain controllers that could be attributed to or at least exacerbated by an undersized ESE version store. Furthermore, even though the default ESE version store size can be augmented by the “EDB max ver pages (increment over the minimum)” registry setting, customers are often hesitant to use the setting because it is a complex topic that warrants heavier and more generous amounts of documentation than what has traditionally been available.

The algorithm is now greatly simplified in Server 2019:

– When NTDS first starts, the ESE version store size is now calculated as 10% of physical RAM, with a minimum of 400MB and a maximum of 4GB.

The same calculation applies to physical machines and virtual machines. In the case of virtual machines with dynamic memory, the calculation will be based off of the amount of “starting RAM” assigned to the VM. The “EDB max ver pages (increment over the minimum)” registry setting can still be used, as before, to add additional buckets over the default calculation. (Even beyond 4GB if desired.) The registry setting is in terms of “buckets,” not bytes. Version store buckets are 32KB each on 64-bit systems. (They are 16KB on 32-bit systems, but Microsoft no longer supports any 32-bit server OSes.) Therefore, if one adds 5000 “buckets” by setting the registry entry to 5000 (decimal,) then 156MB will be added to the default version store size. A minimum of 400MB was chosen for backwards compatibility because when using the old algorithm, the default version store size for a DC with a single 64-bit CPU was ~410MB, regardless of how much memory it had. (There is no way to configure less than the minimum of 400MB, similar to previous Windows versions.) The advantage of the new algorithm is that now the version store size scales linearly with the amount of memory the domain controller has, when previously it did not.

Defaults:

Physical Memory in the Domain Controller

Default ESE Version Store Size

1GB

400MB

2GB

400MB

3GB

400MB

4GB

400MB

5GB

500MB

6GB

600MB

8GB

800MB

12GB

1.2GB

24GB

2.4GB

48GB

4GB

128GB

4GB

This new calculation will result in larger default ESE version store sizes for domain controllers with greater than 4GB of physical memory when compared to the old algorithm. This means more version store space to process database transactions, and fewer cases of version store exhaustion. (Which means fewer customers needing to call us!)

Note: This enhancement currently only exists in Server 2019 and there are not yet any plans to backport it to older Windows versions.

Note: This enhancement applies only to Active Directory and not to any other application that uses an ESE database such as Exchange, etc.

ESE Version Store Advanced Debugging and Troubleshooting

This section will cover some basic ESE version store triage, debugging and troubleshooting techniques.

As covered in the AskDS blog post linked to previously, the performance counter used to see how many ESE version store buckets are currently in use is:

\\.\Database ==> Instances(lsass/NTDSA)\Version buckets allocated

Once that counter has reached its limit, (~12,000 buckets or ~400MB by default,) events will be logged to the Directory Services event log, indicating the exhaustion:

Figure 1: NTDS version store exhaustion.

The event can also be viewed graphically in Performance Monitor:

Figure 2: The plateau at 12,314 means that the performance counter “Version Buckets Allocated” cannot go any higher. The flat line represents a dead patient.

As long as the domain controller still has available RAM, try increasing the version store size using the previously mentioned registry setting. Increase it in gradual increments until the domain controller is no longer exhausting the ESE version store, or the server has no more free RAM, whichever comes first. Keep in mind that the more memory that is used for version store, the less memory will be available for other resources such as the database cache, so a sensible balance must be struck to maintain optimal performance for your workload. (i.e. no one size fits all.)

If the “Version Buckets Allocated” performance counter is still pegged at the maximum amount, then there is some further investigation that can be done using the debugger.

The eventual goal will be to determine the nature of the activity within NTDS that is primarily responsible for exhausting the domain controller of all its version store, but first, some setup is required.

First, generate a process memory dump of lsass on the domain controller while the machine is “in state” – that is, while the domain controller is at or near version store exhaustion. To do this, the “Create dump file” option can be used in Task Manager by right-clicking on the lsass process on the Details tab. Optionally, another tool such as Sysinternals’ procdump.exe can be used (with the -ma switch .)

In case the issue is transient and only occurs when no one is watching, data collection can be configured on a trigger, using procdump with the -p switch.

Note: Do not share lsass memory dump files with unauthorized persons, as these memory dumps can contain passwords and other sensitive data.

It is a good idea to generate the dump after the Version Buckets Allocated performance counter has risen to an abnormally elevated level but before version store has plateaued completely. The reason why is because the database transaction responsible may be terminated once the exhaustion occurs, therefore the thread would no longer be present in the memory dump. If the guilty thread is no longer alive once the memory dump is taken, troubleshooting will be much more difficult.

Next, gather a copy of %windir%\System32\esent.dll from the same Server 2019 domain controller. The esent.dll file contains a debugger extension, but it is highly dependent upon the correct Windows version, or else it could output incorrect results. It should match the same version of Windows as the memory dump file.

Once WinDbg is installed, configure the symbol path for Microsoft’s public symbol server:

Figure 3: srv*c:\symbols*http://msdl.microsoft.com/download/symbol

Now load the lsass.dmp memory dump file, and load the esent.dll module that you had previously collected from the same domain controller:

Figure 4: .load esent.dll

Now the ESE database instances present in this memory dump can be viewed with the command !ese dumpinsts:

Figure 5: !ese dumpinsts – The only ESE instance present in an lsass dump on a DC should be NTDSA.

Notice that the current version bucket usage is 11,189 out of 12,802 buckets total. The version store in this memory dump is very nearly exhausted. The database is not in a particularly healthy state at this moment.

The command !ese param <instance> can also be used, specifying the same database instance gotten from the previous command, to see global configuration parameters for that ESE database instance. Notice that JET_paramMaxVerPages is set to 12800 buckets, which is 400MB worth of 32KB buckets:

Figure 6: !ese param

To see much more detail regarding the ESE version store, use the !ese verstore <instance> command, specifying the same database instance:

Figure 7: !ese verstore

The output of the command above shows us that there is an open, long-running database transaction, how long it’s been running, and which thread started it. This also matches the same information displayed in the Directory Services event log event pictured previously.

Neither the event log event nor the esent debugger extension were always quite so helpful; they have both been enhanced in recent versions of Windows.

In older versions of the esent debugger extension, the thread ID could be found in the dwTrxContext field of the PIB, (command: !ese dump PIB 0x000001AD71621320) and the start time of the transaction could be found in m_trxidstack as a 64-bit file time. But now the debugger extension extracts that data automatically for convenience.

Switch to the thread that was identified earlier and look at its call stack:

The four functions that are highlighted by a red rectangle in the picture above are interesting, and here’s why:

When an object is deleted on a domain controller, and that object has links to other objects, those links must also be deleted/cleaned by the domain controller. For example, when an Active Directory user becomes a member of a security group, a database link between the user and the group is created that represents that relationship. The same principle applies to all linked attributes in Active Directory. If the Active Directory Recycle Bin is enabled, then the link-cleaning process will be deferred until the deleted object surpasses its Deleted Object Lifetime – typically 60 or 180 days after being deleted. This is why, when the AD Recycle Bin is enabled, a deleted user can be easily restored with all of its group memberships still intact – because the user account object’s links are not cleaned until after its time in the Recycle Bin has expired.

The trouble begins when an object with many backlinks is deleted. Some security groups, distribution lists, RODC password replication policies, etc., may contain hundreds of thousands or even millions of members. Deleting such an object will give the domain controller a lot of work to do. As you can see in the thread call stack shown above, the domain controller had been busily processing links on a deleted object for 47 seconds and still wasn’t done. All the while, more and more ESE version store space was being consumed.

When the AD Recycle Bin is enabled, this can cause even more confusion, because no one remembers that they deleted that gigantic security group 6 months ago. A time bomb has been sitting in the AD Recycle Bin for months. But suddenly, AD replication grinds to a standstill throughout the domain and the admins are scrambling to figure out why.

The performance counter “\\.\DirectoryServices ==> Instances(NTDS)\Link Values Cleaned/sec” would also show increased activity during this time.

There are two main ways to fight this: either by increasing version store size with the “EDB max ver pages (increment over the minimum)” registry setting, or by decreasing the batch size with the “Links process batch size” registry setting, or a combination of both. Domain controllers process the deletion of these links in batches. The smaller the batch size, the shorter the individual database transactions will be, thus relieving pressure on the ESE version store.

Though the default values are properly-sized for almost all Active Directory deployments and most administrators should never have to worry about them, the two previously-mentioned registry settings are supported and well-informed enterprise administrators are encouraged to tweak the values – within reason – to avoid ESE version store depletion. Contact Microsoft customer support before making any modifications if there is any uncertainty.

At this point, one could continue diving deeper, using various approaches (e.g. consider not only debugging process memory, but also consulting DS Object Access audit logs, object metadata from repadmin.exe, etc.) to find out which exact object with many thousands of links was just deleted, but in the end that’s a moot point. There’s nothing else that can be done with that information. The domain controller simply must complete the work of link processing.

In other situations however, it will be apparent using the same techniques shown previously, that it’s an incoming LDAP query from some network client that’s performing inefficient queries, leading to version store exhaustion. Other times it will be DirSync clients. Other times it may be something else. In those instances, there may be more you can do besides just tweaking the version store variables, such as tracking down and silencing the offending network client(s), optimizing LDAP queries, creating database indices, etc..

When a user on a Windows client visits a secure Web site (by using HTTPS/TLS), reads a secure email (S/MIME), or downloads an ActiveX control that is signed (code signing) and encounters a certificate which chains to a root certificate not present in the root store, Windows will automatically check the appropriate Microsoft Update location for the root certificate.

If it finds it, it downloads it to the system. To the user, the experience is seamless; they don’t see any security dialog boxes or warnings and the download occurs automatically, behind the scenes.

During TLS handshakes, any certificate chains involved in the connection will need to be validated, and, from Windows Vista/2008 onwards, the automatic disallowed root update mechanism is also invoked to verify if there are any changes to the untrusted CTL (Certificate Trust List).

A certificate trust list (CTL) is a predefined list of items that are authenticated and signed by a trusted entity.

It expands on the automatic root update mechanism technology (for trusted root certificates) mentioned earlier to let certificates that are compromised or are untrusted in some way be specifically flagged as untrusted.

Customers therefore benefit from periodic automatic updates to both trusted and untrusted CTLs.

So, after the preamble, what scenarios are we talking about today?

Here are some examples of issues we’ve come across recently.

1)

Your users may experience browser errors after several seconds when trying to browse to secure (https) websites behind a load balancer.

They might receive an error like “The page cannot be displayed. Turn on TLS 1.0, TLS 1.1, and TLS 1.2 in the Advanced settings and try connecting to https://contoso.com again. If this error persists, contact your site administrator.”

If they try to connect to the website via the IP address of the server hosting the site, the https connection works after showing a certificate name mismatch error.

All TLS versions ARE enabled when checking in the browser settings:

Internet Options

2)

You have a 3rd party appliance making TLS connections to a Domain Controller via LDAPs (Secure LDAP over SSL) which may experience delays of up to 15 seconds during the TLS handshake

The issue occurs randomly when connecting to any eligible DC in the environment targeted for authentication.

There are no intervening devices that filter or modify traffic between the appliance and the DCs

2a)

A very similar scenario* to the above is in fact described in the following article by our esteemed colleague, Herbert:

A user sends a certificate on a session. The server need to check for certificate revocation which may take some time.*

This becomes problematic if network communication is restricted and the DC cannot reach the Certificate Distribution Point (CDP) for a certificate.

To determine if your clients are using secure LDAP (LDAPs), check the counter “LDAP New SSL Connections/sec”.

If there are a significant number of sessions, you might want to look at CAPI-Logging.

3)

A 3rd party meeting server performing LDAPs queries against a Domain Controller may fail the TLS handshake on the first attempt after surpassing a pre-configured timeout (e.g 5 seconds) on the application side

During certificate validation operations, the CTL engine gets periodically invoked to verify if there are any changes to the untrusted CTLs.

In the example scenarios we described earlier, if the default public URLs for the CTLs are unreachable, and there is no alternative internal CTL distribution point configured (more on this in a minute), the TLS handshake will be delayed until the WinHttp call to access the default CTL URL times out.

By default, this timeout is usually around 15 seconds, which can cause problems when load balancers or 3rd party applications are involved and have their own (more aggressive) timeouts configured.

If we enable CAPI2 Diagnostic logging, we should be able to see evidence of when and why the timeouts are occurring.

We will see events like the following:

Event ID 20 – Retrieve Third-Party Root Certificate from Network:

Trusted CTL attempt

Trusted CTL Attempt

Disallowed CTL attempt

Disallowed CTL Attempt

Event ID 53 error message details showing that we have failed to access the disallowed CTL:

Event ID 53

The following article gives a more detailed overview of the CAPI2 diagnostics feature available on Windows systems, which is very useful when looking at any certificate validation operations occurring on the system:

To help us confirm that the CTL updater engine is indeed affecting the TLS delays and timeouts we’ve described, we can temporarily disable it for both the trusted and untrusted CTLs and then attempt our TLS connections again.

After applying these steps, you should find that your previously failing TLS connections will no longer timeout. Your symptoms may vary slightly, but you should see speedier connection times, because we have eliminated the delay in trying and failing to reach the CTL URLs.

So, what now?

We should now REVERT the above registry changes by restoring the backup we created, and evaluate the following, more permanent solutions.

We previously stated that disabling the updater engine should only be a temporary measure to confirm the root cause of the timeouts in the above scenarios.

Get your crash helmets on and strap into your seatbelts for a JET engine / ESE database special…

This is Linda Taylor, Senior AD Escalation Engineer from the UK here again. And WAIT…… I also somehow managed to persuade Brett Shirley to join me in this post. Brett is a Principal Software Engineer in the ESE Development team so you can be sure the information in this post is going to be deep and confusing but really interesting and useful and the kind you cannot find anywhere else :- )BTW, Brett used to write blogs before he grew up and got very busy. And just for fun, you might find this old “Brett” classic entertaining. I have never forgotten it. :- )Back to today’s post…this will be a rather more grown up post, although we will talk about DITs but in a very scientific fashion.

In this post, we will start from the ground up and dive deep into the overall file format of an ESE database file including practical skills with esentutl such as how to look at raw database pages. And as the title suggests this is Part1 so there will be more!

What is an ESE database?

Let’s start basic. The Extensible Storage Engine (ESE), also known as JET Blue, is a database engine from Microsoft that does not speak SQL. And Brett also says … For those with a historical bent, or from academia, and remember ‘before SQL’ instead of ‘NoSQL’ ESE is modelled after the ISAMs (indexed sequential access method) that were vogue in the mid-70s. ;-p If you work with Active Directory (which you must do if you are reading this post then you will (I hope!) know that it uses an ESE database. The respective binary being, esent.dll (or Brett loves exchange, it’s ese.dll for the Exchange Server install). Applications like active directory are all ESE clients and use the JET APIs to access the ESE database.

This post will dive deep into the Blue parts above. The ESE side of things. AD is one huge client of ESE, but there are many other Windows components which use an ESE database (and non-Microsoft software too), so your knowledge in this area is actually very applicable for those other areas. Some examples are below:

Tools

There are several built-in command line tools for looking into an ESE database and related files.

esentutl. This is a tool that ships in Windows Server by default for use with Active Directory, Certificate Authority and any other built in ESE databases. This is what we will be using in this post and can be used to look at any ESE database.

eseutil. This is the Exchange version of the same and gets installed typically in the Microsoft\Exchange\V15\Bin sub-directory of the Program Files directory.

ntdsutil. Is a tool specifically for managing an AD or ADLDS databases and cannot be used with generic ESE databases (such as the one produced by Certificate Authority service). This is installed by default when you add the AD DS or ADLDS role.

For read operations such as dumping file or log headers it doesn’t matter which tool you use. But for operations which write to the database you MUST use the matching tool for the application and version (for instance it is not safe to run esentutl /r from Windows Server 2016 on a Windows Server 2008 DB). Further throughout this article if you are looking at an Exchange database instead, you should use eseutil.exe instead of esentutl.exe. For AD and ADLDS always use ntdsutil or esentutl. They have different capabilities, so I use a mixture of both. And Brett says that If you think you can NOT keep the read operations straight from the write operations, play it safe and match the versions and application.

During this post, we will use an AD database as our victim example. We may use other ones, like ADLDS for variety in later posts.

Database logical format – Tables

Let’s start with the logical format. From a logical perspective, an ESE database is a set of tables which have rows and columns and indices.

Below is a visual of the list of tables from an AD database in Windows Server 2016. Different ESE databases will have different table names and use those tables in their own ways.

In this post, we won’t go into the detail about the DNTs, PDNTs and how to analyze an AD database dump taken with LDP because this is AD specific and here we are going to look at ESE specific level. Also, there are other blogs and sources where this has already been explained. for example, here on AskPFEPlat. However, if such post is wanted, tell me and I will endeavor to write one!!

It is also worth noting that all ESE databases have a table called MSysObjects and MSysObjectsShadow which is a backup of MSysObjects. These are also known as “the catalog” of the database and they store metadata about client’s schema of the database – i.e.

All the tables and their table names and where their associated B+ trees start in the database and other miscellaneous metadata.

All the columns for each table and their names (of course), the type of data stored in them, and various schema constraints.

All the indexes on the tables and their names, and where their associated B+ trees start in the database.

This is the boot-strap information for ESE to be able to service client requests for opening tables to eventually retrieve rows of data.

Database physical format

From a physical perspective, an ESE database is just a file on disk. It is a collection of fixed size pages arranged into B+ tree structures. Every database has its page size stamped in the header (and it can vary between different clients, AD uses 8 KB). At a high level it looks like this:

The first “page” is the Header (H).

The second “page” is a Shadow Header (SH) which is a copy of the header.

However, in ESE “page number” (also frequently abbreviated “pgno”) has a very specific meaning (and often shows up in ESE events) and the first NUMBERED page of the actual database is page number / pgno 1 but is actually the third “page” (if you are counting from the beginning :-).

From here on out though, we will not consider the header and shadow header proper pages, and page number 1 will be third page, at byte offset = <page size> * 2 = 8192 * 2 (for AD databases).

If you don’t know the page size, you can dump the database header with esentutl /mh.

Here is a dump of the header for an NTDS.DIT file – the AD database:

The page size is the cbDbPage. AD and ADLDS uses a page size of 8k. Other databases use different page sizes.

A caveat is that to be able to do this, the database must not be in use. So, you’d have to stop the NTDS service on the DC or run esentutl on an offline copy of the database.

But the good news is that in WS2016 and above we can now dump a LIVE DB header with the /vss switch! The command you need would be “esentutl /mh ntds.dit /vss” (note: must be run as administrator).

All these numbered database pages logically are “owned” by various B+ trees where the actual data for the client is contained … and all these B+ trees have a “type of tree” and all of a tree’s pages have a “placement in the tree” flag (Root, or Leaf or implicitly Internal – if not root or leaf).

Ok, Brett, that was “proper” tree and page talk – I think we need some pictures to show them…

Logically the ownership / containing relationship looks like this:

More about B+ Trees

The pages are in turn arranged into B+ Trees. Where top page is known as the ‘Root’ page and then the bottom pages are ‘Leaf’ pages where all the data is kept. Something like this (note this particular example does not show ‘Internal’ B+ tree pages):

The upper / parent page has partial keys indicating that all entries with 4245 + A* can be found in pgno 13, and all entries with 4245 + E* can be found in pgno 14, etc.

Note this is a highly simplified representation of what ESE does … it’s a bit more complicated.

This is not specific to ESE; many database engines have either B trees or B+ trees as a fundamental arrangement of data in their database files.

The Different trees

You should know that there are different types of B+ trees inside the ESE database that are needed for different purposes. These are:

Long Value (LV) Trees – used to store long values. In other words, large chunks of data which don’t fit into the primary record.

Index trees – these are B+Trees used to store indexes.

Space Trees – these are used to track what pages are owned and free / available as new pages for a given B+ tree. Each of the previous three types of B+ Tree (Data, LV, and index), may (if the tree is large) have a set of two space trees associated with them.

Storing large records

Each Row of a table is limited to 8k (or whatever the page size is) in Active Directory and AD LDS. I.e. so each record has to fit into a single database page of 8k..but you are probably aware that you can fit a LOT more than 8k into an AD object or an exchange e-mail! So how do we store large records?

Well, we have different types of columns as illustrated below:

Tagged columns can be split out into what we call the Long Value Tree. So in the tagged column we store a simple 4 byte number that’s called a LID (Long Value ID) which then points to an entry in the LV tree. So we take the large piece of data, break it up into small chunks and prefix those with the key for the LID and the offset.

So, if every part of the record was a LID / pointer to a LV, then essentially we can fit 1300 LV pointers onto the 8k page. btw, this is what creates the 1300 attribute limit in AD. It’s all down to the ESE page size.

Now you can also start to see that when you are looking at a whole AD object you may read pages from various trees to get all the information about your object. For example, for a user with many attributes and group memberships you may have to get data from a page in the ”datatable” \ Primary tree + “datatable” \ LV tree + sd_table \ Primary tree + link_table \ Primary tree.

Index Trees

An index is used for a couple of purposes. Firstly, to make a list of the records in an intelligent order, such as by surname in an alphabetical order. And then secondly to also cut down the number of records which sometimes greatly helps speed up searches (especially when the ‘selectivity is high’ – meaning few entries match).

Below is a visual illustration (with the B+ trees turned on their side to make the diagram easier) of a primary index which is the DNT index in the AD Database – the Data Tree. And a secondary index of dNSHostName. You can see that the secondary index only contains the records which has a dNSHostName populated. It is smaller.

You can also see that in the secondary index, the primary key is the data portion (the name) and then the data is the actual Key that links us back to the REAL record itself.

Inside a Database page

Each database page has a fixed header. And the header has a checksum as well as other information like how much free space is on that page and which B-tree it belongs to.

Then we have these things called TAGS (or nodes), which store the data.

A node can be many things, such as a record in a database table or an entry in an index.

The TAGS are actually out of order on the page, but order is established by the tag array at end.

TAG 0 = Page External Header

This contains variable sized special information on the page, depending upon the type of B-tree and type of page in B tree (space vs. regular tree, and root vs. leaf).

TAG 1,2,3, etc are all “nodes” or lines, and the order is tracked.

The key & data is specific to the B Tree type.

And TAG 1 is actually node 0!!! So here is a visual picture of what an ESE database page looks like:

It is possible to calculate this key if you have an object’s primary key. In AD this is a DNT.

The formulae for that (if you are ever crazy enough to need it) would be:

Start with 0x7F, and if it is a signed INT append a 0x80000000 and then OR in the number

And finally other non-integers column types, such as String and Binary types, have a different more complicated formatting for keys.

Why is this useful? Because, for example you can take a DNT of an object and then calculate its key and then seek to its page using esentutl.exe dump page /m functionality and /k option.

The Nodes also look different (containing different data) depending on the ESE B+tree type. Below is an illustration of the different nodes in a Space tree, a Data Tree, a LV tree and an Index tree.

The green are the keys. The dark blue is data.

What does a REAL page look like?

You can use esentutl to dump pages of the database if you are investigating some corruption for example.

Before we can dump a page, we want to find a page of interest (picking a random page could give you just a blank page) … so first we need some info about the table schema, so to start you can dump all the tables and their associated root page numbers like this :

Note, we have findstring’d the output again to get a nice view of just all the tables and their pgnoFDP and objidFDP. Findstr.exe is case sensitive so use the exact format or use /i switch.

objidFDP identifies this table in the catalog metadata. When looking at a database page we can use its objidFDP to tell which table this page belongs to.

pgnoFDP is the page number of the Father Data Page – the very top page of that B+ tree, also known as the root page. If you run esentutl /mm <dbname> on its own you will see a huge list of every table and B-tree (except internal “space” trees) including all the indexes.

So, in this example page 31 is the root page of the datatable here.

Dumping a page

You can dump a page with esentutl using /m and /p. Below is an example of dumping page 31 from the database – the root page of the “datatable” table as above.

The objidFDP is the number indicating which B-tree the page belongs to. And the cbFree tells us how much of this page is free. (cb = count of bytes). Each database page has a double header checksum – one ECC (Error Correcting Code) checksum for single bit data correction, and a higher fidelity XOR checksum to catch all other errors, including 3 or more bit errors that the ECC may not catch. In addition, we compute a logged data checksum from the page data, but this is not stored in the header, and only utilized by the Exchange 2016 Database Divergence Detection feature.

You can see this is a root page and it has 3 nodes (4 TAGS – remember TAG1 is node 0 also known as line 0! and it is nearly empty! (cbFree = 8092 bytes, so only 100 bytes used for these 3 nodes + page header + external header).

The objidFDP tells us which B-Tree this page belongs to.

And notice the PageFlushType, which is related to the JET Flush Map file we could talk about in another post later.

The nodes here point to pages lower down in the tree. And we could dump a next level page (pgno: 1438)….and we can see them getting deeper and more spread out with more nodes.

So you can see this page has 294 nodes! Which again all point to other pages. It is also a ParentOfLeaf meaning these pgno / page numbers actually point to leaf pages (with the final data on them).

Are you bored yet?

Or are you enjoying this like a geek? either way, we are nearly done with the page internals and the tree climbing here.

If you navigate more down, eventually you will get a page with some data on it like this for example, let’s dump page 69 which TAG 6 is pointing to:

So this one has some data on it (as indicated by the “Leaf page” indicator under the fFlags).

Finally, you can also dump the data – the contents of a node (ie TAG) with the /n switch like this:

Remember: The/n specifier takes a pgno : line or node specifier … this means that the :3 here, dumped TAG 4 from the previous screen. And note that trying to dump “/n69:4” would actually fail.

This /n will dump all the raw data on the page along with the information of columns and their contents and types. The output also needs some translation because it gives us the columnID (711 in the above example) and not the attribute name in AD (or whatever your database may be). The application developer would then be able to translate those column IDs to some meaningful information. For AD and ADLDS, we can translate those to attribute names using the source code.

Finally, there really should be no need to do this in real life, other than in a situation where you are debugging a database problem. However, we hope this provided a good and ‘realistic’ demo to help understand and visualize the structure of an ESE database and how the data is stored inside it!

Stay tuned for more parts …. which Brett says will be significantly more useful to everyday administrators!

Because Justin’s blog post from 2014 covers the fundamentals of what lingering objects are so well, I don’t think I need to go over it again here. If you need to know what lingering objects in Active Directory are, and why you want to get rid of them, then please go read that post first.

The new version of the Lingering Object Liquidator tool began its life as an attempt to address some of the long-standing limitations of the old version. For example, the old version would just stop the entire scan when it encountered a single domain controller that was unreachable. The new version will just skip the unreachable DC and continue scanning the other DCs that are reachable. There are multiple other improvements in the tool as well, such as multithreading and more exhaustive logging.

Before we take a look at the new tool, there are some things you should know:

1) Lingering Object Liquidator – neither the old version nor the new version – are covered by CSS Support Engineers. A small group of us (including yours truly) have provided this tool as a convenience to you, but it comes with no guarantees. If you find a problem with the tool, or have a feature request, drop a line to the public AskDS public email address, or submit feedback to the Windows Server UserVoice forum, but please don’t bother Support Engineers with it on a support case.

2) Don’t immediately go into your production Active Directory forest and start wildly deleting things just because they show up as lingering objects in the tool. Please carefully review and consider any AD objects that are reported to be lingering objects before deleting.

3) The tool may report some false positives for deleted objects that are very close to the garbage collection age. To mitigate this issue, you can manually initiate garbage collection on your domain controllers before using this tool. (We may add this so the tool does it automatically in the future.)

4) The tool will continue to evolve and improve based on your feedback! Contact the AskDS alias or the Uservoice forum linked to in #1 above with any questions, concerns, bug reports or feature requests.

Graphical User Interface Elements

Let’s begin by looking at the graphical user interface. Below is a legend that explains each UI element:

Lingering Object Liquidator v2

A) “Help/About” label. Click this and a page should open up in your default web browser with extra information and detail regarding Lingering Object Liquidator.

B) “Check for Updates” label. Click this and the tool will check for a newer version than the one you’re currently running.

C) “Detect AD Topology” button. This is the first button that should be clicked in most scenarios. The AD Topology must be generated first, before proceeding on to the later phases of lingering object detection and removal.

D) “Naming Context” drop-down menu. (Naming Contexts are sometimes referred to as partitions.) Note that this drop-down menu is only available after AD Topology has been successfully discovered. It contains each Active Directory naming context in the forest. If you know precisely which Active Directory Naming context that you want to scan for lingering objects, you can select it from this menu. (Note: The Schema partition is omitted because it does not support deletion, so in theory it cannot contain lingering objects.) If you do not know which naming contexts may contain lingering objects, you can select the “[Scan All NCs]” option and the tool will scan each Naming Context that it was able to discover during the AD Topology phase.

E) “Reference DC” drop-down menu. Note that this drop-down menu is only available after AD Topology has been successfully discovered. The reference DC is the “known-good” DC against which you will compare other domain controllers for lingering objects. If a domain controller contains AD objects that do not exist on the Reference DC, they will be considered lingering objects. If you select the “[Scan Entire Forest]” option, then the tool will (arbitrarily) select one global catalog from each domain in the forest that is known to be reachable. It is recommended that you wisely choose a known-good DC yourself, because the tool doesn’t necessarily know “the best” reference DC to pick. It will pick one at random.

F) “Target DC” drop-down menu. Note that this drop-down menu is only available after AD Topology has been successfully discovered. The Target DC is the domain controller that is suspected of containing lingering objects. The Target DC will be compared against the Reference DC, and each object that exists on the Target DC but not on the Reference DC is considered a lingering object. If you aren’t sure which DC(s) contain lingering objects, or just want to scan all domain controllers, select the “[Target All DCs]” option from the drop-down menu.

G) “Detect Lingering Objects” button. Note that this button is only available after AD Topology has been successfully discovered. After you have made the appropriate selections in the three aforementioned drop-down menus, click the Detect Lingering Objects button to run the scan. Clicking this button only runs a scan; it does not delete anything. The tool will automatically detect and avoid certain nonsensical situations, such as the user specifying the same Reference and Target DCs, or selecting a Read-Only Domain Controller (RODC) as a Reference DC.

H) “Select All” button. Note that this button does not become available until after lingering objects have been detected. Clicking it merely selects all rows from the table below.

I) “Remove Selected Lingering Objects” button. This button will attempt to delete all lingering objects that have been detected by the detection process. You can select a range of items from the list using the shift key and the arrow keys. You can select and unselect specific items by holding down the control key and clicking on them. If you want to just select all items, click the “Select All” button.

J) “Removal Method” radio buttons. These are mutually exclusive. You can choose which of the two supported methods you want to use to remove the lingering objects that have been detected. The “removeLingeringObject” method refers to the rootDSE modify operation, which can be used to “spot-remove” individual lingering objects. In contrast, the DsReplicaVerifyObjects method will remove all lingering objects all at once. This intention is reflected in the GUI by all lingering objects automatically being selected when the DsReplicaVerifyObjects method is chosen.

Q) The “Lingering Object ListView”. This “ListView” works similarly to a spreadsheet. It will display all lingering objects that were detected. You can think of each row as a lingering object. You can click on the column headers to sort the rows in ascending or descending order, and you can resize the columns to fit your needs. NOTE: If you right-click on the lingering object listview, the selected lingering objects (if any) will be copied to your clipboard.

R) The “Status” box. The status box contains diagnostics and operational messages from the Lingering Object Liquidator tool. Everything that is logged to the status box in the GUI is also mirrored to a text log file.

User-configurable Settings

The user-configurable settings in Lingering Object Liquidator are alluded to in the Status box when the application first starts.

This setting affects the “Detect Lingering Objects” scan. Lingering Object Liquidator establishes event log “subscriptions” to each Target DC that it needs to scan. The tool then waits for the DC to log an event (Event ID 1942 in the Directory Service event log) signaling that lingering object detection has completed for a specific naming context. Only once a certain number of those events (depending on your choices in the “Naming Contexts” drop-down menu,) have been received from the remote domain controller, does the tool know that particular domain controller has been fully scanned. However, there is an overall timeout, and if the tool does not receive the requisite number of Event ID 1942s in the allotted time, the tool “gives up” and proceeds to the next domain controller.

This setting sets the maximum number of threads to use during the “Detect Lingering Objects” scan. Using more threads may decrease the overall time it takes to complete a scan, especially in very large environments.

Tips

The domain controllers must allow the network connectivity required for remote event log management for Lingering Object Liquidator to work. You can enable the required Windows Firewall rules using the following line of PowerShell:

Check the download site often for new versions! (There’s also a handy “check for updates” option in the tool.)

Final Words

We provide this tool because we at AskDS want your Active Directory lingering object removal experience to go as smoothly as possible. If you find any bugs or have any feature requests, please drop a note to our public contact alias.

Release Notes

v2.0.19:
– Initial release to the public.

v2.0.21:
– Added new radio buttons that allow the user more control over which lingering object removal method they want to use – the DsReplicaVerifyObjects method or removeLingeringObject method.
– Fixed issue with Export button not displaying the full path of the export file.
– Fixed crash when unexpected or corrupted data is returned from event log subscription.

Would you like to join the U.S. Directory Services team and work on the most technically challenging and interesting Active Directory problems? Do you want to be the next Ned Pyle or Linda Taylor?

Then read more…

We are an escalation team based out of Irving, Texas; Charlotte, North Carolina; and Fargo, North Dakota. We work with enterprise customers helping them resolve the most critical Active Directory infrastructure problems as well as enabling them to get the best of Microsoft Windows and Identity-related technologies. The work we do is no ordinary support – we work with a huge variety of customer environments and there are rarely two problems which are the same.

You will need strong AD knowledge, strong troubleshooting skills, along with great collaboration, team work and customer service skills.

Ryan Ries here, and today I have a relatively “hardcore” blog post that will not be for the faint of heart. However, it’s about an important topic.

The behavior surrounding security tokens and logon sessions has recently changed on all supported versions of Windows. IT professionals – developers and administrators alike – should understand what this new behavior is, how it can affect them, and how to troubleshoot it.

But first, a little background…

Figure 1 – Tokens

Windows uses security tokens (or access tokens) extensively to control access to system resources. Every thread running on the system uses a security token, and may own several at a time. Threads inherit the security tokens of their parent processes by default, but they may also use special security tokens that represent other identities in an activity known as impersonation. Since security tokens are used to grant access to resources, they should be treated as highly sensitive, because if a malicious user can gain access to someone else’s security token, they will be able to access resources that they would not normally be authorized to access.

Note: Here are some additional references you should read first if you want to know more about access tokens:

If you are an application developer, your application or service may want to create or duplicate tokens for the legitimate purpose of impersonating another user. A typical example would be a server application that wants to impersonate a client to verify that the client has permissions to access a file or database. The application or service must be diligent in how it handles these access tokens by releasing/destroying them as soon as they are no longer needed. If the code fails to call the CloseHandle function on a token handle, that token can then be “leaked” and remain in memory long after it is no longer needed.

A locally authenticated attacker who successfully exploited the vulnerabilities could hijack the session of another user.
To exploit the vulnerabilities, the attacker could run a specially crafted application.
The update corrects how Windows handles session objects to prevent user session hijacking.

Those vulnerabilities were fixed with that update, and I won’t further expound on the “hacking/exploiting” aspect of this topic. We’re here to explore this from a debugging perspective.

This update is significant because it changes how the relationship between tokens and logon sessions is treated across all supported versions of Windows going forward. Applications and services that erroneously leak tokens have always been with us, but the penalty paid for leaking tokens is now greater than before. After MS16-111, when security tokens are leaked, the logon sessions associated with those security tokens also remain on the system until all associated tokens are closed… even after the user has logged off the system. If the tokens associated with a given logon session are never released, then the system now also has a permanent logon session leak as well. If this leak happens often enough, such as on a busy Remote Desktop/Terminal Server where users are logging on and off frequently, it can lead to resource exhaustion on the server, performance issues and denial of service, ultimately causing the system to require a reboot to be returned to service.

Therefore, it’s more important than ever to be able to identify the symptoms of token and session leaks, track down token leaks on your systems, and get your application vendors to fix them.

How Do I Know If My Server Has Leaks?

As mentioned earlier, this problem affects heavily-utilized Remote Desktop Session Host servers the most, because users are constantly logging on and logging off the server. The issue is not limited to Remote Desktop servers, but symptoms will be most obvious there.

Figuring out that you have logon session leaks is the easy part. Just run qwinsta at a command prompt:

Figure 2 – qwinsta

Pay close attention to the session ID numbers, and notice the large gap between session 2 and session 152. This is the clue that the server has a logon session leak problem. The next user that logs on will get session 153, the next user will get session 154, the next user will get session 155, and so on. But the session IDs will never be reused. We have 150 “leaked” sessions in the screenshot above, where no one is logged on to those sessions, no one will ever be able to log on to those sessions ever again (until a reboot,) yet they remain on the system indefinitely. This means each user who logs onto the system is inadvertently leaving tokens lying around in memory, probably because some application or service on the system duplicated the user’s token and didn’t release it. These leaked sessions will forever be unusable and soak up system resources. And the problem will only get worse as users continue to log on to the system. In an optimal situation where there were no leaks, sessions 3-151 would have been destroyed after the users logged out and the resources consumed by those sessions would then be reusable by subsequent logons.

How Do I Find Out Who’s Responsible?

Now that you know you have a problem, next you need to track down the application or service that is responsible for leaking access tokens. When an access token is created, the token is associated to the logon session of the user who is represented by the token, and an internal reference count is incremented. The reference count is decremented whenever the token is destroyed. If the reference count never reaches zero, then the logon session is never destroyed or reused. Therefore, to resolve the logon session leak problem, you must resolve the underlying token leak problem(s). It’s an all-or-nothing deal. If you fix 10 token leaks in your code but miss 1, the logon session leak will still be present as if you had fixed none.

Before we proceed: I would recommend debugging this issue on a lab machine, rather than on a production machine. If you have a logon session leak problem on your production machine, but don’t know where it’s coming from, then install all the same software on a lab machine as you have on the production machine, and use that for your diagnostic efforts. You’ll see in just a second why you probably don’t want to do this in production.

The first step to tracking down the token leaks is to enable token leak tracking on the system.

The registry setting won’t exist by default unless you’ve done this before, so create it. It also did not exist prior to MS16-111, so don’t expect it to do anything if the system does not have MS16-111 installed. This registry setting enables extra accounting on token issuance that you will be able to detect in a debugger, and there may be a noticeable performance impact on busy servers. Therefore, it is not recommended to leave this setting in place unless you are actively debugging a problem. (i.e. don’t do it in production exhibit A.)

Prior to the existence of this registry setting, token leak tracing of this kind used to require using a checked build of Windows. And Microsoft seems to not be releasing a checked build of Server 2016, so… good timing.

Next, you need to configure the server to take a full or kernel memory dump when it crashes. (A live kernel debug may also be an option, but that is outside the scope of this article.) I recommend using DumpConfigurator to configure the computer for complete crash dumps. A kernel dump should be enough to see most of what we need, but get a Complete dump if you can.

Figure 3 – DumpConfigurator

Then reboot the server for the settings to take effect.

Next, you need users to log on and off the server, so that the logon session IDs continue to climb. Since you’re doing this in a lab environment, you might want to use a script to automatically logon and logoff a set of test users. (I provided a sample script for you here.) Make sure you’ve waited 10 minutes after the users have logged off to verify that their logon sessions are permanently leaked before proceeding.

Finally, crash the box. Yep, just crash it. (i.e. don’t do it in production exhibit B.) On a physical machine, this can be done by hitting Right-Ctrl+Scroll+Scroll if you configured the appropriate setting with DumpConfigurator earlier. If this is a Hyper-V machine, you can use the following PowerShell cmdlet on the Hyper-V host:

Debug-VM -VM (Get-VM RDS1) -InjectNonMaskableInterrupt

You may have at your disposal other means of getting a non-maskable interrupt to the machine, such as an out-of-band management card (iLO/DRAC, etc.,) but the point is to deliver an NMI to the machine, and it will bugcheck and generate a memory dump.

Now transfer the memory dump file (C:\Windows\Memory.dmp usually) to whatever workstation you will use to perform your analysis.

Note: Memory dumps may contain sensitive information, such as passwords, so be mindful when sharing them with strangers.

Next, install the Windows Debugging Tools on your workstation if they’re not already installed. I downloaded mine for this demo from the Windows Insider Preview SDK here. But they also come with the SDK, the WDK, WPT, Visual Studio, etc. The more recent the version, the better.

Next, download the MEX Debugging Extension for WinDbg. Engineers within Microsoft have been using the MEX debugger extension for years, but only recently has a public version of the extension been made available. The public version is stripped-down compared to the internal version, but it’s still quite useful. Unpack the file and place mex.dll into your C:\Debuggers\winext directory, or wherever you installed WinDbg.

Now, ensure that your symbol path is configured correctly to use the Microsoft public symbol server within WinDbg:

Figure 4 – Example Symbol Path in WinDbg

The example symbol path above tells WinDbg to download symbols from the specified URL, and store them in your local C:\Symbols directory.

Finally, you are ready to open your crash dump in WinDbg:

Figure 5 – Open Crash Dump from WinDbg

After opening the crash dump, the first thing you’ll want to do is load the MEX debugging extension that you downloaded earlier, by typing the command:

Figure 6 – .load mex

The next thing you probably want to do is start a log file. It will record everything that goes on during this debugging session, so that you can refer to it later in case you forgot what you did or where you left off.

Figure 7 – !logopen

Another useful command that is among the first things I always run is !DumpInfo, abbreviated !di, which simply gives some useful basic information about the memory dump itself, so that you can verify at a glance that you’ve got the correct dump file, which machine it came from and what type of memory dump it is.

Figure 8 – !DumpInfo

You’re ready to start debugging.

At this point, I have good news and I have bad news.

The good news is that there already exists a super-handy debugger extension that lists all the logon session kernel objects, their associated token reference counts, what process was responsible for creating the token, and even the token creation stack, all with a single command! It’s!kdexts.logonsession, and it is awesome.

The bad news is that it doesn’t work… not with public symbols. It only works with private symbols. Here is what it looks like with public symbols:

Since public symbols are all you have unless you work at Microsoft, (and we wish you did,) I’m going to teach you how to do what!kdexts.logonsession does, manually. The hard way. Plus some extra stuff. Buckle up.

First, you should verify whether token leak tracking was turned on when this dump was taken. (That was the registry setting mentioned earlier.)

Figure 10 – x nt!SeTokenLeakTracking = <no type information>

OK… That was not very useful. We’re getting <no type information> because we’re using public symbols. But this symbol corresponds to the SeTokenLeakDiag registry setting that we configured earlier, and we know that’s just 0 or 1, so we can just guess what type it is:

Figure 11 – db nt!SeTokenLeakTracking L1

The db command means “dump bytes.” (dd, or “dump DWORDs,” would have worked just as well.) You should have a symbol fornt!SeTokenLeakTracking if you configured your symbol path properly, and the L1 tells the debugger to just dump the first byte it finds. It should be either 0 or 1. If it’s 0, then the registry setting that we talked about earlier was not set properly, and you can basically just discard this dump file and get a new one. If it’s 1, you’re in business and may proceed.

Next, you need to locate the logon session lists.

Figure 12 – dp nt!SepLogonSessions L1

Like the previous step, dp means “display pointer,” then the name of the symbol, and L1 to just display a single pointer. The 64-bit value on the right is the pointer, and the 64-bit value on the left is the memory address of that pointer.

Now we know where our lists of logon sessions begin. (Lists, plural.)

The SepLogonSessions pointer points to not just a list, but an array of lists. These lists are made up of _SEP_LOGON_SESSION_REFERENCES structures.

Using the dps command (display contiguous pointers) and specifying the beginning of the array that we got from the last step, we can now see where each of the lists in the array begins:

Figure 13 – dps 0xffffb808`3ea02650 – displaying pointers that point to the beginning of each list in the array

If there were not very many logon sessions on the system when the memory dump was taken, you might notice that not all the lists are populated:

Figure 14 – Some of the logon session lists are empty because not very many users had logged on in this example

The array doesn’t fill up contiguously, which is a bummer. You’ll have to skip over the empty lists.

If we wanted to walk just the first list in the array (we’ll talk more about dt and linked lists in just a minute,) it would look something like this:

Figure 15 – Walking the first list in the array and using !grep to filter the output

Notice that I used the !grep command to filter the output for the sake of brevity and readability. It’s part of the Mex debugger extension. I told you it was handy. If you omit the !grep AccountName part, you would get the full, unfiltered output. I chose “AccountName” arbitrarily as a keyword because I knew that was a word that was unique to each element in the list. !grep will only display lines that contain the keyword(s) that you specify.

Next, if we wanted to walk through the entire array of lists all at once, it might look something like this:

Figure 16 – Walking through the entire array of lists!

OK, I realize that I just went bananas there, but I’ll explain what just happened step-by-step.

When you are using the Mex debugger extension, you have access to many new text parsing and filtering commands that can truly enhance your debugging experience. When you look at a long command like the one I just showed, read it from right to left. The commands on the right are fed into the command to their left.

So from right to left, let’s start with !cut -f 2 dps ffffb808`3ea02650

We already showed what the dps <address> command did earlier. The !cut -f 2 command filters that command’s output so that it only displays the second part of each line separated by whitespace. So essentially, it will display only the pointers themselves, and not their memory addresses.

Like this:

Figure 17 – Using !cut to select just the second token in each line of output

Then that is “piped” line-by-line into the next command to the left, which was:

!fel -x “dt nt!_SEP_LOGON_SESSION_REFERENCES @#Line -l Next”

!fel is an abbreviation for !foreachline.

This command instructs the debugger to execute the given command for each line of output supplied by the previous command, where the @#Line pseudo-variable represents the individual line of output. For each line of output that came from the dps command, we are going to use the dt command with the -l parameter to walk that list. (More on walking lists in just a second.)

Next, we use the !grep command to filter all of that output so that only a single unique line is shown from each list element, as I showed earlier.

Finally, we use the !count -q command to suppress all of the output generated up to that point, and instead only tell us how many lines of output it would have generated. This should be the total number of logon sessions on the system.

And 380 was in fact the exact number of logon sessions on the computer when I collected this memory dump. (Refer to Figure 16.)

Alright… now let’s take a deep breath and a step back. We just walked an entire array of lists of structures with a single line of commands. But now we need to zoom in and take a closer look at the data structures contained within those lists.

Remember, ffffb808`3ea02650 was the very beginning of the entire array.

Let’s examine just the very first _SEP_LOGON_SESSION_REFERENCES entry of the first list, to see what such a structure looks like:

Figure 18 – dt _SEP_LOGON_SESSION_REFERENCES* ffffb808`3ea02650

That’s a logon session!

Let’s go over a few of the basic fields in this structure. (Skipping some of the more advanced ones.)

Next: This is a pointer to the next element in the list. You might notice that there’s a “Next,” but there’s no “Previous.” So, you can only walk the list in one direction. This is a singly-linked list.

LogonId: Every logon gets a unique one. For example, “0x3e7” is always the “System” logon.

ReferenceCount: This is how many outstanding token references this logon session has. This is the number that must reach zero before the logon session can be destroyed. In our example, it’s 4.

AccountName: The user who does or used to occupy this session.

AuthorityName: Will be the user’s Active Directory domain, typically. Or the computer name if it’s a local account.

TokenList: This is a doubly or circularly-linked list of the tokens that are associated with this logon session. The number of tokens in this list should match the ReferenceCount.

The following is an illustration of a doubly-linked list:

Figure 19 – Doubly or circularly-linked list

“Flink” stands for Forward Link, and “Blink” stands for Back Link.

So now that we understand that the TokenList member of the _SEP_LOGON_SESSION_REFERENCES structure is a linked list, here is how you walk that list:

Figure 20 – dt nt!_LIST_ENTRY* 0xffffb808`500bdba0+0x0b0 -l Flink

The dt command stands for “display type,” followed by the symbol name of the type that you want to cast the following address to. The reason why we specified the address 0xffffb808`500bdba0 is because that is the address of the _SEP_LOGON_SESSION_REFERENCES object that we found earlier. The reason why we added +0x0b0 after the memory address is because that is the offset from the beginning of the structure at which the TokenList field begins. The -l parameter specifies that we’re trying to walk a list, and finally you must specify a field name (Flink in this case) that tells the debugger which field to use to navigate to the next node in the list.

We walked a list of tokens and what did we get? A list head and 4 data nodes, 5 entries total, which lines up with the ReferenceCount of 4 tokens that we saw earlier. One of the nodes won’t have any data – that’s the list head.

Now, for each entry in the linked list, we can examine its data. We know the payloads that these list nodes carry are tokens, so we can use dt to cast them as such:

Figure 21 – dt _TOKEN*0xffffb808`4f565f40+8+8 – Examining the first token in the list

The reason for the +8+8 on the end is because that’s the offset of the payload. It’s just after the Flink and Blink as shown in Figure 19. You want to skip over them.

We can see that this token is associated to SessionId 0x136/0n310. (Remember I had 380 leaked sessions in this dump.) If you examine the UserAndGroups member by clicking on its DML (click the link,) you can then use !sid to see the SID of the user this token represents:

Figure 22 – Using !sid to see the security identifier in the token

The token also has a DiagnosticInfo structure, which is super-interesting, and is the coolest thing that we unlocked when we set the SeTokenLeakDiag registry setting on the machine earlier. Let’s look at it:

Figure 23 – Examining the DiagnosticInfo structure of the first token

We now have the process ID and the thread ID that was responsible for creating this token! We could examine the ImageFileName, or we could use the ProcessCid to see who it is:

Figure 24 – Using !mex.tasklist to find a process by its PID

Oh… Whoops. Looks like this particular token leak is lsass’s fault. You’re just going to have to let the *ahem* application vendor take care of that one.

Let’s move on to a different token leak. We’re moving on to a different memory dump file as well, so the memory addresses are going to be different from here on out.

I created a special token-leaking application specifically for this article. It looks like this:

Figure 25 – RyansTokenGrabber.exe

It monitors the system for users logging on, and as soon as they do, it duplicates their token via the DuplicateToken API call. I purposely never release those tokens, so if I collect a memory dump of the machine while this is running, then evidence of the leak should be visible in the dump, using the same steps as before.

Using the same debugging techniques I just demonstrated, I verified that I have leaked logon sessions in this memory dump as well, and each leaked session has an access token reference that looks like this:

Figure 26 – A _TOKEN structure shown with its attached DiagnosticInfo

And then by looking at the token’s DiagnosticInfo, we find that the guilty party responsible for leaking this token is indeed RyansTokenGrabber.exe:

Figure 27 – The process responsible for leaking this token

By this point you know who to blame, and now you can go find the author of RyansTokenGrabber.exe, and show them the stone-cold evidence that you’ve collected about how their application is leaking access tokens, leading to logon session leaks, causing you to have to reboot your server every few days, which is a ridiculous and inconvenient thing to have to do, and you shouldn’t stand for it!

We’re almost done. but I have one last trick to show you.

If you examine the StackTrace member of the token’s DiagnosticInfo, you’ll see something like this:

Figure 28 – DiagnosticInfo.CreateTrace

This is a stack trace. It’s a snapshot of all the function calls that led up to this token’s creation. These stack traces grew upwards, so the function at the top of the stack was called last. But the function addresses are not resolving. We must do a little more work to figure out the names of the functions.

First, clean up the output of the stack trace:

Figure 29 – Using !grep and !cut to clean up the output

Now, using all the snazzy new Mex magic you’ve learned, see if you can unassemble (that’s the u command) each address to see if resolves to a function name:

Figure 30 – Unassemble instructions at each address in the stack trace

The output continues beyond what I’ve shown above, but you get the idea.

The function on top of the trace will almost always be SepDuplicateToken, but could also be SepCreateToken or SepFilterToken, and whether one creation method was used versus another could be a big hint as to where in the program’s code to start searching for the token leak. You will find that the usefulness of these stacks will vary wildly from one scenario to the next, as things like inlined functions, lack of symbols, unloaded modules, and managed code all influence the integrity of the stack. However, you (or the developer of the application you’re using) can use this information to figure out where the token is being created in this program, and fix the leak.

Alright, that’s it. If you’re still reading this, then… thank you for hanging in there. I know this wasn’t exactly a light read.

And lastly, allow me to reiterate that this is not just a contrived, unrealistic scenario; There’s a lot of software out there on the market that does this kind of thing. And if you happen to write such software, then I really hope you read this blog post. It may help you improve the quality of your software in the future. Windows needs application developers to be “good citizens” and avoid writing software with the ability to destabilize the operating system. Hopefully this blog post helps someone out there do just that.

I have spent the last month working with customers worldwide who experienced password change failures after installing the updates under Ms16-101 security bulletin KB’s (listed below), as well as working with the product group in getting those addressed and documented in the public KB articles under the known issues section. It has been busy!

In this post I will aim to provide you with a quick “cheat sheet” of known issues and needed actions as well as ideas and troubleshooting techniques to get there.

Let’s start by understanding the changes.

The following 6 articles describe the changes in MS16-101 as well as a list of Known issues. If you have not yet applied MS16-101 I would strongly recommend reading these and understanding how they may affect you.

The good news is that this month’s updates address some of the known issues with MS16-101.

The bad newsis that not all the issues are caused by some code defect in MS16-101 and in some cases the right solution is to make your environment more secure by ensuring that the password change can happen over Kerberos and does not need to fall back to NTLM. That may include opening TCP ports used by Kerberos, fixing other Kerberos problems like missing SPN’s or changing your application code to pass in a valid domain name.

Let’s start with the basics…

Symptoms:

“The system detected a possible attempt to compromise security. Please make sure that you can contact the server that authenticated you.”Or“The system cannot contact a domain controller to service the authentication request. Please try again later.”

This text maps to the error codes below:

Hexadecimal

Decimal

Symbolic

Friendly

0xc0000388

1073740920

STATUS_DOWNGRADE_DETECTED

The system detected a possible attempt to compromise security. Please make sure that you can contact the server that authenticated you.

0x80074f1

1265

ERROR_DOWNGRADE_DETECTED

The system detected a possible attempt to compromise security. Please make sure that you can contact the server that authenticated you.

Question: What does MS16-101 do and why would password changes fail after installing it?

Answer: As documented in the listed KB articles, the security updates that are provided in MS16-101 disable the ability of the Microsoft Negotiate SSP to fall back to NTLM for password change operations in the case where Kerberos fails with the STATUS_NO_LOGON_SERVERS (0xc000005e) error code.
In this situation, the password change will now fail (post MS16-101) with the above mentioned error codes (ERROR_DOWNGRADE_DETECTED / STATUS_DOWNGRADE_DETECTED).Important: Password RESET is not affected by MS16-101 at all in any scenario. Only password change using the Negotiate package is affected.

So, now you understand the change, let’s look at the known issues and learn how to best identify and resolve those.

Summary and Cheat Sheet

To make it easier to follow I have matched the ordering of known issues in this post with the public KB articles above.

First, when troubleshooting a failed password change post MS16-101 you will need to understand HOW and WHERE the password change is happening and if it is for a domain account or a local account. Here is a cheat sheet.

Summary of SCENARIO’s and a quick reference table of actions needed.

Text: “System detected a possible attempt to compromise security. Please ensure that you can contact the server that authenticated you. “

Troubleshoot using this guide and fix Kerberos.

2.

Domain password change fails via application code with an INCORRECT/UNEXPECTED Error code when a password which does not meet password complexity is entered.

For example, before installing MS16-101, such password change may have returned a status like STATUS_PASSWORD_RESTRICTION and it now returns STATUS_DOWNGRADE_DETECTED (after installing Ms16-101) causing your application to behave in an expected way or even crash.

Note: In these cases password change works ok when correct new password is entered that complies with the password policy.

Passwords for disabled and locked out user accounts cannot be changed using Negotiate method.

None. By design.

5.

Domain password change fails via application code when a good password is entered.

This is the case where if you pass a servername to NetUserChangePassword, the password change will fail post MS16-101. This is because it would have previously worked and relied on NTLM. NTLM is insecure and Kerberos is always preferred. Therefore passing a domain name here is the way forward.

One thing to note for this one is that most of the ADSI and C#/.NET changePassword API’s end up calling NetUserChangePassword under the hood. Therefore, also passing invalid domain names to these API’s will fail. I have provided a detailed walkthrough example in this post with log snippets.

Troubleshoot using this guide and fix code to use Kerberos.

6.

After you install MS 16-101 update, you may encounter 0xC0000022 NTLM authentication errors.

After you install the security updates that are described in MS16-101, remote, programmatic changes of a local user account password remotely, and password changes across untrusted forest fail with the STATUS_DOWNGRADE_DETECTED error as documented in this post.

This happens because the operation relies on NTLM fall-back since there is no Kerberos without a trust. NTLM fall-back is forbidden by MS16-101.

For this scenario you will need to install October fixes in the table below and set the registry key NegoAllowNtlmPwdChangeFallback documented in KB’s below which allows the NTLM fall back to happen again and unblocks this scenario.

Note: you may also consider using this registry key in an emergency for Known Issue#5 when it takes time to update the application code. However please read the above articles carefully and only consider this as a short term solution for scenario 5.

Troubleshooting

As I mentioned, this post is intended to support the documentation of the known issues in the Ms16-101 KB articles and provide help and guidance for troubleshooting. It should help you identify which known issue you are experiencing as well as provide resolution suggestions for each case.

I have also included a troubleshooting walkthrough of some of the more complex example cases. We will start with the problem definition, and then look at the available logs and tools to identify a suitable resolution. The idea is to teach “how to fish” because there can be many different scenario’s and hopefully you can apply these techniques and use the log files documented here to help resolve the issues when needed.

Once you know the scenario that you are using for the password change the next step is usually to collect some data on the server or client where the password change is occuring. For example if you have a web server running a password change application and doing password changes on behalf of users, you will need to collect the logs there. If in doubt collect the logs from all involved machines and then look for the right one doing the password change using the snippets in the examples. Here are the helpful logs.

DATA COLLECTION

The same logs will help in all the scenario’s.

LOGS

1. SPENGO debug log/ LSASS.log

To enable this log run the following commands from an elevated admin CMD prompt to set the below registry keys:

This will log Negotiate debug output to the %windir%\system32\lsass.log.

There is no need for reboot. The log is effective immediately.

Lsass.log is a text file that is easy to read with a text editor such as Wordpad.

2. Netlogon.log:

This log has been around for many years and is useful for troubleshooting DC LOCATOR traffic. It can be used together with a network trace to understand why the STATUS_NO_LOGON_SERVERS is being returned for the Kerberos password change attempt.

· To enable Netlogon debug logging run the following command from an elevated CMD prompt:

nltest /dbflag:0x26FFFFFF

· The resulting log is found in %windir%\debug\netlogon.log & netlogon.bak

· There is no need for reboot. The log is effective immediately. See also 109626 Enabling debug logging for the Net Logon service

· The Netlogon.log (and Netlogon.bak) is a text file.

Open the log with any text editor (I like good old Notepad.exe)

3. Collect a Network trace during the password change issue using the tool of your choice.

Scenario’s, Explanations and Walkthrough’s:

When reading this you should keep in mind that you may be seeing more than one scenario. The best thing to do is to start with one, fix that and see if there are any other problems left.

1. Domain password change fails via CTRL+ALT+DEL

This is most likely a Kerberos DC locator failure of some kind where the password changes were relying on NTLM before installing MS16-101 and are now failing. This is the simplest and easiest case to resolve using basic Kerberos troubleshooting methods.

Solution:Fix Kerberos.

Some tips from cases which we saw:

1. Use the Network trace to identify if the necessary communication ports are open. This was quite a common issue. So start by checking this.

In order for Kerberos password changes to work communication on TCP port 464 needs to be open between the client doing the
password change and the domain controller.

Note on RODC: Read-only domain controllers (RODCs) can service password changes if the user is allowed by the RODCs password replication policy. Users who are not allowed by the RODC password policy require network connectivity to a read/write domain controller (RWDC) in the user account domain to be able to change the password.

If you find these, then investigate firewall and open ports. It is often useful to take a simultaneous trace from the client and the domain controller and check if the packets are arriving at the other end.

2. Make sure that the target Kerberos names are valid.

IP addresses are not valid Kerberos names

Kerberos supports short names and fully qualified domain names. Like CONTOSO or Contoso.com

3. Make sure that service principal names (SPNs) are registered correctly.

2. Domain password change fails via application code with an INCORRECT/UNEXPECTED Error code when a password which does not meet password complexity is entered.

For example, before installing MS16-101, such password change may have returned a status like STATUS_PASSWORD_RESTRICTION. After installing Ms16-101 it returns STATUS_DOWNGRADE_DETECTED causing your application to behave in an expected way or even crash.

Note: In this scenario, password change succeeds when correct new password is entered that complies with the password policy.

Cause:

This issue is caused by a code defect in ADSI whereby the status returned from Kerberos was not returned to the user by ADSI correctly.
Here is a more detailed explanation of this one for the geek in you:

Before MS16-101 behavior:

1. An application calls ChangePassword method from using the ADSI LDAP provider.
Setting and changing passwords with the ADSI LDAP Provider is documented here.
Under the hood this calls Negotiate/Kerberos to change the password using a valid realm name.
Kerberos returns STATUS_PASSWORD_RESTRICTION or Other failure code.

2. A 2nd changepassword call is made via NetUserChangePassword API with an intentional realmname as the <dcname> which uses
Negotiate and will retry Kerberos. Kerberos fails with STATUS_NO_LOGON_SERVERS because a DC name is not a valid realm name.

3.Negotiate then retries over NTLM which succeeds or returns the same previous failure status.

The password change fails if a bad password was entered and the NTLM error code is returned back to the application. If a valid password was entered, everything works because the 1st change password call passes in a good name and if Kerberos works, the password change operation succeeds and you never enter into step 3.

Post MS16-101 behavior /why it fails with MS16-101 installed:

1. An application calls ChangePassword method from using the ADSI LDAP provider. This calls Negotiate for the password change with
a valid realm name.
Kerberos returns STATUS_PASSWORD_RESTRICTION or Other failure code.

2. A 2nd ChangePassword call is made via NetUserChangePassword with a <dcname> as realm name which fails over Kerberos with
STATUS_NO_LOGON_SERVERS which triggers NTLM fallback.

3. Because NTLM fallback is blocked on MS16-101, Error STATUS_DOWNGRADE_DETECTED is returned to the calling app.

Solution:Easy. Install the October update which will fix this issue. The fix lies in adsmsext.dll included in the October updates.

MS16-101 had a defect where Negotiate did not correctly determine that the password change was local and would try to find a DC using the local machine as the domain name.

This failed and NTLM fallback was no longer allowed post MS16-101. Therefore, the password changes failed with STATUS_DOWNGRADE_DETECTED.

Example:

One such scenario which I saw where password changes of local user accounts via ctrl+alt+delete failed with the message “The system detected a possible attempt to compromise security. Please ensure that you can contact the server that authenticated you.” Was when you have the following group policy set and you try to change a password of a local account:

“.” (less quotes). The period or “dot” designates the local machine name

Notes

Cause: In this case, post MS16-101 Negotiate incorrectly determined that the account is not local and tried to discover a DC using \\<machinename> as the domain and failed. This caused the password change to fail with the STATUS_DOWNGRADE_DETECTED error.

Solution:Install October fixes listed in the table at the top of this post.

4.Passwords for disabled and locked out user accounts cannot be changed using Negotiate method.

Important: Password Reset is not affected by MS16-101 at all in any scenario. Only password change. Therefore, any application which is doing a password Reset will be unaffected by Ms16-101.

Another important thing to note is that MS16-101 only affects applications using Negotiate. Therefore, it is possible to change locked-out and disabled account password using other method’s such as LDAPs.

For example, the PowerShell cmdlet Set-ADAccountPassword will continue to work for locked out and disabled account password changes as it does not use Negotiate.

There are 2 possibilities here:(a) The application code is passing an incorrect domain name parameter causing Kerberos password change to fail to locate a DC.(b) Application code is good and Kerberos password change fails for other reason like blocked port or DNS issue or missing SPN.

Let’s start with (a) The application code is passing an incorrect domain name/parameter causing Kerberos password change to fail to locate a DC.

(a) Data Analysis Walkthrough Example based on a real case:

1. Start with Lsass.log (SPNEGO trace)

If you are troubleshooting a password change failure after MS16-101 look for the following text in Lsass.log to indicate that Kerberos failed and NTLM fallback was forbidden by Ms16-101:

0xc000005E is STATUS_NO_LOGON_SERVERS
0xc0000388 is STATUS_DOWNGRADE_DETECTED

If you see this, it means Kerberos failed to locate a Domain Controller in the domain and fallback to NTLM is not allowed by Ms16-101. Next you should look at the Netlogon.log and the Network trace to understand why.

2. Network trace

Look at the network trace and filter the traffic based on the client IP, DNS and any authentication related traffic.
You may see the client is requesting a Kerberos ticket using an invalid SPN like:

So here the client tried to get a ticket for this ldap\Contoso.com SPN and failed with KDC_ERR_S_PRINCIPAL_UNKNOWN because this SPN is not registered anywhere.

This is expected. A valid LDAP SPN is example like ldap\DC1.contoso.com

Next let’s check the Netlogon.log

3. Netlogon.log:

Open the log with any text editor (I like good old Notepad.exe) and check the following:

Is a valid domain name being passed to DC locator?

Invalid names such as \\servername.contoso.com or IP address \\x.y.x.w will cause dclocator to fail and thus Kerberos password change to return STATUS_NO_LOGON_SERVERS. Once that happens NTLM fall back is not allowed and you get a failed password change.

If you find this issue examine the application code and make necessary changes to ensure correct domain name format is being passed to the ChangePassword API that is being used.

\\contoso.com is not a valid domain name. (contoso.com is a valid domain name)

This Error translates to:

0x4bc

1212

ERROR_INVALID_DOMAINNAME

The format of the specified domain name is invalid.

winerror.h

So what happened here?

The application code passed an invalid TargetName to kerberos. It used the domain name as a server name and so we see the SPN of LDAP\contoso.com.

The client tried to get a ticket for this SPN and failed with KDC_ERR_S_PRINCIPAL_UNKNOWN because this SPN is not registered anywhere. As Noted: this is expected. A valid LDAP SPN is example like ldap\DC1.contoso.com.

The application code then tried the password change again and passed in \\contoso.com as a domain name for the password change. Anything beginning with \\ as domain name is not valid. IP address is not valid. So DCLOCATOR will fail to locate a DC when given this domain name. We can see this in the Netlogon.log and the Network trace.

Conclusion and Solution

If the domain name is invalid here, examine the code snippet which is doing the password change to understand why the wrong name is passed in.

The fix in these cases will be to change the code to ensure a valid domain name is passed to Kerberos to allow the password change to successfully happen over Kerberos and not NTLM. NTLM is not secure. If Kerberos is possible, it should be the protocol used.

SOLUTION

The solution here was to remove “ContextOptions.ServerBind | ContextOptions.SimpleBind ” and allow the code to use the default (Negotiate). Note, because we were using a domain context but ServerBind this caused the issue. Negotiate with Domain context is the option that works and is successfully able to use kerberos.

Why does this code work before MS16-101 and fail after?

Specifically: “This parameter specifies the options that are used for binding to the server. The application can set multiple options that are linked with a bitwise OR operation. “

Passing in a domain name such as contoso.com with the ContextOptions ServerBind or SimpleBind causes the client to attempt to use an SPN like ldap\contoso.com because it expects the name which is passed in to be a ServerName.

This is not a valid SPN and does not exist, therefore this will fail and as a result Kerberos will fail with STATUS_NO_LOGON_SERVERS.
Before MS16-101, in this scenario, the Negotiate package would fall back to NTLM, attempt the password change using NTLM and succeed.
Post MS16-101 this fall back is not allowed and Kerberos is enforced.

(b) If Application Code is good but Kerberos fails to locate a DC for other reason

If you see a correct domain name and SPN’s in the above logs, then the issue is that kerberos fails for some other reason such as blocked TCP ports. In this case revert to Scenario 1 to troubleshoot why Kerberos failed to locate a Domain Controller.

There is a chance that you may also have both (a) and (b). Traces and logs are the best tools to identify.

I will not go into detail of this scenario as it is well described in the KB article KB3195799 NTLM authentication fails with 0xC0000022 error for Windows Server 2012, Windows 8.1, and Windows Server 2012 R2 after update is applied.

That’s all for today! I hope you find this useful. I will update this post if any new information arises.

]]>https://blogs.technet.microsoft.com/askds/2016/10/13/troubleshooting-failed-password-changes-after-installing-ms16-101/feed/22Access-Based Enumeration (ABE) Troubleshooting (part 2 of 2)https://blogs.technet.microsoft.com/askds/2016/09/21/access-based-enumeration-abe-troubleshooting-part-2-of-2/
https://blogs.technet.microsoft.com/askds/2016/09/21/access-based-enumeration-abe-troubleshooting-part-2-of-2/#commentsWed, 21 Sep 2016 12:59:38 +0000https://blogs.technet.microsoft.com/askds/?p=16605Read more]]>Hello everyone! Hubert from the German Networking Team here again with part two of my little Blog Post Series about Access-Based Enumeration (ABE). In the first part I covered some of the basic concepts of ABE. In this second part I will focus on monitoring and troubleshooting Access-based enumeration.We will begin with a quick overview of Windows Explorer’s directory change notification mechanism (Change Notify), and how that mechanism can lead to performance issues before moving on to monitoring your environment for performance issues.

Change Notify and its impact on DFSN servers with ABE

Let’s say you are viewing the contents of a network share while a file or folder is added to the share remotely by someone else. Your view of this share will be updated automatically with the new contents of the share without you having to manually refresh (press F5) your view.Change Notify is the mechanism that makes this work in all SMB Protocols (1,2 and 3).The way it works is quite simple:

The client sends a CHANGE_NOTIFY request to the server indicating the directory or file it is interested in. Windows Explorer (as an application on the client) does this by default for the directory that is currently in focus.

Once there is a change to the file or directory in question, the server will respond with a CHANGE_NOTIFY Response, indicating that a change happened.

This causes the client to send a QUERY_DIRECTORY request (in case it was a directory or DFS Namespace) to the server to find out what has changed.

QUERY_DIRECTORY is the thing we discussed in the first post that causes ABE filter calculations. Recall that it’s these filter calculation that result in CPU load and client-side delays.Let’s look at a common scenario:

During login, your users get a mapped drive pointing at a share in a DFS Namespace.

This mapped drive causes the clients to connect to your DFSN Servers

The client sends a Change Notification (even if the user hasn’t tried to open the mapped drive in Windows Explorer yet) for the DFS Root.

Nothing more happens until there is a change on the server-side. Administrative work, such as adding and removing links, typically happens during business hours, whenever the administrators find the time, or the script that does it, runs.

Back to our scenario. Let’s have a server-side change to illustrate what happens next:

We add a Link to the DFS Namespace.

Once the DFSN Server picks up the new link in the namespace from Active directory, it will create the corresponding reparse point in its local file system.If you do not use Root Scalability Mode (RSM) this will happen almost at the same time on all of the DFS Servers in that namespace. With RSM the changes will usually be applied by the different DFS servers over the next hour (or whatever your SyncInterval is set to).

These changes trigger CHANGE_NOTIFY responses to be sent out to any client that indicated interest in changes to the DFS Root on that server. This usually applies to hundreds of clients per DFS server.

This causes hundreds of Clients to send QUERY_DIRECTORY requests simultaneously.

What happens next strongly depends on the size of your namespace (larger namespaces lead to longer duration per ABE calculation) and the number of Clients (aka Requests) per CPU of the DFSN Server (remember the calculation from the first part?)

As your Server does not have hundreds of CPUs there will definitely be some backlog. The numbers above decide how big this backlog will be, and how long it takes for the server to work its way back to normal. Keep in mind that while pedaling out of the backlog situation, your server still has to answer other, ongoing requests that are unrelated to our Change Notify Event.Suffice it to say, this backlog and the CPU demand associated with it can also have negative impact to other jobs. For example, if you use this DFSN server to make a bunch of changes to your namespace, these changes will appear to take forever, simply because the executing server is starved of CPU Cycles. The same holds true if you run other workloads on the same server or want to RDP into the box.

So! What can you do about it?As is common with an overloaded server, there are a few different approaches you could take:

Distribute the load across more servers (and CPU cores)

Make changes outside of business hours

Disable Change Notify in Windows Explorer

Approach

Method

Distribute the load / scale up

An expensive way to handle the excessive load is to throw more servers/CPU cores into the DFS infrastructure. In theory, you could increase the number of Servers and the number of CPUs to a level where you can handle such peak loads without any issues, but that can be a very expensive approach.

Make changes outside business hours

Depending on your organizations structure, your business needs, SLAs and other requirements, you could simply make planned administrative changes to your Namespaces outside the main business hours, when there are less clients connected to your DFSN Servers.

Disable Change Notify in Windows Explorer

You can set:NoRemoteChangeNotifyNoRemoteRecursiveEventsSee https://support.microsoft.com/en-us/kb/831129to prevent Windows Explorer from sending Change Notification Requests.This is however a client-side setting that disables this functionality (change notify) not just for DFS shares but also for any fileserver it is working with. Thus you have to actively press F5 to see changes to a folder or a share in your Windows Explorer. This might or might not be a big deal for your users.

Monitoring ABE

As you may have realized by now, ABE is not a fire and forget technology—it needs constant oversight and occasional tuning. We’ve mainly discussed the design and “tuning” aspect so far. Let’s look into the monitoring aspect.

Using Task Manager / Process Explorer

This is a bit tricky, unfortunately, as any load caused by ABE shows up in Task Manager inside the System process (as do many other things on the server). In order to correlate high CPU utilization in the System process to ABE load, you need to use a tool such as Process Explorer and configure it to use public symbols. With this configured properly, you can drill deeper inside the System Process and see the different threads and the component names. We need to note, that ABE and the Fileserver both use functions in srv.sys and srv2.sys. So strictly speaking it’s not possible to differentiate between them just by the component names. However, if you are troubleshooting a performance problem on an ABE-enabled server where most of the threads in the System process are sitting in functions from srv.sys and srv2.sys, then it’s very likely due to expensive ABE filter calculations. This is, aside from disabling ABE, the best approach to reliably prove your problem to be caused by ABE.

Using Network trace analysis

Looking at CPU utilization shows us the server-side problem. We must use other measures to determine what the client-side impact is, one approach is to take a network trace and analyze the SMB/SMB2 Service Response times. You may however end up having to capture the trace on a mirrored switch port. To make analysis of this a bit easier, Message Analyzer has an SMB Service Performance chart you can use.

You get there by using a New Viewer, like below.

Wireshark also has a feature that provides you with statistics under Statistics -> Service Response Times -> SMB2. Ignore the values for ChangeNotify (its normal that they are several seconds or even minutes). All other response times translate into delays for the clients. If you see values over a second, you can consider your files service not only to be slow but outright broken.While you have that trace in front of you, you can also look for SMB/TCP Connections that are terminated abnormally by the Client as the server failed to respond to the SMB Requests in time. If you have any of those, then you have clients unable to connect to your file service, likely throwing error messages.

Using Performance Monitor

If your server is running Windows Server 2012 or newer, the following performance counters are available:

Object

Counter

Instance

SMB Server Shares

Avg. sec /Data Request

<Share that has ABE Enabled>

SMB Server Shares

Avg. sec/Read

‘’

SMB Server Shares

Avg. sec/Request

‘’

SMB Server Shares

Avg. sec/Write

‘’

SMB Server Shares

Avg. Data Queue Length

‘’

SMB Server Shares

Avg. Read Queue Length

‘’

SMB Server Shares

Avg. Write Queue Length

‘’

SMB Server Shares

Current Pending Requests

‘’

Most noticeable here is Avg. sec/Request counter as this contains the response time to the QUERY_DIRECTORY requests (Wireshark displays them as Find Requests). The other values will suffer from a lack of CPU Cycles in varying ways but all indicate delays for the clients. As mentioned in the first part: We expect single digit millisecond response times from non-ABE Fileservers that are performing well. For ABE-enabled Servers (more precisely Shares) the values for QUERY_DIRECTORY / Find Requests will always be higher due to the inevitable length of the ABE Calculation.

When you reached a state where all the other SMB Requests aside of the QUERY_DIRECTORY are constantly responded to in less than 10ms and the QUERY_DIRECTORY constantly in less than 50ms you have a very good performing Server with ABE.

Other Symptoms

There are other symptoms of ABE problems that you may observe, however, none of them on their own is very telling, without the information from the points above.

At a first glance a high CPU Utilization and a high Processor Queue lengths are indicators of an ABE problem, however they are also indicators of other CPU-related performance issues. Not to mention there are cases where you encounter ABE performance problems without saturating all your CPUs.

The Server Work Queues\Active Threads (NonBlocking) will usually raise to their maximum allowed limit (MaxThreadsPerQueue ) as well as the Server Work Queues\Queue Length increasing. Both indicate that the Fileserver is busy, but on their own don’t tell you how bad the situation is. However, there are scenarios where the File server will not use up all Worker Threads allowed due to a bottleneck somewhere else such as in the Disk Subsystem or CPU Cycles available to it.

See the following should you choose to setup long-term monitoring (which you should) in order to get some trends:

If you collect those values every day (or a shorter interval), you can get a pretty good picture how much head-room you have left with your servers at the moment and if there are trends that you need to react to.

Feel free to add more information to your monitoring to get a better picture of the situation. For example: gather information on how many DFS servers were active at any given day for a certain site, so you can explain if unusual high numbers of user requests on the other servers come from a server downtime.

ABELevel

Some of you might have heard about the registry key ABELevel. The ABELevel value specifies the maximum level of the folders on which the ABE feature is enabled. While the title of the KB sounds very promising, and the hotfix is presented as a “Resolution”, the hotfix and registry value have very little practical application. Here’s why:ABELevel is a system-wide setting and does not differentiate between different shares on the same server. If you host several shares, you are unable to filter to different depths as the setting forces you to go for the deepest folder hierarchy. This results in unnecessary filter calculations for shares.

Usually the widest directories are on the upper levels—those levels that you need to filter. Disabling the filtering for the lower level directories doesn’t yield much of a performance gain, as those small directories don’t have much impact on server performance, while the big top-level directories do. Furthermore, the registry value doesn’t make any sense for DFS Namespaces as you have only one folder level there and you should avoid filtering on your fileservers anyway.

Well then, this concludes this small (my first) blog series.I hope you found reading it worthwhile and got some input for your infrastructures out there.

With best regardsHubert

]]>https://blogs.technet.microsoft.com/askds/2016/09/21/access-based-enumeration-abe-troubleshooting-part-2-of-2/feed/2Access-Based Enumeration (ABE) Concepts (part 1 of 2)https://blogs.technet.microsoft.com/askds/2016/09/01/access-based-enumeration-abe-concepts-part-1-of-2/
https://blogs.technet.microsoft.com/askds/2016/09/01/access-based-enumeration-abe-concepts-part-1-of-2/#commentsThu, 01 Sep 2016 21:50:57 +0000https://blogs.technet.microsoft.com/askds/?p=16535Read more]]>Hello everyone, Hubert from the German Networking Team here. Today I want to revisit a topic that I wrote about in 2009: Access-Based Enumeration (ABE)

This is the first part of a 2-part Series. This first part will explain some conceptual things around ABE. The second part will focus on diagnostic and troubleshooting of ABE related problems. The second post is here.

Access-Based Enumeration has existed since Windows Server 2003 SP1 and has not change in any significant form since my Blog post in 2009. However, what has significantly changed is its popularity.

With its integration into V2 (2008 Mode) DFS Namespaces and the increasing demand for data privacy, it became a tool of choice for many architects. However, the same strict limitations and performance impact it had in Windows Server 2003 still apply today. With this post, I hope to shed some more light here as these limitations and the performance impact are either unknown or often ignored. Read on to gain a little insight and background on ABE so that you:

Understand its capabilities and limitations

Gain the background knowledge needed for my next post on how to troubleshoot ABE

Two things to keep in mind:

ABE is not a security feature (it’s more of a convenience feature)

There is no guarantee that ABE will perform well under all circumstances. If performance issues come up in your deployment, disabling ABE is a valid solution.

“Access-based enumeration displays only the files and folders that a user has permissions to access. If a user does not have Read (or equivalent) permissions for a folder, Windows hides the folder from the user’s view. This feature is active only when viewing files and folders in a shared folder; it is not active when viewing files and folders in the local file system.”

Note that ABE has to check the user’s permissions at the time of enumeration and filter out files and folders they don’t have Read permissions to. Also note that this filtering only applies if the user is attempting to access the share via SMB versus simply browsing the same folder structure in the local file system.

For example, let’s assume you have an ABE enabled file server share with 500 files and folders, but a certain user only has read permissions to 5 of those folders. The user is only able to view 5 folders when accessing the share over the network. If the user logons to this server and browses the local file system, they will see all of the files and folders.

In addition to file server shares, ABE can also be used to filter the links in DFS Namespaces.

With V2 Namespaces DFSN got the capability to store permissions for each DFSN link, and apply those permissions to the local file system of each DFSN Server.

Those NTFS permissions are then used by ABE to filter directory enumerations against the DFSN root share thus removing DFSN links from the results sent to the client.

Therefore, ABE can be used to either hide sensitive information in the link/folder names, or to increase usability by hiding hundreds of links/folders the user does not have access to.

How does it work?

The filtering happens on the file server at the time of the request.

Any Object (File / Folder / Shortcut / Reparse Point / etc.) where the user has less than generic read permissions is omitted in the response by the server.

Generic Read means:

List Folder / Read Data

Read Attributes

Read Extended Attributes

Read Permissions

If you take any of these permissions away, ABE will hide the object.

So you could create a scenario (i.e. remove the Read Permission permission) where the object is hidden from the user, but he/she could still open/read the file or folder if the user knows its name.

That brings us to the next important conceptual point we need to understand:

ABE does not do access control.

It only filters the response to a Directory Enumeration. The access control is still done through NTFS.

Aside from that ABE only works when the access happens through the Server Service (aka the Fileserver). Any access locally to the file system is not affected by ABE. Restated:

“Access-based enumeration does not prevent users from obtaining a referral to a folder target if they already know the DFS path of the folder with targets. Permissions set using Windows Explorer or the Icacls command on namespace roots or folders without targets control whether users can access the DFS folder or namespace root. However, they do not prevent users from directly accessing a folder with targets. Only the share permissions or the NTFS file system permissions of the shared folder itself can prevent users from accessing folder targets.” Recall what I said earlier, “ABE is not a security feature”. TechNet

ABE does not do any caching.

Every requests causes a filter calculation. There is no cache. ABE will repeat the same exact work for identical directory enumerations by the same user.

ABE cannot predict the permissions or the result.

It has to do the calculations for each object in every level of your folder hierarchy every time it is accessed.

If you use inheritance on the folder structure, a user will have the same permission and thus the same filter result from ABE through the entire folder structure. Still ABE as to calculate this result, consuming CPU Cycles in the process.

If you enable ABE on such a folder structure you are just wasting CPU cycles without any gain.

With those basics out of the way, let’s dive into the mechanics behind the scenes:

With ABE enabled, this list is not immediately sent out to the client, but instead passed over to the ABE for processing.

ABE will iterate through EVERY object of this list and compare the permission of the user with the objects ACL.

The objects where the user does not have generic read access are removed from the list.

After ABE has completed its processing, the client receives the filtered list.

This yields two effects:

This comparison is an active operation and thus consumes CPU Cycles.

This comparison takes time, and this time is passed down to the User as the results will only be sent, when the comparisons for the entire directory are completed.

This brings us directly to the core point of this Blog:

In order to successfully use ABE in your environment you have to manage both effects.

If you don’t, ABE can cause a wide spread outage of your File services.

The first effect can cause a complete saturation of your CPUs (all cores at 100%).

This does not only increase the response times of the Fileserver to its clients to a magnitude where the Server is not accepting any new connections or the clients kill their connection after not getting a response from the server for several minutes, but it can also prevent you from establishing a remote desktop connection to the server to make any changes (like disabling ABE for instance).

The second effect can increase the response times of your fileserver (even if its otherwise Idle) to a magnitude that is not accepted by the Users anymore.

The comparison for a single directory enumeration by a single user can keep one CPU in your server busy for quite some time, thus making it more likely for new incoming requests to overlap with already running ABE calculations. This eventually results in a Backlog adding further to the delays experienced by your clients.

To illustrate this let’s roll some numbers:

A little disclaimer:

The following calculation is what I’ve seen, your results may differ as there are many moving pieces in play here. In other words, your mileage may vary. That aside, the numbers seen here are not entirely off but stem from real production environments. Performance of Disk and CPU and other workloads play into these numbers as well.

Thus the calculation and numbers are for illustration purposes only. Don’t use it to calculate your server’s performance capabilities.

We usually expect single digit millisecond response times measured at the fileserver to achieve good performance (network latency obviously adds to the numbers seen on the client).

In our scenario above (10,000 Links, ABE, 3.5 Ghz CPU) it is not unseen that a single enumeration of the namespace would take 500ms.

CPU cores and speed

DFS Namespace Links

RSS configured per recommendations

ABE enabled?

Response time

4 @ 3.5 GHz

10,000

Yes

No

<10ms

4 @ 3.5 GHz

10,000

Yes

Yes

300 – 500 ms

That means a single CPU can handle up to 2 Directory Enumerations per Second. Multiplied by 4 CPUs the server can handle 8 User Requests per Second. Any more than those 8 requests and we push the Server into a backlog.

Backlog in this case means new requests are stuck in the Processor Queue behind other requests, therefore multiplying the wait time.

This can reach dimensions where the client (and the user) is waiting for minutes and the client eventually decides to kill the TCP connection, and in case of DFSN, fail over to another server.

Anyone remotely familiar with Fileserver Scalability probably instantly recognizes how bad and frightening those numbers are. Please keep in mind, that not every request sent to the server is a QUERY_DIRECTORY request, and all other requests such as Write, Read, Open, Close etc. do not cause an ABE calculation (however they suffer from an ABE-induced lack of CPU resources in the same way).

There is no such Cache for SMB1. Thus SMB1 Clients will send more Directory Enumeration Requests than SMB2 or SMB3 Clients (particularly if you keep the F5 key pressed).

It should now be obvious that you should use SMB2/3 versus SMB1 and ensure you leave the caches enabled if you use ABE on your servers.

As you might have realized by now, there is no easy or reliable way to predict the CPU demand of ABE. If you are developing a completely new environment you usually cannot forecast the proportion of QUERY_DIRECTORY requests in relation to the other requests or the frequency of the same.

Recommendations!

The most important recommendation I can give you is:

Do not enable ABE unless you really need to.

Let’s take the Users Home shares as an example:

Usually there is no user browsing manually through this structure, but instead the users get a mapped drive pointing to their folder. So the usability aspect does not count. Additionally most users will know (or can find out from the Office Address book) the names or aliases of their colleagues. So there is no sensitive information to hide here. For ease of management most home shares live in big namespace or server shares, what makes them very unfit to be used with ABE. In many cases the user has full control (or at least write permissions) inside his own home share. Why should I waste my CPU Cycles to filter the requests inside someone’s Home Share?

Considering all those points, I would be intrigued to learn about a telling argument to enable ABE on User Home Shares or Roaming Profile Shares. Please sound off in the comments.

If you have a data structure where you really need to enable ABE, your file service concept needs to facilitate these four requirements:

You need Scalability.

You need the ability to increase the number of CPUs doing the ABE calculations in order to react to increasing numbers (directory sizes, number of clients, usage frequency) and thus performance demand.

By that you can add easily more CPUs by just adding further Namespace Servers in the sites where they are required.

Also keep in mind, that you should have some redundancy and that another server might not be able to take the full additional load of a failing server on top of its own load.

You need small chunks

The number of objects that ABE needs to check for each calculation is the single most important factor for the performance requirement.

Instead of having a single big 10,000 link namespace (same applies to directories on file servers) build 10 smaller 1,000 link-namespaces and combine them into a DFS Cascade.

By that ABE just needs to filter 1,000 objects for every request.

Just re-do the example calculation above with 250ms, 100ms, 50ms or even less.

You will notice that you are suddenly able to reach very decent numbers in terms of Requests/per Second.

The other nice side effect is, that you will do less calculations, as the user will usually follow only one branch in the directory tree, and is thus not causing ABE calculations for the other branches.

You need Separation of Workloads.

Having your SQL Server run on the same machine as your ABE Server can cause a lack of Performance for both workloads.

Having ABE run on you Domain Controller exposes your Domain Controller Role to the risk of being starved of CPU Cycles and thus not facilitating Domain Logons anymore.

You need to test and monitor your performance

In many cases you are deploying a new file service concept into an existing environment.

Thus you can get some numbers regarding QUERY_DIRECTORY requests, from the existing DFS / Fileservers.

Monitor the SMB Service Response Times, the Processor utilization and Queue length and the feel on the client while browsing through the structures.

This should give you an idea on how many servers you will need, and if it is required to go for a slimmer design of the data structures.

Keep monitoring those values through the lifecycle of your file server deployment in order to scale up in time.

Any deployment of new software, clients or the normal increase in data structure size could throw off your initial calculations and test results.

This point should imho be outlined very clearly in any concept documentation.

This concludes the first part of this Blog Series.

I hope you found it worthwhile and got an understanding how to successfully design a File service with ABE.

Now to round off your knowledge, or if you need to troubleshoot a Performance Issue on an ABE-enabled Server, I strongly encourage you to read the second part of this Blog Series. This post will be updated as soon as it’s live.

With best regards,

Hubert

]]>https://blogs.technet.microsoft.com/askds/2016/09/01/access-based-enumeration-abe-concepts-part-1-of-2/feed/2Deploying Group Policy Security Update MS16-072 \ KB3163622https://blogs.technet.microsoft.com/askds/2016/06/22/deploying-group-policy-security-update-ms16-072-kb3163622/
https://blogs.technet.microsoft.com/askds/2016/06/22/deploying-group-policy-security-update-ms16-072-kb3163622/#commentsWed, 22 Jun 2016 13:37:40 +0000https://blogs.technet.microsoft.com/askds/?p=16215Read more]]>My name is Ajay Sarkaria & I work with the Windows Supportability team at Microsoft. There have been many questions on deploying the newly released security update MS16-072.

This post was written to provide guidance and answer questions needed by administrators to deploy the newly released security update, MS16-072 that addresses a vulnerability. The vulnerability could allow elevation of privilege if an attacker launches a man-in-the-middle (MiTM) attack against the traffic passing between a domain controller and the target machine on domain-joined Windows computers.

The table below summarizes the KB article number for the relevant Operating System:

What does this security update change?

The most important aspect of this security update is to understand the behavior changes affecting the way User Group Policy is applied on a Windows computer. MS16-072 changes the security context with which user group policies are retrieved. Traditionally, when a user group policy is retrieved, it is processed using the user’s security context.

After MS16-072 is installed, user group policies are retrieved by using the computer’s security context. This by-design behavior change protects domain joined computers from a security vulnerability.

When a user group policy is retrieved using the computer’s security context, the computer account will now need “read” access to retrieve the group policy objects (GPOs) needed to apply to the user.

Traditionally, all group policies were read if the “user” had read access either directly or being part of a domain group e.g. Authenticated Users

What do we need to check before deploying this security update?

As discussed above, by default “Authenticated Users” have “Read” and “Apply Group Policy” on all Group Policy Objects in an Active Directory Domain.

Below is a screenshot from the Default Domain Policy:

If permissions on any of the Group Policy Objects in your active Directory domain have not been modified, are using the defaults, and as long as Kerberos authentication is working fine in your Active Directory forest (i.e. there are not Kerberos errors visible in the system event log on client computers while accessing domain resources), there is nothing else you need to make sure before you deploy the security update.

In some deployments, administrators may have removed the “Authenticated Users” group from some or all Group Policy Objects (Security filtering, etc.)

In such cases, you will need to make sure of the following before you deploy the security update:

Check if “Authenticated Users” group read permissions were removed intentionally by the admins. If not, then you should probably add those back. For example, if you do not use any security filtering to target specific group policies to a set of users, you could add “Authenticated Users” back with the default permissions as shown in the example screenshot above.

If the “Authenticated Users” permissions were removed intentionally (security filtering, etc), then as a result of the by-design change in this security update (i.e. to now use the computer’s security context to retrieve user policies), you will need to add the computer account retrieving the group policy object (GPO) to “Read” Group Policy (and not “Apply group policy“).

Example Screenshot:

In the above example screenshot, let’s say an Administrator wants “User-Policy” (Name of the Group Policy Object) to only apply to the user with name “MSFT Ajay” and not to any other user, then the above is how the Group Policy would have been filtered for other users. “Authenticated Users” has been removed intentionally in the above example scenario.

Notice that no other user or group is included to have “Read” or “Apply Group Policy” permissions other than the default Domain Admins and Enterprise Admins. These groups do not have “Apply Group Policy” by default so the GPO would not apply to the users of these groups & apply only to user “MSFT Ajay”

What will happen if there are Group Policy Objects (GPOs) in an Active Directory domain that are using security filtering as discussed in the example scenario above?

Symptoms when you have security filtering Group Policy Objects (GPOs) like the above example and you install the security update MS16-072:

Printers or mapped drives assigned through Group Policy Preferences disappear.

Shortcuts to applications on users’ desktop are missing

Security filtering group policy does not process anymore

You may see the following change in gpresult: Filtering: Not Applied (Unknown Reason)

If you are using Folder Redirection and the Folder Redirection group policy removal option is set to “Redirect the folder back to the user profile location when policy is removed,” the redirected folders are moved back to the client machine after installing this security update

What is the Resolution?

Simply adding the “Authenticated Users” group with the “Read” permissions on the Group Policy Objects (GPOs) should be sufficient. Domain Computers are part of the “Authenticated Users” group. “Authenticated Users” have these permissions on any new Group Policy Objects (GPOs) by default. Again, the guidance is to add just “Read” permissions and not “Apply Group Policy” for “Authenticated Users”

What if adding Authenticated Users with Read permissions is not an option?

If adding “Authenticated Users” with just “Read” permissions is not an option in your environment, then you will need to add the “Domain Computers” group with “Read” Permissions. If you want to limit it beyond the Domain Computers group: Administrators can also create a new domain group and add the computer accounts to the group so you can limit the “Read Access” on a Group Policy Object (GPO). However, computers will not pick up membership of the new group until a reboot. Also keep in mind that with this security update installed, this additional step is only required if the default “Authenticated Users” Group has been removed from the policy where user settings are applied.

Example Screenshots:

Now in the above scenario, after you install the security update, as the user group policy needs to be retrieved using the system’s security context, (domain joined system being part of the “Domain Computers” security group by default), the client computer will be able to retrieve the user policies required to be applied to the user and the same will be processed successfully.

How to identify GPOs with issues:

In case you have already installed the security update and need to identify Group Policy Objects (GPOs) that are affected, the easy way is just to do a simple gpupdate /force on a Windows client computer and then run the gpresult /h new-report.html -> Open the new-report.html and review for any errors like: “Reason Denied: Inaccessible, Empty or Disabled”

The script can run only on Windows 7 and above Operating Systems which have the RSAT or GPMC installed or Domain Controllers running Windows Server 2008 R2 and above

The script works in a single domain scenario.

The script will detect all GPOs in your domain (Not Forest) which are missing “Authenticated Users” permissions & give the option to add “Authenticated Users” with “Read” Permissions (Not Apply Group Policy). If you have multiple domains in your Active Directory Forest, you will need to run this for each domain.

Domain Computers are part of the Authenticated Users group

The script can only add permissions to the Group Policy Objects (GPOs) in the same domain as the context of the current user running the script. In a multi domain forest, you must run it in the context of the Domain Admin of the other domain in your forest.

Sample Screenshots when you run the script:

In the first sample screenshot below, running the script detects all Group Policy Objects (GPOs) in your domain which has the “Authenticated Users” missing the Read Permission.

Select and Deploy GPOs again:Note: To modify permissions on multiple AGPM-managed GPOs, use shift+click or ctrl+click to select multiple GPO’s at a time then deploy them in a single operation. CTRL_A does not select all policies.

The targeted GPO now have the new permissions when viewed in AD:

Below are some Frequently asked Questions we have seen:

Frequently Asked Questions (FAQs):

Q1) Do I need to install the fix on only client OS? OR do I also need to install it on the Server OS?

A1) It is recommended you patch Windows and Windows Server computers which are running Windows Vista, Windows Server 2008 and newer Operating Systems (OS), regardless of SKU or role, in your entire domain environment. These updates only change behavior from a client (as in “client-server distributed system architecture”) standpoint, but all computers in a domain are “clients” to SYSVOL and Group Policy; even the Domain Controllers (DCs) themselves

Q2) Do I need to enable any registry settings to enable the security update?

A2) No, this security update will be enabled when you install the MS16-072 security update, however you need to check the permissions on your Group Policy Objects (GPOs) as explained above

Q3) What will change in regard to how group policy processing works after the security update is installed?

A3) To retrieve user policy, the connection to the Windows domain controller (DC) prior to the installation of MS16-072 is done under the user’s security context. With this security update installed, instead of user’s security context, Windows group policy clients will now force local system’s security context, therefore forcing Kerberos authentication

Q4) We already have the security update MS15-011 & MS15-014 installed which hardens the UNC paths for SYSVOL & NETLOGON & have the following registry keys being pushed using group policy:

RequirePrivacy=1

RequireMutualAuthentication=1

RequireIntegrity=1

Should the UNC Hardening security update with the above registry settings not take care of this vulnerability when processing group policy from the SYSVOL?

A4) No. UNC Hardening alone will not protect against this vulnerability. In order to protect against this vulnerability, one of the following scenarios must apply: UNC Hardened access is enabled for SYSVOL/NETLOGON as suggested, and the client computer is configured to require Kerberos FAST Armoring

Q5) If we have security filtering on Computer objects, what change may be needed after we install the security update?

A5) Nothing will change in regard to how Computer Group Policy retrieval and processing works

Q6) We are using security filtering for user objects and after installing the update, group policy processing is not working anymore

A6) As noted above, the security update changes the way user group policy settings are retrieved. The reason for group policy processing failing after the update is installed is because you may have removed the default “Authenticated Users” group from the Group Policy Object (GPO). The computer account will now need “read” permissions on the Group Policy Object (GPO). You can add “Domain Computers” group with “Read” permissions on the Group Policy Object (GPO) to be able to retrieve the list of GPOs to download for the user

A7)No, this security update will not impact cross forest user group policy processing. When a user from one forest logs onto a computer in another forest and the group policy setting “Allow Cross-Forest User Policy and Roaming User Profiles” is enabled, the user group policy during the cross forest logon will be retrieved using the user’s security context.

Q8) Is there a need to specifically add “Domain Computers” to make user group policy processing work or adding “Authenticated Users” with just read permissions should suffice?

A8) Yes, just adding “Authenticated Users” with Read permissions should suffice. If you already have “Authenticated Users” added with at-least read permissions on a GPO, there is no further action required. “Domain Computers” are by default part of the “Authenticated Users” group & user group policy processing will continue to work. You only need to add “Domain Computers” to the GPO with read permissions if you do not want to add “Authenticated Users” to have “Read”