The Mystery of Lsass.exe Memory Consumption, (When all components get involved)

Hi All, this is Karim Elsaid and I’m a Support Escalation Engineer working with the Dubai platforms support team.Recently I was working on a very challenging and interesting case, and I wanted to share that experience with you.

One of our customers was experiencing a problem on all his Domain Controllers (Running x86 Windows Server 2003), where intermittently the DCs ran out of resources. When one of the servers ran out of resources it had to be restarted to restore its functionality. The following event was logged on the DC when the problem occurred:

This event gets logged when the Local Security Authority Subsystem (lsass.exe) process is unable to allocate memory from the heap.

When this problem occurs, users are not able to authenticate and Exchange servers cease functioning until the DC gets rebooted.

By the time we got the call, the DC that was exhibiting the issue had been rebooted, hence when we checked the lsass.exe memory usage it was very low, but when we checked another DC on the same site, the lsass.exe memory usage was around 1.7GB.Well, that’s pretty high but it is worth mentioning that by default lsass.exe may utilize a large amount of virtual memory for the ESENT Cache (Active Directory database), nevertheless this should not cause any issues on DCs as it should back-off when needed.

The issue here is that the memory usage for lsass.exe kept growing and never got released back again – this is called a process memory leak.

To confirm our theory we obtained a Perfmon Log from the DC and checked how Lsass.exe is doing with memory, which proved the theory that lsass.exe is actually leaking memory.

We can see from the graph above a steady increase in lsass.exe private bytes.The DC was booted without /3GB and hence the limit of the user mode process address space is 2GB.

Now, we need to explore the options on how to track and address such leaks. We could obtain an lsass.exe process dump and check the heap, or obtain a Full memory dump of the server while in the failed state, to check the lsass.exe heap usage and overall system state, but the thing is none of the above may immediately reveal what’s going on.We would need to be very lucky to get the dump while the actual memory call that happens inside lsass.exe process is executing, which is very slim.

What we are looking for is a “pattern”, the pattern which is causing the actual leak inside the lsass.exe process.

For that we will turn to a tool that I see many of the customers are not aware of, but in fact it is very handy and easy to troubleshoot such a memory leak problem. We will use the “User Mode Dump Heap” or UMDH. By using this tool it will do the job for us, it will monitor the all heap calls that happen in the lsass.exe process.All we need to do is to download the “Debugging Tools For Windows”, enable user mode stack acquisition from the command line, download the symbols from Microsoft public symbol servers, reboot the server and we are good to go.

So now, all heap blocks that the lsass.exe process allocates will get monitored by UMDH, and the tool will tell us information about that specific call stack.For example, the number of allocations using this stack, number of bytes consumed, etc…

Back to our main problem: We have setup the DC for UMDH and started monitoring the lsass.exe process.We verified UMDH is working after the reboot by generating a sample log – it is very critical to generate this simple log to make sure that you are able to resolve the symbols correctly.

We kept generating UMDH logs daily and then the interesting pattern started to appear, which shows:

+ 123097760 ( 123098480 –720) 1538731 allocsBackTrace4BA0

+ 1538722 ( 1538731 –9)BackTrace4BA0allocations

ntdll!RtlAllocateHeapSlowly+00000041

ntdll!RtlAllocateHeap+00000E9F

msvcrt!malloc+0000006C

NTDSA!ABBind_local+0000031C

NTDSA!NspiBind+0000013D

RPCRT4!Invoke+00000030

RPCRT4!NdrStubCall2+00000299

RPCRT4!NdrServerCall2+00000019

RPCRT4!DispatchToStubInCNoAvrf+00000038

RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0000011F

RPCRT4!RPC_INTERFACE::DispatchToStub+000000A3

RPCRT4!OSF_SCALL::DispatchHelper+00000149

RPCRT4!OSF_SCALL::DispatchRPCCall+0000010D

RPCRT4!OSF_SCALL::ProcessReceivedPDU+0000057F

RPCRT4!OSF_SCALL::BeginRpcCall+00000194

RPCRT4!OSF_SCONNECTION::ProcessReceiveComplete+00000435

RPCRT4!ProcessConnectionServerReceivedEvent+00000021

RPCRT4!LOADABLE_TRANSPORT::ProcessIOEvents+000001B8

RPCRT4!ProcessIOEventsWrapper+0000000D

RPCRT4!BaseCachedThreadRoutine+0000009D

RPCRT4!ThreadStartRoutine+0000001B

kernel32!BaseThreadStart+00000034

From the backtrace 4BA0 we can see that there are ~1.5 heap allocation calls that have occurred without being freed from the first day we started to monitor lsass using UMDH.

There is also another call stack with exactly the same symptoms:

+ 51525199 ( 51525506 –307) 1538802 allocsBackTrace4B92

+ 1538793 ( 1538802 –9)BackTrace4B92allocations

ntdll!RtlAllocateHeapSlowly+00000041

ntdll!RtlAllocateHeap+00000E9F

LSASRV!LsapAllocatePrivateHeapNoZero+0000004F

RPCRT4!operator new+0000000D

RPCRT4!W2AAttachHelper+00000032

RPCRT4!CNlDelUnicodeAnsi::Convert+00000014

NTDSA!ABBind_local+0000005D

NTDSA!NspiBind+0000013D

RPCRT4!Invoke+00000030

RPCRT4!NdrStubCall2+00000299

RPCRT4!NdrServerCall2+00000019

RPCRT4!DispatchToStubInCNoAvrf+00000038

RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0000011F

RPCRT4!RPC_INTERFACE::DispatchToStub+000000A3

RPCRT4!OSF_SCALL::DispatchHelper+00000149

RPCRT4!OSF_SCALL::DispatchRPCCall+0000010D

RPCRT4!OSF_SCALL::ProcessReceivedPDU+0000057F

RPCRT4!OSF_SCALL::BeginRpcCall+00000194

RPCRT4!OSF_SCONNECTION::ProcessReceiveComplete+00000435

RPCRT4!ProcessConnectionServerReceivedEvent+00000021

RPCRT4!LOADABLE_TRANSPORT::ProcessIOEvents+000001B8

RPCRT4!ProcessIOEventsWrapper+0000000D

RPCRT4!BaseCachedThreadRoutine+0000009D

RPCRT4!ThreadStartRoutine+0000001B

kernel32!BaseThreadStart+00000034

From both call stacks above we can see that the originating function that leads to the heap allocation is “NTDSA!NspiBind”

NSPI is the interface which the Exchange Address Book provider uses to talk to the Global Catalogs. Typically when Outlook clients want to talk to the directory for address book, they talk to the Exchange server directly and then the Exchange server will look up the appropriate GC, forward the request to it and finally send the result back to the originating clients.This method is called the DSproxy method. Outlook clients can also send referral requests (RFR) to the Exchange server to get a referral to a GC and then start talking to this GC directly. You can find more information about this operation “Here”

With this new information in hand, we turned our attention to the Exchange server for a while. We did so by obtaininga netmon trace from the Exchange server, and started examining it to reveal a massive number of RFR requests originating from outlook client machines toward the Exchange server:

In the netmon capture we saw an excessive number of NspiBind requests been sent from the Outlook client machines to the Exchange server and Exchange server in turn sending these NspiBind requests to the GC, in the netmon capture we didn’t find any frames for NspiUnBind requests, which should be used to destroy that connection.

When we checked the Exchange server security logs we found a massive number of logon events that are occurring for every single client:

Event Type:Success Audit

Event Source:Security

Event Category:(2)

Event ID:540

Date:5/26/2010

Time:3:49:09 PM

User:

Computer:EXCHSERVER

Description:

Successful Network Logon:

User Name:User_Name

Domain:CONTOSO.COM

Logon ID:(0x0,0x3DA1F42)

Logon Type:3

Logon Process:Kerberos

Authentication Package:Kerberos

Workstation Name:

Logon GUID:{b03f9ad2-3150-e1be-f50a-b3abb9d2c09d}

Caller User Name:–

Caller Domain:–

Caller Logon ID:–

Caller Process ID: –

Transited Services: –

Source Network Address:10.11.12.13

Source Port:0

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

So, now we know the problem of lsass.exe memory consumption on the DCs is actually occurring because of a massive number of NspiBind requests that are being sent to the DCs without proper NspiUnbind requests.

So we turned our attention to the client machines that have Outlook installed and started to list all 3rd party add-ins that are installed on them.

We found that all affected client machines are actually sharing 3rd party add-ins for email content archival. Further investigation on this software revealed a strange MAPI Wrapper DLL library, and our customer confirmed that the start time of the problem approximately matches the time they have rolled out the content archival software on the client machines.

We uninstalled this content archival add-in from a couple of machines and these machines stopped hammering the Exchange Server and GC with NspiBind requests.

Later we uninstalled it from the rest of the client machines and the lsass.exe process on the DCs has finally stabilized and we no longer have any issues with its memory usage.

I’d like to mention that starting with Windows Server 2008 we have implemented a “NSPI Max Sessions Per User” where you can limit the number of NSPI requests from a client to the server.

That’s it! I hope you enjoyed reading and hopefully this should help you to troubleshoot any memory leaks you may encounter.The information contained here can be applied to other processes when you suspect heap leaks.

1. You said: "From the backtrace 4BA0 we can see that there are ~1.5 heap allocation calls that have occurred"

Question: Is that in reference to "1538731 allocs" ? And do you mean ~1.5million ?

[Exactly, this is ~1.5 million calls that occurred since the first day we ran the UMDH trace.]

2. You said: "From both call stacks above we can see that the originating function that leads to the heap allocation is “NTDSA!NspiBind”"

Question: How are you so sure it was the function NTDSA!NspiBind and not for example NTDSA!ABBind_local or one of the RPCRT4! functions ?

[We are only interested for the calls that lead to the malloc() function. In this case I was interested in the NspiBind() as it was the function that is responsible for initiating the NSPI session between the client and the server which resulted at the end for a malloc() call. See MSDN for more information on NspiBind(): http://msdn.microsoft.com/en-us/library/dd941979(v=prot.10).aspx]