Zombies probably won’t consume 32 GB of your memory like they did to me, but zombie processes do exist, and I can help you find them and make sure that developers fix them. Tool source link is at the bottom.

Is it just me, or do Windows machines that have been up for a while seem to lose memory? After a few weeks of use (or a long weekend of building Chrome over 300 times) I kept noticing that Task Manager showed me running low on memory, but it didn’t show the memory being used by anything. In the example below task manager shows 49.8 GB of RAM in use, plus 4.4 GB of compressed memory, and yet only 5.8 GB of page/non-paged pool, few processes running, and no process using anywhere near enough to explain where the memory had gone:

My machine has 96 GB of RAM – lucky me – and when I don’t have any programs running I think it’s reasonable to hope that I’d have at least half of it available.

Sometimes I have dealt with this by rebooting but that should never be necessary. The Windows kernel is robust and well implemented so this memory disappearing shouldn’t happen, and yet…

The first clue came when I remembered that a coworker of mine had complained of zombie processes being left behind – processes that had shut down but not been cleaned up by the kernel. He’d even written a tool that would dump a list of zombie processes – their names and counts. His original complaint was of hundreds of zombies. I ran his tool and it showed 506,000 zombie processes!

It occurred to me that one cause of zombie processes could be one process failing to close the handles to other processes. And the great thing about having a huge number of zombies is that they are harder to hide. So, I went to Task Manager’s Details tab, added the Handles column, and sorted by it. Voila. I immediately saw that CcmExec.exe (part of Microsoft’s System Management Server) had 508,000 handles open, which is both a lot and also amazingly close to my zombie count.

I held my breath and killed CcmExec.exe, unsure of what would happen:

The results were as dramatic as I could imagine. As I said earlier, the Windows kernel is well written and when a process is killed then all of its resources are freed. So, those 508,000 handles that were owned by CcmExec.exe were abruptly closed and my available memory went up by 32 GB! Mystery solved!

What is a zombie process?

Until this point we weren’t entirely sure what was causing these processes to hang around. In hindsight it’s obvious that these zombies were caused by a trivial user-space coding bug. The rule is that when you create a process you need to call CloseHandle on its process handle and its thread handle. If you don’t care about the process then you should close the handles immediately. If you do care – if you want to wait for the process to quit – WaitForSingleObject(hProcess, INFINITE); – or query its exit code – GetExitCodeProcess(hProcess, &exitCode); – then you need to remember to close the handles after that. Similarly, if you open an existing process with OpenProcess you need to close that handle when you are done.

If the process that holds on to the handles is a system process then it will even continue holding those handles after you log out and log back in – another source of confusion during our investigation last year.

So, a zombie process is a process that has shut down but is kept around because some other still-running process holds a handle to it. It’s okay for a process to do this briefly, but it is bad form to leave a handle unclosed for long.

Where is that memory going?

Another thing I’d done during the investigation was to run RamMap. This tool attempts to account for every page of memory in use. Its Process Memory tab had shown hundreds of thousands of processes that were each using 32 KB of RAM and presumably those were the zombies. But ~500,000 times 32 KB only equals ~16 GB – where did the rest of the freed up memory come from? Comparing the before and after Use Counts pages in RamMap explained it:

We can plainly see the ~16 GB drop in Process Private memory. We can also see a 16 GB drop in Page Table memory. Apparently a zombie process consumes ~32 KB of page tables, in addition to its ~32 KB of process private memory, for a total cost of ~64 KB. I don’t know why zombie processes consume that much RAM, but it’s probably because there should never be enough of them for that to matter.

A few types of memory actually increased after killing CcmExec.exe, mostly Mapped File and Metafile. I don’t know what that means but my guess would be that that indicates more data being cached, which would be a good thing. I don’t necessarily want memory to be unused, but I do want it to be available.

Trivia: rammap opens all processes, including zombies, so it needs to be closed before zombies will go away

Windows has a reputation for not handling process creation as well as Linux and this investigation, and one of my previous ones, suggest that that reputation is well earned. I hope that Microsoft fixes this bug – it’s sloppy.

Why do I hit so many crazy problems?

I work on the Windows version of Chrome, and one of my tasks is optimizing its build system, which requires doing a lot of test builds. Building chrome involves creating between 28,000 and 37,000 processes, depending on build settings. When using our distributed build system (goma) these processes are created and destroyed very quickly – my fastest full build ever took about 200 seconds. This aggressive process creation has revealed a number of interesting bugs, mostly in Windows or its components:

What now?

If you aren’t on a corporate managed machine then you probably don’t run CcmExec.exe and you will avoid this particular bug. And if you don’t build Chrome or something equivalent then you will probably avoid this bug. But!

CcmExec is not the only program that leaks process handles. I have found many others leaking modest numbers of handles and there are certainly more.

The bitter reality, as all experienced programmers know, is that any mistake that is not explicitly prevented will be made. Simply writing “This handle must be closed” in the documentation is insufficient. So, here is my contribution towards making this something detectable, and therefore preventable. FindZombieHandles is a tool, based on NtApiDotNet and sample code from @tiraniddo, that prints a list of zombies and who is keeping them alive. Here is sample output from my home laptop:

274 total zombie processes.
249 zombies held by IntelCpHeciSvc.exe(9428)
249 zombies of Video.UI.exe
14 zombies held by RuntimeBroker.exe(10784)
11 zombies of MicrosoftEdgeCP.exe
3 zombies of MicrosoftEdge.exe
8 zombies held by svchost.exe(8012)
4 zombies of ServiceHub.IdentityHost.exe
2 zombies of cmd.exe
2 zombies of vs_installerservice.exe
3 zombies held by explorer.exe(7908)
3 zombies of MicrosoftEdge.exe
1 zombie held by devenv.exe(24284)
1 zombie of MSBuild.exe
1 zombie held by SynTPEnh.exe(10220)
1 zombie of SynTPEnh.exe
1 zombie held by tphkload.exe(5068)
1 zombie of tpnumlkd.exe
1 zombie held by svchost.exe(1872)
1 zombie of userinit.exe

274 zombies isn’t too bad, but it represents some bugs that should be fixed. The IntelCpHeciSvc.exe one is the worst, as it seems to leak a process handle every time I play a video from Windows Explorer.

Visual Studio leaks handles to at least two processes and one of these is easy to reproduce. Just fire off a build and wait ~15 minutes for MSBuild.exe to go away. Or, if you “set MSBUILDDISABLENODEREUSE=1” then MSBuild.exe goes away immediately and every build leaks a process handle. Unfortunately some jerk at Microsoft fixed this bug the moment I reported it, and the fix may ship in VS 15.6, so you’ll have to act quickly to see this (and no, I don’t really think he’s a jerk).

You can also see leaked processes using Process Explorer, by configuring the lower pane to show handles, as shown here (note that both the process and thread handles are leaked in this case):

Lenovo’s tphkload.exe leaks one handle, their SUService.exe leaks three handles

Synaptic’s SynTPEnh.exe leaks one handle

googledrivesync.exe leaks one handle (reported internally)

Process handles aren’t the only kind that can be leaked. For instance, the “Intel(R) Online Connect Access service” (IntelTechnologyAccessService.exe) only uses 4 MB of RAM, but after 30 days of uptime had created 27,504 (!!!) handles. I diagnosed this leak using just Task Manager and reported it here. I also used the awesome !htrace command in windbg to get stacks for the CreateEventW calls from Intel’s code. Think they’ll fix this?

Using Processs Explorer I could see that NVDisplay.Container.exe from NVIDIA has ~5,000 handles to \BaseNamedObjects\NvXDSyncStop-61F8EBFF-D414-46A7-90AE-98DD58E4BC99 event, creating a new one about every two minutes? I guess they want to be really sure that they can stop NvXDSync? Reported, and a fix has been checked in.

Apparently ETDCtrl.exe (11.x), some app associated with ELANTech/Synaptics trackpads, leaks handles to shared memory. The process accumulated about 16,000 handles and when the process was killed about 3 GB of missing RAM was returned to the system – quite noticeable on an 8 GB laptop with no swap.

Apparently nobody has been paying attention to this for a while – hey Microsoft, maybe start watching for handle leaks so that Windows runs better? And Intel and NVIDIA? Take a look at your code. I’ll be watching you.

So, grab FindZombieHandles, run it on your machine, and report or fix what you find, and use Task Manager and Process Explorer as well.

Updates: Microsoft recommended disabling the feature that leaks handles and doing so has resolved the issue for me (and they are fixing the leaks). It’s an expensive feature and it turns out we were ignoring the data anyway! Also, all Windows 10 PIDs are multiples of four which explains why ~500,000 zombies led to PIDs in the 2,000,000+ range.

Advertisements

Share this:

Like this:

LikeLoading...

Related

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs.
I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8.

RAII is so important it’s a rare occasion where it’s worth dropping everything you’re doing and implementing it around important resources like process handles. The fact microsoft can ship code in 2018 that leaks process handles is disgraceful.

The API is to blame of course it should be reimplemented as an RAII API immedately but sparing that you could probably fix it with minimal impact by wrapping the handle in a std::unique_ptr with CloseHandle as the deleter function.

If one subdivides objects into those that own non-fungible resources, those that have mutable state but own no resources, and those which have neither mutable state nor fungible resources, RAII is much better than GC for for the first, both are good for the second, and GC is better than RAII for the third. Consequently, a good language really should provide for both RAII and GC. Unfortunately, languages that support GC don’t really accommodate RAII, and vice versa.

That was the reason I think MS COM was built over ARC model (perhaps SOM and CORBA predating it were too). Apple managed to bring GC into non-managed Objective C but after few years moved back to ARC because of it.

Frankly, at this point booth RAII and iDisposable/using seem about the same.
In both cases you have to manually implement the boilerplate, and your resource owning class would either free resources in d-tor or in some finalyzer method, which frankly looks like implementation detail. Also while mere creating of object is surely much easier than using{} or try{}finally{} control constructs, from “bigger picture” it is comparable hassle.

I guess the reasons are mostly psychological. GC language push programmers to never think about owning and managing things, cause smart runtime will do it better. And modern computers have abundance of all generic resources like CPU power and memory. So when for those “non-fungible resources” it is required to switch your mindset to good ol’ manual lifetime control – they just don’t even have the idea about that.
The Rust language has a point of their separation of ownership and leasing concepts. At the expense of yet more boilerplate 😀

Yes, both usual and elevated.
OTOH I think that is another miss for the tool, it could self-elevate.

The tool does not know to wait for user to read its output, so it should be run form some other console application, like cmd.exe

Then the tool does not know how to elevate itself, which means I should spawn one more CMD window, this time elevated. Then manually traverse into download folder (where I just downloaded four files one after another) and run it.
Tedious, quite.

I understand this is only proof-of-concept so can only hope SysInternals would take the bucket and make somewhat user-friendly tool, something with UI like their RootKit Revealer

Having an *option* to self elevate would be nice, but I wouldn’t want it to require elevation since then some people would not be able to run it. The elevation doesn’t change the counts reported, it just affects whether all process names are retrieved.

But can tool without elevation reliably scan other processes handles, especially those of elevated processors? See Process Explorer, Process Monitor – the latter installs a custom driver so it requires elevation, the former only shows a subset of data when not-elevated.

What does this tool look for, for current interactive user’s applications, or a total count of zombie processes on the machine? If latter, then elevation is required.

So in the end, the question is whether it would be a reasonable use case to only count zombies among Current User processes, ignoring all other zombies.

I’d say “no”, but I do not have to work on terminal servers and similar environments, so YMMV

So we are back at square one: whether “find some zombies not all” is a reasonable for general situation task. Especially for non-informed user (one that did not read entire manual, and all blogs/forums, before running the tool So perhaps the ideal behavior could be when the tool by default elevates itself, but has a bail-out option to keep itself not-elevated. That said, I still think this tool just asks to be included into SysInternals suite or NirSoft suite, etc. And since they implement their tools in classic old C++, closer to API, I guess if they can be persuaded they are better suited to polish both UI and compatibility. You found the problem and made the conception demonstrator, and that is great. You obviously don’t plan to extend it into feature-full toolkit of utilities anyway.

@BruceDawson, The specific issue you are referring to is an issue with Ccmexec.exe and its associated driver, prepdrv.sys.The issue is not with the OS. The driver that registers for process create and delete notifications (PsSetCreateProcessNotifyRoutine). This driver then queues a notification to its usermode component (Ccmexec) to let it know about either a process create or exit. The issue is that the queue that the driver is using has a limit of 1024 entries. This means that if you are on a machine that has very high process create/delete in a very short duration (e.g. a build machine), the notifications from the driver are lost. The System Center team is working on it.

Interesting, i will test, I have strange problems with Edge running long, roughly a week.
Some convenience for program for UAC:
in “app.manifest” line
“
＜requestedExecutionLevel level=”requireAdministrator” uiAccess=”false” /＞
＜/requestedPrivileges＞
”
and, at end of program
“Console.ReadKey();”
🙂

I’ve been having memory issues for weeks, and was able to track it down to zombie processes thanks to this post. Unfortunately, it seems that “System” is the process holding open the handles… Any idea what that could mean, or where I should start looking next? It seems every process that executes turns into a zombie.

I’d hit over 50k after only a few hours, and eventually run out of main memory after 8-10 hours or so. I’m running Windows 10.

The issue seemed to go away when in safe mode, so I tried removing/updating some drivers and it didn’t seem to fix it. I think I’m just going to reformat at this point. Thanks anyways, this was a very helpful post regardless!

Wow – that’s crazy. 50k zombie processes will consume about 3.2 GB of RAM which on most “normal” machines is a huge percentage. I think you have two separate mysteries:
1) Who is creating all of those processes? 10k+ per hour is extremely high for anyone not building Chrome.
2) Why are they not closing the process handles?

This is just from normal usage. Short lived process such as chrome threads made up the bulk of what I was seeing. When literally every thread and process leaves behind a zombie, I guess it fills up pretty quick.

I unfortunately already nuked the PC, but if for some reason it happens again I’ll definitely get you a trace!

Zombie threads should be much more lightweight than zombie processes – if zombie threads are even a thing.

50,000 zombies in a few hours suggests several zombie processes *per second*. That is not normal. Chrome, for instance, creates a new process on most navigations, but most people don’t navigate to several new sites per second.

@Rob @brucedawson
We’re experiencing a similar issue of the “System” process creating and holding open quite a few zombie processes (although a lot less than you’re talking about). Still, after several days of uptime our developer workstations run out of memory and must be restarted to clear the issue.

I’ve been investigating with the Sysinternals tools, the FindZomebieProcesses tool from this post, but I’m no closer to finding the root cause of the issue. We’ve previously gone through support cases with many of our software vendors trying to determine the cause (we had suspicions it was related to our security software, our time-tracking software, our file-sharing software, etc.).

Did you ever find a cause for your issue? Or any other guidance on troubleshooting exactly why the System process is spawning zombies?

By the way, this post and the comments were excellent and very timely for us. Thank you both for your contributions!

In my case it wasn’t ‘the “System”‘ process holding the handles it was a SYSTEM process. That is, there is a process called “System” and there are several processes that run as the SYSTEM user. In my case the CcmExec.exe process (a SYSTEM process) was holding the handles so I knew which team at Microsoft to contact. If the “System” process (with that name) is leaking handles then I don’t know what is going on. Driver bug?

Try in safe mode, maybe try using App Verifier or similar to grab call stacks for handle creation. Or, just try disabling one piece of software at a time until the problem stops happening.

It is quite possible that the process that is leaking the process handles is not actually creating the processes, but is merely getting handles to processes created by others – that is quite common.

Hi Bruce, this is probably being caused by the use of software metering in the SCCM environment (used to gather info on processes that are executed for reporting in SCCM). It isn’t something you can turn off on your own – your SCCM admins have that power.

It is fixed in the recent SCCM 1802 release – so you might pass this on to whomever manages your Windows computers.

And, if the zombie count is increasing by one per 3 s then after a day or two it should be clearly visible as an elevated handle count in *some* process. So, just wait and then look at task manager or process explorer and see which process has the highest handle count. That’s probably your culprit. Good luck, and let us know what you find.

Curious. I am unaware of a way to have a zombie process without somebody having a handle to it. At one per 3 s you should be leaking ~28,000 zombies per day, which should be very noticeable. If you get up to 30,000+ zombies and that isn’t visible in somebody’s handle count (make sure task manager is showing processes from all users) then I’m afraid you are in terra incognito. I guess you could try killing processes (and not restarting them) until the leaking stops, but ???