Winbind using 100% CPU

Winbind using 100% CPU

(Re-posting on this email list per Jeremy Allison's request.)

I am trying to figure out why winbind is using 100% CPU on my file server.
I am using Samba version 4.0.4. Everything is fine for a few minutes when I
start winbind, however after a while it begins using 100% CPU. I haven't
been able to narrow down what triggers this CPU usage spike, but I did
attach the GNU debugger to find out what's going on in the process. The
backtrace revealed this information:

Re: Winbind using 100% CPU

On Thursday 11 April 2013 09:58:33 Dylan Klomparens wrote:

> (Re-posting on this email list per Jeremy Allison's request.)
>
> I am trying to figure out why winbind is using 100% CPU on my file server.
> I am using Samba version 4.0.4. Everything is fine for a few minutes when I
> start winbind, however after a while it begins using 100% CPU. I haven't
> been able to narrow down what triggers this CPU usage spike, but I did
> attach the GNU debugger to find out what's going on in the process. The
> backtrace revealed this information:
>
> #0 0x000000000041cf30 in _talloc_free@plt ()
> #1 0x0000000000452320 in winbindd_reinit_after_fork ()
> #2 0x00000000004524e6 in fork_domain_child ()
> #3 0x0000000000453585 in wb_child_request_trigger ()
> #4 0x000000381d2048e2 in tevent_common_loop_immediate () from
> /lib64/libtevent.so.0
> #5 0x00007fbed6b98e17 in run_events_poll () from /lib64/libsmbconf.so.0
> #6 0x00007fbed6b9922e in s3_event_loop_once () from /lib64/libsmbconf.so.0
> #7 0x000000381d204060 in _tevent_loop_once () from /lib64/libtevent.so.0
> #8 0x000000000042049a in main ()

Re: Winbind using 100% CPU

Here are two valgrind reports, one with leak-check=summary (the default),
and the other with leak-check=full (very verbose). The exact command line
is at the top of each file. The Samba executable I am running is from a
Fedora package so debug symbols are not included in the executable. Thus, I
couldn't run the desired gdb commands. Is this the first time this bug has
been encountered? I can try and download/compile the source and attempt to
re-capture the problem with more information if that is necessary.

Re: Winbind using 100% CPU

> (Re-posting on this email list per Jeremy Allison's request.)
>
> I am trying to figure out why winbind is using 100% CPU on my file server.
> I am using Samba version 4.0.4. Everything is fine for a few minutes when I
> start winbind, however after a while it begins using 100% CPU. I haven't
> been able to narrow down what triggers this CPU usage spike, but I did
> attach the GNU debugger to find out what's going on in the process. The
> backtrace revealed this information:
>
> #0 0x000000000041cf30 in _talloc_free@plt ()
> #1 0x0000000000452320 in winbindd_reinit_after_fork ()
> #2 0x00000000004524e6 in fork_domain_child ()
> #3 0x0000000000453585 in wb_child_request_trigger ()
> #4 0x000000381d2048e2 in tevent_common_loop_immediate () from
> /lib64/libtevent.so.0
> #5 0x00007fbed6b98e17 in run_events_poll () from /lib64/libsmbconf.so.0
> #6 0x00007fbed6b9922e in s3_event_loop_once () from /lib64/libsmbconf.so.0
> #7 0x000000381d204060 in _tevent_loop_once () from /lib64/libtevent.so.0
> #8 0x000000000042049a in main ()
>
> Apparently it's stuck in the winbindd_reinit_after_fork (and more
> specifically the _talloc_free function). This code resides in
> $SOURCE_HOME\source3\winbindd\winbindd_dual.c.

That looks like corrupted memory - probably a loop
in the talloc tree.

which is probably from dcerpc_add_auth_footer() but this is not the codepath
from the backtrace. Did you run into the 100% problem with running under
valgrind?

> The exact command line
> is at the top of each file. The Samba executable I am running is from a
> Fedora package so debug symbols are not included in the executable. Thus, I
> couldn't run the desired gdb commands.

Please install the debuginfo package so we get the symbols. You can do this
with:

debuginfo-install samba

> Is this the first time this bug has been encountered?

More or less. Did winbind hit the 100% problem while running with valgrind?

> I can try and download/compile the source and attempt to
> re-capture the problem with more information if that is necessary.

debuginfo-install samba

Then please get us a full backtrace and the talloc report. After that please
run again with valgrind.

Re: Winbind using 100% CPU

> On Thu, Apr 11, 2013 at 09:58:33AM -0400, Dylan Klomparens wrote:
>> (Re-posting on this email list per Jeremy Allison's request.)
>>
>> I am trying to figure out why winbind is using 100% CPU on my file server.
>> I am using Samba version 4.0.4. Everything is fine for a few minutes when I
>> start winbind, however after a while it begins using 100% CPU. I haven't
>> been able to narrow down what triggers this CPU usage spike, but I did
>> attach the GNU debugger to find out what's going on in the process. The
>> backtrace revealed this information:
>>
>> #0 0x000000000041cf30 in _talloc_free@plt ()
>> #1 0x0000000000452320 in winbindd_reinit_after_fork ()
>> #2 0x00000000004524e6 in fork_domain_child ()
>> #3 0x0000000000453585 in wb_child_request_trigger ()
>> #4 0x000000381d2048e2 in tevent_common_loop_immediate () from
>> /lib64/libtevent.so.0
>> #5 0x00007fbed6b98e17 in run_events_poll () from /lib64/libsmbconf.so.0
>> #6 0x00007fbed6b9922e in s3_event_loop_once () from /lib64/libsmbconf.so.0
>> #7 0x000000381d204060 in _tevent_loop_once () from /lib64/libtevent.so.0
>> #8 0x000000000042049a in main ()
>>
>> Apparently it's stuck in the winbindd_reinit_after_fork (and more
>> specifically the _talloc_free function). This code resides in
>> $SOURCE_HOME\source3\winbindd\winbindd_dual.c.
>
> That looks like corrupted memory - probably a loop
> in the talloc tree.

I've got a user who sees this and we're adding the same dlinklist
element twice, creating a loop in the winbind child list.

I've got a broken wrist so responses take a while, but that's my
current hint. On 3.6.3 and 3.6.13.

Re: Winbind using 100% CPU

> On Thu, Apr 11, 2013 at 6:59 PM, Jeremy Allison <[hidden email]> wrote:
>> On Thu, Apr 11, 2013 at 09:58:33AM -0400, Dylan Klomparens wrote:
>>> (Re-posting on this email list per Jeremy Allison's request.)
>>>
>>> I am trying to figure out why winbind is using 100% CPU on my file server.
>>> I am using Samba version 4.0.4. Everything is fine for a few minutes when I
>>> start winbind, however after a while it begins using 100% CPU. I haven't
>>> been able to narrow down what triggers this CPU usage spike, but I did
>>> attach the GNU debugger to find out what's going on in the process. The
>>> backtrace revealed this information:
>>>
>>> #0 0x000000000041cf30 in _talloc_free@plt ()
>>> #1 0x0000000000452320 in winbindd_reinit_after_fork ()
>>> #2 0x00000000004524e6 in fork_domain_child ()
>>> #3 0x0000000000453585 in wb_child_request_trigger ()
>>> #4 0x000000381d2048e2 in tevent_common_loop_immediate () from
>>> /lib64/libtevent.so.0
>>> #5 0x00007fbed6b98e17 in run_events_poll () from /lib64/libsmbconf.so.0
>>> #6 0x00007fbed6b9922e in s3_event_loop_once () from /lib64/libsmbconf.so.0
>>> #7 0x000000381d204060 in _tevent_loop_once () from /lib64/libtevent.so.0
>>> #8 0x000000000042049a in main ()
>>>
>>> Apparently it's stuck in the winbindd_reinit_after_fork (and more
>>> specifically the _talloc_free function). This code resides in
>>> $SOURCE_HOME\source3\winbindd\winbindd_dual.c.
>>
>> That looks like corrupted memory - probably a loop
>> in the talloc tree.
> I've got a user who sees this and we're adding the same dlinklist
> element twice, creating a loop in the winbind child list.
>
> I've got a broken wrist so responses take a while, but that's my
> current hint. On 3.6.3 and 3.6.13.
>

I see this in the parent winbind log. The last 3 entries are changes
I made to the dlist macros within winbind only. you can see a 2
second delay and then a second add of the same item to the child
winbind list. No entries in between (and a production system so there
is reluctance to increase the debug level).

Re: Winbind using 100% CPU

On Thu, Apr 18, 2013 at 12:16 PM, Dylan Klomparens
<[hidden email]> wrote:
> The patch has exposed the problem. I've attached the output from winbind.
> Please let me know if you need additional information!
Wow, that's quick. Any other clues, like what was going on on the
system then? what was causing winbind to do work? How many trusted
domains?

I seem to have the exact same problem in my environment - I have three
domains, and it seems that at some point multiple winbindd instances go into
infinite loop in the same location in winbindd_dual, because of duplication
in the winbindd_child list: