Description of problem:
We use ldap for user authentication for a number of RHEL 3, 4, and 5 servers. A
number of months ago, nscd began to crash every few weeks on the RHEL 4 and 5
servers. We did not consider it a big deal because nscd is not a vital service.
Lately, the nscd problems have gotten serious. On two of our servers beginning
about 1-2 months ago (one RHEL 3 and one RHEL 4 server) nscd began claiming that
certain users did not exist, while authenticating fine for others. Once nscd was
turned off, the problem went away.
At that point we turned off nscd on all of our servers in chkconfig.
Unfortunately, two (all?) of our RHEL 4 boxes will not start without nscd
enabled. They hang on "starting system message bus." We are currently turning
nscd off by hand on these servers after start up.
Version-Release number of selected component (if applicable):
Varies.
How reproducible:
The messagebus dependency is easy to reproduce. The rest is not.
Steps to Reproduce: (messagebus dependency)
1. Enable messagebus (on RHEL4).
2. Disable nscd.
3. Attempt to reboot machine.
Actual results:
Hang on starting system message bus.
Expected results:
Normal start up.
Additional info:

The problem with message bus I believe is related to having 'ldap' listed to
search for protocols in your /etc/nsswitch.conf file. Remove ldap from that
line and see if that helps.
I'm also curious to the rest of the ldap/nscd issues. Do they look like Bug
#428837?

We've removed ldap from protocols in /etc/nsswitch.conf. The next time either
our user server or our mail server is rebooted (hopefully not for a while) we'll
let you know if that was the fix.
This issue does not look at all like #428837. I've never seen nscd hit 100% of
the CPU. The two nscd issues that we are having are:
1) nscd crashes at random intervals (usually after running for a week) -- this
behavior has existed for a number of months
2) nscd returns bad or missing information for random users -- this behavior has
existed for 1 or 2 months.
I attempted to debug the first issue, but I was never able to capture a crash
with nscd in debug mode -- it generates a lot of debug data.

Yes, we've got the following line:
nss_initgroups_ignoreusers root,ldap
Also, I've finally been able to record a failure in the ldap logs. My username
is "thras" with uid 4954
With nscd on, I ran 'id thras' a couple times from the command line, and it
returned no such user. Then I turned nscd off and 'id thras' worked. I wasn't
able to reproduce this again with myself or any other users.
But here is (I think -- there aren't any timestamps) the relevant portion of the
nscd log:
2906: handle_request: request received (Version = 2) from PID 9515
2906: GETFDPW
2906: provide access to FD 6, for passwd
2906: handle_request: request received (Version = 2) from PID 9515
2906: GETFDGR
2906: provide access to FD 8, for group
2906: handle_request: request received (Version = 2) from PID 9515
2906: GETGRBYGID (4954)
2906: Haven't found "4954" in group cache!
2906: handle_request: request received (Version = 2) from PID 9529
2906: GETFDPW
2906: provide access to FD 6, for passwd
2906: handle_request: request received (Version = 2) from PID 9529
2906: GETFDGR
2906: provide access to FD 8, for group
2906: handle_request: request received (Version = 2) from PID 9529
2906: GETGRBYGID (4954)
2906: Haven't found "4954" in group cache!
2906: pruning hosts cache; time 1203108144
About 2000 lines (~3 minutes) earlier in the log this shows up, but I don't
think that's when 'id' failed:
2906: considering INITGROUPS entry "thras", timeout 1200467811
2906: Reloading "thras" in group cache!

We were never able to solve the dbus problem. Currently, downgrading nss_ldap seems to fix all sorts of problems. I don't know what sort of testing process is going on with this package before release, but it may need some modifications.

We've seen similar problems on RHEL4, since about 17 February, when our updates updated nss_ldap and nscd. We had not seen this before on RHEL4.
We also see it on some 5.3 boxes, but they weren't in production on anything before 5.3.
However, we have seen problems enumerating local users (e.g. 'getent passwd root' fails), so I suspect this is an nscd bug, and not an nss_ldap bug.
E.g., we have about 10 servers which are very similar software-wise, one of these did not get the updates at the same time, and this host is not seeing the problem.
(I don't agree with the nss_initgroups_ignoreusers workaround, we use 'bind_policy soft' to restore the older nss_ldap behaviour).
Is anyone seeing this problem without nscd ?
This could just be bug #495515 ...

Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life.
Please See https://access.redhat.com/support/policy/updates/errata/
If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.

Note

You need to
log in
before you can comment on or make changes to this bug.