Followup 1

--On November 16, 2009 8:27:21 PM +0000 whm@stanford.edu wrote:
> Full_Name: Bill MacAllister
> Version: 2.4.19+
> OS: Debian 5
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (171.64.19.165)
>
>
> On a system where slapd is the top CPU process and is consuming 4% of the
> CPU simple queries are taking ten of seconds to complete. Here is an
> example query:
Last time we saw this it appeared that it might be related to DNS. How
long does a netstat -a take on the affected systems when the LDAP queries
are slow?
--Quanah
--
Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc
--------------------
Zimbra :: the leader in open source messaging and collaboration

--On Monday, November 16, 2009 07:37:49 PM -0800 Quanah Gibson-Mount
<quanah@zimbra.com> wrote:
>
>
> --On November 16, 2009 8:27:21 PM +0000 whm@stanford.edu wrote:
>
>> Full_Name: Bill MacAllister
>> Version: 2.4.19+
>> OS: Debian 5
>> URL: ftp://ftp.openldap.org/incoming/
>> Submission from: (NULL) (171.64.19.165)
>>
>>
>> On a system where slapd is the top CPU process and is consuming 4% of
the
>> CPU simple queries are taking ten of seconds to complete. Here is an
>> example query:
>
> Last time we saw this it appeared that it might be related to DNS.
> How long does a netstat -a take on the affected systems when the
> LDAP queries are slow?
I don't have that number. I have only see it once in the test
environment I am not sure how long it will take me to get it. Since
this plastered the production service today, I will have to reproduce
it though before we attempt another production upgrade.
Not clear to me why a DNS issue would appear only under load and not
at other times. And it seems strange that it appears in
lenny/openldap-2.4 and not in etch/openldap-2.3.
Bill
--
Bill MacAllister, System Software Programmer
Unix Systems Group, Stanford University

On Nov 16, 2009, at 9:03 PM, Bill MacAllister <whm@stanford.edu> wrote:
>
>
> --On Monday, November 16, 2009 07:37:49 PM -0800 Quanah Gibson-Mount
> <quanah@zimbra.com> wrote:
>
>>
>>
>> --On November 16, 2009 8:27:21 PM +0000 whm@stanford.edu wrote:
>>
>>> Full_Name: Bill MacAllister
>>> Version: 2.4.19+
>>> OS: Debian 5
>>> URL: ftp://ftp.openldap.org/incoming/
>>> Submission from: (NULL) (171.64.19.165)
>>>
>>>
>>> On a system where slapd is the top CPU process and is consuming 4%
>>> of the
>>> CPU simple queries are taking ten of seconds to complete. Here is
>>> an
>>> example query:
>>
>> Last time we saw this it appeared that it might be related to DNS.
>> How long does a netstat -a take on the affected systems when the
>> LDAP queries are slow?
>
> I don't have that number. I have only see it once in the test
> environment I am not sure how long it will take me to get it. Since
> this plastered the production service today, I will have to reproduce
> it though before we attempt another production upgrade.
>
> Not clear to me why a DNS issue would appear only under load and not
> at other times. And it seems strange that it appears in
> lenny/openldap-2.4 and not in etch/openldap-2.3.
>
> Bill
>
> --
> Bill MacAllister, System Software Programmer
> Unix Systems Group, Stanford University
For historical note this was caused by Cyrus-sasl being built
incorrectly by the debian packagers when heimdal is used.
--Quanah

Followup 4

quanah@zimbra.com writes:
> For historical note this was caused by Cyrus-sasl being built
> incorrectly by the debian packagers when heimdal is used.
Well, rather, that's a theory which currently looks good. It definitely
hasn't been proven yet.
--
Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>

--On Tuesday, November 17, 2009 09:11:23 PM -0800 Quanah Gibson-Mount
<quanah@zimbra.com> wrote:
>
>
> On Nov 16, 2009, at 9:03 PM, Bill MacAllister <whm@stanford.edu>
wrote:
>
>>
>>
>> --On Monday, November 16, 2009 07:37:49 PM -0800 Quanah Gibson-Mount
>> <quanah@zimbra.com> wrote:
>>
>>>
>>>
>>> --On November 16, 2009 8:27:21 PM +0000 whm@stanford.edu wrote:
>>>
>>>> Full_Name: Bill MacAllister
>>>> Version: 2.4.19+
>>>> OS: Debian 5
>>>> URL: ftp://ftp.openldap.org/incoming/
>>>> Submission from: (NULL) (171.64.19.165)
>>>>
>>>>
>>>> On a system where slapd is the top CPU process and is consuming
4%
>>>> of the
>>>> CPU simple queries are taking ten of seconds to complete. Here
is
>>>> an
>>>> example query:
>>>
>>> Last time we saw this it appeared that it might be related to DNS.
>>> How long does a netstat -a take on the affected systems when the
>>> LDAP queries are slow?
>>
>> I don't have that number. I have only see it once in the test
>> environment I am not sure how long it will take me to get it. Since
>> this plastered the production service today, I will have to reproduce
>> it though before we attempt another production upgrade.
>>
>> Not clear to me why a DNS issue would appear only under load and not
>> at other times. And it seems strange that it appears in
>> lenny/openldap-2.4 and not in etch/openldap-2.3.
>>
>> Bill
>>
>> --
>> Bill MacAllister, System Software Programmer
>> Unix Systems Group, Stanford University
>
> For historical note this was caused by Cyrus-sasl being built
> incorrectly by the debian packagers when heimdal is used.
>
> --Quanah
I don't understand why you refer to this finding as historical. If I
am reading this correctly you and Howard have found the underlying
cause. Now that the problem is understood can you suggest a way for
us to cause the problem in our test environments? At this point we
will really need to convince ourselves that the problem is indeed fixed
before we try to deploy 2.4 in our production environment again.
Bill
--
Bill MacAllister, System Software Programmer
Unix Systems Group, Stanford University

--On November 18, 2009 9:39:02 AM +0000 Bill MacAllister
<whm@stanford.edu>
wrote:
>> For historical note this was caused by Cyrus-sasl being built
>> incorrectly by the debian packagers when heimdal is used.
>>
>
> I don't understand why you refer to this finding as historical.
Not a historical finding. As a record for anyone who comes across this ITS
and wants to know what was found.
> If I
> am reading this correctly you and Howard have found the underlying
> cause. Now that the problem is understood can you suggest a way for
> us to cause the problem in our test environments? At this point we
> will really need to convince ourselves that the problem is indeed fixed
> before we try to deploy 2.4 in our production environment again.
To note, first off, this issue was not a bug in OpenLDAP, and the project
went beyond its scope in tracking down why cyrus-sasl was behaving the way
it was. Finding out test cases for you to explore is also beyond the scope
of the OpenLDAP project when dealing with non-OpenLDAP issues.
However, given what is known, i.e., that the NTLM code path was being
called during SASL/GSSAPI binds, I would suggest you either set up a number
of windows boxes that try and do SASL/GSSAPI auth with NTLM to a test
server, or write a script that does that and run it from multiple systems.
Some reference points:
<http://www.netid.washington.edu/documentation/ldapAuth.aspx>
It also seems it may be possible to use python-ldap to do this. I don't
know if it is possible with Net::LDAP or Net::LDAPapi
--Quanah
--
Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc
--------------------
Zimbra :: the leader in open source messaging and collaboration