You don't need to switch - this fix is in 4.* as well.
David
On Mon, Jun 25, 2012 at 9:36 AM, Phil Regier <pregier at ittc.ku.edu> wrote:
> Nice; that's pretty slick! I'm sure that will solve the problem; I'll
> switch back to 3.0.5 in a bit to try it out.
>> Thanks!
>> Phil
>> ----- Original Message -----
> From: "David Beer" <dbeer at adaptivecomputing.com>
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Sent: Monday, June 25, 2012 10:01:29 AM
> Subject: Re: [torqueusers] Sporadic UID errors
>>> Phil,
>> We have had other customers/users that had this problem due to LDAP
> failing sometimes. We added a retry parameter for the moms. You can set it
> in the mom's config file, just add the line:
>> $ext_pwd_retry <num retries>
>> If you don't really have users going to machines that they shouldn't go
> to, then you might want to set this to a fairly high number so that jobs
> aren't lost unnecessarily.
>> David
>>> On Fri, Jun 22, 2012 at 2:48 PM, Phil Regier < pregier at ittc.ku.edu >
> wrote:
>>> Oops. An error and an omission: I meant 4.0.2 instead of 4.0.4 (trying
> 4.0.3 snapshot now), and it should also be noted that as part of the stress
> test I am constantly watching repeated qstats. The problem does not seem to
> appear with 4.0.x as such; might this be related to the switch from a
> single-threaded server to multi-threaded?
>>>> ----- Original Message -----
> From: "Phil Regier" < pregier at ittc.ku.edu >
> To: torqueusers at supercluster.org> Sent: Friday, June 22, 2012 2:14:12 PM
> Subject: Sporadic UID errors
>> Sorry if this has been raised (there is another LDAP thread active but I
> think the problem is very different) before; I'm still going through the
> archives.
>> I'm trying to evaluate (stress test) Torque 3.0.5 and 4.0.4 for a possible
> upgrade from 2.x and have come across some odd behaviors. In particular,
> when I submit 1000 small jobs to a fake one-node cluster running Torque
> 3.0.5 and Maui 3.3.1 (built in-house as RPMs -- not by me, but I can
> retrieve specfiles etc. if that would help) and authenticated against LDAP,
> I tend to get 2-3 failed submissions (i.e., about 0.25% of my jobs never
> get accepted); for example:
>> ...
> 14289.localhost
> 14290.localhost
> 14291.localhost
> qsub: Bad UID for job execution MSG=User pregier does not exist in server
> password file
>> 14293.localhost
> 14294.localhost
> 14295.localhost
> ...
>>> This is just a loop; there is no difference between job 14291, 14293, and
> what should have been 14292.
>> Is this normal? Are there precautions to avoid it, or is this a bug I
> should be reporting in more detail?
>> Thanks for any suggestions; I'm not terribly experienced with Torque, so
> I'm not sure how quickly I should be bringing this sort of thing to the
> list. I can provide more details about my setup and/or stress tests, but
> didn't want to dump too much useless information in my first post.
>> Phil Regier
> Student assistant system admininstrator
> University of Kansas, ITTC
> _______________________________________________
> torqueusers mailing list
>torqueusers at supercluster.org>http://www.supercluster.org/mailman/listinfo/torqueusers>>>> --
>> David Beer | Software Engineer
> Adaptive Computing
>> _______________________________________________
> torqueusers mailing list
>torqueusers at supercluster.org>http://www.supercluster.org/mailman/listinfo/torqueusers> _______________________________________________
> torqueusers mailing list
>torqueusers at supercluster.org>http://www.supercluster.org/mailman/listinfo/torqueusers>
--
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120625/c46fc90a/attachment.html