Well,
As a quick follow up to my post, it seems that while the error message
in TORQUE would suggest a bug, the problem turned out to be in some
other piece of software. Of interest in addition to the logs I first
showed, we saw these errors as well in /var/log/messages:
Jun 28 12:42:10 10.10.10.101 n1 Jun 28 12:42:10 pbs_mom:
LOG_ERROR::Resource temporarily unavailable (11) in fork_me, fork failed
Jun 28 13:21:09 10.10.10.104 n4 Jun 28 13:21:09 pbs_mom:
LOG_ERROR::Resource temporarily unavailable (11) in fork_me, fork failed
Jun 28 13:21:09 10.10.10.104 n4 Jun 28 13:21:09 pbs_mom:
LOG_ERROR::fork_to_user, forked failed, errno=25 (Inappropriate ioctl
for device)
Jun 28 13:21:09 10.10.10.104 n4 Jun 28 13:21:09 pbs_mom:
LOG_ERROR::Inappropriate ioctl for device (25) in req_cpyfile,
fork_to_user failed with rc=-15010 'forked failed, errno=25
(Inappropriate ioctl for device)' - returning failure
Jun 28 13:21:10 10.10.10.100 n0 Jun 28 13:21:10 pbs_mom:
LOG_ERROR::Resource temporarily unavailable (11) in fork_me, fork failed
It turned out to be an issue with a new signal being passed up from
copy_process() inside of the kernel. The error was around the kernel
refusing to do the fork while a signal was pending for the parent. A
quick patch to some of our code cleaned up the problem.
-Joshua Bernstein
Software Development Manager
Penguin Computing
Ken Nielson wrote:
> On 06/23/2010 05:26 PM, Joshua Bernstein wrote:
>> David Beer wrote:
>>>>> ----- Original Message -----
>>>>>>> Hello Folks!
>>>>>>>> When testing this specific workload within TORQUE we are running into
>>>> an
>>>> issue where TORQUE (qstat) thinks there is a still a job in the
>>>> running
>>>> state (R). Though logging into that node, ps -ef, confirms that no job
>>>> is indeed running there. Further still, the application's logfile
>>>> confirms the job in question exited cleanly.
>>>>>>>> If we have a peek inside of the pbs_mom's log for that node, pbs_mom
>>>> prints out a nice message saying something about a bug here. Notice
>>>> the
>>>> BUG: line in the excerpt below:
>>>>>>>>>>>> 20100621:06/21/2010 17:46:48;0001; pbs_mom;Job;TMomFinalizeJob3;job
>>>> 994.scyld.localdomain started, pid = 73977
>>>> 20100621:06/21/2010 17:46:49;0080;
>>>> pbs_mom;Job;994.scyld.localdomain;scan_for_terminated: job
>>>> 994.scyld.localdomain task 1 terminated, sid=73977
>>>> 20100621:06/21/2010 17:46:50;0008;
>>>> pbs_mom;Job;994.scyld.localdomain;job was terminated
>>>> 20100621:06/21/2010 17:46:50;0001;
>>>> pbs_mom;Job;951.scyld.localdomain;BUG: mismatched jobid in
>>>> preobit_reply
>>>> (994.scyld.localdomain != 951.scyld.localdomain)
>>>>>>>> This particular job isn't doing anything MPI related, and is single
>>>> threaded. I've been able to duplicate this behavior in TORQUE 2.3.10,
>>>> and even in the SVN checkout of the 'trunk' from yesterday.
>>>>>>>> Has anybody seen this, or have any idea whats going on here? I'm happy
>>>> of course to supply a patch when the problem become more problematic.
>>>>>>> I haven't seen this behavior, but I'd be happy to try to reproduce it. Do you have some suggestions for reproducing it?
>>>>> Thanks David. I'd love to say I had an easy way to reproduce it, but it
>> seems to involved 1,000 of jobs on the system concurrently running that
>> the same time. I should add that the scheduler in this case is Maui
>> rather then pbs_sched or Moab.
>>>> Currently, I'm thinking there is context switch problem inside of
>> catch_child.c, but its just an early idea.
>>>> -Josh
>>>>> With over a thousand jobs running I would not be surprised if we were
> caught by a race condition. TORQUE global variables are completely
> unprotected and I am surprised we do not run into more problems.
>> Ken
>> _______________________________________________
> torqueusers mailing list
>torqueusers at supercluster.org>http://www.supercluster.org/mailman/listinfo/torqueusers