We recently received four new systems to add to our cluster, they have
pure uefi bios on them, and do not pxe boot. So I will have to install
them manually instead of just installing rocks to build the nodes. We
have run into a problem when we try to use more than one of these new
nodes to run openmpi jobs. We have compiled our on openmpi with torque
support and we have used it on existing hardware for a while.
What does work:
serial jobs submitted to each node
parallel jobs that fit into one node
parallel program submitted by command line on either node (mpirun -host
compute-2-0,compute-2-1 -np 2 hostname)
parallel jobs run on other node
But when the job is submitted, it is queued, but the output files are
never created.
FYI we use moab, if that matters.
I have ensured:
mpi/torque libraries are loaded
torque versions are the same (3.0.5-1 supplied by rocks installs on
other nodes)
/etc/hosts is the same on all systems
ldap is working correctly on all system
universal file system home directories are mounted and writeable
Please help, I am at a loss for as to what is happening.
Dan
Pertinent log information
compute-2-0
12/13/2012 19:46:02;0008; pbs_mom;Job;do_rpp;got an internal task
manager request in do_rpp
12/13/2012 19:46:02;0002; pbs_mom;Svr;im_request;connect from
10.1.255.226:1023
12/13/2012 19:46:02;0008;
pbs_mom;Job;296009.server.name.edu;im_request:received request
'ABORT_JOB' (10) for job 296009.server.name.edu from 10.1.255.226:1023
12/13/2012 19:46:02;0008; pbs_mom;Job;296009.server.name.edu;ERROR:
received request 'ABORT_JOB' from 10.1.255.226:1023 for job
'296009.server.name.edu' (job does not exist locally)
12/13/2012 19:46:02;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file
descriptor (9) in do_rpp, cannot get protocol End of File
12/13/2012 19:46:02;0002; pbs_mom;Svr;im_eof;End of File from addr
10.1.255.226:1023
compute-2-1
12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file
descriptor (9) in do_rpp, cannot get protocol Premature end of message
12/13/2012 19:48:06;0002; pbs_mom;Svr;im_eof;Premature end of message
from addr 10.1.255.227:15003
12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::node_bailout,
296009.server.name.edu join_job failed from node compute-2-0 1 -
recovery attempted)
12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::sister could
not communicate (15061) in 296009.server.name.edu, job_start_error from
node compute-2-0 in job_start_error
12/13/2012 19:48:06;0008; pbs_mom;Req;send_sisters;sending command
ABORT_JOB for job 296009.server.name.edu (10)
12/13/2012 19:48:06;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters for job 296009.server.name.edu
12/13/2012 19:48:06;0001;
pbs_mom;Job;296009.server.name.edu;send_sisters: sister #1
(compute-2-0) is not ok (1099)
12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::node_bailout,
node_bailout: received KILL/ABORT request for job 296009.server.name.edu
from node compute-2-0
12/13/2012 19:48:06;0080; pbs_mom;Svr;scan_for_exiting;searching for
exiting jobs
12/13/2012 19:48:06;0008; pbs_mom;Job;kill_job;scan_for_exiting:
sending signal 9, "KILL" to job 296009.server.name.edu, reason: local
task termination detected
12/13/2012 19:48:06;0008; pbs_mom;Job;296009.server.name.edu;kill_job
done (killed 0 processes)
12/13/2012 19:48:06;0080; pbs_mom;Job;296009.server.name.edu;sending
preobit jobstat
12/13/2012 19:48:06;0008; pbs_mom;Job;do_rpp;got an internal task
manager request in do_rpp
12/13/2012 19:48:06;0002; pbs_mom;Svr;im_request;connect from
10.1.255.227:15003
12/13/2012 19:48:06;0008;
pbs_mom;Job;296009.server.name.edu;im_request:received request 'ERROR'
(99) for job 296009.server.name.edu from 10.1.255.227:15003
12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request,
event 4 taskid 0 not found
12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request,
error sending command 99 to job 296009.server.name.edu
12/13/2012 19:48:06;0002; pbs_mom;Svr;im_eof;No error from addr
10.1.255.227:15003
12/13/2012 19:48:06;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
12/13/2012 19:48:06;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top
of while loop
12/13/2012 19:48:06;0080; pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
12/13/2012 19:48:06;0080; pbs_mom;Job;296009.server.name.edu;performing
job clean-up in preobit_reply()
12/13/2012 19:48:06;0080; pbs_mom;Job;296009.server.name.edu;epilog
subtask created with pid 72123 - substate set to JOB_SUBSTATE_OBIT -
registered post_epilogue
Head node:
12/13/2012 19:47:22;0008;PBS_Server;Job;296009.server.name.edu;Job Run
at request of root at server.name.edu
12/13/2012 19:47:22;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
setting job 296009.server.name.edu state from QUEUED-QUEUED to
RUNNING-PRERUN (4-40)
12/13/2012 19:47:22;0008;PBS_Server;Job;296009.server.name.edu;forking
in send_job
12/13/2012 19:47:22;0004;PBS_Server;Svr;svr_connect;attempting connect
to host 10.1.255.226 port 15002
12/13/2012 19:47:22;0008;PBS_Server;Job;296009.server.name.edu;entering
post_sendmom
12/13/2012 19:47:22;0002;PBS_Server;Job;296009.server.name.edu;child
reported success for job after 0 seconds (dest=compute-2-1), rc=0
12/13/2012 19:47:22;0008;PBS_Server;Job;reply_send;Reply sent for
request type RunJob on socket 13
12/13/2012 19:47:22;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
setting job 296009.server.name.edu state from RUNNING-PRERUN to
RUNNING-RUNNING (4-42)
12/13/2012 19:47:22;0004;PBS_Server;Svr;svr_connect;attempting connect
to host 10.1.255.226 port 15002
12/13/2012 19:47:27;0004;PBS_Server;Svr;svr_connect;attempting connect
to host 10.1.255.226 port 15002
.......
12/13/2012 19:50:30;0009;PBS_Server;Job;296009.server.name.edu;obit
received - updating final job usage info
12/13/2012 19:50:30;0008;PBS_Server;Job;296009.server.name.edu;attr
resources_used modified
12/13/2012 19:50:30;0008;PBS_Server;Job;296009.server.name.edu;attr
Error_Path modified
12/13/2012 19:50:30;0008;PBS_Server;Job;reply_send;Reply sent for
request type JobObituary on socket 14
12/13/2012 19:50:30;0009;PBS_Server;Job;296009.server.name.edu;job exit
status -3 handled
12/13/2012 19:50:30;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
setting job 296009.server.name.edu state from RUNNING-RERUN1 to
EXITING-RERUN1 (5-61)
12/13/2012
19:50:30;0009;PBS_Server;Job;296009.server.name.edu;on_job_rerun task
assigned to job
12/13/2012
19:50:30;0009;PBS_Server;Job;296009.server.name.edu;req_jobobit completed
12/13/2012 19:50:30;0004;PBS_Server;Svr;svr_connect;attempting connect
to host 10.1.255.226 port 15002
12/13/2012 19:50:30;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
setting job 296009.server.name.edu state from EXITING-RERUN1 to
EXITING-RERUN2 (5-62)
12/13/2012 19:50:30;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
setting job 296009.server.name.edu state from EXITING-RERUN2 to
EXITING-RERUN3 (5-63)
.......
and this keeps repeating
We also get logs like:
Unable to copy file
/var/spool/torque/spool/296008.biocluster.igb.illinois.edu.OU to
/home/a-m/danield/mpitest.sh.o296008 *** error from copy /bin/cp: cannot
stat `/var/spool/torque/spool/296008.biocluster.igb.illinois.edu.OU': No
such file or directory *** end error output
But I am pretty sure that is because the mpi job is not running on the
second node, from the first node that receives the job information.