Hi all.
Please bear with me if I say something foolish, I've been involved with this Torque/Maui stuff now for, oh, about six days.
I'm going to give a few bullet points, for those who don't have time to read a long post, and then a longer explanation.
Here's the scenario:
We recently upgraded from maui-3.2.6p16/torque-2.2.1 to maui-3.2.6p21/torque-2.3.6. Since then we've been having very
serious performance problems seemingly related to problems connecting to the pbs_server.
A few issues:
1. If pbs_mom fails to notify the pbs_server that a job has terminated (via post_epiloque()), it never tries again.
This results in an inconsistent state where the pbs_server believes jobs are still running that are terminated. We can
confirm this by examining the pbs_mom logs and finding where they failed to notify the pbs_server about a job
completing. Running "pbsnodes {node}" on the server still shows the jobs as running. The message in the mom log looks
like this:
03/19/2009 13:15:07;0001; pbs_mom;Svr;pbs_mom;Operation now in progress (115) in post_epilogue, cannot bind to port
1023 in client_to_svr - connection refused
2. Even more strangely, we find messages in the maui.log on the head node showing a job trying to start on one of these
worker nodes, which pbs indicates are 'job-exclusive'. Those job startups fail with:
03/20/2009 00:01:33;0080;PBS_Server;Req;req_reject;Reject reply code=15044(Resource temporarily unavailable
REJHOST=worker186 MSG=cannot allocate node 'worker186' to job - node not currently available (nps needed/free: 1/0,
joblist: 123281.cab1.fnal.gov:0,127446.cab1.fnal.gov:1)), aux=0, type=RunJob, from root at cab1.fnal.gov
3. We also have a problem where maui freezes for 15 minutes trying to query the pbs_server process. During this time,
maui is completely frozen and no jobs are scheduled. I have seen this reported in the lists, but I don't see any
solution other than an unofficial patch on 3.2.6p20 proposed by someone and modifying a maui parameter (which we've done).
So my questions are:
Why is pbs_server getting so busy it won't allow connections?
How do I get the pbs_mom and pbs_server to synchronize jobs after the mom fails to notify the server of a job completion?
Why is maui doing job starts on nodes that pbs_server, incorrectly, thinks are job-exclusive?
How do I stop maui from freezing for 15 minutes at a time?
Here's the longer analysis for those who haven't fallen asleep:
If we look at pbs_nodes {node}, we'll see jobs listed as running that we know are listed as completed in the worker node
mom log. I traced this through the code, and it appears to be an issue with the post_epiloque() procedure. That
procedure attempts to open a connection to the pbs_server and notify it that a job has finished-- the code refers to
this as an "obit". If that connection fails, the pbs_mom process doesn't attempt to notify the server that a job has
completed again.
The relevant portion of the code is:
Source File: resmom/catch_child.c- post_epiloque() procedure
/* open new connection */
sock = mom_open_socket_to_jobs_server(pjob, id, obit_reply);
if (sock < 0)
{
/* FAILURE */
if ((errno == EINTR) || (errno == ETIMEDOUT) || (errno == EINPROGRESS))
{
/* transient failure - server/network up but busy... retry */
int retrycount;
for (retrycount = 0;retrycount < 2;retrycount++)
{
sock = mom_open_socket_to_jobs_server(pjob, id, obit_reply);
if (sock >= 0)
break;
} /* END for (retrycount) */
}
if (sock < 0)
{
/* We are trying to send obit, but failed - where is this retried?
* Answer: I think that the main_loop should examine jobs and try
* every so often to send the obit. This would work for recovered
* jobs also.
*/
return(1);
}
}
Please note that the comment at the end recognizes that a failed connection will require that the obit be attempted
later, but I cannot see anywhere in the code that this is being done. There is just a little snippet that hints that
someone was thinking about this in mom_main.c:
int MOMRetryObit = 0; /* NOTE: change to TRUE is 2.4 */
So it looks like this is being planned for 2.4, but not working in 2.3.6.
Similarly, maui on the head node freezes while trying to run a pbs_disconnect() (after timing out requesting info the
pbs server). There are various messages about this problem floating around but I still haven't seen a solution. We've
already set RMCFG[base] TIMEOUT=120, but that doesn't seem to help.
It looks to me like the main problem here is that pbs_server is not accepting connections for some reason. That seems
to cause these other problems.
I apologize for the length of this post...
Thanks in advance!
Ed