Hi all,
The function start_process in src/resmom/start_exec.c seems to have a superfluous write() in the parent code after the fork. Following is the section of code where this occurs:
/*
** Begin a new process for the fledgling task.
*/
if ((pid = fork_me(-1)) == -1)
{
/* fork failed */
return(-1);
}
if (pid != 0)
{
/* parent */
int gotsuccess = 0;
close(kid_read);
close(kid_write);
/* read sid */
for (;;)
{
i = read(parent_read, (char *) & sjr, sizeof(sjr));
if ((i == -1) && (errno == EINTR))
continue;
if ((i == sizeof(sjr)) && (sjr.sj_code == 0) && !gotsuccess)
{
gotsuccess = 1;
if (write(parent_write, &sjr, sizeof(sjr)) == -1) {}
continue;
}
if (gotsuccess)
{
i = sizeof(sjr);
}
break;
} /* END for(;;) */
j = errno;
close(parent_read);
if (i != sizeof(sjr))
{
sprintf(log_buffer, "read of pipe for sid job %s got %d not %ld (errno: %d, %s)",
pjob->ji_qs.ji_jobid,
i,
(long)sizeof(sjr),
j,
strerror(j));
log_err(j, id, log_buffer);
close(parent_write);
return(-1);
}
*************** This is the write that seems to be extra ********************
if (write(parent_write, &sjr, sizeof(sjr)) == -1) {}
close(parent_write);
The Child process calls starter_return which writes a code to the parent_read pipe and then waits for the acknowledgment from the parent. Once the acknowledgment is received the child closes the child_read pipe. The parent then tries to write once more to that pipe, however, a race condition exists and if the child reads and closes the pipe before the parent can write to it again a SIGPIPE is sent to the parent and pbs_mom is terminated.
This write() statement seems to have been in place from the beginning. I was working on the 2.4 branch and was able to reproduce the SIGPIPE every time. I commented out the write() statement and the code worked as expected.
Is there a reason we should not remove this write() statement from the code?
Ken Nielson
Cluster Resources