On 10/22/12 12:52 PM, "Michael Jennings" <mej at lbl.gov> wrote:
>Thanks for the pointers. Unfortunately, we've been running with those
>changes in place for quite some time now, and it doesn't seem to have
>fixed the problem. So I guess we'll keep looking.
Interesting. Without the first patch, I couldn't start jobs. Without the
second, jobs never showed as completed or disappeared from the server. Do
your jobs fail to start, or fail to exit? Is there any difference if you
do a single-node job versus a multi-node job?
If you turn logging up to 7 on the pbs_server and pbs_moms, is there
anything interesting written to the logs?
>For what it's worth, I found this error some time ago (which, based on
>the revision numbers you gave me, came from your patch). It doesn't
>seem to fix the issue either, but it's still likely needed (because
>dash will always be exactly equal to dot as a result):
>>Index: src/server/job_func.c
>===================================================================
>--- src/server/job_func.c (revision 6967)
>+++ src/server/job_func.c (working copy)
>@@ -2197,7 +2197,7 @@
> * the get the external sub-job */
> if (get_subjob == TRUE)
> {
>- dot = strchr(jobid, '-');
>+ dot = strchr(jobid, '.');
>> if (((dash = strchr(jobid, '-')) != NULL) &&
> (dot != NULL) &&
The patch I sent in had a '.' in there, but apparently that isn't what got
committed. As written, that's going to break the heterogenous subjob
feature. But if you aren't using that feature, this should still protect
you from the original defect.
~Matt
---
Matt Ezell
HPC Systems Administrator
Oak Ridge National Laboratory