I have a subnode that is currently using 7 out of its 8 slots. I have jobswaiting in the queue, but they will not start processing. Everything wasworking fine a couple weeks ago, and then it just stopped. I restarted thesubnode a couple of times to try to fix it, but that did not work. I alsomodified the np_load_ave so that it equals 100, but that did not workeither. There is probably a really easy answer for this. Can anyone pointme in the right direction?

Post by jsadino---------- Forwarded message ----------Date: Fri, Dec 3, 2010 at 1:10 PMSubject: subnode with empty slots but jobs in queueHello,I have a subnode that is currently using 7 out of its 8 slots. I have jobs waiting in the queue, but they will not start processing. Everything was working fine a couple weeks ago, and then it just stopped. I restarted the subnode a couple of times to try to fix it, but that did not work. I also modified the np_load_ave so that it equals 100, but that did not work either.

the load_threshold can also be set to none, when cores = slots.

Did you define/request any memory or other resource? Any resource quota set in place?

The waiting jobs are serial ones?

-- Reuti

Post by jsadinoThere is probably a really easy answer for this. Can anyone point me in the right direction?Thank you!Jeff Sadino

Post by jsadinoI have a subnode that is currently using 7 out of its 8 slots. Ihave jobs waiting in the queue, but they will not start processing.Everything was working fine a couple weeks ago, and then it juststopped.

the load_threshold can also be set to none, when cores = slots.Did you define/request any memory or other resource? Any resource quota set in place?The waiting jobs are serial ones?

I have a similar problem with SGE 6.2u4. I have a nodewith 48-cores which will only run 30 jobs. Here is therelevant output from qconf:

Post by jsadinoI have a subnode that is currently using 7 out of its 8 slots. Ihave jobs waiting in the queue, but they will not start processing.Everything was working fine a couple weeks ago, and then it juststopped.

the load_threshold can also be set to none, when cores = slots.Did you define/request any memory or other resource? Any resource quota set in place?The waiting jobs are serial ones?

I have a similar problem with SGE 6.2u4. I have a nodewith 48-cores which will only run 30 jobs. Here is the---seq_no 0load_thresholds NONEsuspend_thresholds NONEnsuspend 1suspend_interval 00:05:00priority 0min_cpu_interval 00:05:00processors UNDEFINEDqtype BATCH INTERACTIVEckpt_list NONEpe_list make mpi mpich ortererun FALSEslots 1,[compute-0-0.local=4],[compute-0-1.local=4], \[compute-0-2.local=4],[compute-0-3.local=4], \[compute-0-5.local=4],[compute-0-4.local=4], \[compute-0-6.local=4],[compute-0-7.local=48], \[compute-0-8.local=48]---Right now compute-0-8 is down, although qstat still showssome jobs for it. (Why would this happen?)

SGE assumes some network problems. You will have to use `qdel -f ...` to get rid of these jobs.

Post by jlforrestAll the jobs in this cluster are serial jobs. Any idea whyI can't run 18 more jobs on compute-0-7? I restarted theqmaster but it didn't make any difference.Cordially,--Jon ForrestResearch Computing SupportCollege of Chemistry173 Tan HallUniversity of California BerkeleyBerkeley, CA94720-1460510-643-1032------------------------------------------------------http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302517

No, I forgot to mention -u "*" in addition to get the list of all users' jobs.

No problem. At least it wasn't me screwing up. The outputis below.

I think I might have some idea of what might be causingthis. compute-0-7 crashed last week, I think on 12/02/2010.I brought it up soon afterwards. So, the jobs that showa submit time of before 12/02/2010 are not really there.I counted and there are 19 of them. This, plus the 29 thatare running, equals 48, which is the number of cores.

So the real question is why did these jobs remainvisible to SGE after compute-0-7 was rebooted.

No, I forgot to mention -u "*" in addition to get the list of all users' jobs.

No problem. At least it wasn't me screwing up. The outputis below.I think I might have some idea of what might be causingthis. compute-0-7 crashed last week, I think on 12/02/2010.I brought it up soon afterwards. So, the jobs that showa submit time of before 12/02/2010 are not really there.I counted and there are 19 of them. This, plus the 29 thatare running, equals 48, which is the number of cores.So the real question is why did these jobs remainvisible to SGE after compute-0-7 was rebooted.

Was the node only rebooted, or also the local spool directory of SGE removed? When the local spool directory exists after the reboot, the execd would inform the qmaster about the failed jobs. When there is no information on the node about the last running jobs, the execd won't tell anything to the qmaster, and on its own it's waiting for the jobs to reappear.

Post by reutiWas the node only rebooted, or also the local spool directory of SGEremoved? When the local spool directory exists after the reboot, theexecd would inform the qmaster about the failed jobs. When there isno information on the node about the last running jobs, the execdwon't tell anything to the qmaster, and on its own it's waiting forthe jobs to reappear.

This is a Rocks cluster so after the nodecrashed it was reinstalled from scratch. Thisremoved the local spool directory, which wouldexplain my problem. In fact, from what you say,this would happen whenever a Rocks nodeis reinstalled if there were running SGEjob when the node crashed, right?

Post by reutiWhen the local spool directory exists after the reboot, theexecd would inform the qmaster about the failed jobs. When there isno information on the node about the last running jobs, the execdwon't tell anything to the qmaster, and on its own it's waiting forthe jobs to reappear.

I was thinking about this. I wonder if thisis the right thing to do. If the actualcontents of the local spool directory isempty, or different than what the qmasterexpects, then what point is there for theqmaster to continue to think that thejobs exist, or will ever come back?In other words, shouldn't the contentsof the local spool directory determinethe qmaster's conception of reality?--Jon ForrestResearch Computing SupportCollege of Chemistry173 Tan HallUniversity of California BerkeleyBerkeley, CA94720-1460510-643-1032***@berkeley.edu

Post by reutiWhen the local spool directory exists after the reboot, theexecd would inform the qmaster about the failed jobs. When there isno information on the node about the last running jobs, the execdwon't tell anything to the qmaster, and on its own it's waiting forthe jobs to reappear.

I was thinking about this. I wonder if thisis the right thing to do. If the actualcontents of the local spool directory isempty, or different than what the qmasterexpects, then what point is there for theqmaster to continue to think that thejobs exist, or will ever come back?In other words, shouldn't the contentsof the local spool directory determinethe qmaster's conception of reality?

I remember, that this discussion was already on the list before. I'm not sure of the final conclusion and also can't find any issue entered for it. It was like a making a check one time when the execd comes up again, what should be there from the point of view of the qmaster and what finds the execd in his local (spool) directory.

SGE was just not designed to handle reinstalled nodes in combination with local spool directories.

This is not only an issue of the jobs listed in `qstat`, but also possible tightly integrated tasks which were started by `qrsh -inherit`. If there was one, the complete job is invalid. If there was none (as your parallel jobs has serial and parallel steps), then you may be lucky and the job won't notice the reinstallation of the node at all.

Feel free to enter an issue about it.

More complicated would be this issue with persistent scratch directories on the nodes, which I suggested:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=3292

If the local persistent scratch directory is gone, something happened to the node...