However, lately we observed the following. We have a bunch of 8-corenodes connected by Infiniband and running MPI jobs across nodes. We foundthat processed often get placed on full nodes which have 8 MPI processesalready running. This leaves us with many oversubscribed (load 16instead of 8) nodes. This happens although there are many empty nodesleft in the queue. It is almost as if the slots already taken on onenode are ignored by SGE.

This is seen with OpenMPI and Intel MPI and with different applications.No applications does threading or anything that would create moreprocesses than requested slots.

Did anybody have similar observations? We are thankful for any hints onhow to debug this.

Post by steve_sWe're using SGE for a while now and are quite happy with it.However, lately we observed the following. We have a bunch of 8-corenodes connected by Infiniband and running MPI jobs across nodes. We foundthat processed often get placed on full nodes which have 8 MPI processesalready running. This leaves us with many oversubscribed (load 16instead of 8) nodes. This happens although there are many empty nodesleft in the queue. It is almost as if the slots already taken on onenode are ignored by SGE.

how many slots are defined in the queue definition, and how many queues do you have defined?

-- Reuti

Post by steve_sThis is seen with OpenMPI and Intel MPI and with different applications.No applications does threading or anything that would create moreprocesses than requested slots.Did anybody have similar observations? We are thankful for any hints onhow to debug this.best,Steve------------------------------------------------------http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305816

Post by steve_sHowever, lately we observed the following. We have a bunch of 8-corenodes connected by Infiniband and running MPI jobs across nodes. We foundthat processed often get placed on full nodes which have 8 MPI processesalready running. This leaves us with many oversubscribed (load 16instead of 8) nodes. This happens although there are many empty nodesleft in the queue. It is almost as if the slots already taken on onenode are ignored by SGE.

how many slots are defined in the queue definition, and how many queues do you have defined?

$ qconf -sqladde.qall.qtest.qvtc.q

Only the first and last queue are used and only the first is used forparallel jobs. Nodes belong to only one queue at a time such that jobsin different queues cannot run on the same node.

This is a known issue. When scheduling parallel jobs with 6.2 to 6.2u5,the scheduler ignores host load. This often results in jobs piling upon a few nodes while other nodes are idle. The issue is fixed in 6.2u6(currently only available in product form).

Post by steve_sHowever, lately we observed the following. We have a bunch of 8-corenodes connected by Infiniband and running MPI jobs across nodes. We foundthat processed often get placed on full nodes which have 8 MPI processesalready running. This leaves us with many oversubscribed (load 16instead of 8) nodes. This happens although there are many empty nodesleft in the queue. It is almost as if the slots already taken on onenode are ignored by SGE.

how many slots are defined in the queue definition, and how many queues do you have defined?

$ qconf -sqladde.qall.qtest.qvtc.qOnly the first and last queue are used and only the first is used forparallel jobs. Nodes belong to only one queue at a time such that jobsin different queues cannot run on the same node.8 slots (see attachment for full output).$ qconf -sq adde.q | grep slotslots 8Thank you.best,Steve------------------------------------------------------http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305824

Post by steve_sHowever, lately we observed the following. We have a bunch of 8-corenodes connected by Infiniband and running MPI jobs across nodes. We foundthat processed often get placed on full nodes which have 8 MPI processesalready running. This leaves us with many oversubscribed (load 16instead of 8) nodes. This happens although there are many empty nodesleft in the queue. It is almost as if the slots already taken on onenode are ignored by SGE.

how many slots are defined in the queue definition, and how many queues do you have defined?

$ qconf -sqladde.qall.qtest.qvtc.qOnly the first and last queue are used and only the first is used forparallel jobs. Nodes belong to only one queue at a time such that jobsin different queues cannot run on the same node.

Did you change the host assignment to certain queues, while jobs were still running? Maybe you need to limit the number total slots per machine to 8 in an RQS or setting it for each host's complex_values.

Another reason for virtual oversubscription: processes in state "D" count as running and dispite the fact of the high load, all is in best order.

Post by templedfThis is a known issue. When scheduling parallel jobs with 6.2 to 6.2u5,the scheduler ignores host load.

Yep.

Post by templedfThis often results in jobs piling upon a few nodes while other nodes are idle.

OK, good to know. We're running 6.2u3 here.

I'm not sure if I get this right: Even if the load is ignored, doesn'tSGE keep track of already given-away slots on each node? I alwaysthought that this is the way jobs are scheduled in the first place(besides policies and all that, but that should have nothing to do withload or slots in this context).

Given that SGE knows i.e. np_load_avg on each node, I thought we couldcircumvent the problem by setting np_load_avg to requestable=YES andthen something like

$ qsub -hard -l 'np_load_avg < 0.3' ...

but this gives me

"Unable to run job: denied: missing value for request "np_load_avg".Exiting."

whereas using "=" or ">" works. I guess the reason is what is stated incomplex(5):

">=, >, <=, < operators can only be overridden, when the new valueis more restrictive than the old one."

So, I cannot use "<". If that is the case, what can we do about it? Dowe need to define a new complex attribute (say 'np_load_avg_less') alongwith a load_sensor or can we hijack np_load_avg in another way?

Post by reutiAs far as I understood the problem, the nodes are oversubscribed by getting more than 8 processes scheduled.

Exactly.

Post by reutiDid you change the host assignment to certain queues, while jobs were still running? Maybe you need to limit the number total slots per machine to 8 in an RQS or setting it for each host's complex_values.

No, we didn't change the host assignment.

Sorry, but what do you mean by RQS? Did not see that in thedocumentation so far.

Post by reutiAnother reason for virtual oversubscription: processes in state "D" count as running and dispite the fact of the high load, all is in best order.

Oversubscribed nodes do not always run 16 instead of 8 processes, someonly 14 or so. Nevertheless, the load is always almost exactly 16. Asfar as I can see, processes on these oversubscribed nodes (with > 8processes) run with ~50% CPU load each.

Post by templedfThis is a known issue. When scheduling parallel jobs with 6.2 to 6.2u5,the scheduler ignores host load.

Yep.

Post by templedfThis often results in jobs piling upon a few nodes while other nodes are idle.

OK, good to know. We're running 6.2u3 here.I'm not sure if I get this right: Even if the load is ignored, doesn'tSGE keep track of already given-away slots on each node? I alwaysthought that this is the way jobs are scheduled in the first place(besides policies and all that, but that should have nothing to do withload or slots in this context).Given that SGE knows i.e. np_load_avg on each node, I thought we couldcircumvent the problem by setting np_load_avg to requestable=YES andthen something like$ qsub -hard -l 'np_load_avg < 0.3' ...

You can only specify a value, the relation is defined already in the complex definition.

Post by steve_sbut this gives me"Unable to run job: denied: missing value for request "np_load_avg".Exiting."whereas using "=" or ">" works. I guess the reason is what is stated in

When > is working, it's a bug. I get: Unable to run job: unknown resource "fubar>12". (same for <, maybe it was fixed in 6.2u5).

Post by steve_s">=, >, <=, < operators can only be overridden, when the new valueis more restrictive than the old one."So, I cannot use "<". If that is the case, what can we do about it? Dowe need to define a new complex attribute (say 'np_load_avg_less') alongwith a load_sensor or can we hijack np_load_avg in another way?

Post by reutiAs far as I understood the problem, the nodes are oversubscribed by getting more than 8 processes scheduled.

Post by reutiDid you change the host assignment to certain queues, while jobs were still running? Maybe you need to limit the number total slots per machine to 8 in an RQS or setting it for each host's complex_values.

No, we didn't change the host assignment.Sorry, but what do you mean by RQS? Did not see that in thedocumentation so far.

man sge_resource_quota

When you have more than one queue on a maschine, all slots might get used and thus oversubscribing the machine. Hence the total number of used slots across all queues at a time on each machine must be limited. When you have only one queue per machine, then this can't happen though.

Post by reutiAnother reason for virtual oversubscription: processes in state "D" count as running and dispite the fact of the high load, all is in best order.

Oversubscribed nodes do not always run 16 instead of 8 processes, someonly 14 or so. Nevertheless, the load is always almost exactly 16. Asfar as I can see, processes on these oversubscribed nodes (with > 8processes) run with ~50% CPU load each.

What does:

ps -e f

(f w/o -) show on such a node? Are all the processes bound to an sge_shepherd, or did some jump out of the processes tree and weren't killed?

You can only specify a value, the relation is defined already in the complex definition.

[...]

Post by reutiWhen > is working, it's a bug. I get: Unable to run job: unknownresource "fubar>12". (same for <, maybe it was fixed in 6.2u5).

Yes, you are right. The only thing that works is "=":$ qsub -hard -l 'np_load_avg=0.3' ...

That is no solution to the original problem, though (but apparently notrequired, either -- see my last post).

[...]

Post by reutips -e f(f w/o -) show on such a node? Are all the processes bound to ansge_shepherd, or did some jump out of the processes tree and weren'tkilled?

There are no sge_shepherd's on the nodes. I did not set up SGE on themachine but what I understand from the documentation is thatsge_shepherd is only used in the case of "tight integration" of PEs.In our case, the PE starts the MPI processes.

You can only specify a value, the relation is defined already in the complex definition.

[...]

Post by reutiWhen > is working, it's a bug. I get: Unable to run job: unknownresource "fubar>12". (same for <, maybe it was fixed in 6.2u5).

$ qsub -hard -l 'np_load_avg=0.3' ...That is no solution to the original problem, though (but apparently notrequired, either -- see my last post).[...]

Post by reutips -e f(f w/o -) show on such a node? Are all the processes bound to ansge_shepherd, or did some jump out of the processes tree and weren'tkilled?

There are no sge_shepherd's on the nodes. I did not set up SGE on themachine but what I understand from the documentation is thatsge_shepherd is only used in the case of "tight integration" of PEs.In our case, the PE starts the MPI processes.

Well, even with a loose integration, you have to honor the lost of granted machines for your job. What do you mean in detail by "the PE starts the MPI processes"? You will need at least a sgeexecd on the nodes, so that SGE is aware of its existence and can make a suitable slot allocation for your job. (The sgeexecd will then start the shepherd in case of a tight integration.)

-- Reuti

Post by steve_sbest,Steve------------------------------------------------------http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307894

Post by reutips -e f(f w/o -) show on such a node? Are all the processes bound to ansge_shepherd, or did some jump out of the processes tree and weren'tkilled?

There are no sge_shepherd's on the nodes. I did not set up SGE on themachine but what I understand from the documentation is thatsge_shepherd is only used in the case of "tight integration" of PEs.In our case, the PE starts the MPI processes.

Well, even with a loose integration, you have to honor the lost ofgranted machines for your job. What do you mean in detail by "the PEstarts the MPI processes"? You will need at least a sgeexecd on thenodes, so that SGE is aware of its existence and can make a suitableslot allocation for your job. (The sgeexecd will then start theshepherd in case of a tight integration.)

Yes, sge_execd is present on each node, as well as sge_shepherd-$JOB_IDon the master node, where the job-script is executed:

Apparently, we have tight integration then. I did look for sge_shepherdon the wrong node (not the master node). This is the first time I take acloser look at these daemons, that's why a little confusion here (we gotthe machine pre-configured and all, getting familiar with the systemalways takes a factor of pi longer than expected). Sorry for the noise.

Now that we know what to look for, we can search for jobs which do notbehave.

Post by reutips -e f(f w/o -) show on such a node? Are all the processes bound to ansge_shepherd, or did some jump out of the processes tree and weren'tkilled?

There are no sge_shepherd's on the nodes. I did not set up SGE on themachine but what I understand from the documentation is thatsge_shepherd is only used in the case of "tight integration" of PEs.In our case, the PE starts the MPI processes.

Well, even with a loose integration, you have to honor the lost ofgranted machines for your job. What do you mean in detail by "the PEstarts the MPI processes"? You will need at least a sgeexecd on thenodes, so that SGE is aware of its existence and can make a suitableslot allocation for your job. (The sgeexecd will then start theshepherd in case of a tight integration.)

Yes, sge_execd is present on each node, as well as sge_shepherd-$JOB_ID4693 ? Sl 33:32 /cm/shared/apps/sge/current/bin/lx26-amd64/sge_execd12165 ? S 0:00 \_ sge_shepherd-60013 -bg12389 ? S 0:00 \_ python /cm/shared/apps/intel/impi/3.2.2.006/bin64/mpiexec ....Apparently, we have tight integration then. I did look for sge_shepherdon the wrong node (not the master node). This is the first time I take acloser look at these daemons, that's why a little confusion here (we gotthe machine pre-configured and all, getting familiar with the systemalways takes a factor of pi longer than expected). Sorry for the noise.

The sge_shepherd will be started on each slave node in case of a tight integration too. When you have a loose integration and no sge_shepherd on the slaves, then there maybe processes which survive the crash of a job and hence results in the effect you observed. Simply because SGE doesn't know anything about the processes started by a simple rsh/ssh outside of SGE's context.

There is a Howto for the tight integration of MPICH2 prior 1.3 and Intel MPI which you are using into SGE:

Intel MPICH2 will at some point in the future also use the Hydra startup manager.

-- Reuti

Post by steve_sNow that we know what to look for, we can search for jobs which do notbehave.best,Steve------------------------------------------------------http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307950

Post by reutiThe sge_shepherd will be started on each slave node in case of a tightintegration too. When you have a loose integration and no sge_shepherdon the slaves, then there maybe processes which survive the crash of ajob and hence results in the effect you observed. Simply because SGEdoesn't know anything about the processes started by a simple rsh/sshoutside of SGE's context.

OK, makes sense. I checked again, and yes: sge_shepherd only on master.sge_shepherds on the slaves are from different jobs.

Post by reutiThere is a Howto for the tight integration of MPICH2 prior 1.3 andhttp://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.htmlhttp://gridengine.sunsource.net/howto/remove_orphaned_processes.htmlIntel MPICH2 will at some point in the future also use the Hydra startup manager.

I may be either missing info or context, but we had this problem with6.2 with overlapping Qs and it was resolved by explicitly specifyingthe threshold for the Qs by setting np_load_avg to be just over 1.

$ qconf -sq long |grep load_thresholdsload_thresholds np_load_avg=1.1

We often get overlapping Q execution hosts registering theirdispleasure by entering an overload state, but only by a fewpercentage points (1 compute process per core plus a few % due tosystem processes).

Almost all our Qs are overlapping due to competing requirements /hardware and this seems to address that part of it fine. (tho I'd muchprefer to keep them separate for simplicity's sake).

Post by templedfThis is a known issue. When scheduling parallel jobs with 6.2to 6.2u5, the scheduler ignores host load.

Yep.

Post by templedfThis often results in jobs piling upon a few nodes while other nodes are idle.

OK, good to know. We're running 6.2u3 here.I'm not sure if I get this right: Even if the load is ignored,doesn't SGE keep track of already given-away slots on each node? Ialways thought that this is the way jobs are scheduled in thefirst place (besides policies and all that, but that should havenothing to do with load or slots in this context).Given that SGE knows i.e. np_load_avg on each node, I thought wecould circumvent the problem by setting np_load_avg torequestable=YES and then something like$ qsub -hard -l 'np_load_avg < 0.3' ...but this gives me"Unable to run job: denied: missing value for request"np_load_avg". Exiting."whereas using "=" or ">" works. I guess the reason is what is">=, >, <=, < operators can only be overridden, when the newvalue is more restrictive than the old one."So, I cannot use "<". If that is the case, what can we do about it?Do we need to define a new complex attribute (say'np_load_avg_less') along with a load_sensor or can we hijacknp_load_avg in another way?

Post by reutiAs far as I understood the problem, the nodes are oversubscribedby getting more than 8 processes scheduled.

Exactly.

Post by reutiDid you change the host assignment to certain queues, while jobswere still running? Maybe you need to limit the number totalslots per machine to 8 in an RQS or setting it for each host'scomplex_values.

No, we didn't change the host assignment.Sorry, but what do you mean by RQS? Did not see that in thedocumentation so far.

Post by reutiAnother reason for virtual oversubscription: processes in state"D" count as running and dispite the fact of the high load, allis in best order.

Oversubscribed nodes do not always run 16 instead of 8 processes,some only 14 or so. Nevertheless, the load is always almostexactly 16. As far as I can see, processes on these oversubscribednodes (with > 8 processes) run with ~50% CPU load each.best,Steve------------------------------------------------------http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305856

Post by hjmangalamI may be either missing info or context, but we had this problem with6.2 with overlapping Qs and it was resolved by explicitly specifyingthe threshold for the Qs by setting np_load_avg to be just over 1.$ qconf -sq long |grep load_thresholdsload_thresholds np_load_avg=1.1We often get overlapping Q execution hosts registering theirdispleasure by entering an overload state, but only by a fewpercentage points (1 compute process per core plus a few % due tosystem processes).

Yes, this avoids oversubscription, but may leave slots unused as also processes in state "D" count as running and can create an artificial higher load. The usual approach to limit slots across serveral queues is one of these:

Post by hjmangalamAlmost all our Qs are overlapping due to competing requirements /hardware and this seems to address that part of it fine. (tho I'd muchprefer to keep them separate for simplicity's sake).hjm

Post by templedfThis is a known issue. When scheduling parallel jobs with 6.2to 6.2u5, the scheduler ignores host load.

Yep.

Post by templedfThis often results in jobs piling upon a few nodes while other nodes are idle.

OK, good to know. We're running 6.2u3 here.I'm not sure if I get this right: Even if the load is ignored,doesn't SGE keep track of already given-away slots on each node? Ialways thought that this is the way jobs are scheduled in thefirst place (besides policies and all that, but that should havenothing to do with load or slots in this context).Given that SGE knows i.e. np_load_avg on each node, I thought wecould circumvent the problem by setting np_load_avg torequestable=YES and then something like$ qsub -hard -l 'np_load_avg < 0.3' ...but this gives me"Unable to run job: denied: missing value for request"np_load_avg". Exiting."whereas using "=" or ">" works. I guess the reason is what is">=, >, <=, < operators can only be overridden, when the newvalue is more restrictive than the old one."So, I cannot use "<". If that is the case, what can we do about it?Do we need to define a new complex attribute (say'np_load_avg_less') along with a load_sensor or can we hijacknp_load_avg in another way?

Post by reutiAs far as I understood the problem, the nodes are oversubscribedby getting more than 8 processes scheduled.

Exactly.

Post by reutiDid you change the host assignment to certain queues, while jobswere still running? Maybe you need to limit the number totalslots per machine to 8 in an RQS or setting it for each host'scomplex_values.

No, we didn't change the host assignment.Sorry, but what do you mean by RQS? Did not see that in thedocumentation so far.

Post by reutiAnother reason for virtual oversubscription: processes in state"D" count as running and dispite the fact of the high load, allis in best order.

Oversubscribed nodes do not always run 16 instead of 8 processes,some only 14 or so. Nevertheless, the load is always almostexactly 16. As far as I can see, processes on these oversubscribednodes (with > 8 processes) run with ~50% CPU load each.best,Steve------------------------------------------------------http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305856

Post by hjmangalamI may be either missing info or context, but we had this problem with6.2 with overlapping Qs and it was resolved by explicitly specifyingthe threshold for the Qs by setting np_load_avg to be just over 1.$ qconf -sq long |grep load_thresholdsload_thresholds np_load_avg=1.1We often get overlapping Q execution hosts registering theirdispleasure by entering an overload state, but only by a fewpercentage points (1 compute process per core plus a few % due tosystem processes).

Yes, this avoids oversubscription, but may leave slots unused as alsoprocesses in state "D" count as running and can create an artificialhigher load. The usual approach to limit slots across serveral queueshttp://gridengine.sunsource.net/ds/viewMessage.do?dsMessageId=253527&dsForumId=38

We have adopted this solution (set up an RQS to limit slots per node)and it seems to work so far.

Our queues do not overlap, but the overload was (at least partly) causedby dead jobs of which SGE had apparently no knowledge.