We've set up Olesen to help users run jobs on the cluster that requireFLEXlm licenses, and would also like to be able to set up a resourcequota so that when users launch jobs they're not able to lock up all ofthe licenses:

{name moe_limitdescription limit everyone to no more than 20 moe licenseenabled TRUElimit users {*} to moe=20}

For some reason, though, we're running into problems with some usersthat submit jobs that use PEs, and also request certain resources withthe "-l" switch get stuck in a qw state, and the message references theresource quota:

scheduling info: queue instance "***@compute-1-25.local"dropped because it is disabledqueue instance "***@compute-0-11.local"dropped because it is disabledqueue instance "***@compute-1-26.local"dropped because it is fullcannot run in queue "himem.q" because it isnot contained in its hard queue list (-q)cannot run because it exceeds limit"steevmi1/////" in rule "moe_limit/1"cannot run in PE "orte" because it onlyoffers 0 slots

Post by mdsteevesWe're running SGE 6.2u4 on RHEL5.4.We've set up Olesen to help users run jobs on the cluster that requireFLEXlm licenses, and would also like to be able to set up a resourcequota so that when users launch jobs they're not able to lock up all of{name moe_limitdescription limit everyone to no more than 20 moe licenseenabled TRUElimit users {*} to moe=20}For some reason, though, we're running into problems with some usersthat submit jobs that use PEs, and also request certain resources withthe "-l" switch get stuck in a qw state, and the message references thedropped because it is disableddropped because it is disableddropped because it is fullcannot run in queue "himem.q" because it isnot contained in its hard queue list (-q)cannot run because it exceeds limit"steevmi1/////" in rule "moe_limit/1"cannot run in PE "orte" because it onlyoffers 0 slots#!/bin/bash#$ -S /bin/ksh#$ -j y#$ -cwd#$ -q mpi.q#$ -pe orte 8#$ -N mdsTest## #$ -l h_cpu=1## #$ -l mem_total=5G## #$ -l arch=lx26-amd64## #$ -l moe=1## Any of the following do not work, and cause the job to hang in the## #$ -l q=mpi.q## #$ -l hostname="compute-0-2"## #$ -lhostname="compute-0-78|compute-0-106|compute-0-69|compute-0-68|compute-0-100|compute-0-63|compute-0-93|compute-0-82|compute-0-76"

I don't see any resource reservation in the above lines: #$ -R

And to have an effect it's necessary to set "max_reservation 20" or an appropriate value in the scheduler configuration. Then slots should be reserved for this job, so that he won't die of starvation.

Is this fixing the issue?

-- Reuti

Post by mdsteeveshostnamesleep 300Even switching from "-q mpi.q" to "-masterq mpi.q" doesn't help any. Ifwe disable the resource quota rule, then the jobs run without anyproblems. Is there something that we're missing?-Mike--------------------------------------------------------http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305177

Post by reutiI don't see any resource reservation in the above lines: #$ -RAnd to have an effect it's necessary to set "max_reservation 20" or an appropriate value in the scheduler configuration. Then slots should be reserved for this job, so that he won't die of starvation.Is this fixing the issue?

Resource reservation for the resource quota piece? We don't use that atthe moment -- the moe_limit that's currently in place limits each userto only be able to have 20 jobs running, which is the behavior that wewant. The problem we're having is that other jobs, that don't need oruse these licenses, get stuck in a "qw" state, and reference themoe_limit resource quota. If we go in and disable the resource quota,then the job gets dispatched to a node and runs without problem.

If we don't use either "-l qname=...." or "-l hostname=...." when wesubmit the job, then it launches without problem.

If we don't specify a parallel environment, but leave the -l requests inthe job submission, then it launches without a problem.

While I haven't tested each and every resource that could be requestedwhen a job is submitted, the jobs only seem to stick in a qw state if wetry to request either a queue or a host.

Post by reutiI don't see any resource reservation in the above lines: #$ -RAnd to have an effect it's necessary to set "max_reservation 20" or an appropriate value in the scheduler configuration. Then slots should be reserved for this job, so that he won't die of starvation.Is this fixing the issue?

Resource reservation for the resource quota piece? We don't use that atthe moment -- the moe_limit that's currently in place limits each userto only be able to have 20 jobs running, which is the behavior that wewant. The problem we're having is that other jobs, that don't need oruse these licenses, get stuck in a "qw" state, and reference themoe_limit resource quota. If we go in and disable the resource quota,then the job gets dispatched to a node and runs without problem.

AFAICS you are limiting the number of potential queue instances with all the examples you mentioned as not working:

## #$ -l q=mpi.q## #$ -l hostname="compute-0-2"## #$ -l hostname...

Hence SGE has less options to schedule the job. Or does it also happen in an empty cluster?

Nevertheless: One bug to mention is, that you can't use -q in combination with -l h=. The workaround is to request the hostnames in the -q request:

-q ***@compute-0-2

-- Reuti

Post by mdsteevesIf we don't use either "-l qname=...." or "-l hostname=...." when wesubmit the job, then it launches without problem.If we don't specify a parallel environment, but leave the -l requests inthe job submission, then it launches without a problem.While I haven't tested each and every resource that could be requestedwhen a job is submitted, the jobs only seem to stick in a qw state if wetry to request either a queue or a host.-Mike--------------------------------------------------------http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305578

Post by mdsteevesResource reservation for the resource quota piece? We don't use that atthe moment -- the moe_limit that's currently in place limits each userto only be able to have 20 jobs running, which is the behavior that wewant. The problem we're having is that other jobs, that don't need oruse these licenses, get stuck in a "qw" state, and reference themoe_limit resource quota. If we go in and disable the resource quota,then the job gets dispatched to a node and runs without problem.

## #$ -l q=mpi.q## #$ -l hostname="compute-0-2"## #$ -l hostname...Hence SGE has less options to schedule the job. Or does it also happen in an empty cluster?

We're working with the user to see what they're trying to accomplishwith the resource requests, but we're also trying to figure out why themoe_limit is causing these jobs to sit in qw when enabled.

Post by mdsteevesWe're running SGE 6.2u4 on RHEL5.4.{name moe_limitdescription limit everyone to no more than 20 moe licenseenabled TRUElimit users {*} to moe=20}For some reason, though, we're running into problems with some usersthat submit jobs that use PEs, and also request certain resources with

I have just seen this behaviour with 6.2u2_1 when requesting a $fill_upPE and having a -masterq specification. Disabling the resource limitcaused the job to be scheduled on the next scheduler run.

It did not happen in tests with 6.2u5 and u6. So it seems, it is a bugfixed somewhere inbetween. I do not have 6.2u4 to compare.

Post by mdsteevescannot run because it exceeds limit"steevmi1/////" in rule "moe_limit/1"cannot run in PE "orte" because it onlyoffers 0 slots