First, Maui has pretty much solved our problems with submitting 100k+
jobs. It was simply overloading the stock PBS scheduler. But now I'm
coming into a (maybe) configuration problem.
I have one user who submits 100k+ jobs. This works fine for the first
couple thousand jobs, which line up in the queue and begin to execute.
However after a certain time, the queue is only visibily filled with 15
jobs. There are still thousands of jobs hidden. I can see this if I run
qstat multiple times in a row, the queue is repopulated with 15 more jobs.
The jobs only run for about 45 seconds so I'm thinking Maui isn't picking
up on this?
My second question is probably more important. We run two PBS environments
on all of our clusters. One is our 'default' high priority queue and the
second is the 'default' low priority queue. The low priority queue is for
jobs that run at nice 19 or 20. So we load up the low priority queue with
niced jobs and don't care how long they take to finish. This leaves the
high priority queue to process our own grid and MPI jobs.
This has worked fine for awhile, but now I have a user who wants to run a
few hundred thousand jobs in our low priority queue (see paragraph 1 and
2). The stock pbs_sched was simply getting overloaded and would crash.
This is when I set up a test cluster using PBS/Maui and we haven't had a
problem (other than the 15 queue limit I spoke of before).
Yesterday, I set up a second Maui to schedule the low priority queue. I
could submit jobs, check job status, however the jobs would never run.
This is when I started checking the Maui logs and found checksum errors.
This is when I discovered the problem. The $PATH environment picks up the
normally installed Maui and uses its binaries to perform its functions.
Turns out the checksum error is when the normally installed Maui tries to
process and query the second low priority queue. I can get around this by
using the second installed Maui's binaries to query the low priority
queue. If there is a way to disable this at compilation time, I think my
problem will go away.
I look forward to any comments or questions!
--
Jeremy Mann
jeremy at biochem.uthscsa.edu
University of Texas Health Science Center
Bioinformatics Core Facility
http://www.bioinformatics.uthscsa.edu
Phone: (210) 567-2672