(The UNICOS/mk command, "grmview -l", will show the current configuration.)

Under UNICOS/mk 2.0.3 (the current OS on yukon) and prior versions, all 1-PE jobs run on CMD PEs, not on APP PEs.

If you request 1 PE from NQS, however, NQS treats the job like a parallel application (even though the OS runs the job on a CMD PE). Thus, NQS believes that one less APP PE is available, which may lead to another users' job being unnecessarily blocked.

At ARSC, you should use the "single" queue, and not the "small" queue, to run 1-PE jobs. This is a special queue to get around the 1-PE problem.

Using "single" prevents NQS from mistakenly crediting an APP PE to your job. It enables you to run two or more 1-PE jobs simultaneously, and it keeps your jobs out of contention with jobs in the "small" queue.

From "news queues" on yukon, here's how to use "single":

> To route a job to the "single" queue, do not specify a PE or
> runtime limit. (If either limit is specified, the job will be
> routed to another queue.) For example, given this specification:
>
> #QSUB -q mpp
>
> NQS would route the request to the "single" queue.

Why all the confusion? NQS, the OS, and the applications themselves play different roles, and don't always cooperate.

NQS keeps its own tally of how many PEs it has assigned and how many are available. For instance, on a system with the NQS "global PE limit" set to 128, NQS might "release" a 128-PE request. It would then calculate that 0 PEs remain available. Once "released," the action of the request is handled by the OS, and not NQS.

In one scenario, the job (qsub script) might sit and compile Fortran code for 30 minutes, thus idling 128 PEs, but NQS wouldn't know about it, and wouldn't be able to assign other waiting jobs to those PEs. NQS must assume that all requested PEs are in use.

On the other hand, if an interactive user launched a 2-PE job while the 128-PE NQS request was compiling, the OS would indeed notice the available processors and start the 2-PE job. Unfortunately, the 128-PE request would then be blocked by the 2-PE job when it finished compiling, and NQS wouldn't be able to run anything else because, to its knowledge, 128-PEs would be in use.

In another scenario, a 1-PE request might appear in the "small" queue. When NQS releases a 1-PE "small" request, it subtracts 1 from its pool of available APP PEs, even though the job runs on a CMD PE, and thus under-counts the number of available APP PEs. This can again lead to jobs being blocked.

This problem with 1-PE jobs occurs from time-to-time on yukon, which has a global MPP PE limit of 256. Here's an example, as shown by the "qstat -m" command:

shows NQS's count of the APP PEs in use, listing them by queue and for the entire machine. There were three actual parallel applications running, with the sizes 18, 50, and 60, for a total of 128 PEs.

Examine the row:

small 4/1 30/1 10 28800 28800

which shows how NQS mistakenly counted 1 PE used in the "small" queue, and thus obtained a total of 129 total PEs in use. At this point, NQS refused to launch a waiting 128-PE request that could have run. From NQS's point of view, this would have consumed a total of 257 PEs, exceeding the global limit. This is apparent in the row for the overall totals:

yukon 100/5 256/129

(The situation is resolved fairly easily by a sysadmin, but only when someone is on duty to notice it.)

The column:

RUN
CNT

shows the count of jobs running, by queue. Note that there was one job in the single queue. Following the row:

single 10/1 0/0 1 -- --

to the
QUEUE-PE'S/CNT
column, note that this request did not count against NQS's PE total. This is the correct behavior for the "single" queue.

Run 1-PE jobs in "single"!

Debugging Debugging With FLUSH?

The print statement is an ever-popular debugging tool.

This week, a user was trying to diagnose a Fortran code. It launched Okay, but immediately hung. He inserted this,

write(*,*)'BEGIN'

as the first executable statement in the program, recompiled, and ran, and again, it hung. It never even printed "BEGIN".

The solution was to debug the debugging statement by changing it to this:

write(*,*)'BEGIN'
CALL FLUSH (101)

The write statement had executed successfully the first time, but the write buffer had not filled up before the problem which caused the "hang" was reached, and thus, the "BEGIN" had never been "flushed" to the user's console. "FLUSH" forces the contents of a write buffer out to the specified unit number (101 is used for standard output), even if the buffer is not yet full.

Quick-Tip Q & A

A:{{ You're not sure if you compiled with Apprentice, PAT, or VAMPIR
enabled in your current executable. How can you find out? }}
Two answers. The first didn't work for VAMPIR and in either case,
you may have to scrutinize the output for hints:
what a.out
egrep -i "apprentice
vampir
pat"
strings a.out
egrep -i "apprentice
vampir
pat"
Three examples:
yukon$ what a.out.1
egrep -i "apprentice
vampir
pat"
apprentice/Lib/apprif.c 30.0 11/20/97 14:50:55
apprentice/Lib/cal.s 20.3 05/22/97 12:27:01
apprentice/Lib/comm.c 30.0 11/20/97 14:50:55
yukon$ strings a.out.1
egrep -i "apprentice
vampir
pat"
head -5
@(#)apprentice/Lib/apprif.c
WARNING: The Apprentice Runtime Information File (RIF) is being written
WARNING FROM APPRENTICE INSTRUMENTATION
barrier that not all PE's entered. The Apprentice
PROGRAM ERROR DETECTED BY APPRENTICE INSTRUMENTATION
yukon$ strings a.out.2
egrep -i "apprentice
vampir
pat"
head -5
VAMPIRtrace
VAMPIRtrace
VAMPIRtrace
VAMPIRtrace
VAMPIRtrace
Q: "I can't login! I keep on trying... The Kerberos (so-called) server
accepts my 'kerberos password,' asks for my 'card-code,' which
I enter, but then it says:
Enter Next Token:
I enter my SecurID PIN into my SecurID card (AGAIN), type the 'next
token' which appears on the card, but it doesn't work!"
(What should this person do?)

The University of Alaska Fairbanks is an affirmative action/equal
opportunity employer and educational institution and is a part of the University
of Alaska system.
Arctic Region Supercomputing Center (ARSC) |PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8602 | fax: 907-450-8601 | Supporting high performance computational research in science and engineering with emphasis on high latitudes and the arctic.
For questions or comments regarding this website, contact info@arsc.edu