Note to readers: 34 tara nodes were in production at
the time of this writing - Dec 29, 2009

Tools for monitoring your jobs

Introduction

There are several tools available on the cluster to help you monitor jobs
on the cluster. We will discuss some of them here.

squeue / sinfo

The most basic way to check the status of the batch system are the programs
squeue and sinfo. These are not graphical programs, but we will mention them
here for comparison.
We can check which jobs are active with squeue

Notice that the first job is in state PD (pending), and is waiting for
32 nodes to become available. The second job is in state R (running),
and is executing on node n7. We can also see what's going on with the
batch system from the perspective of the queues, using sinfo.

We can see that the two nodes (n1, n2) in the develop queue are idle. The
other queues share nodes n3 - n84, and currently n3 is in use for a running
job. By combining this with the Linux watch command, we can make a simple
display that refreshes periodically. Try

You can also customize the output of the squeue and sinfo commands.
Many fields are available that aren't shown in the default output format.
For example we can add a SHARED field, which tells if a job allows
its nodes to be shared, and a TIME_LEFT field which says how much time
is left before the job's walltime limit is reached.

We've specified "%.8h %.12L", in addition to some other standard fields,
to obtain this output. For all available fields and other output options,
see the squeue and sinfo man pages.

scontrol

SLURM maintains more information about the system than is available through
squeue and sinfo. The scontrol command allows you to see this. First, let's
see how to get very detailed information about all jobs currently in the
batch system (this includes running, recently completed, pending, etc).

From this output, we can see for example that the job was submitted at
2010-02-13T18:31:55, has 11 tasks (NumCPUs) running on nodes n33 and n34,
and its working directory is /home/araim1/parallel-test.
One thing that's missing is how many processes are running on each node.
Fortunately, we can get this by specifying the "--detail" option.

See the man page for scontrol ("man scontrol") for more details about the
command, especially to help understand how to interpret the many fields
which are reported. Also note that some of the features of the scontrol
command, such as modifying job information, can only be accessed by
system administrators.

smap

smap is similar to the previous commands, but a bit more interactive. It
provides an ncurses graphical interface to the information. Try the command

[araim1@tara-fe1 ~]$ smap

to get a display of running jobs like the following

At the top, notice the symbols A, B, C, and "dot", which illustate how
jobs have been allocated on the cluster. There are 84 slots, corresponding
to the 84 nodes currently deployed. The symbols A, B, and C correspond to the
job descriptions below. A dot means that no job is running on that node. We
can also see the queue perspective

[araim1@tara-fe1 ~]$ smap -Ds

This view is slightly misleading. There are two nodes devoted to the develop
queue, but the remaining 82 do not belong exclusively to the performance
queue. As we noted earlier, those 82 nodes are shared among the non-develop
queues. This view also does not display running jobs.

If you would like the display to refresh periodically (say every 1 second)
launch smap with the following

[araim1@tara-fe1 ~]$ smap -i 1

sview

Sview is an X-windows application, so you'll need to set up your terminal to
display graphics. See
Running X Windows programs remotely
for more information. Once your terminal is configured, you can start sview

[araim1@tara-fe1 ~]$ sview

By default, you'll get the familiar jobs view

And you may also see the status of the queues

The information shown is similar as in smap, except jobs are identified by
color codes rather than ID symbols. In addition, we can also see queue
usage in this display. In the example above however, all nodes are idle.
The display automatically refreshes periodically.

Ganglia

Ganglia is a higher level monitoring tool that let's you see usage of the
cluster. You can get an idea of the current usage, for example which nodes
are currently down or how much memory is in use. You can also see historical
information, like a graph of CPU load over the last month.

You can access the Ganglia webpage for tara
here. However, note that it is
currently only available from within the campus network.