In just about any computing activity, it's important ensure that your programs
are using memory efficiently. This is especially crucial in high performance
computing, where your problem may be so large that it won't fit on a single
machine, or even a few machines. In this page, we'll have a look at how to
monitor memory usage on the cluster.

Checking the top command

The easiest way to check the memory usage of a running process is to use the
interactive "top" command. At the command line, try running

[araim1@tara-fe1 ~]$ top

You'll probably get a long list of processes as below, most of which you aren't
interested in. You'll also see some interesting numbers like free memory,
swap space used, and percent CPU currently utilized. Each process has several
memory statistics shown. The most conservative one is VIRT, which includes
code, data, and virtual memory. The one that probably reflects our actual
usage the most is RES, which only includes code and data. These two values
together give us a good idea of our usage. The top display automatically
updates itself every few seconds. For more information, see the top manual
page ("man top").

The issue with top is that it's interactive. When we're running our high
performance parallel code, we may want to log the memory usage at some
very specific times. For example, when we finish allocating a large data
structure. We'd prefer not to have to watch the top command and track things
manually.

Checking the proc filesystem

Let's take one step in the direction of automating memory checking. To do
this, we'll use the proc filesystem. This is a special filesystem on Unix
machines which contains information about the system. We'll try a few
commands to get a feel for it.
Here is information about the CPU cores on the front end node. This is fairly
static information which we do not expect to change much.

We can also check the memory usage of a specific process. Try
"cat /proc/<PID>/status" to get information about a process with a given
PID (process ID). We can check "cat /proc/self/status" to the get information
about the current process.

Notice that we're getting information about the "cat" command, which is the
"self" when we run "cat /proc/self/status" directly from the command line.
Earlier when we ran the top command, we looked at the VIRT and RES columns. From
the display above, we can get the same information from the VmSize and VmRSS
fields, respectively.

Checking memory from a parallel C program

We've seen how to check memory usage for a single process, but what about
and MPI job with multiple processes? Let's suppose we want to see the usage
for each process, as well as the total (sum) across all processes. We'll
use the serial function from the previous section, and gather the results
into an array on a single process (with ID "root"). We'll also make a
simple helper function that sums over this array (with the result being
stored in process 0).

Notice that the global memory usage reported in the output is much higher
than the sum of per-process usages that are reported immediately prior.
This is coming from the usage of the MPI library
itself, which is allocating some memory as we use it. We could verify in our
test program that if we call both memory functions a second time, the numbers
would remain steady.