Linux Cluster Disk Guide

This is a guide to issues with disk storage on the on the Linux cluster. To find out how disks on one system can be accessed from another, see the automount page. To understand different partition names (e.g., why are there /share and /scratch directories), see the disk sharing page.

How much disk space do I have?

To find out how much disk space you have available, use the df command. You'll probably always want to use the -h option, so the sizes appear in human-readable form:

df -h

You'll almost certainly see disks in the list that are mounted via automount. If you find the automounted disks to be distracting, add -l to the command:

df -hl

Bear in mind that you don't want to use the -l option if your home directory is not on the machine to which you've logged in. (As of Jan-2017, this mainly applies to ATLAS users logged onto xenia.)

Here's the result of executing df -hl on the machine tanya on 28-Jan-2017:

If we ignore partitions that relate to the operating system, we're left with two key user-accessible filesystems: /home and /data. (Many systems have other key partitions, such as /share and /scratch.)

It's not much

Your first reaction may be: "There's not much disk space for my home directory, and I have to share that space with other people in the collaboration. Why, my watch has more storage than that!"

You're right. It's intended that the /home be used for "source" files (program code, scientific papers, plots, etc.); /data or /scratch should be used for large and re-creatable files (compiled binaries, data summaries, temporary work files, etc.). We have to ask you to use judgement and discipline, and to be aware that you're sharing space with your fellow scientists.

If you're just skimming this page, stop and read this

The reason why /home is small and /data is big is that the /home partition is backed up; /data is not. In fact, it goes one step further: the /data partition is always considered expendable for any type of system maintenance activity. If a system is being repaired, upgraded, or restored, the /data partition may be erased.

There's more about this in the section on backups below.

What do I do if I need more disk space?

First, look to /data partitions on other systems in your working group. The /data partitions on all the systems that belong to a group are intended to be a shared resource; if you don't have enough space on /nevis/yourmachine/data, cd /nevis/othermachine/data in your group and see how much free space it has.

I strongly advise you to exercise common courtesy as you're scrounging for disk space. If I found someone had used a big chunk of my server's /data partition without asking, I might be annoyed.

If you still don't have enough disk space on all your group's machines to satisfy your needs, you may have to request more disks be added to the existing systems (or buy a new box).

Backups

The Nevis Linux cluster is backed up nightly onto shelley, the Nevis backup server. This includes the systems at the Nevis Annex.

For speed, we don't copy every file from every system; we use a program called rsync to copy over only those files that have changed since the day before.

We don't back up every file on every system on the cluster. The policy is: the /home partition and /share partitions are backed up; /data is not. There is a web page that contains the list of which partitions are backed up.

We maintain previous versions of old files on shelley. (Actually we do an incremental tar of the disk images after the rsync procedure has run for all the machines in the cluster.) This means we can recover old versions of files if necessary. However, there's a time limit: we only keep old file versions for 30 days. We cannot recover files that were deleted or overwritten prior to that.

The answer is backup. There are roughly 35-40 systems on the Linux cluster, and we back them all up every night. At present the backup job takes 8-10 hours to run, and we back up about 2TB of files. Even if we had more disk space, as a practical matter we can't have a daily backup that takes more than 24 hours to run.

We therefore have to ask users to segregate their files into key files that will be backed up, and re-creatable files that won't. The relative sizes of /home versus /data partitions help enforce this segregation.

Why don't you back up /data partitions?

We have vastly more disk storage on the Nevis cluster than we can hope to back up on any system that we can afford. As of Jan-2017 we have over 340TB of storage assigned to /data partitions on different systems.

I've got files in a /data partition that would be a pain to re-create. How can I back them up myself?

The simplest thing to do is to make copies on other /data partitions in your workgroup's cluster. After all, that's all a backup is: a second copy of your files.

Why only a few weeks worth of backups and versions? Why not a year?

We're doing what we can with the resources we have available. We don't have the disk space on our backup server for a year's worth of backups.

I've got critical files that I want backed up even more often. What can I do?

You can supplement our backups with copies of your own. For example, I have my own private backup procedure for my critical source files. The procedure makes use of the rsync command; you can see it in ~seligman/bin/rsync.sh on the Linux cluster.

I run this script automatically a few times per day using cron. Here's a sample line from my crontab file:

10 */6 * * * /nevis/tanya/home/seligman/bin/rsync.sh

This translates to: Every six hours, at ten minutes past the hour, run my script.

As you look over my files, notice that I copy a subset of my home directory onto a data disk on another machine. Abusing this facility by backing up gigabytes of files in your home directory onto someone else's system may get you yelled at.

Long-term data storage

For the purposes of this section, "long-term" means more than six months or so.

By the above definition, there is no long-term data storage at Nevis. As noted above:

we back up /home directories, but keep old back-ups for no more than a few weeks;

/data directories are not backed up at all;

RAID arrays can and do fail. (This section is being written on 25-Apr-06; on that day, we lost the contents of a RAID5 array.)

If you need long-term storage for any of your files, I suggest you consider the facilities at BNL, FNAL, or CERN.