Working with large datasets and long-running programs on ECCO and SDSx

ECCO and SDSx are clusters, shared by a large number of users, and with finite resources. In order to better manage the cluster for all users, users are expected to follow some rules. Some are soft but monitored, others are hard constraints that will affect your individual effort if not taken into account.

File sizes

Some simple rules:

$HOME (your home directory, your desktop) is small, and should be used only for reduced derivative files and final result files

Quotas might affect your ability to store files in $HOME

$HOME is shared available on head node and all compute nodes

$SCRATCH (/scratch or /temporary - they both point to the same filesystem) is designed for large files.

Not all $SCRATCH are created equal though - the easiest way to find out is to open an interactive qsub job and type 'df -h /temporary/', giving you the size in Terabytes

$SCRATCH (or /temporary) is ... well, temporary. We clean it out if a file hasn't been used in more than two weeks. There is no backup, "cleaning out" means deleting, forever.

Most shared data (public-use datasets on ECCO, synthetic datasets on SDSx) are in a common location, for your use.

The above rules imply the following structure on any of your programs:

Read the shared data from the shared location. Do NOT copy them to your desktop.

Define a way to reference the $SCRATCH location for your temporary files. This can be used to share storage in between two Stata or SAS or R jobs, or simply within your program. Store files here which, even if it takes a long time, can be reproduced using your programs (which in principle means ALL files you create), but which are not needed in the long-run.

Actively clean-up at the end of a job: delete unneeded data files

Only write to $HOME analysis results, or greatly reduced files that are of long-term use.

We remind users on ECCO that there is NO BACKUP of any files that you create. You must either commit to your own offsite Subversion or Git repository, or make other accomodations to copy files offsite.

Programming with limits

Most simple jobs can be run with our 'qcommands', but as soon as your job sequence becomes more complex, you will want to create your own qsub scripts. With custom qsub scripts, you can

control which nodes your programs get allocated to. You may want to do this when you have intermediate files on $SCRATCH. You do NOT want to do this in general, since it may delay when your job can be run.

ask for longer walltime. Most queues have a default wallclock limit (the time your job is allowed to run). They also have a longer maximum wallclock limit that users are allowed to run jobs for. If your job runs for a very long time, you will want to increase the requested time. (PBS command: "-l walltime=HH:MM:SS")

You will also want to cut jobs into smaller programs, so that they can run within the wallclock limits, and in order to make them robust to restarts.

Unless you have a regression routine that by itself pushes the limits, you can at a minimum separate data preparation steps, and analysis steps (and writing the output of the data preparation step to a location on $SCRATCH, from where the analysis step can read it)

You can have one data preparation routine, and multiple regression routines. By separating them into separate qsub programs, you can submit all regression routines simultaneously (within the limits imposed on the system). At a minimum, if running two of them in parallel, you reduce your overall time waiting for all the results by 50%. Not bad.

Another example Stata with intermediate storage and config file

The progams below (config.do, 01_prep.do, 02_analysis.do, the qsub programs) split the processing into two steps: the preparation, which stores a file on $SCRATCH, and the analysis, which reads that file. Because $SCRATCH is specific to a particular compute node (not shared), we capture what node the 01_prep.qsub was submitted to, and add it to the