Article Index

Check Pointing HOWTO

Many codes do checkpointing. That is, they write their current state of
computation to a file. The concept is that if the code dies for
whatever reason, the code could be restarted from the last good
checkpoint. This method saves time, particularly for long computations.

Remember that PVFS is intended as a high-speed scratch file system.
Writing your check point files to PVFS is a very good thing to do
since the file system is so fast. However, there is a chance one
of the IO servers could go down and you will not have access to
the files that were using the IO server that is down. However
this danger is true for any file system, not just PVFS. Let's
take a few moments to examine how we might modify our codes to
better do check pointing.

{mosgoogle right}

A simple approach to checkpointing is to write the state of
computations to a file at some interval during the code run.
The checkpoint file name is usually the same since this saves
file space. This method is also convenient from a coding point
of view since the code uses the same file name for writing the
checkpoint and for reading the checkpoint. However, if during
the writing of the checkpoint file a problem occurs then the
checkpoint file will be corrupt and you will have lost the benefits
of checkpointing, i.e. you must restart your entire program.
Moreover, if the file systems, becomes corrupt or goes off-line,
then you will have to wait until the file system has been
repaired or restored to get the checkpoint file back.

There are several ways to avoid some of these problems. They are
the same for PVFS or any other file system. The first thing you
should do is write to multiple files and partitions. I would
recommend rotating through at least two, preferably three files,
and partitions, if available.

The first write to a checkpoint should write to the first
name. If possible you should read the data back in to make
sure the file is the correct size (this is optional of course).
You can also do an estimate of the size of the file to make
sure it is correct. After the file has been written and you
have determined the file size is correct, do an md5sum
on the file and save it as well. Also, if possible, the file should
be copied from PVFS to a file systems that is backed up.

The next checkpoint should write to the next file name. After
writing it should follow the same process of checking the size,
computing the md5sum of the checkpoint file, and copying the
file to a file system that is backed up.

This process continues for as many checkpoint files as you want. After
you have written the last in the series, you then use the first
filename, then the second, and so on.

The key to this process is using multiple files for writing
checkpoint data. Also, be sure to compute the md5sum and if
possible copy the checkpoint files from PVFS to another file system
that is backed up.

RAID-1 Within PVFS Itself

Every so often the idea of using RAID-1 (mirroring) within PVFS
itself is asked on the PVFS and PVFS2 mailing lists. The concept
would be to split the IO servers in half, create a PVFS file
system from half, and then mirror it on the other half of the
IO servers. Then if an IO server goes down, the mirrored PVFS
can take over until the faulty IO server is brought back on line.

There a couple of downsides to this idea. First, you are only using
half of your IO servers which means you will only get half the
speed. Second, the RAID-1 operation means that the throughput will
be slowed because of the need to copy the data to the mirrored IO
servers. You can look at this one of two ways - you will get less
than half the speed you could be getting -or- you are paying twice
the money for the same speed.

Moreover, remember the intention of PVFS. It is designed to be a
high-speed scratch file system. The key word is scratch.
Therefore, redesigning or adding internal components to make PVFS
more resilient goes against the basic tenant of PVFS design. Even
though the developers of PVFS do their best, to add things that help
the resiliency of PVFS, they will normally not do anything to
sacrifice the performance potential.

Parting Comments

This column illustrates many ways you can improve the resilience and
the flexibility of PVFS. Some of these options are trades and some
options improve both the throughput and flexibility of PVFS. As always
your application should dictate how you deploy PVFS.

Dr. Jeff Layton hopes to someday have a 20 TB file system in his home
computer. He lives in the Atlanta area
and can sometimes be found lounging at the nearby Fry's, dreaming of
hardware and drinking coffee (but never during working hours).