A Benchmark for Parallel File Systems

Article Index

In a previous article I started to explore benchmarks for parallel
file systems. In the article, we learned that benchmarks for serial file
systems are not the best tools to measure the performance of parallel file
systems (big surprise). Five of the most common parallel file system
benchmarks were also mentioned, but the use of these was limited because
they were only applicable to certain workloads and/or certain access
patterns -- either memory access or storage access.

In this article we will take a look at a relatively new parallel
file system benchmark suite that was designed to capture the behavior
of several classes of parallel access signatures.

Introduction

A very good starting point for a new synthetic parallel file system
benchmark suite is the
Master Thesis from Frank Shorter at the
Parallel Architecture Research Laboratory (
PARL) at Clemson University
under the direction of Dr. Walt Ligon. [Just to be clear, there is no connection (that we know of) between the
name of our mascot (Walt) and the esteemed Walt Ligon.-- Ed.]

This article will discuss synthetic benchmarks, that is, benchmarks
that are not real applications, but are designed to reflect real
applications. Due to the wide range of I/O requirements from various
codes, developing synthetic benchmarks is the only realistic approach
to benchmark parallel file systems.

Before jumping into a discussion of the new benchmark suite that
Mr. Shorter suggests, we should ask the question about what should a
parallel file system benchmark produce or what should it do?

Let's start with the premise that parallel file systems are designed
for high performance for large amounts of data. So we're interested in
measuring how long it takes to read and write data to and from the
file system. One can think of this as the bandwidth of the file system
(given the time and the amount of data, you get compute bandwidth).

However, there are some caveats here. First, what is a "large amount
of data"? Second, how do we effectively measure time for distributed
I/O when the nodes are independent of one another? Hopefully I'll answer
these questions as we discuss some new benchmarks.

MPI-IO Is Here

Before the advent of the MPI 2.0 standard writing data to and from
parallel file systems was done on an ad hoc basis that many times
included non-standard methods. MPI 2.0 brought MPI based functions for
reading and writing data to the file system, in our case, a parallel
file system. This change provides portability to MPI applications
needing parallel I/O. Some MPI's have implemented MPI-IO functions
in their 1.X versions as well.
{mosgoogle right}

To simplify coding and help improve performance, MPI-IO was designed
to abstract the I/O function. It allows the code to set what is called
a virtual file view, which defines which part of a file is visible to
a process. It also defines collective I/O functions that help
non-contiguous file access in a few functions. By defining a virtual
file view and using collective I/O functions, the code can then perform
the I/O in a few function calls.

Also, MPI-IO allows complex implicit data types to be constructed for
data of any type and size. The combination of a file view and virtual
data types, allows virtually any access pattern to be addressed. In
a generic sense, it allows the data, whether in memory or in a file,
to be stored either contiguously or non-contiguously. The data can be
stored either non-contiguously in memory and contiguously in a file,
or contiguously in memory and non-contiguously in a file, or both
non-contiguously in memory and non-contiguously in a file.

For the rest of the discussion, I'll focus on benchmarking that use
MPI-IO. First, because it is a standard that allows codes to be
portable. Second, because it allows different access patterns to
be easily coded into the benchmark.

Work Loads

Before synthetic benchmark(s) can actually be written, we must choose
the access patterns that we desire to simulate. These patterns come
from examining the typical work loads that people see in their codes.

As discussed in Mr. Shorter's thesis, past efforts at defining work
loads provide a very good summary of the dominant work loads. One of
the dominant work pattern found by examining a large number of codes
is what is called a strided access. A stride is the distance within
the file from the beginning of one part of a data block to the
beginning of the next part.

Mr Shorter has described some past work on the CHARISMA project that
has shown there are two types of strided access. The first, termed,
simple strided, was found to be used in many codes with and without
serial access patterns. In other words, the data may be accessed in
a serial manner, but there is was some distance between the data
access. They went on to say that serial access can be mathematically
decomposed into a series of simple strides.

The second popular stride pattern is called "nested stride." In this
case, the code is accessing a stripe of data. Within the stripe, the
data can be accessed in a set of simple strides. Hence, the term,
nested stride.

In addition to how the data is actually accessed (the spatial access
pattern), past efforts at examining I/O patters of programs have
shown that there is a temporal access pattern. As you would guess,
researchers found that codes sometimes have an access patterns that
reflect various procedures in the code. Such patterns vary with time,
and are called temporal access patterns.

Introducing pio-bench

The result of Mr. Shorter's thesis was a new benchmark suite,
named pio-bench. In this section, I'll go over some of the design
aspects that he considered. While it may seem a bit dry, understanding
the critical issues and how they were implemented in the benchmark
suite will help you understand the benchmark results.

Finding The Time

In his work, Mr. Shorter has taken great pains to develop a framework
that provides accurate timing. The first step was to divide the actual
benchmark into three pieces, a setup phase, an operations phase (reads
or writes), and a cleanup phase. While this choice may sound simplistic,
it allows the timings to focus on the various pieces of the whole I/O
process that may be of interest to certain people. Also, it allows for
standardized timing reports for all of the various benchmarks.

{mosgoogle right}

As I mentioned in the previous column, measuring the time it takes
to perform I/O on a distributed system is a difficult proposition.
The clocks on the node are skewed relative to one another and the nodes
will finish their various portions of the I/O at different times. So,
how does one measure time for a parallel file systems benchmark?

To resolve this issue, the pio-bench code uses aggregate time. That is,
the time from when the earliest process starts its I/O, to the time
that the last process finishes its I/O.

The general flow of the benchmarks begins with the setup phase. After
the setup phase, an MPI_Barrier is called. This step ensures that the
various MPI processes are synchronized. Then the operations take place,
where the I/O is actually performed, followed by another MPI_Barrier.
This aggregate time is written as:

Aggregate Time = T + S + B

Where the term S is the time it takes between the end of the first
MPI_Barrier and the beginning of the first file access by any process.
The term T is the total amount of time that all accesses take from
beginning to end. Note that some processes may be done before T
is reached. At the end of the operations phase, another MPI_Barrier is
called. The term B is the time that this second barrier takes after
each access has completed. This second barrier ensures that all
processes have finished executing their operations.

If the amount data processed is small, including B in the timings
can skew the results. So, the time for the I/O processing must be long
enough so that B is insignificant to the aggregate time. This
requirement helps define the minimum amount of data the benchmark needs
to run for reproducible results.