Raw Data Storage in HDF5

This document describes the various ways that raw data is
stored in an HDF5 file and the object header messages which
contain the parameters for the storage.

Raw data storage has three components: the mapping from some
logical multi-dimensional element space to the linear address
space of a file, compression of the raw data on disk, and
striping of raw data across multiple files. These components
are orthogonal.

Some goals of the storage mechanism are to be able to
efficently store data which is:

Small

Small pieces of raw data can be treated as meta data and
stored in the object header. This will be achieved by storing
the raw data in the object header with message 0x0006.
Compression and striping are not supported in this case.

Complete Large

The library should be able to store large arrays
contiguously in the file provided the user knows the final
array size a priori. The array can then be read/written in a
single I/O request. This is accomplished by describing the
storage with object header message 0x0005. Compression and
striping are not supported in this case.

Sparse Large

A large sparse raw data array should be stored in a manner
that is space-efficient but one in which any element can still
be accessed in a reasonable amount of time. Implementation
details are below.

Dynamic Size

One often doesn't have prior knowledge of the size of an
array. It would be nice to allow arrays to grow dynamically in
any dimension. It might also be nice to allow the array to
grow in the negative dimension directions if convenient to
implement. Implementation details are below.

Subslab Access

Some multi-dimensional arrays are almost always accessed by
subslabs. For instance, a 2-d array of pixels might always be
accessed as smaller 1k-by-1k 2-d arrays always aligned on 1k
index values. We should be able to store the array in such a
way that striding though the entire array is not necessary.
Subslab access might also be useful with compression
algorithms where each storage slab can be compressed
independently of the others. Implementation details are below.

Compressed

Various compression algorithms can be applied to the entire
array. We're not planning to support separate algorithms (or a
single algorithm with separate parameters) for each chunk
although it would be possible to implement that in a manner
similar to the way striping across files is
implemented.

Striped Across Files

The array access functions should support arrays stored
discontiguously across a set of files.

Implementation of Indexed Storage

The Sparse Large, Dynamic Size, and Subslab Access methods
share so much code that they can be described with a single
message. The new Indexed Storage Message (0x0008)
will replace the old Chunked Object (0x0009) and
Sparse Object (0x000A) Messages.

The Format of the Indexed Storage Message

byte

byte

byte

byte

Address of B-tree

Number of Dimensions

Reserved

Reserved

Reserved

Reserved (4 bytes)

Alignment for Dimension 0 (4 bytes)

Alignment for Dimension 1 (4 bytes)

...

Alignment for Dimension N (4 bytes)

The alignment fields indicate the alignment in logical space to
use when allocating new storage areas on disk. For instance,
writing every other element of a 100-element one-dimensional
array (using one HDF5 I/O partial write operation per element)
that has unit storage alignment would result in 50
single-element, discontiguous storage segments. However, using
an alignment of 25 would result in only four discontiguous
segments. The size of the message varies with the number of
dimensions.

A B-tree is used to point to the discontiguous portions of
storage which has been allocated for the object. All keys of a
particular B-tree are the same size and are a function of the
number of dimensions. It is therefore not possible to change the
dimensionality of an indexed storage array after its B-tree is
created.

The Format of a B-Tree Key

byte

byte

byte

byte

External File Number or Zero (4 bytes)

Chunk Offset in Dimension 0 (4 bytes)

Chunk Offset in Dimension 1 (4 bytes)

...

Chunk Offset in Dimension N (4 bytes)

The keys within a B-tree obey an ordering based on the chunk
offsets. If the offsets in dimension-0 are equal, then
dimension-1 is used, etc. The External File Number field
contains a 1-origin offset into the External File List message
which contains the name of the external file in which that chunk
is stored.

Implementation of Striping

The indexed storage will support arbitrary striping at the
chunk level; each chunk can be stored in any file. This is
accomplished by using the External File Number field of an
indexed storage B-tree key as a 1-origin offset into an External
File List Message (0x0009) which takes the form:

The Format of the External File List Message

byte

byte

byte

byte

Name Heap Address

Number of Slots Allocated (4 bytes)

Number of File Names (4 bytes)

Byte Offset of Name 1 in Heap (4 bytes)

Byte Offset of Name 2 in Heap (4 bytes)

...

Unused Slot(s)

Each indexed storage array that has all or part of its data
stored in external files will contain a single external file
list message. The size of the messages is determined when the
message is created, but it may be possible to enlarge the
message on demand by moving it. At this time, it's not possible
for multiple arrays to share a single external file list
message.

Adds a new, empty external file list message to an object
header and returns a pointer to that message. The message
acts as a cache for file descriptors of external files that
are open.

intn H5O_efl_index (H5O_efl_t *efl, const char *filename)

Gets the external file index number for a particular file name.
If the name isn't in the external file list then it's added to
the H5O_efl_t struct and immediately written to the object
header to which the external file list message belongs. Name
comparison is textual. Each name should be relative to the
directory which contains the HDF5 file.

H5F_low_t *H5O_efl_open (H5O_efl_t *efl, intn index, uintn mode)

Gets a low-level file descriptor for an external file. The
external file list caches file descriptors because we might
have many more external files than there are file descriptors
available to this process. The caller should not close this file.

herr_t H5O_efl_release (H5O_efl_t *efl)

Releases an external file list, closes all files
associated with that list, and if the list has been modified
since the call to H5O_efl_new flushes the message
to disk.