Things that can affect performance

HDF5 performance, such as speed, memory usage, and storage effic=
iency can be affected by how an HDF5 file is accessed or stored. Listed bel=
ow are performance issues that can an occur and how to avoid them.

Excessive Me=
mory Usage

Open Objects

Open objects use up memory. The amount of memory used may be substantial=
when many objects are left open. You should:

Check for all open HDF5 object identifiers.

Delay opening of files and datasets as close to their act=
ual use as is feasible.

Close files and datasets as soon as their use is complete=
d.

If writing to a portion of a dataset in a loop, be sure t=
o close the dataspace with each iteration, as this can cause a large tempor=
ary "memory leak".

There are APIs to determine if datasets and groups are left open. H5F_GET_OBJ_COUNT will get the nu=
mber of open objects in the file, and H5F_GET_OBJ_IDS will return a list of the open object identifiers.=

Metadata Cache

The metadata cache can also affect memory usage. Modify the metadata cac=
he settings to minimize the size and growth of the cache as much as possibl=
e without decreasing performance.

By default the metadata cache is 2 MB in size, and it can be allowed to =
increase to a maximum of 32 MB per file. The metadata cache can be disabled=
or modified. Memory used for the cache is not released until the datasets =
or file are closed.

Memory and Storage Issues Caused by Chunking

There can be a number of issues caused by using chunking inefficiently. =
Please see the advanced topic, C=
hunking in HDF5, for detailed information regarding the use of chunking=
. Some things that may help are listed below:

There is a chunk cache for each open dataset. The default=
size for this chunk cache is 1 MB. If there are a lot of chunked datasets =
left open, a large amount of memory may be used. It can help to reduce the =
size of the chunk cache if this 1 MB default is not needed.

For best performance, the chunk cache size should be equa=
l to or greater than the chunk size for a dataset. If the chunk cache size =
is smaller than the dataset's chunk size it will be ignored and the chunks =
read directly from disk. This can cause spectacularly and unnecessarily poo=
r performance in cases where an application repeatedly reads small sections=
of the same chunk, since each of those reads requires reading the entire c=
hunk from the disk. If the chunk is compressed the performance problem is c=
ompounded because the entire chunk must be decompressed for each read. The =
chunk cache size can be modified with the H5P_SET_CHUNK_CACHE call.

Also be aware that if a datase=
t is read by whole chunks and there is no need to access the chunks more th=
an once on the disk, the chunk cache is not needed and can be set to 0 if t=
here is a shortage of memory.

Avoid using a chunk size that is really small. There can =
be a lot of overhead with a small chunk size, which can affect performance,=
in addition to making the file a lot larger.

Match the amount read from a chunked dataset to the chunk=
size, particularly if reading a dataset once from beginning to end in a lo=
op.

Other Issues

If you have a large number of small datasets (smaller tha=
n 64k) then consider storing them as compact datasets. If a dataset can fit=
in the header of the dataset, there will be less I/O and storage used.

These datatypes can cause memory or performance issues.

Variable Length: Datasets with variable length datatyp=
es cannot be compressed. Also, frequent editing of datasets with variable l=
ength datatypes and closing the file between edits, can leave holes in the =
file. A workaround is to leave the file open while editing the datasets.

A fixed length dataset that is compressed can be used as an alternative=
to using a variable length datatype.

Compound Datatypes in Fortran and Java:

Compound da=
tatypes work well with C, but they are slow when using them with Fortran or=
Java. They are also cumbersome, because you can only read/write data by fi=
eld in F90 and Java. [It is not possible to pass an array of Fortran struct=
ures to a C function in a portable manner. In any case, the Fortran layer h=
as to repack the Fortran array to an array of C structures. The main proble=
m is that Fortran enforces type checking at compilation time and it is impo=
ssible to overload the h5dread/write_f function with a datatype that is def=
ined by the user.]