In part
1, we explained what data chunking is about in the
context of scientific data access libraries such as netCDF-4 and HDF5,
presented a 38 GB 3-dimensional dataset as a motivating example,
discussed benefits of chunking, and showed with some benchmarks
what a huge difference chunk shapes can make in balancing read times
for data that will be accessed in multiple ways.

In this post, I'll continue looking at that example dataset to see how
we derived good chunk shapes, look at
how long it can take to rechunk a multidimensional dataset, and consider
the use of Solid State Disk (SSD) for both accessing and rechunking
data.

Chunk shapes and sizes: You can't always get what you want

Tiling of two-dimensional data is often used to explain chunking,
because it's easy to understand and to illustrate with simple
figures. Common patterns of subgrid access for a
2D array include:

accessing data by rows

accessing data by columns

accessing a rectangular subgrid of data from somewhere in the middle of the array

If data is stored contiguously by rows, then accessing by rows is very fast, accessing by columns is slow, and the speed of accessing subgrids depends on details of how many rows and columns are involved and the shape of the subgrid. That's all still true if you interchange the words "row" and "column".

For 2-dimensional data, if you want to support equally frequent access by either rows or columns,
then a natural solution is chunking the data into rectangular chunks (also known as tiles) so that reading a row
requires the same number of disk accesses as reading a column. You can treat
each chunk as if it were a single disk block that must be read
completely to access any of its data values. An optimum
solution is to make the chunks similar in shape to the entire array,
so that the same number of chunks are required to read an entire
row or an entire column.

For example, if you have a 277 x 349 array of values (the shape of a
horizontal slice in the example dataset in part 1), you could use
chunks of size 28 x 35, so that 10 chunks would be adequate to read
all the values in any row or any column. Notice there's a little overhang in the last column and last row of chunks, which really cover 280 x 350 values, but that's typically handled in the library and not something you have to worry about. To get all the values in a row, you're really reading more data than you want, since each chunk has values from 28 rows, but that's OK because it's typically the number of disk accesses that make I/O slow, not the number of values read. Also, if you're reading successive rows, caching the 35 chunks for a row in memory means you only have to read each chunk once.

If your chunks are smaller than a
disk block you will still have to read a whole disk block, so it
doesn't make much sense to choose chunks significantly smaller than a
disk block. The number of bytes in a 28 x 35 block of floats (4 bytes
each), is 3920 bytes, which is actually pretty close to the 4096 byte
blocks used in many desktop file systems, so 28 x 35 is a pretty good
shape and size for the parameters of this problem (we're ignoring compressed chunks for now).

Two-dimensional examples are often where the discussion of chunk shapes and sizes
ends. However, the 2D case is too simple
to generalize to 3D or higher, because
equalizing access time along each axis is not necessarily the most common access pattern to be optimized in higher dimensions. The real-world example presented in part 1 involved a (time, y, x) floating-point variable of shape 98128 x 277 x 349, in which we wanted to balance access to time series of shape 98128 x 1 x 1 and to
horizontal slices of shape 1 x 277 x 349 so that neither type of access was ridiculously slow. If we just use chunk shapes of similar shape to the 3D variable, we might try 9813 x 28 x 35 chunks. For that chunk shape, a time series will need 10 chunks but a horizontal slice will
need 100 chunks, and take 10 times longer if the number of disk accesses
are the predominant time (as they usually are).

A little algebra can be applied to this 3D case, setting the number of chunks
accessed for a 1D time series equal to the number of chunks
accessed for a 2D horizontal slice, leading to chunks of shape

98128/N2 by 277/N by 349/N

where N4 is the total
number of chunks used to partition the array. It turns out not to
matter how the chunks are distributed along the x and y axes, as long
as their product, the number of chunks in a horizontal slice, is the
same as the number of chunks in a time series. So the most general
formula for optimal chunk shapes for this access pattern has an arbitrary positive constant C,
with chunk shapes

98128/N2 by C*277/N by (1/C)*349/N

Here's source code for a little Python function, chunk_shape_3D, that computes a good shape
for this access pattern of equally frequent 1D and 2D accesses of a 3D
variable. You provide a variable's shape as a list of dimension
sizes, a desired chunk size that the resulting chunks should be close
to without exceeding, and the external size in bytes of each element
of the variable. The function returns a "good" chunk shape. The function handles some details not described here, such as converting ideal shapes with fractional dimensions to practical shapes with whole-number dimensions.

Examples of chunk shapes returned by the function are given in the following table for our 98128 x 277 x 349 variable, for various power-of-two chunk sizes that
include the common cases of size 4 KB, 8KB, 1MB, and 4 MB,
and assuming the variable values are 4 bytes each.

Desired chunk size

(bytes)

Actual chunk size

(bytes)

Chunk shape

(values)

Number of chunks per access

(time series, spatial slice)

4096

3960

33 x 5 x 6

2974, 3304

8192

7728

46 x 6 x 7

2134, 2850

16384

16384

64 x 8 x 8

1533, 1540

1048576

1032000

516 x 20 x 25

191, 196

4194304

4189920

1032 x 29 x 35

96, 100

Rechunking? How Long Will That Take?

If you have some important datasets that are heavily accessed in
complementary ways, such as the 1D and (n-1)D pattern presented here, and
you're convinced that rechunking the data might be a good idea, the
good news is that tools for rechunking are available: nccopy for netCDF-4 data and
h5repack, which works for both HDF5 and netCDF-4 data. Note that you can specify chunking (and compression) for classic-model netCDF data, so you don't have to make use of features of the netCDF-4 data model to get the benefits of chunking.

The bad news is that rechunking BigData, or even only PrettyBigData,
can take a lot of time. The problem is analogous to transposing a
matrix that's too big to fit in memory all at once. But don't
despair. The time it takes to rechunk a big dataset is often only a
small multiple of time to copy the data from one disk
file to another, especially when you have a lot of memory. Maybe some
day rechunking datasets will be a cloud service, which would be
especially convenient if your big datasets are already in the cloud.

Another issue to consider is compression. If your data is
compressed in netCDF-4 or HDF5, it has to be chunked, because a
chunk is the atomic unit of compression as well as disk access.
Rechunking compressed data means reading each compressed chunk,
uncompressing it, rechunking the uncompressed data, recompressing the
new chunks, and writing them out to disk. Believe it or not, that can
be faster than doing the same rechunking with uncompressed data, due to savings in disk I/O for data that is very compressible.

Here are some benchmarks for rechunking our example dataset, using nccopy or h5repack:

Source chunks

Destination chunks

nccopy: disk, SSD (minutes)

h5repack: disk, SSD (minutes)

1032 x 29 x 35

516 x 20 x 25

7, 4

99, 38

1032 x 29 x 35

64 x 8 x 8

10, 10

134, 43

1032 x 29 x 35

46 x 6 x 7

11, 10

144, 46

1032 x 29 x 35

33 x 5 x 6

12, 14

147, 49

Though nccopy is faster than h5repack for netCDF files, it could probably still be sped up significantly. However, the 4-minute time for rechunking to 1 MB chunks using SSD is about the same as copying the 38 GB file using SSD, so it's unlikely that could be sped up much. But optimizing rechunking to smaller chunks might be a good project for a summer intern ...

If you're going to be a frequent rechunker, you'd be wise to get a machine with lots of memory and maybe lots of SSD instead of spinning disk. But there are some surprises with SSD, as we'll see in the next section.

Running these tests have helped to provide some tips on rechunking that weren't obvious to me. First, if you want to rechunk data quickly, what form of source is best, in terms of chunking and compression? Possibilities include unchunked contiguous data, a few large chunks, or lots of small chunks, and in each case, is it better to use compressed or uncompressed data as the source? Benchmarks with the 38 GB example dataset suggest a few answers.

First, rechunking from a contiguous layout is slow unless you can read the entire input file into memory. Typically that's not practical, but you may have to start with contiguous data to get a chunked dataset with a few large chunks that you can use as source for experimenting with creating files with better chunk shapes.

It doesn't seem to matter much whether the input or output data is compressed, as I/O time will dominate the rechunking, as long as you have enough memory dedicated to chunk cache to hold both the compressed input and compressed output in memory. If you lack that much memory, use of chunk cache may have to be tuned for optimum rechunking. This is still somewhat of a dark art, but nccopy lets you specify how much memory to use for chunk caches as a command-line option.

Even with SSD, rechunking data takes a relatively long time. Is it ever worth it? I think it is, for important datasets that will be written once but read many times. It's similar to justification for developing the new zlib-compatible zopfli compression algorithm, which is 100x slower than zlib for compression, but compresses 5% better, so it saves time on every access and is a win after about 20 accesses. With the huge bias in access times discussed in part 1, rechunking is a win if it replaces only a few accesses in the "slow" order.

If Memory Serves: What's Going On with SSD?

If you do I/O intensive manipulations such as rechunking and if you can afford it, equip your machines with SSD in addition to spinning disk (but make sure your SSDs are designed to deal with power faults). How
does using SSD compare with using conventional spinning disks? We've tried them for:

accessing contiguous data

accessing chunked data

rechunking data

Serial access with SSD can easily be 4 or 5 times faster than spinning disks, but that's not the only speedup SSD provides.
Using SSD with traditional contiguous storage can make chunking the
data unnecessary, because
random access is so much faster in SSD than spinning disk. Here's an
example of average time to access time series and horizontal slabs on
our example dataset:

Storage layout,
chunk shapes

Read time
series (sec)

Read spatial
slice (sec)

Performance bias
(slowest / fastest)

Contiguous favoring time range

0.00003

0.00004

1.3

Contiguous favoring spatial slice

53

0.003

18000

1032 x 29 x 35

1.2

1.0

1.2

64 x 8 x 8

0.5

0.3

1.5

46 x 6 x 7

0.6

0.2

2.4

33 x 5 x 6

0.6

0.3

2.4

There's a bit of a mystery here. Why does using SSD make only one form of
access (in red) to unchunked data slow, when spinning disk is slow for 2 forms of access to contiguous layout. And how can SSD access possibly make accessing a spatial slice as fast as a time range for data organized to favor time series access (also in red). Stay tuned, to see if this is just a bug or if there's some logical explanation.

Different chunk shapes are optimal for different patterns of access. For example, if you know that the data will primarily be accessed by fixed slices of the first dimension, optimal chunks would be in the shape 1 x 37 x 256 x 512 or a similar shape that fits within one physical disk block, such as 1 x 5 x 32 x 64, which for 4-byte values fits within a 64K disk block.

If there is no most common access pattern, let D be the number of values you want in a chunk (preferably <= what one disk access can read), then let c = (D/(25256 * 37 * 256 * 512)) ^ (1/4), and use a chunk shape resulting from multiplying each dimension size by c and truncating to an integer.

Interesting post. Unfortunately I have a different situation and I don't know how to optimize data I/O. I have a huge array, 10 dimensions like this: 3x72x73x42x32x23x46x26x83x73
I have to write millions of values at a time on the array, but these data are not contiguous in any dimension! Furthermore, the way I have to write data is quite wired because first I have to read the value of the variable stored in the file, then update it (it's just a sum)and finally I write the new value at the same position on the array. I know the coordinates of all values to update before I had to write them. So at the moment I have this for loop where for each point in the array, I read value, update it and then write it on file. And this for millions of time. Is there any technique to optimize this ?

rank = 3 # this is a special case of n-dimensional function chunk_shape
chunkVals = chunkSize/valSize # ideal number of values in a chunk
numChunks = last(cumprod(varShape))/chunkVals # ideal number of chunks
axisChunks = numChunks**(1/valSize) # ideal number of chunks along each 2D axis
#axisChunks = numChunks**0.25 # ideal number of chunks along each 2D axis

}
# cFloor is typically too small, (numVals(cFloor) < chunkSize).
# Adding 1 to each shape dim results in chunks that are too large, (numVals(cCeil) > chunkSize).
# Want to just add 1 to some of the axes to get as close as possible to chunkSize without exceeding it.

# Here we use brute force, compute numVals(cCand) for all 2**rank candidates and return the one closest to chunkSize without exceeding it.