Stack and Concatenate

We often store large arrays on disk in many different files. We
want to stack or concatenate these arrays together into one logical array.
Dask solves this problem with the stack and concatenate functions, which
stitch many arrays together into a single array, either creating a new
dimension with stack or along an existing dimension with concatenate.

Stack

We stack many existing dask arrays into a new array, creating a new dimension
as we go.

>>>importdask.arrayasda>>>arrays=[from_array(np.ones((4,4)),blockshape=(2,2))...foriinrange(3)]# A small stack of dask arrays>>>da.stack(arrays,axis=0).shape(3,4,4)>>>da.stack(arrays,axis=1).shape(4,3,4)>>>da.stack(arrays,axis=2).shape(4,4,3)

This creates a new dimension with length equal to the number of slices

Concatenate

We concatenate existing arrays into a new array, extending them along an
existing dimension

>>>importdask.arrayasda>>>arrays=[from_array(np.ones((4,4)),blockshape=(2,2))...foriinrange(3)]# small stack of dask arrays>>>da.concatenate(arrays,axis=0).shape(12,4)>>>da.concatenate(arrays,axis=1).shape(4,12)

I suspect that the temperature scale is in Kelvin. It looks like the random
day is taken during Northern Summer. Another fun one, lets look at the
difference between the temperatures at 00:00 and at 12:00

>>>imshow(x[::4].mean(axis=0)-x[2::4].mean(axis=0),cmap='RdBu_r')

Even though this looks and feels like NumPy we’re actually operating off of
disk using blocked algorithms. We execute these operations using only a small
amount of memory. If these operations were computationally intense (they
aren’t) then we also would also benefit from multiple cores.

What just happened

To be totally clear the following steps just occurred:

Open up a bunch of netCDF files and located a temperature variable
within each file. This is cheap.

For each of those temperature variables create a da.Array object,
specifying how we want to block up that variable. This is also cheap.

Make a new da.Array by concatenating all of our da.Arrays for each
day. This, like the other steps, is just book-keeping. We haven’t loaded
data or computed anything yet.

Write numpy-style code x[::2].mean(axis=0) - x[2::2].mean(axis=0).
This creates yet another da.Array with a more complex task graph. It
takes a few hundred milliseconds to create this dictionary.

Callimshow on our da.Array object

imshow calls np.array on its input, this starts the multi-core task
scheduler

A flurry of chunks fly out of all the netCDF files. These chunks meet
various NumPy functions and create new chunks. Well organized magic occurs
and an np.ndarray emerges.

Matplotlib makes a pretty graph

Problems that Popped Up

The threaded scheduler is introducing significant overhead in its planning.
For this workflow the single-threaded naive scheduler is actually significantly
faster. We’ll have to find better solutions to reduce scheduling overhead.

Conclusion

I hope that this shows off how dask.array can be useful when dealing with
collections of on-disk arrays. As always I’m very happy to hear how we can
make this project more useful for your work. If you have large n-dimensional
datasets I’d love to hear about what you do and how dask.array can help. I
can be reached either in the comments below or at [email protected].

Acknowledgements

First, other projects can already do this. In particular if this seemed useful
for your work then you should probably also know about
Biggus,
produced by the UK Met office, which has been around for much longer than
dask.array and is used in production.