We now give performance numbers on out-of-core matrix-matrix multiplication.

Matrix-Matrix Multiplication

Dense matrix-matrix multiplication is compute-bound, not I/O bound.
We spend most of our time doing arithmetic and relatively little time shuffling
data around. As a result we may be able to read large data from disk without
performance loss.

When multiplying two $n\times n$ matrices we read $n^2$ bytes but perform $n^3$
computations. There are $n$ computations to do per byte so, relatively
speaking, I/O is cheap.

We normally measure speed for single CPUs in Giga Floating Point Operations
Per Second (GFLOPS). Lets look at how my laptop does on single-threaded
in-memory matrix-matrix multiplication using NumPy.

OK, so NumPy’s matrix-matrix multiplication clocks in at 6 GFLOPS more or
less. The np.dot function ties in to the GEMM operation in the BLAS
library on my machine. Currently my numpy just uses reference BLAS. (you can
check this with np.show_config().)

Matrix-Matrix Multiply From Disk

For matrices too large to fit in memory we compute the solution one part at a
time, loading blocks from disk when necessary. We parallelize this with
multiple threads. Our last post demonstrates how NumPy+Blaze+Dask automates
this for us.

We perform a simple numerical experiment, using HDF5 as our on-disk store.

18.9 GFLOPS, roughly 3 times faster than the in-memory solution. At first
glance this is confusing - shouldn’t we be slower coming from disk? Our
speedup is due to our use of four cores in parallel. This is good, we don’t
experience much slowdown coming from disk.

It’s as if all of our hard drive just became memory.

OpenBLAS

Reference BLAS is slow; it was written long ago. OpenBLAS is a modern
implementation. I installed OpenBLAS with my system installer (apt-get) and
then reconfigured and rebuilt numpy. OpenBLAS supports many cores. We’ll show
timings with one and with four threads.

Sadly the out-of-core solution doesn’t improve much by using OpenBLAS.
Acutally when both OpenBLAS and dask try to parallelize we lose performance.

Results

Performance (GFLOPS)

In-Memory

On-Disk

Reference BLAS

6

18

OpenBLAS one thread

11

23

OpenBLAS four threads

22

11

tl:dr When doing compute intensive work, don’t worry about using disk, just
don’t use two mechisms of parallelism at the same time.

Main Take-Aways

We don’t lose much by operating from disk in compute-intensive tasks

Actually we can improve performance when an optimized BLAS isn’t avaiable.

Dask doesn’t benefit much from an optimized BLAS. This is sad and surprising. I expected performance to scale with single-core in-memory performance. Perhaps this is indicative of some other limiting factor

One shouldn’t extrapolate too far with these numbers. They’re only relevant for highly compute-bound operations like matrix-matrix multiply

Also, thanks to Wesley Emeneker for finding where we were leaking memory,
making results like these possible.