I’ve done a bit of coding in the past with SSE instructions and achieved quite a significant speedup. I’ve also been playing recently with OpenCL as a means of implementing a fast, cross-device version of the Dual-Tree Complex Wavelet Transform (DT-CWT).

Since I lack a GPU capable of supporting OpenCL, I’m rather limited to targetting the CPU. This is a useful task in itself, as I would like a DT-CWT library that is fast on any platform, making use of the whatever hardware exists.

OpenCL implements a vector datatype, which the CPU compilers will try to map to SSE instructions. If no vector instructions are available, the compiler should just unroll the vector and perform scalar operations on each element (as would likely happen on most GPUs).

At the core of the DT-CWT is a convolution. The faster the convolution can be done, the faster the DT-CWT is done. This means a good starting point for a DT-CWT implementation is producing an efficient implementation of a convolution.

Given the difficulty of writing and debugging OpenCL, and the need to initially target a CPU, it made sense to me to create, in the first instance, a pure C version of an efficient convolution.

This is compiled with the following options:gcc -std=c99 -Wall -O3 -msse3 -mtune=core2

Running this with a 1024 sample input vector and a 16 sample kernel, it takes about 24.5μs per loop (on my rather old Core2 Duo).

The first modification is to reimplement this with SSE instructions. We can do this by computing 4 output values in parallel. If the kernel is of length N, then we create N 4-vectors. We then copy each element of kernel into each element of the corresponding 4-vector. We can then compute 4 output values in parallel. For this, we require that the input vector is a multiple of 4.

Note that we need to compute the last value in the output separately as the output is not a multiple of 4. We could also extend this principle to inputs that are not a multiple of 4 (but I don’t here!).

I use the SSE intrinsics (specifically, SSE3), included with#include <pmmintrin.h>
#include <xmmintrin.h>

The __m128 type is used to denote an SSE 4-vector that can be used with the intrinsics.

We’re still not done though! Each load from the dataset is an unaligned load (_mm_loadu_ps). Aligned loads using SSE are significantly faster than unaligned loads. To make the load aligned, the address of the data to be copied in must be a multiple of 16 (with AVX, Intel’s latest vector instruction set incarnation, I believe this has been increased to 32).

Unaligned loads are a necessity in the code so far as we step over every sample, meaning we never know what the byte-alignment of any given block of 4 floats will be. By somehow persuading every load to be aligned, we might be able to get more speed.

We can do this by making 4 copies (well, almost whole copies) of the input array, each with a differing alignment, such that every possible set of 4 floats can be found on the crucial 16-byte alignment.

If we start with the original data, with the numbers showing the sample numbers:
original: [0, 1, 2, 3, 4, 5, …]

A quick test demonstrates that the speed up from the aligned load overwhelms any increase in time due to extra copies being made.

The time with the same input and kernel is now about 5.4μs.

One final tweak is to fix the length of the kernel loop, replacing kernel_length with a constant. This allows the compiler to unroll the whole of the kernel loop and yields a time of about 4.0μs (code here).

Inspecting the yielded assembly of the inner loop suggests that we’ve pretty much reached the limit of what is possible, with most of the time being spent doing an aligned load, followed by a 4-vector multiplication, followed by a 4-vector add, the essence of a convolution.

Running on larger arrays suggests that speedup scales well across larger arrays (I tested up to 32,768). In the repository is a python script that you can play with to convince yourself.

Advertisements

Like this:

LikeLoading...

Related

About Henry Gomersall

I'm a engineer wanting to make things actually useful. I'm someone that wants to drive technology and ideas to be helpful for everyone. I'm someone that realises the disconnect between technology and utility and I hope to correct that...