Thursday, February 3, 2011

OpenCL simple memory manipulation tests

Investigations for the 2nd homework assignment in the OpenCL-U (2011 iVEC @ UWA summer school). The aim of the assignment was to write an OpenCL kernel that reverses a buffer of bytes. It's a trivial task however there are many ways to achieve it in OpenCL with varying results.

Using a NVIDIA GT 9600 I was able to make some significant performance gains optimising the size of the kernel's memory reads. However surprisingly on a NVIDIA GTS 360M the performance got worse as I optimised the kernel:

The NVIDIA GT 9600 was only able to handle a 16MB memory buffer. To create a 16MB random test file:

dd if=/dev/urandom of=random.bin bs=1M count=16
The "reverse" program source is a modified version of the template Derek provided here. Sample output from a run:

Each work-item loads a single byte from global memory, calculates its destination index and then writes it. On the GT 9600 this fails to use the full width of data bus when performing each memory read/write.

We need to modify the data types to use the full width of the memory data bus. But how many bytes should we try accessing per read/write? GPU's are designed for rapidly calculating pixel values. The shader units work with RGBA float values ie. 4x4 = 16 bytes per pixel. Intuitively a char16 may be the optimal data width to request.

A 1.1% speed improvement using the swizzle instead of multiple assignment steps. However this may not be optimal as 2.9GB/sec of data throughput represents only 7% of the GPU's memory bandwidth (see below). Increasing the data request size using uint16's (4x16=64 byte reads).

By type-casting each uint to a char4 vector, swizzling the chars and then casting back to uints that are rearranged into an output uint vector 64 bytes can be copied in each work-item. This is a 3.7x speed improvement over the simple char16 vector case. This result was surprising.

While trying to increase vector sizes beyond 64 bytes the work-group sizes started dropping from the maximum 512. Perhaps this was due to limits in the size of the register file?

Summary:

Processing large 64 byte vectors provided a 17.6x speed-up over the trivial solution. It also copies/reverses at 10.5GB/sec data throughput, 27% of the GT 9600's memory bandwidth.

Not bad. However when testing the same kernels on the 360M, the simple 1 char per work-item solution was fastest! At 5.37GB/sec the simplest solution used 20.3% of the available memory bandwidth, not far from the best result on the GT 9600. The vector optimisations made the 360M slower, an unexpected result.

Update: the 360M has NVIDIA's CL_DEVICE_COMPUTE_CAPABILITY_NV: v1.2. In this device the memory controller coalesces multiple single-word memory transactions as described in section G.3.2 here (thanks Derek).