SSE2 vs CUDA in parallel bit stream transposition / similarity

Anatoliy Kuznetsov, Igor Tolstoy. Aug 16, 2009.

Introduction

One of the most hot trends of today is adoption of parallel programming techniques for
GPGPU. GPUs of today has grown into powerful parallel machines claiming Teraflop capabilities at
a commodity price. It is proven that graphics cards are very capable at floating point tasks and
there are some attempts to use GPU for high bandwidth integer and logical computations applicable
for databases and data-mining. We decided to experiment with CUDA on a commodity nVidia GTX 9500 card and
create a mini application to dealing with a lot of integer bitwise arithmetic.
For comparison we take a commodity Core2 Quad 2.4GHz system and SSE2 optimized algorithm.

Bit Transposition and Similarity Algorithm

The goal of this algorithm is to take a block of 32-bit integers and do a bit slicing, transposing
the array into an equivalent matrix representation, where N-th row of the matrix corresponds to
a N-th bit in the input array. Each row of the transposition matrix consists of N/32 bytes of the
input stream.

Figure is taken from [1]

After transposition is accomplished we decided to compute self-similarity of the data.
We need to:

For each row of the matrix compute number of 1 bits (population count)

For each pair of rows compute Humming distance: population count in the XOR product of two bit rows

The result of the similarity stage should be a right triangular distance matrix of 32x32
where diagonal elemnts represent line N population count and elements U[i,j] are Humming distance
between rows

Why we think this parallel bit streams is important?

Parallel bit streams is equivalent but alternative representation of a bit-vector.
This transformation is allows the same bit-stream operations like (AND, NOT, OR, XOR and SUB)
plus it allows random access. Any 32-bit word can be gathered in constant time with
a controlled, predictable CPU penalty.
At the same time there is plenty of data not using the full 32-bit capacity, so a light
weight compression algorithm becomes possible if we know some of the bit slices are always zero bits(or one)
or sufficiently similar between each other.
We think this data representation has a lot of potential for data-mining application manipulating
sparse vectors of all sorts.

This algorithm implements 3 nested loops with the internal loop completely unrolled,
doing forward 32 word lookup with bit-gathering and writing results into rows word by word.
We think this algorithm represents both readability and sufficient performance.
Simple 3-loop method proved to be slower, more complex SSE2 implementations are not
available to us yet. :-) Loop unrolling with a lot of variables is good enough because
current generations of x86 processors actively use out-of-order execution and register
renaming to avoid costly memory access for local variables.

The problem of Humming distance is embarassingly parallel and can be easily coded using SSE2 SIMD
(and probably any other SIMD sistem). SSE2 optimizations are implementad in BitMagic
library and described here: "128-bit SSE2 optimization".

32-bit CPU vs. SSE2

Long story short: SSE2 parallel algorithm wins with a long margin.
We tested on different configurations, SSE2 optimized code is typically 2 times faster than 32-bit
code (with lookup-table based bit counting). (Matrix transposition was exactly the same).
Interesting note is that on Intel Atom processor SSE2 code wins with a higher mergin,
maybe in-order execution makes 32-bit code slow, maybe SIMD unit on Atom is very good...

Intel Core2 Quad CPU. (c) Intel Corp.

SSE2 vs SSE4.2

Unfortunately we have no data at this point. Nehalem microarchitecture implements
hardware POPCNT, so it should be faster. Unfortunately it looks like Intel choose not to
implement true SIMD 128-bit version of POPCNT, so final performance of a SSE + POPCNT mix
is an open question...

CUDA optimization notes

We decided to resist the temptation to print here all 256 variants of CUDA kernel we tried.
But we definately want to outline the final tricks we used:

Pass input data array as a GPU constant. Constant memory is fast and it is equivalent of CPU cache.
But not like CPU cache CUDA constant memory can be controlled programmatically.

Use shared memory to strore transposed matrix and compute Humming distance.
Again, shared memory here is just another form of close to ALU cache, much faster than main GPU memory

Unroll loops to use more registers. GPU offers plenty of registers available to programmer(compiler).
Typically you have to find a good balance between starting more threads and running portions of code
sequentially on registers. There is no one solver bullet here - you have to experiment.

Use hardware bit counting, CUDA offers it.

Use device memory mapping. It seems to be faster than cudaMemCpy.
(We don't completely understand how nVidia implements memory mapping over PCI, but it seems like an
interesting feature.

Do latency optimization. CUDA kernel call with all memory transfer looks like a high latency
operation. All preparations and staging can actually take longer than the call itself.
This is especially important for relatively simple and high speed integer operations.
So staging overhead for 4 transposition blocks on CUDA is approximately the same as for 1 block.
Effectively it gives you 4 blocks for the price of one.
And of cause you cannot partition the GPU so one CPU thread takes one Stream Multiprocessor (SM)
and another one takes another. One GPU device cannot be used concurrently - so use it all or loose
the remaining SM units.

Benchmark results

As you can see at integer operation 1 GT 9500 Stream Multiprocessor looses to 1 SSE aware CPU Core,
while 4 SM start operating on par with 1 CPU Core. Raw integer performance of the GPU seems to be higher than CPU,
but data transfer certainly takes its lions share. Core 2 Quad features 4 CPU cores, so it is also capable of running
4 blocks at a time. nVidia GT200 shows 30 SMs(?).

Is CUDA worth it?

Economics of CUDA development is not bad but it is not particularly good.
CUDA language pretends to be C, but really looks like a good Macro ASM mixed with declarative approach for compute grid
configuration. Graphics card has no Operating System or complex Task Planner/ Scheduler of its own, which is both good and bad.
Direct access to the HW is good when your task scales well. If you need to partition your program into non-uniform small
subtasks, you cannot run too many different tasks asyncronously in parallel without writing an "uber-shader" or "uber-CUDA-kernel".
Uber-shaders are predictably complex and vulnerable to combinatorial explosion of variants of subalgorithms in the GPU kernel code.
So real-life scalability of a GPU solution should be limited by the affordable complexity of software tools,
CPU-GPU middleware to combine and execute batches of small tasks on GPU.
A lot of GPU programming reminds us functional or declarative programming, so functional high level language should be very helpful
for auto-parallelisation of GPU tasks.

Larabee?

Ecomonics of vector SIMD programming (SSE) also requires skills, where the main issue is current generation of
SSE is NOT Turing complete (not even close). From the excellent article of Michael Abrash
"A First Look at the Larrabee New Instructions (LRBni)"
it seems that LRBni (Larabee vector command set) is going to be better, because it is at least vector complete
(provides vector gather/scatter), which makes it increadibly awesome (but not Turing complete).
Looking at both cases: SSE2 and CUDA we can extrapolate(well, speculate) both performance and complexity of programming for the
future Intel Larabee hardware. Larabee will have 512-bit vector registers. This should be pretty fast if your loop is branchless.
If you need a branch - LBNi provides "vector compare instructions"
(with an open question on how to do actual branching after a VECTOR comparison?).

The upcoming GP-GPU battle of CUDA3 vs. Larabee is going to be pretty interesting.
If raw performance numbers are close to equal(?) the winning factor becomes
compilers, languages, libraries, profiles and debuggers: tools (and (manufacturing) costs).

Conclusion

GP-GPU integer optimization start making sense if we want to combine a lot of computational resources in one box,
growing a data-mining super-server (or cluster of super servers). Combining CPUs+GPU offers unprecedented concentration of
computational resources otherwise available only for distributed systems like MPI, Google's Map-Reduce, Hadoop, other cluster-ware.
Once our picture factors-in distributed cloud computing on the net, GP-GPU problems like PCIe latency suddenly become very affordable
(network based cluster-ware cannot compete with realtime GPU).
This allows creation of database super-nodes, where multiple CPUs cores and GPUs are combined with fast
random access Solid State Drives, RAM Drives, etc.
Concentartion of various silicon based (computational) resources can save bandwidth to slow devices
(read: Hard Drives and Network Storage).

Another consideration is that "The Gigahertz Race" is now over, shapes into "The Parallel Race".
It means algorithms and systems should be adapted, revisited or re-architected to meet the new reality of hundreds of
parallel threads (not necessarily pthreads) where data needs restructure to minimize collisions and facilitate parallel access.

We hope this article is useful for other developers. We would be happy to see your comments, suggestions, objections at
BitMagic library BLOG.