New example code: TEA encryption with CUDA

I’ve written some more CUDA demonstration-code: The Tiny Encryption Algorithm implemented in CUDA.

The code demonstrates 100% occupancy, 100% coalesced 128bit memory transactions and use of page-locked memory. It performed at around 380 mb/s on a GTX 260. Compare that to 40mb/s on a 2×2.5Ghz Core2Duo (without using SSE).

Beware some pitfalls when playing with the execution parameters. Especially beware those implicit memory/threadblock alignment requirements from hell!

Get it here and compile with ‘nvcc -Xptxas “-v” -maxrregcount=10 tea_cuda.cu‘