Share your bug fixes

The GPU is an unruly beast. Please share your most annoying bugs and how you eventually managed to fix them, so that others won’t have to go through that same horror.

Here’s a few of my own, in no particular order:

- Number of parameter on kernel calls: a kernel stack space is limited, and may vary with the hardware. As a rule of thumb, try to keep it below 128 bytes. NVIDIA claims it can go up to 256, but with mem alignment and other issues, it can rarely reach that
threshold safely. If you’re unsure on how to calculate your kernel’s stack size, assume that on a worst-case scenario, each normal argument takes 8 bytes each, except for pointers (i.e., array arguments), where you should count 16 bytes. Exceed the stack size,
and you’ll have crazy errors.

- Launching kernels within a different context: If you’re getting “Invalid handle” while launching your kernel in a multi-gpu setting, it usually means that the “gpu” instance that you’re using to launch the kernel is on a different context than the current
context. Either use CUDAfy’s multi-threading management tools (check cudafy’s unit test for examples), or manually switch the gpu’s context to become the current context, by calling mygpu.SetCurrentContext() before you launch the kernel.

- If you’re getting an ErrorUnknown after either launching your kernel or (if launching it asynchronously) on the next cudafy instruction, that means that your kernel aborted unexpectedly. There could be a plethora of reasons for that, but the most frequent
is memory access violations. Use emulation mode to pinpoint your problem. The CUDA Toolkit “cuda-memcheck.exe” can also help you there. Other frequent reasons for ErrorUnknown are calling “return” on only a few of the threads within the block, or dividing
by zero, or even launching a kernel with an excessively large blockDim*GridDim.

- If you’re getting inconsistent numerical results, which change on every run, most likely you are accessing uninitialized memory somewhere in your computations. Another source of this can result by not calling SynchThreads() after some particular manipulations
of shared memory where more than one thread contribute for a common shared result. Try placing SynchThreads() everywhere, and if it fixes your problem, selectively remove them. This has fixed some vexing problems I had where different hardware & architectures
would produce inconsistent results.