General Questions

Should I use PyOpenCL or PyCUDA?

Good question. I put together a page that presents arguments that help you decide. The answer will likely depend on your particular situation. In most cases, "it doesn't matter" is probably the correct answer.

Build Questions

How do I make PyCUDA rebuild itself from scratch?

Just the delete the build subdirectory created during compilation:

rm -Rf build

Then restart compilation:

python setup.py install

I have <insert random compilation problem> with gcc 4.1 or older. Help!

Try adding:

CXXFLAGS = ['-DBOOST_PYTHON_NO_PY_SIGNATURES']

to your pycuda/siteconf.py or $HOME/.aksetup-defaults.py.

It can also help to also use DBOOST_NO_INCLASS_MEMBER_INITIALIZATION, i.e. with:

I'm getting funny setuptools errors, like KeyError: '_driver'. Why?

Usage Questions

How about multiple GPUs?

Two ways:

Allocate two contexts, juggle (pycuda.driver.Context.push and pycuda.driver.Context.pop) them from that one process.

Work with several processes or threads, using MPI, multiprocesing or threading. As of Version 0.90.2, PyCUDA will release the Global Interpreter Lock while it is waiting for CUDA operations to finish. As of version 0.93, PyCUDA will actually work when used together with threads. Also see threading, below.

My program terminates after a launch failure. Why?

This should not be an issue any more with 0.93 and later, where cleanup failures have been downgraded to warnings.

What's going on here? First of all, recall that launch failures in CUDA are asynchronous. So the actual traceback does not point to the failed kernel launch, it points to the next CUDA request after the failed kernel.

Next, as far as I can tell, a CUDA context becomes invalid after a launch failure, and all following CUDA calls in that context fail. Now, that includes cleanup (see the cuMemFree in the traceback?) that PyCUDA tries to perform automatically. Here, a bit of PyCUDA's C++ heritage shows through. While performing cleanup, we are processing an exception (the launch failure reported by cuMemcpyDtoH). If another exception occurs during exception processing, C++ gives up and aborts the program with a message.

In principle, this could be handled better. If you're willing to dedicate time to this, I'll likely take your patch.

Are the CUBLAS APIs available via PyCUDA?

No. I would be more than happy to make them available, but that would be mostly either-or with the rest of PyCUDA, because of the following sentence in the CUDA programming guide:

[CUDA] is composed of two APIs:

A low-level API called the CUDA driver API,

A higher-level API called the CUDA runtime API that is implemented on top of the CUDA driver API. These APIs are mutually exclusive: An application should use either one or the other.

PyCUDA is based on the driver API. CUBLAS uses the high-level API. Once can violate this rule without crashing immediately. But sketchy stuff does happen. Instead, for BLAS-1 operations, PyCUDA comes with a class called GPUArray that essentially reimplements that part of CUBLAS.

If you dig into the history of PyCUDA, you'll find that, at one point, I did have rudimentary CUBLAS wrappers. I removed them because of the above issue. If you would like to make CUBLAS wrappers, feel free to use these rudiments as a starting point. That said, Arno Pähler's python-cuda has complete ctypes-based wrappers for CUBLAS. I don't think they interact natively with numpy, though.

I've found some nice undocumented function in PyCUDA. Can I use it?

Of course you can. But don't come whining if it breaks or goes away in a future release. Being open-source, neither of these two should be show-stoppers anyway, and we welcome fixes for any functionality, documented or not.

The rule is that if something is documented, we will in general make every effort to keep future version backward compatible with the present interface. If it isn't, there's no such guarantee.

Does PyCUDA automatically activate the right context for the object I'm talking to?

No. It does know which context each object belongs, and it does implicitly activate contexts for cleanup purposes. Since I'm not entirely sure how costly context activation is supposed to be, PyCUDA will not juggle contexts for you if you're talking to an object from a context that's not currently active. Here's a rule of thumb: As long as you have control over invocation order, you have to manage contexts yourself. Since you mostly don't have control over cleanup, PyCUDA manages contexts for you in this case. To make this transparent to you, the user, PyCUDA will automatically restore the previous context once it's done cleaning up.

How does PyCUDA handle threading?

As of version 0.93, PyCUDA supports threading. There is an example of how this can be done in examples/multiple_threads.py in the PyCUDA distribution. (The current git repo does not have examples/multiple_threads.py. Here is a direct link to a previous version: http://git.tiker.net/pycuda.git/blob_plain/72aae2da98d034203ebb243ec621247ab8a60341:/examples/multiple_threads.py) When you use threading in PyCUDA, you should be aware of one peculiarity, though. Contexts in CUDA are a per-thread affair, and as such all contexts associated with a thread as well as GPU memory, arrays and other resources in that context will be automatically freed when the thread exits. PyCUDA will notice this and will not try to free the corresponding resource--it's already gone after all.

There is another, less intended consequence, though: If Python's garbage collector finds a PyCUDA object it wishes to dispose of, and PyCUDA, upon trying to free it, determines that the object was allocated outside of the current thread of execution, then that object is quietly leaked. This properly handles the above situation, but it mishandles a situation where:

You use reference cycles in a GPU driver thread, necessitating the GC (over just

regular reference counts).

You require cleanup to be performed before thread exit.

You rely on PyCUDA to perform this cleanup.

To entirely avoid the problem, do one of the following:

Use multiprocessing instead of threading.

Explicitly call free on the objects you want cleaned up.

How do I specify the correct types when calling and preparing PyCUDA functions?

When calling a CUDA kernel directly (via __call__) or when "preparing" the function, the following mapping between C data types and CUDA types holds:

Handles returned by device memory allocation functions can be cast to numpy.intp. There is no distinction between pointed-to types as far as PyCUDA is concerned. E.g. passing a handle to a float * for an argument that expects and int * will result in undefined kernel behavior.

Where the first tuple is the dimensions of your block grid, and the remaining arguments are your kernel arguments. Observe that there's no need to explictly cast the arguments to a prepared invocation.

On the other hand, if you'd like to go with direct (i.e. unprepared) invocation, your call should include explicit casts:

Is it possible to use cuda-gdb with PyCUDA?

Yes! As of version 0.94.1, support for this is built right into PyCUDA. This transcript shows what you need to do debug the demo.py script in the PyCUDA examples folder. Note that you need to start Python with the extra switch -m pycuda.debug.

We are now debugging inside the kernel, where we may step, examine data, or set further breakpoint. See the gdb manual and Nvidia's documentation for details.

Note that you cannot debug the host-side code using gdb. I recommend pudb for that job.

Is it possible to profile CUDA code with PyCUDA?

Yes! When you set environment variable CUDA_PROFILE to 1, CUDA creates log files called cuda_profile_NN.log which contain performance informations about kernels that were run. You must run the python script in the following way:

CUDA_PROFILE=1 python scriptname.py

After running this code in current directory there will be file cuda_profile_0.log containing names of kernels that were run and time it took to run each kernel, both on CPU and on GPU: