I'm trying to access the memory associated with a GPUArray from within a
compiled extension built using Cython's memoryview feature. According to the
Cython documentation, it is possible to access C arrays using this feature;
however, when I attempt to do so using a GPUArray's pointer with a GPU (and
version of CUDA) that supports UVA, the extension segfaults. Does anyone have
any thoughts as to why this might be occurring?
Here is the Cython code for a sample extension that uses memoryviews:
cdef class Wrapper:
cdef long n
cdef void *ptr
cdef double[:] view
def __cinit__(self, unsigned long long addr, long n):
self.n = n
self.ptr = <void *>addr
self.view = <double[:n]>self.ptr
def __getitem__(self, int i):
return self.view[i]
def __setitem__(self, int i, double val):
self.view[i] = val
When the above is compiled into an extension named my_ext using Cython 0.19.1
and Python 2.7.4 on Linux and passed the pointer associated with a numpy array of doubles
as follows, I can use it to access the array's contents:
import numpy as np
import my_ext
x = np.arange(10, dtype=np.double)
w = my_ext.Wrapper(x.ctypes.data, x.size)
print w[0] # prints 0.0
w[0] = 100.0
print w[0] # prints 100.0
Attempting to do so with a GPUArray results in a segfault:
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
x_gpu = gpuarray.arange(10, dtype=np.double)
w = my_ext.Wrapper(x.ptr, x.size)
print w[0] # segfaults
I'm using pycuda 2013.1.1 built with CUDA 5.0.35.
--
Lev Givon
http://www.columbia.edu/~lev/http://lebedov.github.com/

Looking through the pycuda source, it seems like the GIL is being held when
performing some of the memcpy operations: memcpy_htod, memcpy_dtoh, and
their async counterparts.
We noticed that host copies were taking a quite a bit of time and were
hoping to be able to run some background operations while they are in
flight.
Is this on purpose, or could we safely change them to release the GIL? I
was alternatively thinking of doing:
x = ndarray(...)
GPUArray.to_gpu_async(x, stream)
while not stream.is_done():
time.sleep(0)
But this is a bit convoluted.
Thanks,
--
R

Andreas Baumbach <healther.astro(a)gmail.com> writes:
> Hi,
>
> after I finally managed to subscribe to the mailing list I just ran in
> another issue. I'm still trying to implement a conjungate gradient method.
> That already works but the speed up vs Scipy with CUBLAS optimisation is
> only a factor of 4.
>
> Basically I need to store an scalar value (one double) on the GPU (as
> opposed to the main RAM of now) and pass this value as an argument to the
> mul_add-function.
> I tried using one-entry GPUarrays, but the only way I got this to work is
> via gpuarray.get() which only transfers it to CPU to put it back on the
> GPU. Which gives no speedup at all. Is there anyway to get this working?
https://github.com/inducer/pycuda/blob/master/pycuda/sparse/cg.py
:)
Note how it gets around the problem you encountered by making its own
custom "lc2" (=linear combination of two vectors) kernel. It also binds
the scalar coefficients to texture references, which on GT200 and
earlier was a reasonable way of doing scalar broadcast. (Fermi and newer
have "ldu" instructions ("load uniform") that should make this
redundant).
HTH,
Andreas

Conjugate gradient---great application!
How are you currently providing this scalar to your kernel invocation?
Why doesn't just passing it in as a scalar value work (casting it to a
numpy.float32() or float64() if needed)? I mean, do you explicitly need the
scalar to live in GPU memory where all threads can see it and update it? If
this is the case (i.e., your scalar needs to be writeable by any thread),
you could always mem_alloc() four or eight bytes and provide your kernels
with a pointer to this---is there a problem with this approach too?
Best,
Ahmed
On Tue, Jul 16, 2013 at 3:18 PM, Andreas Baumbach
<healther.astro(a)gmail.com>wrote:
> Hi,
>
> after I finally managed to subscribe to the mailing list I just ran in
> another issue. I'm still trying to implement a conjungate gradient method.
> That already works but the speed up vs Scipy with CUBLAS optimisation is
> only a factor of 4.
>
> Basically I need to store an scalar value (one double) on the GPU (as
> opposed to the main RAM of now) and pass this value as an argument to the
> mul_add-function.
> I tried using one-entry GPUarrays, but the only way I got this to work is
> via gpuarray.get() which only transfers it to CPU to put it back on the
> GPU. Which gives no speedup at all. Is there anyway to get this working?
>
> Cheers,
> Andi
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> http://lists.tiker.net/listinfo/pycuda
>
>

Hi,
after I finally managed to subscribe to the mailing list I just ran in
another issue. I'm still trying to implement a conjungate gradient method.
That already works but the speed up vs Scipy with CUBLAS optimisation is
only a factor of 4.
Basically I need to store an scalar value (one double) on the GPU (as
opposed to the main RAM of now) and pass this value as an argument to the
mul_add-function.
I tried using one-entry GPUarrays, but the only way I got this to work is
via gpuarray.get() which only transfers it to CPU to put it back on the
GPU. Which gives no speedup at all. Is there anyway to get this working?
Cheers,
Andi