Cuda grid size limitations

Are there limitations as to what I can set the grid size of a CUDA kernel to be? I ran into a problem where kernels were not launching with a grid size of 33 x 33 but were able to launch when the grid size was 32 x 32. Is there any reason for this to occur? Or is it likely that changing the number of blocks from 32 x 32 to 33 x 33 broke some other constraint?

Because of this bug you are actually trying to launch a grid of size blockSize, and blocks of size gridSize. It would appear that the maximum size of a block on your GPU is 1024 threads, so launching blocks of 33x33 fails.

Firstly, be really sure this is what you want to do. Without describing the manipulations you want to do, it's hard to comment on this, but be aware that matrix multiplication is an n-cubed operation. If your manipulations are not the same complexity, chances are you'll do better simply using...

So, here's the solution I came up with after simplifying my case. There was a problem with memory usage - I tried to store / read different amount of data than I claimed to use when allocating it. I hope it will be helpful for anyone in the future: #include...

Going through your package there are multiple aspects that need to be changed. You shouldn't use a 'Makefile' but a 'Makevars' file instead to improve compatibility for multiple architecture builds. Try to follow the standard variable names (e.g. CPPC should be CXX), this makes everything play together much better. Don't...

The performance difference you observe is mostly due to the increased instruction overhead in the pitched memory indexing scheme. Because your array size is a large power of two in the major direction, it is very likely that the pitched array allocated with cudaMalloc3D is the same size as the...

This is not valid CUDA code: extern "C" void someCUDAcode() { int a; CUDA_CALL(cudaMalloc((void**) &a, sizeof(int))); mykernel<<<1, 1>>>(1); // CUDA_CALL(cudaFree(&a)); } When we want to do a cudaMalloc operation, we use pointers in C, not ordinary variables, like this: int *a; CUDA_CALL(cudaMalloc((void**) &a, sizeof(int))); When we want to free a...

OK, I think I fixed the problem. The conftest.c seems to be looking for cuda.h in /usr/include, instead of the supposed /usr/local/cuda/include. The problem is solved once I created a soft link of cuda.h and cuda_runtime_api.h.

Here val deviceOutput = new CUdeviceptr() cuMemAlloc(deviceOutput, SI) you are allocating SI bytes - which is 4 bytes, as the size of one int. Writing more than 4 bytes to this device pointer will mess up things. It should be cuMemAlloc(deviceOutput, SI * numElements) And similarly, I think that the...

The problem is that f_obj within struct FuncEvalF is a const FunctionObj<T>&. It is instantiated as a temporary on the host FunctionObj<float>(), but the reference to it is not valid anymore later on. One way to fix this is to create a copy of it instead of holding a reference...

As @JaredHoberock pointed out, probably the key issue is that you are trying to compile a .cpp file. You need to rename that file to .cu and also make sure it is being compiled by nvcc. After you fix that, you will probably run into another issue. This is not...

The example implicitly declares a binary floating point format, which has an arbitrary precision exponent, but only 2 bits in the mantissa. All numbers are of format 1.xx * 2^n. When one performs floating point addition, one must de-normalize or scale the arguments to have the same exponent. 0.25 =...

Thrust doesn't resize vectors as part of any algorithm call. The size of vectors going into a thrust algorithm will be exactly the size of those vectors coming out of the algorithm. shrink_to_fit also has no impact on a vector's size, but it may impact the capacity, which has to...

The pointer has to be created (i.e. allocated) with cudaHostAlloc, even on integrated systems like Jetson. The reason for this is that the GPU requires (zero-copy) memory to be pinned, i.e. removed from the host demand-paging system. Ordinary allocations are subject to demand-paging, and may not be used as zero-copy...

I'm afraid MVAPICH does not support yet using multiple GPUs in the same process (source: mailing list). Advanced memory transfer operations require storing device-specific structures, so unless there is explicit support for multiple devices, I'm afraid there is no way to make your code run. On the other side, you...

I believe the details are that for each __device__ variable, cudafe creates a normal global variable as in C and also a CUDA-specific PTX variable. The global C variable is used so that the host program can refer to the variable by its address, and the PTX variable is used...

304.xx is a driver that will support CUDA 5 and previous (does not support newer CUDA versions.) If you want to reinstall ubuntu to create a clean setup, the linux getting started guide has all the instructions needed to set up CUDA 7 if that is your intent. I believe...

This is normal, when the driver packaged in the CUDA installer is "older" than your GPU. You should retain your current GPU driver, and go ahead with the CUDA toolkit installation, but de-select the option to install the GPU driver. Your existing driver should work fine. ...

Is that better to create grids with blocks, containing 128 threads each? Will such code run faster? Optimal block size depends on the problem. It's a idea for your block size to be a multiple of the warp size. Other factors are occupancy considerations, and shared memory usage. Does...

q1- lets say i have copy one array onto device through stream1 using cudaMemCpyAsync; would i be able to access the values of that array in different stream say 2? Yes, the array da is accessible in both kernels you have shown. However, an important question is whether or...

As hinted by the commenter, I’ve tried creating a single instance of CudaDirectXInteropResource along with the D3D texture. It worked. It’s counter-intuitive and undocumented, but it looks like cuGraphicsUnregisterResource destroys the newly written data. At least on my machine with GeForce GTX 960, Cuda 7.0 and Windows 8.1 x64. So,...

What you are observing is probably an artifact of running the code on a Windows WDDM platform. The WDDM subsystem has a lot of latency which other platforms are not hampered by, so to improve overall performance, the CUDA WDDM driver performs command batching. This can interfere with the expect...

Thrust interprets ordinary pointers as pointing to data on the host: thrust::reduce_by_key(d_list, d_list+n, d_ones, C, D,cmp); Therefore thrust will call the host path for the above algorithm, and it will seg fault when it attempts to dereference those pointers in host code. This is covered in the thrust getting started...

The reason the error doesn't occur on this line: REAL tmp = unew_row[j]; // no error on this line is because the compiler is optimizing that line out. It doesn't do anything useful, and so the compiler completely eliminates it. The compiler warning: xxx.cu(87): warning: variable "tmp" was declared but...

This is the answer given by njuffa in the comments: ...The content of GPU memory doesn't change between invocations of the application. In case of a program failure, we would want to avoid picking up good data from a previous run, which may lead (erroneously) to a belief that the...

We have seen three different codes in this question in the first 24 hours of its existence. This answer addresses the final evolution. The underlying problem you are having is with this type of operation: cudaMalloc(&p_gpu, sizeof(Pass)); cudaMalloc(&p_gpu -> pass, 5 * sizeof(int)); The second cudaMalloc is illegal. This is...

I'm pretty sure this is not correct: fftKernel<<<BLOCK_SIZE, gridSize>>>(...); The first kernel configuration parameter should be the grid dimensions. The second should be the block dimensions. I was able to reproduce the error you indicate in the code you have posted. When I reverse those two parameters: fftKernel<<<gridSize, BLOCK_SIZE>>>(...); the...

What you have appeared to have overlooked is that copy_if returns an iterator which points to the end of the copied data from the stream compaction operation. So all that is required is this: //copies to device thrust::device_vector<int> d_src = h_src; //Result vector thrust::device_vector<int> d_res(d_src.size()); //Copy non-zero elements from d_src...

Say I launch a child grid from one thread in a block at threadIdx.x==0. Can I assume that all other threads in the parent grid have finished executing up to the point I launched the child grid as well? No. You can make no assumptions about the state of...

You can use a simple loop, and specify the threads you want to do the work in each iteration. Something like: for (int z = 0; z < zmax; z++) { if (threadIdx.z == z) { //do whatever with x and y } __syncthreads(); } In each iteration, threads with...

(The author of JCuda here (not "JCUDA", please)) As mentioned in the forum post linked from the comment: It is not impossible to use structs in CUDA kernels and fill them from JCuda side. It is just very complicated, and rarely beneficial. For the reason of why it is rarely...

As you've already pointed out, this is not correct: template<typename Op> I believe it should be: template<Op op> The following code seems to work correctly for me (removing the texturing since it's extraneous to the issue): $ cat t755.cu #include <stdio.h> #include <math.h> typedef float(*Op)(float, float); __device__ float mymax(float a,...

This code will work for very specific dimensions but not for others. It will work for square matrix multiplication when width is exactly equal to the product of your block dimension (number of threads - 20 in the code you have shown) and your grid dimension (number of blocks -...

It seems that you've made at least 1 error in transcribing the code from the GPU Gems 3 chapter into your kernel. This line is incorrect: temp[bi] += g_idata[ai]; it should be: temp[bi] += temp[ai]; When I make that one change to the code you have now posted, it seems...

That depends on the CUDA Version used I think. Compute capability(version) V1.0 V1.2 V2.x V3.0-X.X Maximum number of resident threads per multiprocessor 768 1024 1536 2048 Amount of local memory per thread 16 KB 512 KB Maximum number of threads per block 512 1024 If found this peace of Information...

The PyCUDA implementation directly follows the CUDA driver API, so you can use any driver API code you can find as a model, but there are two things required to make this work: Use the module function module.get_global() to retrieve the device pointer to the symbol within the compiled source...

This is the standard CUDA idiom for determining the minimum number of blocks in each dimension (the "grid") that completely cover the desired input. This could be expressed as ceil(nx/block.x), that is, figure out how many blocks are needed to cover the desired size, then round up. But full floating...

There are various atomic functions which support atomic operations on unsigned long long int (ie. a 64-bit unsigned integer), such as atomicCAS, atomicExch and atomicAdd. And if you have a cc3.5 or higher GPU you have even more options. Referring to the documentation on clock64(): long long int clock64(); when...

1.cusolverSpScsrlsvchol returns wrong results for x: 1 3.33333 2.33333 1 3.33333 2.33333 1.33333 1 2.33333 1.33333 0.666667 1 1 1 1 1 You said: A is positive definite and symmetric. No, it is not. It is not symmetric. cusolverSpcsrlsvqr() has no requirement that the A matrix be symmetric. cusolverSpcsrlsvchol()...

CUBLAS library functions can be called from device code. Thrust algorithms can be called from device code. Various CURAND functions can be called from device code. Other libraries that are part of the CUDA toolkit at this time (i.e. CUDA 7) -- CUFFT, CUSPARSE, CUSOLVER -- can only be used...

Consider one warp of the thread block finishing the first iteration and starting the next one, while other warps are still working on the first iteration. If you don't have __syncthreads at label sync2, you will end up with this warp writing to shared memory while others are reading from...

C++11 support is added officially in CUDA 7.0. And you need GCC version 4.7 or later to have C++11 support. See details here: http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-compiler-new-features

Effective load throughput is not the only metric that determines the performance of your kernel! A kernel with perfectly coalesced loads will always have a lower effective load throughput than the equivalent, non coalesced kernel, but that alone says nothing about its execution time: in the end, the one metric...

CUB is an evolving library, which means that as new features are introduced in CUDA, CUB may evolve in newer releases to take advantage of those. If you then attempt to use a newer CUB release with an older CUDA version, you may run into compatibility issues. This usage by...

The only actual problem in your code is here: cudaMalloc( &d_x,sizeof(d_x) ); sizeof(d_x) is just the size of a pointer. You can fix it like this: cudaMalloc( &d_x,sizeof(x) ); If you want to find out if a CUBLAS API call is failing, then you should check the return code of...

I created a simple example based on thrust. It uses column-major order to store the matrices in a thrust::device_vector. It should scale well with larger row/column counts. Another approach could be based off the thrust strided_range example. This example does what you want (fill the diagonals based on the input...

No, this won't be possible. K20m can be used (with some effort) with OpenGL graphics on Linux, but at least up through windows 8.x, you won't be able to use K20m as a D3D device in Windows. The K20m does not publish a VGA classcode in PCI configuration space, which...