This is probably a noob question, as I'm new to opencl programming. I'm currently using the ruby bindings and it's working great so far. I'm writing an ifs fractal generator and having some problems with it. I'm using a geforce 9400 gt to run the calculations and it's also shared with X on a fedora 10 box. The program runs fine for about 10 to 12 times, then I start to get enqueueing errors, -5 CL_OUT_OF_RESOURCES. X then starts having problems and random pixels on my display start dropping out. Moving a window is a cool effect because it just sort of dissolves. :) X crashes and I have to reboot. :(

I was thinking that the problem was either the card overheating, which doesn't appear to be the case because I can let the card sit for awhile and the errors are still there. Or I'm filling the card up with my queued events. Would using clReleaseCommandQueue help my problem? I think my biggest problem is that I have been spoiled by ruby which takes care of all these issues, and it's been a while since I have programmed in C. :)

Thanks,

Grimm

dbs2

02-24-2010, 05:39 AM

The most likely problems are 1) a bug in the allocation code for your video card or 2) a bug in your code that is writing out of bounds of your OpenCL objects and corrupting other data. Neither of these are particularly easy to debug, unfortunately.

grimm

02-24-2010, 02:34 PM

Thanks, I will double check my code and make sure that nothing is writing were it shouldn't be. I did change the local_work_size to 32, I had it set to 16 and now it appears to be much better behaved. I don't know if that should make any difference though?

Grimm

grimm

02-24-2010, 05:39 PM

Ok I think I know why it's crashing. I was setting the global_work_size too large (32000) and the card was running out of memory. I have changed the global size to 3200 and increased the number of internal loops that each kernel does. Now it is working with no problems. :)

Grimm

dbs2

03-01-2010, 01:50 PM

The global work size should be basically unlimited. I believe there is a bug/feature in the Nvidia drivers that limits it to 65k or something ridiculously small, though. (The Mac OS X one does not have a limit on Nvidia hardware, so it's clearly not a hardware limitation.)

Kratzy974

03-02-2010, 01:20 AM

I do work on Windows 7 and I get -5 when I have many loops (big loops in big loops) on NVidia hardware. My current (very small code with deep loops) could run on 4 cores (of 16), but if I use more I get the -5.
I don't use __local, just __global and small memory areas (only one reach 1.1 MB, all others are together less than 200k).
I did removed some inner loops, then it is also possible to work with more cores.

Does anyone have an idea, were to put the finger on ? Or is this a known problem on NVidia hardware ? (9600 GT and 9600 GX2)

grimm

03-02-2010, 04:11 PM

I have noticed the same issue. I thought that the private memory space was 32 bits on the card, so you would think that you could have loops up to that size? Other ideas I had were that the card was timing out those kernels which run too long, or because I'm sharing the card with the OS there are limits to how much of the card I can use. I'm going to double check my Nvidia driver and make sure I have the latest. ?!?

Grimm

Kratzy974

03-02-2010, 11:41 PM

I've the latest driver on board (196 Beta), also current Nexus.
The loops are 100 * 12000 * 500 = 60Mio within one core.
When using more cores, each one just get a part of this loop size, the first loop (100) will split up to the cores.
The card is not shared (Desktop is on my ATI card).
All variables are init before, even the loop values, so there should be no mem alloc in a loop.

I wonder about the time needed. On CPU this need 1.1 sec, on NVidia GPU it took 400 sec.

grimm

03-03-2010, 02:12 PM

Those are some pretty big loops. :) The cores on your GPU are almost always going to be slower then your CPU. Sounds like you might want to try and break your code into smaller chunks to get better performance. The difference in my loops are very small, a loop of 3200 will work, anything larger dies with a -5. :( The more I think about it the more I feel it's a timeout issue. The GPU by design needs to support as many programs as you have running on the system, any of which could call on the GPU to do work. One way to guarantee this is to not allow any one program to hog the GPU, and kill anything that tries. I suspect that even though you are not using your GPU for other stuff the driver works the same in both cases.

Grimm

Kratzy974

03-03-2010, 02:40 PM

this would be pretty bad. I could break it down (throw away the 100x loop). But the other loops are needed (really) and I need to do this often. Its a small calculation, need to be done very often(the first 100x loop is just an example, and need to be replaced with 15k and more).

When I use 1 Core. I don't get a timeout, even when this need 400 secs (6-7 min).
I expect, that the CPU is 2-5 times faster than GPU with one core. but not 400 times.

Still the first -5 comes up then using clWaitEvents. This happens very fast after call (directly). After that I couldn't call the kernel without -5.