The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Hybrid View

Strange issue when dealing with Bitmasks

EDIT:
Solved

I am in the process of porting a graph based genetic algorithm and I keep coming across a strange problem. I generate my chromosomes on the cpu and offload them to the gpu. One of the steps of my fitness function is to determine how many bits are set to 1 (which would indicate the inclusion of a node). When trying to verify results on the cpu, the numbers are not matching up. First I figured my local caching was the issue, so I switched to using my global memory, to no avail. Next I figured maybe the integer modulus and division were the problem, so I tried re-implementing it using floating point operations and casts. Still not working. It seems to me that the chromosome isn't being copied properly for gid > 0. Has any one used bitmasks effectively on the gpu?

As a note: This is my first attempt at using OpenCl and I love the power. I just need to learn all the tricks of the trade
If you have any questions or need more information please let me know!

Note: It seems that the first workitem in the group calculates all of its sizes properly, but every other workitem is off. Might this have to do with memory access?

Also, I've tried copying the chromosomes back after they are written and recalculating. It all gets copied correctly. So the problem either lies in the conditional being executed incorrectly for whatever reason for gid>0, or chrome_local not having the correct data for gid>0. The address gets calculated properly as far as I know. I'll try eliminating the conditional using a lookup table. If that doesn't fix it, chrome_local must not be copied correctly. Otherwise I guess I'm just crazy

Okay now I've changed the code to use a lookup table and it doesn't work...

I don't see the point in copying InputChroms array into a local or global buffer. You can directly access InputChroms in your loop since it is already in global memory.
You also don't give information about the way you split your kernel execution into work-groups. This can explain potential synchronization troubles.

I don't see the point in copying InputChroms array into a local or global buffer. You can directly access InputChroms in your loop since it is already in global memory.
You also don't give information about the way you split your kernel execution into work-groups. This can explain potential synchronization troubles.

My apologies. I copy the chromosome to local memory because I access it a lot in subsequent calculations(hundreds of times for each chromosome). I have tried, however, to use global memory (InputChroms) for this calculation and I get the same faulty result for gid > 0.

Now this is the part I am unsure of. My current global NDRange is set to 2 (eventually I will increase this once I can figure out the issue) and I omit the local range(which I guess means the driver will decide how to split up the job locally).

So might this be an issue with my memory access?
If there are any good resources (besides the api documentation, it didn't seem too informative to me) let me know.
Thanks for the help by the way. I've been tearing my hair out trying to figure this out.

Each input is following format (most are 2 or 3d arrays compressed into a 1d piece of contiguous memory)

InputGraph[GRAPH_SIZE, maxDegree]
-This holds an adjacency list with edges for each vertex. the first index is the vertex id, and the 2nd index is the edge
-I access this as such InputGraph[vert*maxDegree+edge] where vert and edge are the indexes

InputChroms[NUM_WORK_ITEMS, POP_SIZE, CHROM_SIZE_BYTES]
-This array stores contiguous chromosomes(each of length CHROM_SIZE_BYTES uchar)
-Each work item is allocated POP_SIZE number of chromosomes
-I access this as such InputChroms[(gid*POP_SIZE+cur_chrom)*CHROM_SIZE_BYTES+byteOffs et)]
**where gid is the id of the work item
**(gid*POP_SIZE*CHROM_SIZE_BYTES) is the offset for the beginning of the current work item's block of chromosomes
**(cur_chrom*CHROM_SIZE_BYTES) is the offset into the work item's block of chromosomes for the chromosome of interest
**I factored this to get ((gid*POP_SIZE+cur_chrom)*CHROM_SIZE_BYTES) as the offset to the beginning of the chromosome
**byteOffset is the particular byte I want to access (if I want the 30th bit in a chromosome, it would be 30/8=3rd byte)

InputQueues[NUM_WORK_ITEMS] and I access it using InputQueues[gid]
-this is an array of queues, one for each workitem(this is used later to perform a BFS)

OutputFitness[NUM_WORK_ITEMS, POP_SIZE]
-this is the array I write back to after evaluating the fitness (and I am using it currently to get the gpu-calculated sSize that is causing me issues)
-OutputFitness[gid*POP_SIZE+cur_chrom]

LookupTable[256]
-this is something I just added, and it stores the number of 1 bits for every number from 0-255. I do this now to count 8 bits at once in the chromosome. All my tests show that the counting still happens correctly(for the first workitem gid=0), no matter how many chromosomes this one work item actually counts.

this is extremely dubious because each work-item will write at the same memory location.

Generally speaking, when you use local memory in a kernel, you have to explicitly set the local work-size in clEnqueueNDRangeKernel() because it has to match the amount of local memory allocated in the kernel.