The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Re: Do I need atomic_add in that case?

This sounds very much like a common histogram calculation. In OpenCL this is a non-trivial operation if it's to be done efficiently, but I think that a few of the SDKs include examples of this in their samples. However, an atomic add should work in you case, but it won't give very high performance.

Re: Do I need atomic_add in that case?

Re: Do I need atomic_add in that case?

Originally Posted by linyufly

Thanks ibbles!

Is it correct not to use atomic_add in this case?

No. you have to use some atomic mechanism to sync the reads and writes to histo[].

The operation of "histo[arr[id]]++;" (or any increment of global memory) is actually composed of three operations: read the value from global memory, increment it, write it back. In any multi-core device (even CPU) you have no guarantee that these three operations will be done atomically.

Using global atomic_inc in this case is feasible, but not the best way. As mentioned, there are examples in SDKs which show optimized ways to do histogram calculations (use local memory for middle-histogram calculations, do reduction in the end)

Re: Do I need atomic_add in that case?

No. you have to use some atomic mechanism to sync the reads and writes to histo[].

The operation of "histo[arr[id]]++;" (or any increment of global memory) is actually composed of three operations: read the value from global memory, increment it, write it back. In any multi-core device (even CPU) you have no guarantee that these three operations will be done atomically.

Using global atomic_inc in this case is feasible, but not the best way. As mentioned, there are examples in SDKs which show optimized ways to do histogram calculations (use local memory for middle-histogram calculations, do reduction in the end)

Thanks ofer_rose!

I have a further question. If two threads write to one global memory address at the same time with different values, say, 3 and 4, will the result be anything other than 3 and 4?

Re: Do I need atomic_add in that case?

In theory you're right - but the "problem" is that GPUs are vector machines (also CPUs, if you use SSE/AVX).

The meaning is that threads are not executed solely, but in SIMD fashion (SIMT, if you insist on NVIDIA terminology) - a "packs" of threads, which is equal to the vector width, is running the same command for N threads simultaneously. For NVIDIA, this pack is called warps and the size is 32. For AMD its called wavefront and the size is 64.

So the sequence that I described (read/increment/write) is done by 32 or 64 threads simultaneously. Atomicity is preserved at the warp/wavefront granularity (read done simultaneously for all, increment done for all, etc.). the result is that the range of possible "jumps" in histogram bin value is more than 1 or 2 - pending on the number of threads from the same pack which hit the same bin.

Re: Do I need atomic_add in that case?

Originally Posted by ofer_rose

In theory you're right - but the "problem" is that GPUs are vector machines (also CPUs, if you use SSE/AVX).

The meaning is that threads are not executed solely, but in SIMD fashion (SIMT, if you insist on NVIDIA terminology) - a "packs" of threads, which is equal to the vector width, is running the same command for N threads simultaneously. For NVIDIA, this pack is called warps and the size is 32. For AMD its called wavefront and the size is 64.

So the sequence that I described (read/increment/write) is done by 32 or 64 threads simultaneously. Atomicity is preserved at the warp/wavefront granularity (read done simultaneously for all, increment done for all, etc.). the result is that the range of possible "jumps" in histogram bin value is more than 1 or 2 - pending on the number of threads from the same pack which hit the same bin.

Hi ofer_rose,

Thanks!

I should have directly described my application. There are several properties, 0..N-1 and M objects. I have an int array for properties, A[0..N-1] and I use one thread for one object.

In the end of one thread I will know what property the object has, say, k, and then I let A[k] be the thread id (equal to the object id). If more than one object have the same property k, then A[k] can be any of them.

My original question should be: do I need atomic add when setting A[k] equal to the current thread id for correctness?