This contains _atomic_xchg for long (actually it gets a pointer to int3, where the first two ints are the value, and the third is used for locking) and simple kernel to test this. However, it hangs up and the driver restarts. It happens even for small global work sizes, like 64, and I don't see the idea how to fix it; atomic_xchg is the only function needed.

A work item is NOT a thread. There is no independent program counter. You are programming for a SIMD architecture, so you have to realise that if one work item is spinning in the while loop, 64 for of them are together. If the one that grabbed the lock is in the same wavefront as one that is spinning (which will always be the case unless only one work item enters that function) then it cannot progress and you have deadlock.

Try moving the while loop out, something like this:

bool done = false;

while(!done) {

a = atomic_cmpxchg(...);

if( a ) {

... do the operation because you own the lock

done = true;

}

}

I think the logic there works... but maybe not. You should be able to do something along those lines, though. Once all lanes have performed the atomic, the wavefront can move on.

Hm, I thought that when a wavefront executes instructions, which are not needed for current work item (because of a conditional expression), this work item just does nops. Actually they do the same operations, but their results are discarded?

Yes, similar logic seems to work for me. After asking the question, I've found another similar discussion and modified the function:

no-op or discarded output is the same, really. How masked operations are implemented depends on the architecture. If you think about how you'd do it on SSE where you can't mask arithmetic operations you'd have a masked copy (an and, say) into an output register at the end of the divergent block. The GPU may do that, it may switch lanes off for lower power consumption.

I doubt you can get rid of the lock because without it you have no way to guarantee that the two sub-operations work atomically. Actually, in the code above you may still not get the right answer, I think. The compiler or hardware may not enforce visbility of operations on p_val in the way you expect with respect to the atomic operations. OpenCL's memory model is very weak, like C's, and offers little guarantee of ordering across work items.

Your updates to global p_val can't be forced to be seen outside of the work-group without using atomics. And they would only be valid within the work-group if you use global barriers. And barriers can't be conditional. Which suddenly makes your loop a whole lot more complicated and slower.

So although the code might guarantee atomicity of execution within and across multiple compute units, there can't be a guarantee of atomicity of data unless the data operations themselves are implemented using atomics.

I've managed to avoid any global atomic usage outside of atomic counters - which can be and are implemented in hardware. These let one implement lockless batch oriented queues without needing any serialisation at all. It wont fit every problem but then again gpu's don't fit many problems. The only other tool of global synchronisation I use is kernel invocation - this is the only way to synchronise non-atomic updates to global data across compute units.

This is incorrect, the GCN(Most HD7XXX devices) based devices have support for 64bit atomics. If they are not enabled by default(not sure what release they are enabled in), they can be enabled with the environment variable GPU_64BIT_ATOMICS=1.

Oh, didn't know this. However, I've read AMD APP OpenCL Programming Guide and other materials, and none of them mention that some ATI GPUs support 64bit atomics. Also, the only occurence of GPU_64BIT_ATOMICS which was found in Google is in this discussion