Most of modern architectures have a Weakly-Ordered Memory Model, meaning that the memory is not automatically accessed in the order in which we specify it in the program. Therefore, to guarantee the program correctness, we must explicitly force a certain ordering. This can be achieved with the use of Memory Fences and Synchronization Barriers.

In GPGPU, Barriers are very familiar and are widely used in the community. Memory Fences however tend to be ignored.

Synchronization Barrier acts as a point at which block threads wait until all of them have reached it.

Memory Fence ensure that all writes made by a thread before the fence are visible to all block threads after the fence. All threads are not forced to execute it though .

Let's take the case of a kernel in which we are launching thread blocks where the block data is mapped at runtime: For instance, the algorithm will select randomly X threads that will perform global and/or local memory loads and stores to do some processing with it later. The remaining REST of the threads will however perform no-writing work and won't affect kernel instructions that follows the barrier / fence (cf. drawing).

Without thread ordering, the kernel would obviously turn into a mess because of race conditions. Synchronizing with a barrier is the standard way to guarantee correctness, but since only X threads are performing writes, it would be useless to wait for the REST of the threads to synchronize. By fencing the threads, only writing threads are synchronized, allowing us save time !

To put it buntly, I would say that Synchronization is a bruteforce way for ordering threads at block scale.

Note that this example is one of the several situations where we could leverage the use of Memory fences over standard synchronization.

Nevertheless, I must admit that these functions are intended to be used by experienced GPGPU devs who already know what they're doing (That is, having the ability to find the best tradeoffs between Branch divergences, Streams, G/L access patterns and Thread ordering when mapping the data to the grid is not controllable).

So, in what kind on situations would you use memory fences ?

Feel free to share your thoughts in the comment section below.

Follow GPGPnotes

Do Something

Get new posts sent to you. If you change your mind later, unfollow with one click.

You're a member of this community! Use the buttons on the right to like this post or share it. Or leave a reply below.

The GPU is one of the most powerful vectorial processor where you can have hundreds of threads operating concurrently on different data. In CS344, Dr. David Luebke and Dr. John Owen from NVIDIA illustrated this very nicely with putting GPUs in the bus category and CPUs among sport cars. Indeed, the amout of time and energy spent per data is considerably lower with GPUs than with CPUs, which makes them greener and more efficient.

To have full benefit, the most crucial thing is maximizing your GPU bandwidth:

When peter (a CUDA/OpenCL dev) starts off with designing his code for parallelization, he has to manage first GPU resources accordingly. Since the graphics card consists of several compute units (Streaming Multiprocessors) with limited resources, peter has to make sure that he's doing his best while writing his kernel to have a maximum of blocks scheduled for execution per SM.

In the previous post, we have seen the benefits of skipping synchronization at warp scale. Certainly, this trick provided a significant latency boost but didn't actually hide the famous thread idleness present in reduction problems.

To deal with that, we could think of increasing the work per thread block - allowing us to launch the kernel with a smaller grid. Remember that smaller grid means shorter execution time.

Let's consider a program computing a 40-sized histogram of a random array of float elements. The naive approach would be obviously to use atomics at block and grid scale with atomicAdd (atomic_add for OpenCL) which will increment the target bin value by one every time it gets there. Since a lot of atomic operations create bottlenecks in bandwidth, we could think of another approach that would get rid of atomics at block scale:

We can achieve that by computing local histograms on shared memory (here 32 for instance) in every block of threads, then performing a reduction on them, and finally updating the global histogram buffer (with atomics at grid scale).

Kernel with explicit synchronization

However, during the reduction step, we are explicitly synchronizing the shared memory to avoid race conditions. Keep in mind that we are doing that for all the blocks ! Therefore, we could think of removing the barriers to boost performance once again. But, would that affect the output correctness ?