The left half of the screen represents the optimized tiled version and the right half represents the plesiochronous version. Each half is divided into two parts:

Top) A view of the Y/Z plane with the X dimension into the screen. Each pixel in the top portion of each side changes color upon completion of computation of column along X. Color changes are an indication of rate of computation, position of change indicates where and when in the Y/Z plane the computation occurred

Bottom) Each thread displays an individual line progressing in time from left to right, and wrapping around (raster-like) with two different colors: green for thread computing, red for in barrier wait. (red “ticks” may appear dark rather than red).

In the left half (traditional tiled), you can note that the Y/Z columns of X are at most in any one of two colors (time phases). The bottom of the left half illustrates the traditional tiled method runs well until the point where the threads start completion of their designated tile(s) and reach the barrier. It looks like a cascade of cars reaching a traffic jamb, which doesn’t clear until all threads reach the barrier.

The right half (plesiochronous), you can note that the Y/Z columns of X are at most in any one of three colors (time phases). The bottom half illustrates the barrier wait time for each thread, are for the most part not synchronized. You may notice that four threads appear to be synchronized, and they are. These are the treads of the same core, and the plesiochronous barrier scheme uses core barriers. These threads are not adjacent because of KMP_AFFINITY=scatter. You may also note that each thread computes their X columns along in the Y direction, essentially the threads tile is not rectangular. You also notice time domain edge is ragged indicating the time skew between threads. Occasionally you will also notice threads getting delayed, presumably by worst case memory latencies due to evictions.

The programs were instrumented to collect (RDTSC) time stamp counter information for each thread as it entered and left a computational region. The timer interval between computational regions is the barrier wait time.

You may click on the video to bring it up full size (double the width and height from that shown here).