The most likely schedule is a 2-D block (gang) using a strip mined k and j loops, and a 3-D thread block (vector) from the k, j, and i loops. Though, this is highly dependent upon what the body of the loop looks like and how the data is accessed.

Thanks Mat - is it possible to comment in general if this is a good way to use OpenACC (in terms of performance)? Actually we observe different performance when we run this code block against different compilers. So I wanted to ask if I should explicitly use gangs and vector clauses in order to tune my code.

So I wanted to ask if I should explicitly use gangs and vector clauses in order to tune my code.

Personally, I don't find explicit schedule tuning to help much. I find the PGI compiler finds a good one in the vast majority of cases and I'd rather not tie my program to a particular schedule since it may not be optimal for other devices.

However, since your tuning for the compiler not the device, it may be worth it to you to set the schedule yourself. Granted, there's more to performance than the schedule, so fixing the schedule may still yield varying performance. Worth a try though.