When you privatize an array, you are creating a temporary copy for each thread. This can dramatically increase your memory usage. Also, since the private arrays are temporary, their values are not stored back to the host. Full details on the private clause can be found in section 2.4.4 of Accelerator model guide http://www.pgroup.com/resources/accel.htm.

Backing up to the original code, the reason why the outer loop wont parallelize is that all values of n (i.e. all threads) need to access the same i, j, and k elements of the arrays. Depending on the order in which the threads store their results, the values stored in the array will change and lead to non-deterministic results.

Instead of having the "n" loop be the outer loop, could it be moved to the innermost loop? This will allow you to parallelize the i, j, and k loops, have the n loop as kernel, reduce the data movement, and increase your compute intensity.