I am solving linear advection equation in order to learn parallel programming. I am running this code on Xeon Phi co processor. Below is a plot of how it scales up with increasing number of threads. I am still in learning process and this is my first "parallel" code. I am also not very familiar with architecture specific optimization.

Comments on how to improve the performance of this code and what other stuff should I look into will be great. Also link for writeup or further tutorials to follow will be really helpful.

2 Answers
2

You should be more explicit about affinity, and plot the performance by number of cores with separate lines for 1t/c, 2t/c, 3t/c and 4t/c. You will get extra performance from adding a whole core than from adding another HW thread on a core you're already using, so showing "2 threads" could have two different performance values (2cx1t or 1cx2t).

By using KMP_PLACE_THREADS, you can easily choose the appropriate configurations (and don't also set OMP_NUM_THREADS!)

You should cite a source or give a reason in your answer as to why you should not use OMP_NUM_THREADS. Do that and you will have my upvote.
–
syb0rgJun 19 '14 at 20:11

Hey Jim, I tried various affinities. compact gives the worst performance. And I think that is also the reason behind poor performance when number of threads exceed number of processors. Could you advice on how should I modify this code to make it more efficient for compactness?
–
maverickJun 21 '14 at 8:55

You should first check for the number of command line arguments via argc. If there are too few, print an error message and then terminate the program.

if (argc < 3) {
// print an error message...
return EXIT_FAILURE;
}

In addition, your first existing check in main() should also return EXIT_FAILURE, instead of calling exit(1). The latter is non-portable, unlike the former, and it's already safe to return from main() if the program cannot continue execution. More info about that here.

All of your erroneous outputs shouldn't be printed to printf():

printf("ERROR: Number of cells should be integral multiple of number of threads \n");

They should instead be printed to fprintf() and stderr:

fprintf(stderr, "Number of cells should be integral multiple of number of threads \n");

They're not actually being initialized; they're being assigned after declaration. Initialization involves giving a variable its type and its initial value. You should do that here, which will keep variables as close in scope as possible, which is generally preferred for maintenance concerns.

Either way, you don't really need that comment. It's already clear what's being done (even after making this correction), and comments shouldn't state the obvious.

Also regarding closest possible variable scope: you have i declared towards the top of main(), but you don't use it until the for loop towards the end. If you don't have C99 (and thus cannot initialize i within the loop statement instead), declare i right before the loop.

This line doesn't look right:

td.maxthreads = omp_get_num_threads();

It appears that you're looking for omp_get_max_threads():

td.maxthreads = omp_get_max_threads();

This will return the maximum number of available threads for a parallel environment. If this is not really your intent, then rename maxthreads to something more accurate.