I have written a C++ code for a finite volume solver to simulate 2D compressible flows on unstructured meshes, and parallelised my code using MPI (openMPI 1.8.1). I partition the initial mesh into N parts (which is equal to the number of processors being used) using gmsh-Metis. In the solver, there is a function that calculates the numerical flux across each local face in the various partitions. This function takes the the left/right values and reconstructed states (evaluated prior to the function call) as input, and returns the corresponding flux. During this function call, there is no inter-processor communication, since all the input data is available locally. I use MPI_Wtime to find the time taken for each such function call. With 6 processors (Intel® Core™ i7 (3770)), I get the following results:

Processor 1: 1406599932 calls in 127.467 minutes

Processor 2: 1478383662 calls in 18.5758 minutes

Processor 3: 1422943146 calls in 65.3507 minutes

Processor 4: 1439105772 calls in 40.379 minutes

Processor 5: 1451746932 calls in 23.9294 minutes

Processor 6: 1467187206 calls in 32.5326 minutes

I am really surprised with the timings, especially those from processors 1 and 2. Processor 2 makes almost 80 million more calls than processor 1 but takes 1/7 the time taken by processor 1. I re-iterate that there is no inter-processor communication taking place in this function. Could the following cause this large a variation in time?

1. Conditional-if loops inside the function
2. The magnitude of the values of input variables. For instance, if a majority of the values in for a processor are close to 0.