Numeric differences are to be expected with parallel applications. The basic
reason for that is that on many architectures floating-point operations are
performed using higher internal precision than that of the arguments and
only the final result is rounded back to the lower output precision. When
performing the same operation in parallel, intermediate results are
communicated using the lower precision and thus the final result could
differ. How much it would differ depends on the stability of the algorithm -
it could be a slight difference in the last 1-2 significant bits or it could
be a completely different result (e.g. when integrating chaotic dynamic
systems).

In your particular case with one process the MPI_Reduce is actually reduced
to a no-op and the summing is done entirely in the preceding loop. With two
processes the sum is broken into two parts which are computed with higher
precision but converted to float before being communicated.

You could try to "cure" this (non-problem) by telling your compiler to not
use higher precision for intermediate results.