Doing the output in one big block is nearly twice as fast--just 29.8ms. Man! Even thinking
about running in parallel sped up this program substantially! I
think this is mostly due to the fact that the inner loop now has no
function calls, so the compiler doesn't have to save and restore
registers across the I/O function call.

OpenMP

Parallelizing with OpenMP is extremely easy, so we'll do that
first. We need to give each thread a different piece of the
output array to work on, which is easy--just parallelize the y loop:

OK, easily done, but this still takes 16.6ms. That's less than a 2x speedup, even on a quad-core machine.

One serious problem with mandelbrot, like a lot of real applications,
is that the number of iterations in the "count" loop varies
dramatically across the image--in the middle, there are lots of set
points that go the whole loop, while the outside has points that escape
quickly. This "load imbalance" means some processors still have
lots of work to do (many iterations to go), while the other processors
just sit there waiting.

OpenMP's "schedule" directive is handy for dealing with load
imbalance. For example, this version causes the threads to check
in after each row, looking for more unfinished work:

Surprisingly, again, this speeds up the program! Now we're at 7ms. We have two options for SSE:

Put zr and zi into a single vector, and try to extract a little
parallelism within the operations on it, like the duplicate
multiplies. Most of our work would be in shuffling values around.

Put four different zr's into one vector, and four different zi's
into another, then do the *exact* operations above for four separate
pixels at once.

Generally, 1 is a little easier (for example, only one pixel at a time
going around the loop), but 2 usually gives much higher
performance. To start with, we can just unroll the x loop 4
times, load everything over into 4-vectors (named in ALL CAPS), and use
the handy little instruction _mm_movemask_ps to extract the sign bits from the 4 floats to figure out when to stop the loop:

This runs, and it's really super fast, at just 2.8 ms, but it gets the
wrong answer. The trouble is that all four floats are chained
together in the loop until they're all
done. What we need is for only the floats that are less than 4.0
to keep going around the loop, and the other floats to stop
changing. We can use the standard (yet weird) SSE branch trick to
fix this, essentially doing the loop work only if we should keep
branching, like "if (tZR*tZR+tZI*tZI<FOUR) {ZR=tZR; ZI=tZI;}".

This gets the right answer! It is a tad slower, 3.8ms. But getting the right answer is non-negotiable!

MPI

The main question here is: which pixels should each rank render?
The obvious answer is rank i renders a block of rows starting at
i*n_rows/size.

That's probably the wrong answer.

The trouble, again, is load imbalance: some rows are harder to render
than others. OpenMP has that handy "schedule(dynamic,1)" option
where processors check in to figure out which rows to use; but that's
expensive over the network (40us mesages), plus there's a bottleneck on
the processor everybody "checks in" to.

One decent option on MPI is to render rows where y%size==rank.
This "round-robin" scheduling means a localized block of high-cost rows
will be split up between as many processors as possible. It makes
reassembling the image a bit more expensive (non-contiguous data per
CPU), but reassembling the image will be a pain anyway you do it.

for (int y=rank;y<ht;y+=size) ...

The easy, inefficient way to reassemble the image is just to bitwise-OR
all the pixels together (most of which are zeros on each processor!)
with an MPI_Reduce.