There is a strong correlation between the portions of Shor's algorithm
which would be performed on a classical computer, portions which would
be performed on a quantum computer, and portions of the simulation
that do not parallelize, and portions that do.

The parts of Shor's algorithm which are pre and post processing which
take place on a classical computer are not good candidates for
parallelization. In contrast, the portions which the quantum computer
would perform are easily parallelized.

The basis for parallelization in the simulation of Shor's algorithm is
that at several points in the sequential code there are for loops that
iterate over large arrays, and modify each element in some uniform
manner.

Given this type of parallelism, each of the Charm++, pthreads, and MPI
paradigms has some appeals and disadvantages. Charm++'s object
paradigm coincides with the sequential codes object oriented approach.
Charm++'s load balancing features are not of much utility, as the work
load between even array portions is very even to start with, and it is
not clear that the overhead imposed by Charm++ would be recovered by
superior load balancing. MPI seems reasonable, as each process element
iterates over its own exclusive region in the parallelized code,
however, we then must communicate each portion to a manager processor,
who would perform various operations.

Pthreads seemed the most natural version, since each thread iterates
over its unique set of array locations, there is no need for locks,
and no danger of deadlock or data corruption. If the array locations
are suitably large, there is very little false sharing due to
different array portions existing in the same cache lines of different
threads. Once deciding on the pthreads paradigm, there are two
hurdles that must be overcome. We must decide on a synchronization
method, and we must decide how to split work.

Splitting the work is achieved in range.h, where given the values of
n, q, and num pthreads, assigns values to each of
num pthreads array locations, such that the i'th array location of
q range lower, q range upper, n range lower and
n range upper contains the array locations where the i'th
process element should begin and end processing.

Setting barriers is achieved in barrier.h. We simply implement a
barrier with pthread locks. These barriers occur before and after
each of the parallel sections. They are necessary because frequently
we perform an operation that involves some result of the entire array
just before or after these parallelized sections.

In accordance with Amdahl's law our speedup is limited by the speedup
of the parallelized sections, but the parallelized sections are such a
large portion of the total running time, that the speedup is nearly
linear.