Distributed memory MIMD (Multiple Instruction Multiple Data)
machines are emerging as a cost-effective means of speeding up
numerically intensive programs.
They scale more easily than other parallel machines.
But writing explicitly parallel programs for these machines
is both difficult and error prone.
Compilers for languages like HPF make the task easier by generating
the necessary inter-processor communication from the data distribution
directives supplied by the programmer.
This dissertation shows that for a large class of array-style programs
automatic data distribution can produce a significant speedup
on a distributed memory MIMD machine.
Array-style programs use array primitives to manipulate entire arrays,
rather than looping explicitly over the array elements.
APL programs are typically array-style.

We show how to apply automated data distribution to APL programs, that
treat arrays and operations on them as atomic.
Automated data distribution determines the necessary
inter-processor communication from the way APL primitives manipulate
the entire arrays, rather than by complex algebraic analysis
of the patterns of array subscripts,
as would be done in more conventional compilers.
A simple distribution and alignment scheme
automatically distributes arrays across available processors.
Arrays can be dynamic, with sizes varying during program execution.
Data distribution is guided by array size estimates.
Distribution trade-off analysis attempts to optimize
the initial distribution by comparing the estimated
communication and computation times, and replicating arrays whose
partitioning results in excessive communication.

Building on the APL to C compiler developed by W.-M. Ching,
we produce explicitly parallelized C, from APL source programs.
We describe the parallel implementation of most of the APL primitives.
The implementation of several APL primitives uses the monotonic data
movement algorithm.
The ideas developed are demonstrated with eight APL programs
of varying complexity.
We show the speedup and efficiency obtained when running these programs
on 2 to 32 processors.
The speedup achieved on 32 processors, ranging from 7 to 30,
shows the technique to be applicable to a wide range of programs.