There are two major dynamics algorithms in the CCM2 code,
the spectral transform method [11],
[23], [27] and the semi-Lagrangian transport
method [41].
The process models for radiation, clouds, surface moisture and temperature
share the common feature that they are coupled
horizontally only through the dynamics. We lump all these processes
under the general term ``physics'' and note that the physics
calculations are independent for each vertical column of grid cells in the
model.

The independence of the physics calculations for each horizontal grid
point is the primary source of parallelism in the PCCM2.
By partitioning the horizontal grid points into blocks and assigning
them to processors, a decomposition of the three dimensional,
physical space data structures
is defined. This decomposition allows the physics calculation for
each vertical column of
grid points to be performed without the need for
interprocessor communication.

The dynamics calculations make use of the spectral transform method
for the approximation of all horizontal derivates in the equations
except those of the
advective term in the moisture transport equation. The spectral
transform involves two stages or two separate transforms,
the fast Fourier transform (FFT) and the Legendre transform.
The Fourier transform integrates information along each east-west
grid line in the longitudinal direction. In the spectral transform
from grid space to spectral space, this is followed by a Legendre
transform integrating the results of the FFT in the north-south, or
latitudinal, direction. Thus, the spectral transform operates on data
``globally'' in that information from each horizontal grid point, and from each
processor, contributes to each spectral coefficient.

The FFT can be performed effectively in parallel [10] by
exploiting the fact that there are multiple grid lines to be transformed at any
one time.
Two options for parallel FFT's are currently implemented in PCCM2.
One is based on transposing the FFT data so that entire longitude lines
are contained in a processor. Since there are multiple lines for each
longitude and level, the parallel FFT's can be spread nearly equally
among the processors. At the conclusion of the FFT, the data are
transposed back to the original processor distribution.
The other option is a distributed parallel FFT, where each longitude line
is divided across a number of processors.
A vector sum algorithm
is used to calculate the Legendre transform in PCCM2.
The parallelization of the spectral transforms in PCCM2 has driven
most of the design decisions adopted for the organization of the
data structures.

The advective terms in the moisture equation are approximated
using a semi-Lagrangian transport method.
The method updates the value of the moisture field at
a grid point (the arrival point, A) by first establishing
a trajectory through which the particle arriving at A has moved during
the current timestep. From this trajectory the
departure point, D, is calculated and the moisture field
is interpolated at D using shape preserving interpolation.
All the calculations involve physical space (grid point) data,
and are decomposed over the processors with the same mesh
decomposition used to parallelize the physics and the spectral transform.
The parallel implementation of this algorithm uses the fact that timestep
constraints imposed by the Eulerian dynamics limit the distance between
the departure and arrival points in the latitude direction.
By extending the arrays in each processor,
thus ``overlapping'' regions assigned to neighboring processors, and updating the
overlapped portion of the array prior to each timestep via interprocessor
communication, the calculations in the different processors can proceed
independently.

The rest of this section describes in more detail the data decomposition and parallel
algorithms for the Legendre transform and the semi-Lagrangian transport.
A detailed description and comparison of parallel algorithms
can be found in the series of papers.
[14,46,10,15].
More details on the parallel CCM2, and a description of load balancing
techniques used in the physics component of the model, can be found in other
papers [21,47].
A data parallel implementation of PCCM2 is described in [18]
and a modestly parallel implementation for shared memory and distributed
memory (1-D decomposition) is described in [17].