Lattice Boltzmann Methods

The lattice Boltzmann method is a powerful technique for
the computational modeling of a wide variety of complex
fluid flow problems including single and multiphase flow
in complex geometries. It is a discrete computational
method based upon the Boltzmann equation.
It considers a typical volume element
of fluid to be composed of a collection of particles that are
represented by a particle velocity distribution function for each
fluid component at each grid point. The time is counted in discrete
time steps and the fluid particles can collide with each other as they
move, possibly under applied forces. The rules governing the collisions are
designed such that the time-average motion of the particles is
consistent with the Navier-Stokes equation.

This method naturally accomodates
a variety of boundary conditions such as the pressure drop across
the interface between two fluids and wetting effects at a fluid-solid
interface. It is an approach that bridges microscopic phenomena
with the continuum macroscopic equations. Further, it can
model the time evolution of systems.

The Lattice Boltzmann Method is resource intensive.
In general, running simulations on large systems (greater
than 100x100x100 grid points) is not practical due to the lack of memory
resources and long processing times. Because of these extreme demands
on memory and computation, and the fact that the LB method generally
needs only nearest neighbor information, the algorithm is an ideal
candidate for parallel computing.

How is the Parallelization Realized?

The code was implemented in C with MPI for portability.
There are multiple included features that enable large problems to
be run quickly.

Single-Program Multiple-Data (SPMD) Model:

The data volume is divided into spatially contiguous blocks
along one axis; multiple copies of the same program run
simultaneously, each operating on its own block of data. Each copy of the
program runs as an independent process and typically each process runs
on its own processor. At the end of each iteration, data for the
planes that lie on the boundaries between blocks are passed between
the appropriate processes and the iteration is completed.

Memory Management:

In order to run large problems, we use multiple techniques to keep
within-processor memory needs as small as possible.

Since only nearest neighbor information is needed, all computation
within a process is performed with only three temporary planes, Thus
temporary memory requirement grows at a much smaller rate than
problem size.

For fluid flow in complex geometries, we have both active
sites (that hold fluid) and
inactive sites (that consist of material such as sandstone). For
efficient use of memory we use an indirect addressing approach where
the active sites point to fluid data and the inactive sites point to
NULL. Hence only minimal memory needs to be devoted to inactive
sites. At each active site we point to the necessary velocity and
mass data for each fluid component. Over the course of an iteration
we visit each active cell in the data volume and calculate the
distribution of each fluid component to be streamed to neighboring
cells. New mass and velocity values are accumulated at each active
cell as its neighbors make their contributions.

What is the Performance of the Parallel Code?

For Modeling Fluid Flow in Complex Geometries:

We ran a series of timing tests on multiple machines.
We found in all cases that the non-parallelizable part of the computation
accounts for between 0.7% and 3% of the total computational load.
In one of the test cases the performance data from the
SGI Origin 2000 closely matches this formula (T is the total
time in seconds for an iteration; N is the number of processors): T = 0.090 + 11.98/N.
The non-parallizable part of the computation is 0.090 seconds,
while the parallelizable portion of the computation uses 11.98 seconds.
So, for example, a single iteration took 12.08 seconds on one processor
but only 1.11 seconds on 12 processors.

Other timing tests indicate that the time for the parallelizable
portion of the code is roughly proportional to the number of active
sites over the entire volume, while interprocess communication time is
roughly proportional to the size of a cross-section of the
volume. So as we process larger systems, the time for the
parallelizable portion of the code should increase proportionally with
the cube of the linear size of the system, while the
non-parallelizable portion should increase with the square of the
linear size of the system. This means that for larger systems, a
larger proportion of the time is in the parallelizable computation and
greater benefits can be derived from running on multiple processors.