Computational Simulation of Multi-Body Interactions with O(n) Scaling

Abstract

Molecular dynamics calculations have historically required significant amounts of computer time since they require integration of the equations of motion over significant periods of time for large numbers of particles. This research describes a new class of algorithms which not only speeds up these computations, but provides for much slower growth in the time required for doing even larger problems. Both the underlying theory and implementation details are provided.

Introduction

Molecular dynamics solves Hamilton's equation of motion for a set of generalized coordinates qi and momenta pi

⋅

q

i

=

∂H

∂pi

⋅

p

i

= −

∂H

∂qi

For a potential V, which is conservative, Cartesian coordinates xi and velocities vi are used, yielding a Hamiltonian

H({pi,xi}) =

∑i

pi2/2mi + V({xi})

From Newton's equations, the forces Fi are given by

Fi = −

∂

∂xi

V({xi})

which are a function of the interparticle distance only.

Put in pragmatic terms, the problem is to calculate these forces efficiently. The majority of the time required is in computing the non-bonded interactions (for N atoms there are approximately N2 interactions, so the complexity of the algorithm is denoted O(N2)). In fact, the computation of bonded forces is even more expensive O(N3), but since there are far fewer bonded interactions this cost turns out to be minor. For certain classes of problems, this has been reduced to O(N). Since this work was done specifically for the domain of structural mechanics where there exist a wide range of constraints for which there is no molecular analog, simplification to the chemical world remains a formidable task.

There have been previous efforts to speed up the direct method of calculating potentials [Hockney 81,Appel 85]. These methods use both heuristics and approximations. For example, many molecular simulations disregard any potential due to particles more than a certain distance from a particular atom. Arnold [Arnold 90] and Rycerz [Rycerz 91] discuss an O(N) algorithm for molecular dynamics simulation that approximates the potential by restricting the pairwise potentials to the nearest neighbors of each atom. Typically these methods have no known a priori estimates of the accuracy of the approximation nor any a posteriori metric of precision.

The most straightforward technique for decreasing the number of non-bonded interactions is to consider only interactions within a preset (or perhaps dynamically adjustable) distance. The computational overhead in doing this becomes significant for larger systems. Lists of "nearby" atoms can be maintained and updated regularly, saving some additional time, but none of these changes the fundamental scaling of the problem - it grows with the square of the number of particles.

When use of neighbor lists becomes impractical, there are two general classes of solutions:

Use an infinite summation method such as the Ewald summation
[Kittel 56] or Fourier methods [Hockney 81]. The latter are quite
efficient, but also quite complicated.

Use two interaction ranges, a short one and a long one. In the
short range list the forces due to particles in the shell between
the short and long range are evaluated and stored on a per particle
basis.

The work described here is essentially a hybrid of these two techniques and provides a significant advantage over each individually while at the same time yielding a scaling of computational work that is a linear function of the number of atoms, not a quadratic one.

We have implemented a multipole algorithm that solves the interparticle gravitational potential (1/r) of N particles to within a prescribable round-off error and whose computational effort grows only linearly with increasing N, i.e. O(N). This makes possible simulations of a large number of atoms on the order of 500,000 or more. Typically, thousands to millions of time steps are required for a realistic dynamic simulation. The linear nature of the cost of computing the potential with the sequential multipole algorithm allows many more time-steps, in less time, than previously practical.

The multipole algorithm can be implemented on a parallel (SIMD - Single Instruction, Multiple Datapath) computer and the computational effort asymptotically approaches O(logN) as the number of separate processors approaches N. Katzenelson [Katzenelson 88] has estimated 24 seconds would be required per time step for a million equally distributed particles on the CM-2 Connection machine (65,536 processors, costing $6MM) using a parallel multipole method. He has shown that even with a smaller array of processors, e.g. 2x2x2, it is possible to get impressive performance improvements.

We plan to extend our initial implementation of the O(N) N-body algorithm to include a Lennard-Jones potential and to implement the parallel multipole potential algorithm for both the gravitational and the Lennard-Jones potential. The net effect of our algorithm is to decrease the number of actual function evaluations. Since direct computation of 1/r is much less time consuming then the computation of 1/r6 − 1/r12, the speed-ups that we will achieve with Lennard-Jones will be far in excess of what are reported for gravitational potentials.

In this work we review the method and its computational structure, and then discuss the efficient implementation of the algorithm on a sequential computer. We numerically verify the algorithm, and compare its speed with that of an O(N2) direct method. We conclude by presenting ideas on how the research concerning this algorithm should proceed.

We must stress that, while all algorithmic analysis uses the capital O to parameterize complexity, the constant of proportionality are not the same for all the algorithms. There is considerable computational overhead in setting up the data structures to permit the faster evaluation, so it is incorrect to assume an N-fold speed-up for all N with this new algorithm.

Multipole Expansion of the Potential

Greengard and Rokhlin [Greengard 86] discovered an algorithm with complexity O(N). Their algorithm computes the potentials (or the forces) by partitioning them into two parts:

φtotal = φnear−field+φfar−field

where φnear−field is the potential due to nearby particles and φfar−field is the potential due to faraway particles. φnear−field is computed by direct evaluation and φfar−field is represented as an absolutely convergent Taylor series. Given a required precision ε, the number of terms p in the Taylor series can be determined in advance to obtain a particular accuracy.

Zhao [Zhao 87] expands the various functions to a series of the derivatives of 1/R where R is the distance from the center of a sphere (whose function is being computed) to the point at which the potential is desired. Unlike Greengard [Greengard 88], who uses spherical coordinates, Zhao's algorithm avoids the evaluation of (direction) cosines by use of Cartesian coordinates (x1,x2,x3).

The algorithm relies on the representation of the potential fields of sources by multipole expansions, and the ability to shift these expansions to common points of reference so they can be summed into single expansions. These expansions represent the potential fields of a region of space. They are evaluated at the position of each source to find the resulting potentials. The ability to choose how many terms of the expansion are computed gives control over the accuracy of the algorithm.

In the next few sections we give a brief overview of Zhao's formulation of the O(N) many body algorithm. For a more detailed explanation of the algorithm see [Zhao 87], [Katzenelson 88], or [Simon 90].

Theorems and Lemmas

The theorems and lemmas derived by Zhao are summarized below. They are essential in understanding the multipole algorithm in the next section. For the detailed statement and proof of these theorems, see [Zhao 87] or [Greengard 88].

The gravitational potential at a point x ∈ R3 due to a point mass μ located at y ∈ R3

φy(x) = −

μ

|| x − y ||

Figure 1 illustrates the quantities needed in the multipole theorems. The potential at point x ∈ R3 is due to mass μ at point y ∈ R3. y(0) is the reference point, and γ is the angle between vectors [(y(0)y)] and [(y(0)x)] . Also there are quantities r = ||x − y ||, r0 = ||x −y(0) ||, and R = ||x − y(0) ||.

Figure 1: Gravitational field.

The multipole theorems produce an expansion for 1/R in a region of analysis outside a sphere. The theorems are:

Multipole Expansion: Provide a multipole series
representation valid outside a local sphere.

Local Expansion: Provide a multipole series
representation valid inside a local sphere.

The ability to sum individual expansions to obtain the multipole expansion of a potential depends on these expansions having a common reference point and a common region of convergence. If they do not, we must shift the series to a common reference point using the lemmas that follow.

Translation of a Multipole Expansion: Shift the coefficients
of a multipole series representation valid outside a local sphere
when the reference point is shifted within the sphere.

Conversion of Multipole into a Local Expansion: Shift the
coefficients of a multipole series representation valid outside a
local sphere when the reference point is shifted outside the
sphere and convert the multipole coefficients to local expansion
coefficients.

Translation of a Local Expansion: Shift the coefficients of a
multipole series representation valid inside a local sphere
when the reference point is shifted within the sphere.

Predicted Error

From the Zhao's Multipole Expansion theorem, the error estimate for the multipole method is:

|ε| ≤ C



r0

R



p+1

where p is the number of terms in the Taylor series expansion. We choose to set the ratio |r0/R| in the error estimate to 1/2 . This restricts the applicability of the expansion to within twice the radius of the circumscribed sphere. Therefore, for a required accuracy of ε, p = log2 ε, we can then expect a factor of 2 decrease in error for each additional term in in the expansion.

As we retain more and more terms in the multipole expansions, the accuracy of the algorithm improves. When p = 20, the error of the computation is at the level of machine round-off precision (or beyond).

In a Cartesian coordinate system, we can define a computational data structure made up of many cubes that represent the three-dimensional space under consideration. We will use this data structure to implement the multipole-expansion method to compute the potential within the predicted error in O(N). The details of this computational data structure are discussed next.

Computational Data Structure

Three-dimensional space may be regarded as a cube which may be divided into eight subcubes, each of which may further be divided into eight subcubes, etc. This decomposition is represented using a tree data structure in which each cube is represented by a node. This data structure is known as an eight-way tree or "oct-tree" [Knuth 81].

Each level of the tree corresponds to one level of the three-dimensional spatial decomposition. Level 0, the root, is equivalent to the entire cube, while level l + 1, the children, is obtained from level l, the parent, by subdivision of each region into eight equal parts. The number of distinct cubes at level l is equal to 8l. Figure 2 illustrates the decomposition of the N-body cube.

Figure 2:

The N-body cube and the first two levels of refinement, indicated by the solid and dashed lines, respectively.

The cube is continually subdivided until there are only a few particles associated with each node. The nodes of a level that are not further decomposed are leaves. Given a uniform spatial distribution of particles, the level of decomposition grows logarithmically with the number of particles to enforce the constraint that the number of particles in each leaf cube is bounded by a predetermined constant. The level of decomposition, or height of the tree, has been determined to be log8 N. Choosing fewer levels will force more particles per leaf node. If we define the average number of particles for a leaf node to be n, then the condition n << N must be met to maintain the linear nature of the multipole algorithm. Conversely, if we choose more levels, then too much time is spent calculating the far-field potential.

For each cube we define its near field, far field, and interactive field. The near-field of a cube consists of neighboring cubes at the same level of decomposition such that the distance from the center of the cube to the center of the near-field neighbor is within 2√3 times the cube edge length. This value is the body diagonal of the cube and the diameter of the circumscribed sphere as required by the multipole theorems. The number of nodes in the near-field is bounded by 81. The far-field of a cube is the exterior of the near-field. Finally, the interactive-field of a cube is the part of its far-field that is in the near-field of the cube's parent. That is, all cubes in the parent's near-field that are not in the near-field of the current cube under consideration are members of the interactive field. The number of nodes in the interactive-field is bounded by 567. There are 8 interactive-field nodes for each of the 81 near-field nodes of the parent, minus the 81 near-field nodes of the current node (8*81−81 = 567).

The near, far, and interactive fields in two dimensions are illustrated in figure 3.

Now we have defined the computational data structure of the O(N) N-body algorithm. The next section describes how Zhao's algorithm can utilize this structure to compute the potential of N particles in three-dimensions.

Algorithm

A formal description of the multipole-expansion algorithm follows. It comes directly from [Zhao 87]. The notation used to describe the multipole-expansion algorithm includes:

φil the p-term multipole expansion about the
center of cube i at level l , describing the contribution
to the potential field in the far-field region of cube i at level
l due to the particles in cube i.

ψil the p-term local expansion about the
center of cube i at level l , which represents the
contribution to the potential field in cube i at level l due to
particles in the far-field region of cube i.

Initial expansion: for each particle i in a leaf
node, compute φin(x) .

Compute for each of the particles in the cube i the
series expansion for the potential due to the particle by using
the Multipole Expansion theorem.

Sum these expansions for particles in cube i to form φin(x) for the cube i .

Upward path: for each of the cubes i at level l ,
compute φil(x) . φil(x) is a multipole expansion
about the center of a cube for all intermediate levels representing
the potential field due to all particles contained in the cube.

Shift φil+1(x) of cube i 's eight subcubes
to the center of cube i by using the
Translation of a Multipole Expansion lemma.

Sum the shifted φil+1(x) of cube i 's eight
subcubes to form φil(x) of cube i .

Downward path: for each cube i at level l ,
compute ψil(x) . ψil(x) , a local expansion,
describes the field due to all particles in the system that are not
contained in the current cube or its nearest neighbors.

Shift φil(x) of the interactive-field to the center of
cube i using the Conversion of Multipole to a Local Expansion
lemma.

Shift ψil−1(x) of cube i 's parent cube to the
center of the cube i , by the Translation of a Local Expansion
lemma.

Sum (1) and (2) to form ψil(x) for the cube i .

Final Evaluation: for each of the particles in the leaf
nodes, compute the potential at the particle. Evaluate ψin(x) to generate the potential or force due to all
particles outside the nearest neighbor cubes at the leaf level.
Include the potential due to nearest neighbors by direct evaluation.

For the leaf cube i that contains the particle p ,
evaluate ψin(x) at the position of p .

Compute direct interactions with particles in cube i and
its near-field.

Sum (1) and (2) as the desired potential for the particle p .

Force is obtained directly from ψin(x) by taking
the partial derivatives.

Each local expansion is described by the coefficients of a p-term polynomial. Direct evaluation of this polynomial at a particle yields the potential. The force is available analytically and therefore there is no need for numerical differentiation.

Zhao's O(N) algorithm distinguishes between the computation of the potential due to the near field particles ("near-field potentials") and far field particles ("far-field potentials"). The near-field potential is computed using direct evaluation. The far-field potential is obtained by evaluating the p-term multipole expansion at the particle position.

The goal of the multipole computation is to compute φfar−field with O(N) work. This is achieved by propagating values, first from the leaves to the root through the computational tree, then from the root to the leaves in two distinct phases. In the first phase the far-field interactions φil are computed by shifting the multipole expansions of the child nodes to the center of the parent node and adding the results. φil is calculated for all cubes at all levels, beginning at the finest level of refinement (the leaf nodes). The multipole-expansion of the field due to each particle is calculated relative to the center of each leaf node and summed. Proceeding upward, the expansions of the eight child nodes are shifted relative to the parent's center and summed. This computation is performed at each level propagating values upward through the tree. At the end of this phase of the algorithm, each node at every level has a multipole-expansion representation of the field due to all particles within its boundaries. These expansions are only valid outside the region defined by the near field of each cube.

In the second phase, we form the local expansions ψil for all cubes at all levels beginning at the coarsest level (root node). The far-field multipole-expansions of all nodes in the interactive field of the node under consideration are converted into local expansions relative to the node's center and are summed. Next, shift the expansion of the parent cube to the center of the current cube under consideration and sum with the local expansions computed from the interactive field. Note, the root node does not have a parent or interactive field so it's local expansion is defined to be zero. The local expansions are propagated down the tree. They describe the field due to all particles in the system that are not contained in the current node or its nearest neighbors.

Finally, for each cube i at the finest level n, we evaluate the local expansion φin for each particle in cube i. Each local expansion is described by the coefficients of a p-term Taylor series. Direct evaluation of this series at a particle yields the potential. Summing the local expansion at each leaf node forms φfar−field. The interactions of each particle in cube i are computed directly and summed to form φnear−field. Finally summing φfar−field and φnear−field gives us the desired potential.

Multipole-expansion Computer Program

A computer program using this multipole-expansion algorithm has been implemented in the C computer language. The efficient implementation of the algorithm requires a language that supports data structures and dynamic memory. C provides both and is also portable across many computer architectures. The source code is in appendix B. The code has been compiled and executed on many computers including the Cray Y-MP, IBM RS/6000, Stardent Titan, Sun 4/280, Decstation 3100, MIPS 3420, and a VAX 8800.

There are significant optimization opportunities for the multipole-expansion algorithm. In the next section, we discuss the optimizations (both implemented and potential) of the algorithm. Next, we discuss the memory requirements of our method. Finally we present results from the execution of the algorithm in which we compare speed and error of the multipole-expansion method to that of the direct evaluation method.

Algorithm Optimizations

Reviewing the algorithm and monitoring of the algorithm execution revealed opportunities for both machine independent and machine dependent optimizations.

The machine independent optimizations we discovered follow. Some of the optimizations are discussed in [Zhao 87] and [Katzenelson 88]. This fact is noted in the relevant optimization discussion.

Use super-nodes to reduce the number of interactive-field
nodes from 567 to 140.

Shifting expansions due to interactive-field potentials are very expensive. This is where most of the work of the algorithm occurs. For each node of the tree, we shift all members of the interactive-field to the current node. There are 567 nodes in the interactive field. We can reduce the shifting on expansions in the interactive-field from 567 to 140 by using super-nodes. That is, we replace eight child nodes by their common parent node if all eight nodes are in the interaction-field but not in the near-field of the node under consideration. This modification speeds up the algorithm by a factor of about 8.

The Translation of a Multipole Expansion lemma translates an expansion from one center of a cube to another. The coefficients of the translated expansion, cijk are

cijk =

i∑α = 0

j∑β = 0

k∑γ = 0

aαβγa′i−α,j−β,k−γ

Here a′αβγ are the coefficients in the expansion for 1/R, measured from the old cube center with respect to the new center. Additionally, the Conversion of Multipole into a Local Expansion lemma converts a multipole into a local expansion. The coefficients of the converted expansion, cijk, are

cijk =

1

i!j!k!

∞∑α,β,γ = 0

aαβγbi+α,j+β,k+γ(i+α)!(j+β)!(k+γ)!

bijk are the coefficents of the local expansions for 1/R, measured from the old cube center with respect to the new center. Finally, the Translation of a Local Expansion lemma shifts an expansion from one center of a cube to another. The coefficients of the shifted expansion, dijk are

dijk =

∞∑α,β,γ = 0

ci+α,j+β,k+γ




i + α

α







j + β

β







k + γ

γ




∆xα1 ∆xβ2 ∆xγ3

Here (∆x1,∆x2,∆x3) are the coordinates of the old cube center with respect to the new center.

a′αβγ, bijk, and ∆xi are constants and can be precomputed. The number of different sets of the a′αβγ and ∆xi constants are the number of children of node, i.e. 8. These constants at different levels are derived by multiplying by the appropriate power of 2. The number of different sets of constants, bijk , is the size of the interactive field of a node, i.e. 140. These constants at different levels are also derived by multiplying by the appropriate power of 2.

The precomputation of these constants speeds up the algorithm by avoiding repeated calculations. Simon [Simon 90] achieved a factor of 2 speed-up by using these precomputed constants.

M

This optimization was discovered by the author (Cristy) and was subsequently confirmed by Katzenelson.

This optimization is not yet implemented.

Store expansions as a tetrahedral array to reduce the memory requirements of the algorithm.

If we retain only terms i + j + k ≤ p of each multipole and local expansions, then the total number of terms is

np =

p∑k=0

p−k∑j=0

p−k−j∑i=0

=

1

6

(p+1)(p+2)(p+3)

This formulation only requires the upper tetrahedral portion of a cubic array to store the expansions. Allocating memory for the full extent of the expansions in all three dimensions is wasteful. Instead we store the coefficients in lexicographic order [Knuth 81,Merrifield 91]. This reduces the memory requirements from p3 to

1

6

(p+1)(p+2)(p+3)

To solve the potential to within machine precision we require that p = 20. The multipole-expansion of p = 20 requires 1771 terms. A cubic array would require 8000 memory locations, whereas storing the terms in lexicographical order in the upper tetrahedral portion of a cubic array reduces the memory requirement to 1771 memory locations.

Using Newton's third law, we need only compute half of the direct pairwise interactions.

Given two sources p and q, we need only compute the potential between p and q once. To assure that the pairwise potential between any two sources is only calculated once, we introduce this simple heuristic:

if ¯ (address(p) < address(q)) { ¯
compute φpq
φp += φpq
φq += φpq}

where φpq is the potential between the pth and qth particle.

The above heuristic works equally well when directly evaluating the
force. The force is obtained by taking the gradient of the the potential
field

where [F]pq is the force between the pth and qth particle.
As expected, this yielded a 2x speed-up.

Simplify factors in the formulas to avoid many calculations.

Many calculations are repeated at each level of the decomposition. By simplifying common factors in the formulas, they can be removed and applied only at the leaf level. This avoids many calculations at the intermediate levels.

Specifically, the Conversion of Multipole into a Local Expansion
lemma converts a multipole into a local expansion. The coefficients of
the converted expansion, cijk, are

cijk

=

1

i!j!k!

∞∑α,β,γ = 0

aαβγbi+α,j+β,k+γ (i+α)!(j+β)!(k+γ)!

bijk

=

(−1)i+j+k

ri+j+k+10

k/2 ∑γ = 0

j/2 ∑β = 0

i/2 ∑α = 0

AB

A

=




−

1

2

i+j+k−α−β−γ




(i+j+k−α−β−γ)!

(i−α)!(j−β)!(k−γ)!

B

=




i − α

α







j − β

β







k − γ

γ




(2ν1)i−2α (2ν2)j−2β (2ν3)k−2γ

Simplifying the formula results in this optimization:

cijk

=

1

i!j!k!

∞∑α,β,γ = 0

aαβγbi+α,j+β,k+γ

bijk

=

(−1)i+j+k i!j!k!

ri+j+k+10

k/2 ∑γ = 0

j/2 ∑β = 0

i/2 ∑α = 0

AB

A

=




−

1

2

i+j+k−α−β−γ




(i+j+k−α−β−γ)!

(i−α)!(j−β)!(k−γ)!

B

=




i − α

α







j − β

β







k − γ

γ




(2ν1)i−2α (2ν2)j−2β (2ν3)k−2γ

The Translation of a Local Expansion lemma shifts an expansion from one center of a cube to another. The coefficients of the shifted expansion, dijk are

Store factorials and binomial coefficients in a table to avoid repeated calculations.

The factorials and binomial coefficients required by the multipole lemmas are redundant. These values can be stored in a table to avoid repeated calculations.

Specifically

i!j!k!

and




−

1

2

i+j+k−α−β−γ




(i+j+k−α−β−γ)!

(i−α)!(j−β)!(k−γ)!

of the multipole lemmas can be precomputed to speed up the algorithm.

This optimization was discovered by the author (Cristy) and was subsequently confirmed by Zhao.

The following machine dependent optimizations were included in the code found in appendix B. Each was discovered by the author (Cristy).

Store expansions as a tetrahedral array to speed-up referencing of the
coefficients.

The indexing of three-dimensional arrays on most computer architectures is expensive. Memory is stored on computers as one-dimensional areas and higher dimensional arrays require a mapping to the one-dimensional area. We can reduce a three-dimensional array to a one-dimensional array by using lexicographical ordering of a tetrahedral array. In most cases, the multipole and local expansions are accessed in lexicographic order, thus eliminating the expensive mappings. In a few cases, the expansions are accessed randomly. Merrifield [Merrifield 91] derived a formula for randomly accessing our expansions:

Np(i,j,k) =




p+3

3




−




p−i+2

3




−




p−i−j+2

2




+ k

Convert divisions in formulas to multiplications to speed-up the calculations.

Divisions on most computer architectures generally cost 2-5 times more in computer time than multiplication. Additionally, division loses one bit of precision due to the finite precision of a computer. We changed all but one division into multiplication in the algorithm by computing the inverse value of any constants. As an example, i!j!k! is a common divisor in the multipole theorems and lemmas. We first compute 1/(i!j!k!) then use this value as a multiplier instead.

Allocate many nodes for the decomposition tree at one time.

Dynamically allocating memory on computers can be expensive in computer time. To avoid excessive calls to allocate memory, a large queue of nodes is allocated at the beginning of the program execution. The nodes of the decomposition tree are then allocated from this queue. This technique also helps prevent memory thrashing by inducing spatial coherence. That is, nodes that are "close-by" in the tree decomposition are generally close-by in memory.

Memory Requirements

In addition to reducing the execution time of the multipole-expansion algorithm, we are also concerned about its computer memory requirements. The multipole-expansion method requires maintaining the decomposition tree in memory and therefore requires more memory than the naive direct evaluation method. Computers that are memory constrained may inhibit good running times of the algorithm due to excessive paging. In some cases the algorithm may not run at all due to shortages of memory. The memory requirements of the multipole-expansion algorithm are discussed here.

The particles themselves require storage. Each particle has a position associated with it. Other information may also be associated with each particle such as velocity, acceleration, mass, etc.

The total number of cubes at all levels of the decomposition in the multipole-expansion tree is

80 + 81 + ... + 8n =

8n+1−1

7

=

8N − 1

7

Each node of the decomposition tree stores the coordinate of the node center, pointers to its children and its parent, and pointers to its near-field and interactive-field. Each node also stores the multipole and local expansions.

Each node has these memory requirements in computer words:

3 - x, y, and z coordinate of the node center.

1 - parent pointer.

8 - children pointers.

80 - near-field pointers.

140 - interactive-field pointers.

φin(x) and ψin(x) -

1

6

(p+1)(p+2)(p+3)

coefficients.

For a large number of particles, the memory required can potentially reach many gigabytes. Reducing this memory requirement is highly desirable.

A tree of depth four, for example, has 4673 (1+64+511+4095) nodes. With p = 3 we need 20 terms in each expansion. The total memory words required just to store the expansions is about 1,275 kilobytes.

On some computer architectures (i.e. Connection Machine) it is possible to address the near and interactive fields directly. This improvement can save us over 220 words of memory per node. Addressing the near and interactive fields directly is discussed in section 6.

Execution Time of the Algorithm

In this section we compare the execution time of the O(N) multipole-expansion computer program to that of direct evaluation. The results were obtained from the program running on an IBM RS/6000 model 550. For each of our tests, we use the super-node heuristic (section 5.1) and require the average accuracy to be at least 10−3. Note, that in these tests we did not precompute constants in the lemmas to avoid repeated calculations. A significant speed-up is expected when this is implemented.

For the speed comparison tests, the positions of the particles were randomly distributed within a cube and the resulting density was roughly uniform. Table 1 compares the execution of the multipole-expansion algorithm to that of direct evaluation. The number of terms in the expansion, p, was chosen as 3 and the calculated potentials are accurate in average to about 10−4. The running time excludes time spent executing in system mode and is accurate to the nearest second. Figure 4 shows the results of the test on the running time of the algorithm as a plot. In the plot, both the running-time axis and number-of-particles axis are scaled logarithmically.

From table 1 and figure 4 one can see that the CPU time requirements of the multipole algorithm grow linearly with number of particles. The improvement in speed, as expected, increases with the number of particles. Note that the 256,000 particle case executes over 30 times faster than the direct method.

Note, that due to the overhead of creating the decomposition tree for the multipole method, the direct method is faster for systems of fewer than 512 particles.

Number of

Execution Time (seconds)

Particles

Multipole-expansion

Direct

Speed-up

511

1

1

1.00

1023

1

1

1.00

2047

3

4

1.33

4095

11

17

1.55

8191

18

67

3.72

32767

165

1084

6.57

262143

2307

69218

30.00

Table 1: Execution time of the multipole-expansion method with p = 3 and 80 near-field nodes compared to that of the direct evaluation method.

Figure 4: Execution time growth rate of the multipole-expansion
method compared to that of the direct evaluation method.

In the remaining tests, the number of terms in the expansions are noted in the table captions as p. They show what effect changing p or the near-field size has on the speed and accuracy of the algorithm. Table 2 shows how the speed-up of the algorithm is reduced as p is increased from 3 to 6. Increased p means improved accuracy at the cost of longer running times.

Number of

Execution Time (seconds)

Particles

Multipole-expansion

Direct

Speed-up

511

2

1

0.50

1023

2

1

0.50

2047

4

4

1.00

4095

13

17

1.30

8191

53

67

1.26

32767

203

1084

5.34

262143

2790

69218

24.80

Table 2: Execution time of the multipole-expansion method with p = 6 and 80 near-field nodes compared to that of the direct evaluation method.

The highest cost per node is associated with converting the multipole to a local expansion for nodes in the interactive field. By reducing the number of near-field nodes, and thus the interactive-field nodes, the program can be sped up at the cost of introducing more error. A smaller near-field also reduces the number of direct calculations. The accuracy of the algorithm improves as we retain more and more terms in the multipole expansion. The error can therefore be compensated by increasing p. If great accuracy is not required, we can reduce the near field size and retain just a few terms in the expansions. The combination of these heuristics can significantly speed-up the algorithm. The effects of this trade-off are shown in tables 3 and 4.

Number of

Execution Time (seconds)

Particles

Multipole-expansion

Direct

Speed-up

511

1

1

1.00

1023

1

1

1.00

2047

3

4

1.33

4095

9

17

1.89

8191

22

67

3.05

32767

133

1084

8.15

262143

1785

69218

38.78

Table 3: Execution time of the multipole-expansion method with p = 4 and 56 near-field nodes compared to that of the direct evaluation method.

Number of

Execution Time (seconds)

Particles

Multipole-expansion

Direct

Speed-up

511

1

1

1.00

1023

1

1

1.00

2047

3

4

1.33

4095

6

17

2.83

8191

18

67

3.72

32767

78

1084

13.90

262143

975

69218

70.99

Table 4: Execution time of the multipole-expansion method with p = 5 and 26 near-field nodes compared to that of the direct evaluation method.

In addition to good running-times, we are interested in getting accurate results. The accuracy of the computer program is discussed in the next section.

Accuracy of the Computer Program

The accuracy of the multipole-expansion method can be measured by comparing the results of multipole-expansion method to the exact potentials (within machine precision) computed with the direct method for the same set of particles. The near-field of each node is defined so that the ratio in the error bounds, described in section 4.2, is 1/2. Table 5 confirms the accuracies of the results obtained by the multipole algorithm as in agreement with the error bounds. p is the highest degree of harmonics retained in an expansion. As p increases the accuracy of the multipole-expansion method improves.

The error due to truncating a multipole expansion is a mixed-sign series. Consequently, the error from truncating the multipole expansion does not necessarily decrease monotonically when retaining more terms. While the theoretical bounds on the maximum error decreases exponentially with an increase of precision, the actual error decays irregularly. However, the error will always be less than the predicted error as p increases.

Table 6 shows the effect of different values of p and near-field sizes. We expect, when p is increased from 3 to 6, the difference in maximum error between the multipole-expansion and the direct method to be reduced. This greater accuracy requires more computation. From the previous section we know that we can speed up the algorithm by reducing the near-field size. Table 6 shows that the error introduced by this heuristic is still within our bounds of required accuracy.

Number of

Particles

Near-field

Precision

Potential Energy

εpotential−energy

511

80

3

1.91

0

511

80

6

1.91

0

511

56

4

1.91

0

511

26

5

1.91

0

1023

80

3

7.67

1.6 x 10−5

1023

80

6

7.67

3.8 x 10−7

1023

56

4

7.67

1.0 x 10−5

1023

26

5

7.67

5.4 x 10−6

2047

80

3

30.64

7.0 x 10−5

2047

80

6

30.64

5.4 x 10−6

2047

56

4

30.64

3.2 x 10−5

2047

26

5

30.64

4.8 x 10−5

4095

80

3

122.79

1.7 x 10−4

4095

80

6

122.79

2.0 x 10−5

4095

56

4

122.79

4.3 x 10−5

4095

26

5

122.79

3.4 x 10−4

8191

80

3

491.47

5.8 x 10−5

8191

80

6

491.47

1.0 x 10−4

8191

56

4

491.47

1.1 x 10−5

8191

26

5

491.47

1.6 x 10−3

32767

80

3

7897.16

8.3 x 10−4

32767

80

6

7897.16

1.6 x 10−3

32767

56

4

7897.16

3.6 x 10−4

32767

26

5

7897.16

0.03

262143

80

3

504945.13

0.15

262143

80

6

504945.13

0.11

262143

56

4

504945.13

4.5 x 10−3

262143

26

5

504945.13

2.2 x 10−4

Table 6: Accuracy test of the multipole-expansion method for various near-field and p configurations.

Future Research

The algorithm as implemented demonstrates that the gravitational potential can be calculated in O(N). There are still many areas that need to be investigated concerning the N-body algorithm. These areas include

Implement the Lennard-Jones potential for molecular dynamics.

Lustig [Lustig 90] has derived the multipole theorems required to
apply the O(N) N-body algorithm to the Lennard-Jones potential:

φ(r) = 4ε



Cr

r12

−

Ca

r6



Only Zhao's two multipole theorems differ substantially between the gravitational potential and Lennard-Jones potential. The lemmas concern shifting expansions to a common reference point and remain unchanged for the Lennard-Jones potential. Therefore, we expect that modifying the existing code for the Lennard-Jones potential will be straightforward.

Greengard [Greengard 88] shows that the O(N) multipole algorithm remains valid with periodic boundary conditions found in molecular dynamics. However, the current C code will need to be enhanced to account for periodic boundary conditions.

Implement the O(logN) parallel multipole algorithm.

Zhao [Zhao 87] proves that the parallel multipole algorithm computational effort grows only logarithmically with increasing N as long as the number of processors is greater than the number of particles. Zhao and Johnsson [Zhao 89] show that the multipole calculations for each node at a particular level can be computed concurrently. Additionally, all near-field calculations can be computed in one parallel step. Katzenelson [Katzenelson 88] shows that the parallel multipole algorithm has significant speed-up for a number of different processor configurations of the SIMD computer. Zhao and Katzenelson both have implementations that solve for the gravitational potential of over a million particles in less than a minute of computer processor time on a 65,536 processor machine.

Molecular dynamics simulations follow the trajectories of a number of atoms over some time interval. The parallel multipole method makes possible the evaluation of the interatomic potential of a large number of atoms, over many time-steps, in a reasonable amount of time.

Implement an adaptive multipole algorithm.

The multipole algorithm assumes a near-uniform distribution of particles to achieve a computational effort that grows only linearly. For distributions with clumps of particles, the performance of the algorithm will degrade. Zhao [Zhao 87] and Greengard [Greengard 88] both presented adaptive algorithms that are independent of the statistics of the distribution of particles. In the adaptive algorithm we specify a value s > 0; at every level of decomposition we subdivide only those cubes that contain more than s particles.

Simon [Simon 90] shows that if the number of terms in the p-term expansion exceeds 8, it is beneficial to use spherical rather than Cartesian coordinates. A simulation requiring extreme accuracy may benefit from an implementation of Greengard's algorithm.

Develop sophisticated methods of defining near and interactive fields.

On sequential computers, the near-field and interactive-field can be computed directly by assigning node addresses that reflect the implicit relationship of the decomposition tree. Given the address i of a node, we know that its parent node has an address of (i−1)/8, and its child nodes have addresses from 8i+1 to 8i+8. By representing the tree in this way, we avoid having to search the tree at each node for its near-field and interactive-field. We can precompute these fields and store them. We can avoid storing the near and interactive field pointers altogether by computing their addresses each time we visit a node.

This relationship can be extended to parallel computers where a node is represented by the address of an individual processor.

Modify the program so it can be vectorized.

Vectorization of algorithms can provide significant speed-ups for sequential algorithms. The sequential multipole-expansion algorithm is already many times faster for evaluating the potential for a large number of particles as compared to the direct method. Vectorizing the multipole-expansion algorithm will further improve this speed-up over direct evaluation.

Acknowledgements

Steve Lustig spent many hours introducing me to numerical simulations and helping me understand the mathematical foundation of the multipole algorithm. Steve also derived the theorems for the multipole Lennard-Jones potential and helped me debug the C code. Discussions with Feng Zhao helped me overcome some initial misunderstanding about the algorithm.

Richard Merrifield derived the lexicographical ordering of a tetrahedral array, a useful optimization of the algorithm.

Steve Lustig, Richard Merrifield, and Karen Bloch reviewed this paper and made suggestions for its improvement.

S. Lustig, Toward a Molecular Dynamics Algorithm with O(logN) Efficiency:
Multipole Series Representations and Shifting Lemmas for the Lennard-Jones
12-6 Potential, du Pont Central Research and Development Memo, November
1990.

Multipole-expansion C Code

This appendix contains sequential code implementing the three-dimensional N-body algorithm in the C language. The numbers to the left of the code is the number of times the statement has been executed for 1023 uniformly distributed particles and 3 terms in the expansions. Lines which have not been executed are prefixed with #####.