N-Body Simulation on Heterogeneous Cluster

1. The N-body algorithm.

The N-body simulation is to simulated the environment of many particles
interact under gravitational force in the universe. Each particles has
seven values representing its own status.

P_i = ( m x y z vx vy vz )

Which means that the particle, P_i, with mass 'm' at location (x, y, z)
is traveling at velocity (vx, vy, vz). For 'N' number of particles, the
index 'i' is from 0 to n-1. Gravitational force between P_i and all
other particles in the universe will place force, and hence
acceleration a_i = (ax, ay, az), on it.

Simulation take place in discrete time steps, dt. In each step, the
velocity is used to update the position of the particle and the
calculated acceleration is used to update the velocity. Thus all
particles will have a new set of status after each simulation step.

The acceleration is computed using the following procedures, where EPS
is a small constant to prevent strange force value when two particles
are too close to each other. In practice, EPS = 0.

In this example, we chose the time duration between each time step is
the unit time for velocity and acceleration computation. Thus the update
of particle position and velocity is simply adding the previous velocity
and current acceleration respectively.

All data including input, output and internal computation are in 32-bit
IEEE-754 single precision format. To compute the acceleration for a
particle, all position and mass information must be read. Thus 16 bytes
(4 bytes x 4) of data are read for each iteration of the inner loop in
the above algorithm. In the inner loop, 17 floating point operations are
performed including 3 subtractions, 3 addition, 6 multiplications, 1
square root, 1 inverse and 3 accumulations. So the computation to data
ratio is

Rcd = 17/16 = 1.0625 fopb (Floating point Operation Per Byte)

Both inner and outer loop need to iterate over all particles. Thus there
are total N*(N-1)*17 operations in each simulation step.

2. Input and Output.

The original input source of this example is the Dubinksi 1995 data set
from the N-body Data Archive web site.

There are 81920 particle in the file in ASCII format. To simplify and
speedup benchmark programs, the file is reformatted in to binary format.
The binary data file contains a sequence of 32-bit floating point
numbers where following the appearance order of the original Dubinski
data file. The simulation data file of FPGA implementation contains
hexadecimal numbers in ASCII format.

The output file is in the same format of the input file while containing
the updated status of particles after the simulation.

3. Single thread CPU implementation.

The reference implementation is a single thread C programing targeting
Intel x86 architecture. It has been compiled using GCC and tested on
Linux platform. To run the reference design, type the following command:

./nbody ../dat/nbody_81920.dat 81920 0 1

The first command line argument is the path and file name of the input
data. The second is the number of particles to be simulated. The third
argument is the value of EPS and the last one is the number of steps to
be run in the simulation.

The reference implementation displays the performance information. One
example is shown below.

4. Multi-threaded CPU implementation.

OpenMP is used as the framework to parallelize the computation on
multi-core system. FOR-LOOP parallel #pragma is used to parallelize the
main loop for computing the acceleration of all particles in the current
time step. The particle data, marked as shared variable, are partitioned
evenly according to the loop index. There is no data dependency between
the acceleration computations of different particles. The results are
writing to the corresponding location in the acceleration array, which
is also marked as shared. No write conflict exist in this stage.

The update of position by current velocity and the update of velocity by
computed acceleration are performed outside the OpenMP parallel
construct.

In the AXEL cluster, we run the OpenMP version of N-body simulation on a
dual-CPU platform with 4 cores on each CPU. The program is instructed to
run in 4 parallel threads such that the side effects of system loading
are minimized. The performance is measured in the same way as in the
single thread reference design and shown below.

5. GPU implementation.

An implementation targeting many-core GPU is created for the N-body
simulation. The computation of acceleration of all particles in the
current time step is off loaded to the GPU while the CPU updates the
position and velocity after that.

Information of the particles needed for acceleration computation are
copied to GPU memory at the beginning of each simulation time step.
These include the mass and position in the format of (m, x, y, z). The
acceleration vectors (ax, ay, az) are read back from the GPU memory
after the kernel finished.

In the kernel, we dedicate one thread per particle. Inside the thread,
it loops through all other particles to generate the acceleration value
and writes it to the GPU memory. The computation in the loop is the same
from the source code of the CPU reference design. No special
optimization is applied on the kernel and the data are stored in GPU
main memory with the same layout as in CPU main memory.

The performance of this design is not as good as the highly optimized
version of N-body simulation in the CUDA SDK. Here, we want to provide a
fair comparison of the processing power by restricting the same type and
amount of computations are performed by different processors.

For the same set of input vectors, the results of the GPU program are
different from those of the CPU counterpart. This is due to the
implementation deviations of floating point operators in each platform.

6. FPGA implementation.

This implementation assume there are four independent memory banks
connected to the FPGA each with 32-bit width data bus. The host program
will put the values of m, x, y and z in separated memory banks so that
these information can be load to the FPGA in the same memory access
cycle. The host program also send auxiliary information to core
including the number of particles in memory N, the starting and ending
index of i in the outer loop.

A deeply pipelined data path is created for the acceleration
computation. All floating point operations are performed by macros
generated from Xilinx Core Generator. The latency of the pipeline is 132
clock cycles and it can operate up to 400MHz.

First, the (x, y, z) values of P_i are read from memory and stored in
internal registers. Then the values of all particles are
read as the P_j inputs. The P_j data are fed to the pipeline one per
clock cycle when the data from memory read operation is valid. After all
P_j are read, the internally stored values in the pipeline are drained.
Extra clock cycles (56 in the example) are need to perform final
reduction and output the internal values of the accumulator before
starting the next iteration in the outer loop for a new P_i.

Further optimizations can be done by overlapping the draining and
reduction phase of current iteration and the feeding phase of next
iteration. But this will cause extra control logic resource and affect
achievable maximum working frequency. We consider the current time
overhead (192 clock cycles) is negligible since the number of particles,
N, is sufficiently large. This time overhead is less than 0.1% when
81920 particles are processed as in the example.

The final acceleration vectors (ax, ay, az) will be stored in a FIFO
where the host program can read through register interface of the FPGA
(uint32_t fpgaReg[USER_REG]). The index of current P_i is also output
such that the host program knows how many data should be read by
comparing with its local index. The host program will update the
position and velocity vectors after all acceleration vectors is read. A
polling scheme with 3us interval is used for the minimizing both host
CPU utilization and FPGA idle time due to FIFO full. The host-FPGA
communication time will overlaps with the FPGA computation time thus no
extra time overhead is introduced.

The theoretical performance of the N-body core without I/O constraints
under 400MHz is

T_t = (81920 + 192) * 81920 * 2.5 ns = 16.82 s

For 333MHz DDRII-SDRAM in the target platform, 171MBps DMA write speed
and 230MBps DMA read speed, the expected performance of the FPGA
implementation is

Number of Slice Registers: 32,483 out of 207,360 15%
Number of Slice LUTs: 27,674 out of 207,360 13%
Number of DSP48Es: 18 out of 192 9%

The difference between expected and measured performance is due to the
memory reading overhead. We stored all particle data in external DDR2
memory which has an effective 4 * 32-bit width interface. In the
computing the expected performance, we assume that all data will be
ready when needed. This is an uninterrupted continuing read operation on
the external DDR2 memory with 4 32-bit words per clock cycle. Due to the
nature of the DDR2 memory IC and also the implementation of the memory
controller (mainly the command buffer size), this assumption is not
valid in real world operation.

For the same set of input vectors, the results of the FPGA program are
different from those of the CPU counterpart. Two factors contribute to
this deviation and both are related to the pipelined accumulator. First,
the sequential CPU program accumulates the acceleration vectors once
their are computed from P_i and P_j. But in the FPGA design, these
vectors are first accumulated into 12 different partial sums matching
the pipeline stages in the floating point adder. These 12 partial sums
are then sum together in the final reduction stage. The difference
between accumulating order introduces the differences in final results.
This factor is confirmed and analyzed by a special CPU version which
recreate the 12 partial sums as in the FPGA version. The second factor
is the stall stages created by the external memory interface. During the
clock cycles when the data from external DDR2 memory are not yet ready,
the N-Body pipeline should be stalled. Using a single shared enable
signal to stall the pipeline is not realistic since the fan out will
grow over 10,0000 and the timing constraints will never be met. A delay
pipe and input selection of the accumulators are used instead to avoid a
global enable signal. This scheme will changed the summing order of the
acceleration vectors when invalid data are skipped in the accumulator
input. Thus the final result will be different as previously discussed.
This factor is non-deterministically affected by the state of the
external DDR2 memory chip. It is analyzed by inserting invalid stages in
RTL simulation.

Since we use only 3% of logic resource and 9% of the DSP blocks in the
target FPGA for the pipelined data path. We can easily duplicate 10
copies of the pipelined data path in the FPGA core. This is effectively
unrolling the outer loop by 10. With several negligible extra memory
read operations for the more P_i values needed, the speed up is linear
as all pipeline shared the same P_j values read from memory. Overheads
also include the extra control logics for sequencing the P_i access and
longer traces for connecting memory ports to multiple pipelined data
path. Thus the maximum working frequency of the multi-core design will
be lower than that of the single core counterpart.

At 266.67MHz, the measured performance of multi-core FPGA implementation
is 17.7 times faster than the single thread CPU implementation.