Review on FFT software

Fast Fourier Transform (FFT) provides the basis of many scientific algorithms. There are hundreds of FFT software packages available. In this section, only those closely related to this project are reviewed.

Distributed FFT Packages

FFTW is one of the most popular FFT packages available. It is open-source, supporting arbitrary input size, portable and delivers good performance due to its self-tuning design (planning before execution). There are two major versions of FFTW. Version 2.x actually has a reliable MPI interface to transform distributed data. However, it internally uses a 1D (slab) decomposition which, as discussed earlier, limits the scalability of large applications. Its serial performance is also inferior to that of version 3.x, which has a brand-new design offering better support to SIMD instructions on modern CPUs. Its MPI interface, reintroduced since version 3.3, is still based on 1D decomposition.

CRAFFT (CRay Adaptive FFT) is available on Cray systems, for example, as part of xt-libsci library on XT/XE systems. It provides a simplified interface to delegate computations to other FFT kernels (including FFTW). Being `adaptive' means that it can dynamically select the fastest FFT kernels available on a system. However, it only supports very limited distributed routines (only 2D/3D complex-to-complex transforms are supported as of version xt-libsci/10.5.0) and those routines are based on an evenly-distributed 1D decomposition.

Intel MKL contains a number of cluster FFT functions; some IBM platforms support distributed FFTs via the Parallel ESSL library. These are all based on 1D decomposition.

There are several open-source packages available which implement 2D-decomposition-based distributed FFTs:

Plimpton's parallel FFT package provides a set of C routines to perform 2D and 3D complex-to-complex FFTs together with very flexible data remapping routines for data transpositions. The communications are implemented using MPI_SEND and MPI_IRECV.

Takahashi's FFTE package in Fortran contains both serial and distributed version of complex-to-complex FFT routines. It supports transform lengths with small prime factors only and uses MPI_ALLTOALL to transpose evenly distributed data. There is no user-callable communication routines.

The most well-known open-source distributed FFT library is called P3DFFT. This project was initiated at San Diego Supercomputer Center at UCSD by Dmitry Pekurovsky. It is highly efficient and it has been widely adopted by scientists doing large-scale simulations, including cutting-edge turbulence studies. (>> More details)

PFFT is a package developped at TU Chemnitz that performs parallel FFTs on massively parallel architectures. It is written in C and built upon FFTW 3.3. Its API emulates the design of FFTW API - very powerful and flexibile but a bit hard to use. It supports N-dimensional transforms distributed across a processor-grid of N-1 dimensions.

Serial FFT Implementations

Although capable of doing actual FFT computations, 2DECOMP&FFT is mainly designed to performs data management and communications. The actual computations of 1D FFTs are delegated to a 3rd-party FFT library, assuming it is already fully optimised to run on a single CPU core. 2DECOMP&FFT interfaces with almost every popular FFT implementations so users have the freedom to choose their favourite packages.

Here is the list of FFT engines:

Generic - This is 2DECOMP&FFT's own FFT implementation. It is based on an algorithm that is attributed to Glassman (refer to Glassman's general N Fast Fourier Transform). It is not particularly efficient, but serves two purposes.

It makes the package independent to external libraries, aiding portability.

It takes over the computation if the external FFT engine fails to work (for example, if a user mistakenly passes in an input with length not supported by the underlying FFT engine).

It is not recommended to use this FFT engine in production works as it may lose 2 digits of accuracy when running in double-precision mode.

FFTW - FFTW is the most popular open-source FFT implementation and it likely to work reasonably well on all hardware due to its auto-tuning feather. Version 3.x of FFTW API is used1. It is officially supported on Cray supercomputers.

ACML - The AMD Math Core Library is the vendor library for AMD CPUs. ACML's FFT kernel is tuned at very low-level (assembly language) to work well on AMD CPUs.

MKL - The Intel Math Kernel Library implementation is to help port the code onto systems using Intel CPUs. Thanks for a wrapper provided by Intel, which translates FFTW calls to MKL calls, it is also possible to link the FFTW implementation above directly to MKL2.

FFTPACK - FFTPACK is widely used in legacy applications. The implementation is based on a Fortran 90 variant of the version 5.0 package here. Its performance is relatively weaker than the vendor libraries above in benchmarks. Only use this engine if applications rely on other FFTPACK functions as well.

FFTE - FFTE in general performs quite well in multi-threaded arrangement. The plan is to implement a hybrid version of 2DECOMP&FFT in the future using FFTE as its base engine.

CUFFT - There is an experimental implementation (source not released, contact us if interested) using CUFFT to compute the 1D FFTs on NVidia GPUs. This enables 2DECOMP&FFT to run on GPU clusters.

Additional FFT engines can be supported when required. In fact, any 3rd-party FFT libraries offering a simple serial 1D FFT function can be wrapped around.

Footnotes

1. Version 3.3 of FFTW contains a brand-new Fortran 2003 interface, as well as the old legacy Fortran interface. A second FFTW implementation based on the new interface is also available in 2DECOMP&FFT. One benefit of the F2003 interface is the guaranteed memory alignment, although this does not seem to make any practical difference.

2. The FFTW 3.x wrapper is distributed as part of the MKL and it is ready to use. There is also a version 2.x wrapper, which is an open-source package that users can compile by themselves. The wrapper actually works! So it was a mistake for the author not to try it first. On the positive side, it is a great pleasure to play with the MKL API which is almost an object-oriented design.