GROMACS 4.6 and GPUs

The native implementation of GPU support in GROMACS 4.6 and later is discussed on a separate page with more further details in the "Acceleration and parallelization" section and in the 4.6 manual section A.6. The information below pertains only to the GPU support in GROMACS 4.5 series, which is based on the OpenMM library. The OpenMM functionality is nominally still present in the 4.6 series, but it is not supported (nor fully tested) and is marked for deprecation.

GROMACS 4.5 and GPUs

In version 4.5, GROMACS provides support for GPU-accelerated MD simulations through the OpenMM library and a collaboration with the Simbios NIH Center for Biomedical Computation at Stanford. This development of freely available open source software (both the Gromacs and OpenMM parts) would not have been possible without generous support both in the EU (ERC, SSF, VR) as well as the US (NIH, NSF), which are all kindly acknowledged.

When trying out the accelerated GROMACS-GPU binaries, please be aware that you might have to change some settings to really make your simulations shine on GPUs. There are already some fields (e.g. Implicit solvent) where the Cuda version is way ahead of the CPU, and for other stuff (in particular PME) you can improve performance by adjusting your settings.

Limitations

The following should be noted before using the GPU accelerated mdrun-gpu:

The current release runs only on modern nVidia GPU hardware with CUDA support. Make sure that the necessary CUDA drivers and libraries for your operating system are already installed.

Multiple GPU cards are not supported.

Only a fairly small subset of the GROMACS features and options are supported on the GPUs. See below for a detailed list.

Consumer level GPU cards are known to often have problems with faulty memory. It is recommended that a full memory check of the cards is done at least once (for example, using the memtest=full option). A partial memory check (for example, memtest=15) before and after the simulation run would help spot problems resulting from overheating of the graphics card.

The maximum size of the simulated systems depends on the available GPU memory, for example, a GTX280 with 1GB memory has been tested with systems of up to about 100,000 atoms.

In order to take a full advantage of the GPU platform features, many algorithms have been implemented in a very different way than they are on the CPUs. Therefore numercal correspondence between some properties of the systems' state should not be expected. Moreover, the values will likely vary when simulations are done on different GPU hardware. However, sufficiently long trajectories should produce comparable statistical averages.

Some energy terms will be missing from the output if you compare the same .tpr on CPU and GPU. This is an unavoidable consequence of the nature of the hardware+software stack used.

Frequent retrieval of system state information such as trajectory coordinates and energies can greatly influence the performance of the program due to slow CPU<–>GPU memory transfer speed.

MD algorithms are complex, and although the Gromacs code is highly tuned for them, they often do not translate very well onto the streaming architetures. Realistic expectations about the achievable speed-up from tests with GTX280: for small protein systems in implicit solvent using all-vs-all kernels the acceleration can be as high as 20 times, but in most other setups involving cutoffs and PME the acceleration is usually only about 5 times relative to a 3GHz CPU core.

Supported features

Integrators: md/md-vv/md-vv-avek, sd/sd1 and bd. OpenMM implements only the velocity-verlet algorithm for MD simulations. Option md is accepted but keep in mind that the actual algorithm is not leap-frog. Thus all three options md, md-vv and md-vv-avek are equivalent. Similarly, options sd and sd1 are also equivalent.

for Ewald summation only 3D geometry is supported, while dipole correction is not.

the cut-off method is supported only for implicit solvent simulations.

Temperature control: Supported only with the sd/sd1, bd, md/md-vv/md-vv-avek integrators. OpenMM implements only the Andersen thermostat. All values for tcoupl are thus accepted and equivalent to andersen. Multiple temperature coupling groups are not supported, only tc-grps=System will work.

Force Fields: Supported FF are Amber, CHARMM. GROMOS and OPLS-AA are not supported.

CMAP dihedrals in CHARMM are not support, so use the -nocmap option with pdb2gmx.

Implicit solvent: Supported only with reaction-field electrostatics. The only supported algorithm for GB is OBC, and the default Gromacs values for the scale factors are hardcoded in OpenMM, i.e. obc alpha=1, obc beta=0.8 and obc gamma=4.85.

Constraints: Constraints in OpenMM are done by a combination of SHAKE, SETTLE and CCMA. Accuracy is based on the SHAKE tolerance as set by the shake_tol option.

Periodic Boundary Conditions: Only pbc=xyz and pbc=no in rectangular cells (boxes) are supported.

Restraints: Distant, orientational, angle and dihedral restraints are not supported in the current implementation.

Free energy calculations: Not supported in the current implementation.

Walls: Not supported.

Non-equilibrium MD: Option acc_grps is not supported.

Electric Fields: Not supported.

QMMM: Not supported.

Installing and running GROMACS-GPU

Using precompiled binaries

Gromacs-GPU can be installed either from out-of-date beta-quality binary packages, or the officially distributed source packages. Using the latter is recommended.

Prerequisites

The current GROMACS-GPU release uses OpenMM acceleration, the necessary libraries and plugins are included in the binary packages for version 2.0. Both the OpenMM library and Gromacs-GPU require version 3.x of the CUDA libraries and compatible NVIDIA driver (i.e. version >= 256). Last but not least, to run GPU accelerated simulations, a CUDA-enabled graphics card is necessary. Molecular dynamics algorithms are very demanding and unlike in other application areas, only high-end graphics cards are capable of providing performance comparable to or higher then modern CPUs. For this reason, mdrun-gpu is compatible with only a subset of CUDA-enabled GPUs (for detailed list see section 6.9.3) and by default it does not run if detects non-compatible hardware. For details about compatibility of NVIDIA drivers with the CUDA library and devices consult the NVIDIA developer page.

Downloads

Note, that the binaries below are outdated beta versions! Building the final version from the main source release is straightforward and strongly encouraged on *NIX systems. For Windows we are considering to build precompiled binaries.

Note: For Linux distributions with older glibc, such as CentOS 5.4, the binaries must be recompiled from source (see below).

Installing an outdated precompiled binary

Download and unpack the binary package for the respective OS and architecture. Copy the content of the package to your normal Gromacs installation directory (or to a custom location). Note that the distributed Gromacs-GPU packages do not contain the entire set of tools and utilities included in a full Gromacs installation. Therefore, it is recommended to have a ≥v4.5 standard Gromacs installation along the GPU accelerated one.

Add the openmm/lib directory to your library path, e.g. in bash:export LD_LIBRARY_PATH=path_to_gromacs/openmm/lib:$LD_LIBRARY_PATH If there are other OpenMM versions installed, make sure that the supplied libraries have preference when running mdrun-gpu. Also, make sure that the CUDA libraries installed match the version of CUDA with which Gromacs-GPU is compiled.

Set the OPENMM_PLUGIN_DIR environment variable to contain the path to the openmm/lib/plugins directory, e.g. in bash:export OPENMM_PLUGIN_DIR=path_to_gromacs/openmm/lib/plugins

At this point, running the command path_to_gromacs/bin/mdrun-gpu -h should display the standard mdrun help which means that the binary runs and all the necessary libraries are accessible.

Compiling and installation of GROMACS-GPU from source

The GPU accelerated mdrun can be compiled on Linux, Mac OS and Windows operating systems, both for 32 and 64 bit. Besides the prerequisites discussed above, in order to compile mdrun-gpu the following additional software is required:

Cmake version ≥ 2.6.4

CUDA-compatible compiler:

MSVC 8 or 9 on Windows

gcc version ≥ 4.1 on Linux and Mac OS

CUDA toolkit 3.x

OpenMM-2.0 libraries and header files

NB: this version has a bug in the CUDA platform where the velocities are half the value of what they should be but only in the first integration step. This has been fixed in the OpenMM-svn repository and will be available in future releases of the library.

Note that the current Gromacs-GPU release is compatible with OpenMM version 2.0. While future versions might be compatible, using the officially supported and tested OpenMM versions is strongly encouraged. OpenMM binaries as well as source code can be obtained from the project's homepage, and you can read more about the underlying idea in the paper. Also note that it is essential that the the same version of CUDA is used to compile both mdrun-gpu and the OpenMM libraries.

To compile mdrun-gpu change the directory top level of the source tree and execute the following commands:

Gromacs-GPU specific mdrun features

Besides the usual command line options, mdrun-gpu also supports a set of “device options”, that are meant to give control over acceleration related functionalities. These options can be used in the following form:

mdrun-gpu -device "ACCELERATION:[DEV OPTION=VALUE,]... [OPTION].."

The option-list prefix ACCELERATION specifies which acceleration library should be used. At the moment, the only supported value is OpenMM. This is followed by the list of comma-separated DEV_OPTION=VALUE option-value pairs which define parameters for the selected acceleration platform. The entire device option string is case insensitive. Below we summarize the available options (of the OpenMM acceleration library) and their possible values.

Platform Selects the GPGPU platform to be used, currently the only supported value is CUDA (in future OpenCL support will be added).

DeviceID The numeric identifier of the CUDA device on which the simulation will be carried out. The default value is 0, i.e the first device.

Memtest GPUs, especially consumer-level devices, are prone to memory errors. There might be various reasons for "soft errors" to happen including (factory) overclocking, overheating, faulty hardware etc, but the result is always the same: unreliable, possibly incorrect results. Therefore, gromacs-gpu has a built-in mechanism for testing the GPU memory in order to catch the obviously faulty hardware. A set of tests are performed before and after each simulation and if errors are detected, the execution is aborted. Accepted values for this option are any integer ≤15 with an optional “s” prefix representing the approximate amount of time in seconds that should be spent on testing, the default value is memtest=15s. It is possible to completely turn off memory testing by setting memtest=off, however this is not advisable.

Force-device Option that enables running mdrun-gpu devices that are not supported but CUDA-capable. Using this option might results in very low performance or even crashes and therefore it is not encouraged. Note, that both the option names and the values are case-insensitive.

GPU Benchmarks

Apart from interest in new technology and algorithms, the obvious reason to do simulations on GPUs is to improve performance, and you are most likely interested in speedup relative to the CPU version. This is of course our target too, but it is important to understand that the heavily accelerated/tuned assembly kernels we have developed for x86 over the last decade makes this relative speedup a quite difficult challenge! Thus, rather than looking at relative speedup you should compare raw absolute performance for matching settings. Relative speedup is meaningless unless you use the same comparison baseline!

In general, the first important point to get achieve good performance is to understand that GPUs are different from CPUs. While you can try to just run your present simulation you might get significantly better performance with slightly different settings, and if you are willing to make more fundamental trade-offs you can sometimes even get order-of-magnitude speedups.

Due to the different algorithms used some of the parameters in the input mdp files are interpreted slightly differently for the GPU, so below we have created a set of benchmarks that try to create settings that are as close to equivalent as possible for a CPU and GPU. This is not to say they will be ideal for you, but by explaining some of the differences we hope to help you make an informed decision, and hopefully use the hardware in a better way.

General advantages of the GPU version

The algorithms used on the GPU will automatically guarantee that all interactions inside the cutoff are calculated every step, which effectively is equivalent to a slightly longer cutoff.

Due to the higher nonbonded kernel performance, it is quite efficient to use longer cutoffs (which is also useful for implicit solvent)

The accuracy of the PME solver is slightly higher than the default Gromacs values. The kernels are quite conservative in this regard, and never resort to linear interpolation or other lower-accuracy alternatives.

It beats the CPU version in most cases, even when compared to using 8 cores on a cluster node (The CPU version is automatically threaded from v. 4.5)

General disadvantages of thecurrent GPU version

Parallel runs don't work yet. We're still working on this, but to make a long story short it's very challenging to achieve performance that actually beats multiple nodes using CPUs. This will be supported in a future version, though.

Not all Gromacs features are supported yet, such as triclinic unit cells or virtual interaction sites required for virtual hydrogens.

Forcefields that do not use combination rules for Lennard-Jones interactions are not supported yet.

File I/O is more expensive relative to the CPU version, so be careful not to write coordinates every 100 steps!

Benchmark systems

To help you evaluate hardware and provide settings you can copy in your own simulations we have created a package with several comparison systems. The are all based on the 159-residue protein dihydrofolate reductase with either ~7000 waters or implicit solvent. After unpacking the file you will have a tree with both CPU and GPU subdirectories, and below those six different systems/settings. All settings are available in the mdp files, but these simulations are intended to represent high-quality production runs - all interactions are calculated every step, bonds are constrained every step, proper constraint algorithms used for the water, and there are no other shortcuts taken just to get better performance numbers. If you would like to play around yourself, you could for instance test 2.5fs timesteps when all bonds are constrained.

The implicit solvent simulations the OBC (onufriev-bashford-case) model, and update the born radii every single step. This type of simulations were not supported on CPUs either prior to Gromacs-4.5, but they are warmly recommended when your goal is to refine a protein or other structure using a single workstation in a limited amount of time, or likewise if you want to run a large number of simulations. While implicit solvent is a tradeoff compared to explicit representation of the water you have to balance this with the possibility of getting an order-of-magnitude more data.

For implicit solvent it is common to employ longer cutoffs, so for this reason we also include versions with 2nm or infinite cutoffs. Even the Gromacs CPU version now has special x86 assembly kernels for infinite-cutoff simulations, but even when using 8-16 cores they cannot even get close to the GPU.

The reaction-field simulations also provide a clear speedup, and this is an extremely useful alternative for free energy calculations.

Presently, the PME performance for a single GPU with ECC enabled roughly matches using Gromacs on all 8 cores of a cluster node. The primary bottleneck here is simply the reciprocal space grid algorithm which is highly accelerated on CPUs in Gromacs, while it is fundamentally harder to implement efficiently on GPUs - this is currently an area of intensive work. The GPU code actually uses 5th order interpolation internally, so you can usually improve performance a bit further by extending the grid/fourierspacing option.

It is ultimately up to you as a user to decide what simulations setups to use, but we would like to emphasize the simply amazing implicit solvent performance provided by GPUs. If money and resources are completely unimportant you can always get quite good parallel performance from a CPU-cluster too, but by adding a GPU to a workstation it is suddenly possible to reach the microsecond/day range e.g. for protein model refinement. Similarly, GPUs excel at throughput style computations even in clusters, including many explicit-solvent simulations (and here we always compare to using all x86 cores on the cluster nodes). We're certainly interested in your own experiences and recommendations/tips/tricks!