VASP runs up to about 10X faster using NVIDIA Tesla P100s compared to CPU-only systems, enabling usage of computationally more demanding and more accurate methods in the same time.

Installation

System requirements

VASP is distributed as source code and has a couple of compile-time and run-time dependencies. For this guide, we will assume that the following software packages are already installed on your Linux system and that their respective environment variables are set:

Intel Compiler Suite (especially Fortran, C/C++ and MKL)

Intel MPI

NVIDIA CUDA 8.0

In most cases, these packages have been installed on your supercomputer by its administrators and can be loaded via a module system. Installing the mentioned packages is beyond the scope of this guide, so please contact your cluster support team if you need further assistance.

The latest revision of the VASP GPU-port can be compiled with the PGI compiler suite, of which a community edition is provided at no cost! But because many VASP users traditionally use the Intel compiler, we will stick to it for this tutorial as well.

Download and Compilation

DOWNLOADING ALL REQUIRED SOURCES

VASP is a commercial software and as a regular VASP licensee you can download the most current version of the GPU port. To acquire a license, see this page. Enter your login credentials on the right under "Community Portal" and click on "Login" to gain access to the download area. Click on "VASP5" and select the "src" folder. At the time of writing, you need to download the following files:

vasp.5.4.4.tar.gz

Make sure to check VASP site regularly to get the latest patches or new versions.

EXTRACTING AND PATCHING

First, extract the VASP source code that you have just downloaded:

tar xfz vasp.5.4.4.tar.gz

Now switch into the freshly extracted directory containing the sources and apply the patches:

cd vasp.5.4.4

The VASP makefile requires some modifications to reflect your local software environment. VASP comes with a selection of makefile templates for different setups, which are located in the arch/ subfolder. Copy an appropriate makefile.include from the arch/ folder (in this guide, we are using Intel's Fortran compiler and NVIDIA CUDA under Linux):

cp arch/makefile.include.linux_intel makefile.include

If you need to adapt the makefile.include, please refer to the section Troubleshooting below.

Most of the options in makefile.include are set to work out of the box by detecting the necessary values from your environment variables, but it is highly recommended to set the GENCODE_ARCH variable in the file you just copied appropriately for your GPUs. Please check the compute capabilities (CC) of your GPU card(s) and edit makefile.include with an editor (e.g. nano, vim or emacs are available on many systems by default):

nano makefile.include

We are using NVIDIA P100 cards and as such compile for compute capability 6.0 to yield best performance. Hence, we are ok with the default GENCODE_ARCH line that looks like this:

Leaving unused compute capabilities flags (e.g. 3.5) in won’t hurt, but enables to run the resulting binary on other GPU architectures as well. If your target GPU features a different compute capability make sure to adapt the line accordingly. So, e.g., when you want to target a V100 as well, make it look like this (and use CUDA 9):

After a successful build, it’s time to build the GPU port of VASP that allows for non-collinear calculations (when LNONCOLLINEAR=.TRUE. or LSORBIT=.TRUE. in the INCAR) using the gpu_ncl target. Note that the Γ point flavor of VASP is not yet supported on the GPU.

You can build the CPU-only versions (std, ncl, gam) just as well:

make gpu_ncl std ncl gam

This will give you the following list of binaries, but for this tutorial, only vasp_gpu and optionally vasp_std are will be used:

Table 1 Overview over different executable files built for VASP.

vasp_std

Default version of VASP

vasp_ncl

Special version required to run calculations with LNONCOLLINEAR=.TRUE. or LSORBIT=.TRUE. in the INCAR

vasp_gam

Special version that saves memory and computations for calculations at Γ only.

vasp_gpu

Same as vasp_std, but with GPU acceleration

vasp_gpu_ncl

Same as vasp_ncl, but with GPU acceleration

We recommend installing the VASP binaries into a place outside your build directory, e.g., into ~/bin to avoid accidental overwriting with future versions:

mkdir -p ~/bin cp bin/vasp* ~/bin

INSTALLING THE POTCAR DATABASE

VASP relies on tabulated data used for smoothing the all-electron wavefunctions, often called pseudopotentials. You can download those pseudopotentials.

Enter your login credentials on the right under “Community Portal” and click on “Login” to gain access to the download area. Then, click on “Potentials” and start with “LDA”. Download all files offered there, and proceed in the same manner for the “PBE” and “PW91” folders.

Running Jobs

First GPU accelerated VASP calculation

There are a few options in the main control file INCAR that need special consideration for the GPU port of VASP. GPU VASP will print error and warning messages if settings in the INCAR file are unsupported or discouraged.

Do not ignore GPU-related messages and act accordingly! This section explains INCAR settings that are relevant for the GPU.

Limit yourself to use of the following options for the ALGO flag:

Normal

Fast

Veryfast

Other algorithms available in VASP have not been extensively tested and are not guaranteed to perform or may even produce incorrect results. Besides that, you must use the following settings in the INCAR file:

LREAL = .TRUE. or LREAL = A

NCORE = 1

To get started, we offer a few example calculations that we will use later to show how to reach better performance compared to simple setups. You can find some exemplary input files in the git repository. Go to the right directory and take a quick look at the INCAR file, you can see that it is in accordance with the options mentioned above:

cd gpu-vasp-files/benchmarks

For copyright reasons, you must generate the required POTCAR files on your own. We assume that you have downloaded and extracted the pseudopotential database like shown and use ~/vasp/potcars/ as the directory where they reside. The exemplary input files come with a script that takes care of the generation automatically, but it needs to know the path to your POTCAR database:

cd siHugeShort bash generatePOTCAR.sh ~/vasp/potcars

Then, you are ready to start your first GPU-accelerated VASP calculation:

~/bin/vasp_gpu

This will only start one process that will utilize only one GPU and one CPU core, regardless of how many are available in your system. Running it this way, may take relatively long, but shows that everything is working. To confirm that the GPU is actively used, enter

nvidia-smi -l

in a terminal connected to the same node where your process is running. You should see your VASP process listed and see to what extent your GPU is utilized. You can stop watching by pressing CTRL+c.

Using a single compute node

Just like the standard version of VASP, the GPU port is parallelized with MPI and can distribute the computational workload across multiple CPUs, GPUs and nodes. We will use Intel MPI in this guide, but all techniques described herein work with other MPI implementations just as well. Please refer to the documentation of your concrete MPI implementation to find the equivalent command line options.

VASP supports a variety of features and algorithms causing its computational profile to be just as diverse. Therefore, depending on your specific calculations, you might need different parameters to yield the quickest possible execution times. These aspects propagate to the GPU port just as well.

In this tutorial, we will provide various techniques that can help speeding up your GPU runs. However, as there is no one optimal setup, you need to benchmark your cases individually to find the settings with the best performance for your cases.

First, let’s see how many (and which) GPUs your node offers:

nvidia-smi –L

The output of the command tells us that we have 4 Tesla P100 GPUs available and lists their unique identifiers (UUID) that we will use later on:

Typically, GPUs need to transfer data between their own and the main memory. On multi-socket systems, the transfer performance depends on the path the data needs to move along. In the best case, there is a direct bus between the two separate memory regions. In the worst-case scenario, the CPU process needs to access memory that is physically located in a RAM module associated to the other CPU socket and then copy it to GPU memory that is (yet again) only accessible via a PCI-E lane controlled by the other CPU socket. Information about the bus topology can be displayed with:

nvidia-smi topo -m

Because GPU accelerated VASP does not (yet) support direct GPU-to-GPU communication, we can ignore most of the output that tells us, what pairs of GPUs could communicate fastest (PIX or even NV#) to slowest (SOC) among each other:

GPU0

GPU1

GPU2

GPU3

mlx5_0

CPU Affinity

GPU0

X

SOC

SOC

SOC

SOC

0-15

GPU1

SOC

X

PHB

PHB

PHB

16-31

GPU2

SOC

PHB

X

PIX

PHB

16-31

GPU3

SOC

PHB

PIX

X

PHB

16-31

mlx5_0

SOC

PHB

PHB

PHB

X

The last column labeled “CPU Affinity” is important because it tells us, on which CPU cores the MPI ranks should ideally be run, if they communicate with a certain GPU. We see that all CPU cores of the first socket (0-15) can directly communicate with GPU0, whereas the CPUs of the second socket (16-31) are expected to show best performance when combined with GPU1, GPU2 and GPU3.

Benchmarks

Expected Performance

Whenever you want to compare execution times of runs in various configurations, it is essential to avoid unforeseen deviations. NVIDIA GPUs feature techniques to allow for temporarily raising and lowering clock-rates based on the current thermal situation and compute load. While this is good for saving power, for benchmarking it might give misleading numbers caused by a slightly higher variance on execution times between multiple runs. Therefore, to do comparative benchmarking we try to turn it off for all the cards in the system:

Though the numbers representing performance have been generated on production systems, they are only meant to serve as a guideline demonstrating the methods presented in the following. Note that performance on your system might differ because there are many aspects influencing CPU and GPU performance.

The Easiest Method: One Process per GPU

The easiest method to use all 4 GPUs present in our system is just to start 4 MPI processes of VASP, and have the mapping, i.e. on which CPU cores your processes will run, taken care of automatically:

mpirun -n 4 ~/bin/vasp_gpu

The Intel MPI environment automatically pins processes to certain CPU cores, so that the operating system cannot move them to other cores during the execution of the job and thus prevents some disadvantageous scenarios for data movement. Yet, this may still lead to a suboptimal solution, because the MPI implementation is not aware of the GPU topology. We can investigate process pinning by increasing the verbosity:

mpirun -n 4 -genv I_MPI_DEBUG=4 ~/bin/vasp_gpu

Looking at the output and comparing it to our findings about the interconnect topology, it seems that things are not ideal:

Rank 0 uses GPU0, but is bound to the more distant CPU cores 16-23. The same problem applies for ranks 2 and 3. Only rank 1 uses GPU1 and is pinned to the cores 24-31, which offer best transfer performance.

Let’s look at some actual performance numbers now. Using all 32 cores of the two Intel® Xeon® E5-2698 v3 CPUs present in our system without any GPU acceleration, it took 607.142 s to complete the siHugeShort benchmark.1 Using 4 GPUs in this default, but suboptimal way, results in an execution time of 273.320 s and a speedup of 2.22x. Use the following metrics included in VASP to quickly find out how long your calculation ran

1 If you have built the CPU-only version of VASP before, you can use the following command to see how long it takes on your system: mpirun -n 32 -env I_MPI_PIN_PROCESSOR_LIST=allcores:map=scatter ~/bin/vasp_std

grep Elapsed\ time OUTCAR

VASP maps GPUs to MPI ranks consecutively, while skipping GPUs with insufficient compute capabilities (if there are any). By that and using the following syntax, we can manually control process placement on the CPU and distribute the ranks so that every process uses a GPU that has the shortest memory transfer path:

This way does not yield an improvement on our system (runtime was 273.370 s) which is probably caused by an imbalanced use of common CPU resources like memory bandwidth and caches (3 processes sharing one CPU). As a compromise, one can distribute the ranks, so they are spread evenly across the CPU sockets, but only one rank must use the slower memory path to the GPU:

With a runtime of 268.939 s, this is a slight improvement of about 3% for this benchmark, but if your workload is heavier on memory transfers, you might gain more.

Especially for larger numbers of ranks, manually selecting the distribution can be tedious or you might decide that equally sharing CPU resources is more important than memory transfers on your system. The following command maps the ranks consecutively, but avoids sharing common resources as much as possible:

This gave us a runtime of 276.299 s and can be especially helpful if some of the CPU cores remain idle. You may want to do so on purpose, if a single process per GPU saturates a GPU resource that is limiting performance. Overloading the GPU even further, would impair performance then. This is given in the siHugeShort benchmark example, so on our system, this is as good as it gets (feel free to try out the coming options here anyway!). However, it’s generally a bad idea to waste available CPU cores as long as you are not overloading the GPUs, so do your own testing!

The Second Easiest Method: Multiple Processes per GPU

To demonstrate how to use more CPU cores than we have GPUs available, we will switch to a different benchmark called silicaIFPEN. It takes 710.156 s to execute using the 32 CPU cores only. Using 4 P100 GPUs with one MPI rank per GPU and the compromise regarding process placement it takes 241.674 s to complete (2.9x times faster). NVIDIA GPUs have the capability to be shared between multiple processes. To use this feature, we must ensure that all GPUs are set to “default” compute mode:

For the silicaIFPEN benchmark, on our system the runtime improved to 211.576 s for 12 processes sharing 4 P100s (i.e. 3 processes per GPU), which raised the speed-up to 3.36. Going with 4 or more processes per GPU has an adverse effect on the runtime, though. The comparison in the table below show that the manually placed processes don’t give an advantage there anymore, as well.

Table 2 Comparison between elapsed times for the silicaIFPEN benchmark varying the number of MPI processes per GPU

After reaching the sweet spot, adding more processes per GPU impairs performance even more. But why? Whenever a GPU needs to switch contexts, i.e., allow another process to take over, it introduces a hard synchronization point. Consequently, there is no possibility for instructions of different processes to overlap on the GPU and overusing this feature can in fact slow things down again. Please also see the illustration below.

In conclusion, it seems to be a good idea to test how much oversubscription is beneficial for your type of calculations. Of course, very large calculations will more easily fill a GPU with a single process than smaller ones, but we can’t encourage you enough to do your own testing!

NVIDIA MPS: enabling overlapping while sharing GPUs

This method is closely related to the previous one, but remedies the problem that the instructions of multiple processes may not overlap on the GPU as shown in the second row of the illustration. It is recommended to set GPUs to process exclusive compute mode when using MPS:

The first command starts the MPS server in the background (daemon mode). When it is running it will intercept instructions issued by processes sharing a GPU and put them into the same context before sending them to the GPU. The difference to the previous section is that from the GPU’s perspective the instructions belong to a single process and context and as such can overlap now, just like if you were using streams within a CUDA application. You can check with nvidia-smi -l that only a single process is accessing the GPUs.

This mode of running the GPU port of VASP can help to increase GPU utilization, when a single process does not saturate GPU resources. To demonstrate this, we employ our third example B.hR105, a calculation using exact exchange within the HSE06 functional. We have run it with different amounts of MPI ranks per GPU each time with and without MPS enabled.

Table 3 Comparison between elapsed times for the B.hR105 benchmark varying the number of MPI processes per GPU each with and without MPS for NSIM=4

MPI ranks per GPU

Total MPI ranks

Elapsed time without MPS

Elapsed time with MPS

0

32 (CPU only)

1027.525 s

1

4

213.903 s

327.835 s

2

8

260.170 s

248.563 s

4

16

221.159 s

158.465 s

7

28

241.594 s

169.441 s

8

32

246.893 s

168.592 s

Most importantly, MPS improves the execution time by 55.4 s (that are 26%) to 158.465 s when compared to the best result without MPS (213.903 s). While there is a sweet spot with 4 ranks per GPU without using MPS, starting the as many processes as there are CPU cores available yields best performance with MPS.

We skipped calculations with 3, 5 and 6 ranks per GPU on purpose because the number of bands (224) that is used in this example is not divisible by the resulting number of ranks and would hence be increased by VASP automatically, which increases the workload. If you are only interested in the time-to-solution, we suggest you experiment a little with the NSIM parameter. By setting it to 32 and using just 1 process per GPU (hence no MPS) we were able to push the calculation time down to 108.193 s, which is roughly a 10x speedup.

Advanced: One MPS instance per GPU

For certain setups, especially on older versions of CUDA, it could be beneficial to start multiple instances of the MPS daemon, e.g. one MPS server per GPU. However, doing so is a little more involved because one has to tell every MPI process which MPS server it should use and every MPS instance must be bound to another GPU. Especially on P100 with CUDA 8.0, we discourage this method, but that doesn’t necessarily mean that you won’t maybe find it useful. The following script can be used to start the instances of MPS:

For this method to work, the GPU port of VASP must be started via a detour using a wrapper script. This will set the environment variables so that each process will use the correct instance of the MPS servers which we have just started: runWithMultipleMPSservers-RR.sh.

This script basically generates a list with the paths for setting the environment variables that decide which MPS instance is used. The fourth last line (myMpsInstance=...) then selects this instance depending on the local MPI process ID. We decided to go with a round-robin fashion, by distributing processes 1 to 4 to GPU0 to GPU3. Process 5 uses GPU 0 again and so would process 9, whereas process 6 and 10 are mapped to GPU 2 and so on. If you used a different path to install your GPU VASP binary, make sure you adapt the line starting with runCommand accordingly. Then, let’s start the calculation:

When using MPS, please keep in mind that the MPS server(s) itself use(s) the CPU. Hence, when you start as many processes as you have CPU cores available it might as well overload your CPU. So, it might be a good idea to reserve a core or two for MPS.

When the calculation is finished, the following script cleans up the MPS instances:

Using multiple compute nodes

ONE PROCESS PER GPU

Basically, everything that was said about using GPUs housed in a single node applies to multiple nodes as well. So, whatever you decide worked best for your systems concerning process mapping will probably work well on more nodes. In the following we will assume that you have a hostfile setup that lists all your nodes associated with your job. In our case, we used two nodes and the hostfile looks like this:

hsw224 hsw225

If you go with the manual selection of process mapping, there is just a small difference to the command given in the previous chapter:

The performance for the siHugeShort benchmark gets a little faster with 258.917 s, but compared to its runtime of 268.939 s on one node does by no means justify its usage. The silicaIFPEN benchmark on the other hand improves notably from 241.674 s to 153.401 s, when going from 4 to 8 P100 GPUs with one MPI process per GPU.

MULTIPLE PROCESSES PER GPU

Regarding the previous sections, going to multiple nodes with multiple processes per GPU is straightforward:

For increasing the number of ranks per GPU, the silicaIFPEN benchmark shows a behavior a little bit different the single node case (fastest configuration took 211.576 s there): Using two processes per GPU improves the runtime only insignificantly to 149.818 s when compared 153.401 s in the case of one process per GPU. Further overloading the GPUs has yet again an adverse effect because already using 3 processes per GPU increases runtime to 153.516 s and 64 ranks in total make it take 231.015 s. So apparently using 1 or 2 processes per GPU on each node is enough in this case.

NVIDIA MPS ON MULTIPLE NODES

Using a single instance of MPS per node is trivial when the instances are started. Some job schedulers offer submission options to do that for you, e.g. SLURM sometimes offers --cuda-mps. If anything like that is available on your cluster, we strongly advise you to use it and proceed just as described in previous section.

But what can you do if your scheduler does not offer such an elegant solution? You must make sure that on each node one (and only one) MPS instance is started prior to VASP being launched. We provide another script that takes care of this: runWithOneMPSPerNode.sh.

Yet again, if you installed the GPU accelerated VASP binary in an alternative location, please adapt the runCommand variable in the beginning of the script. The variables following this calculate the local rank on each node because Intel’s MPI implementation does not provide information this easily. The script starts an MPS server on each first rank per node making sure that the MPS process is not bound to the same core then a VASP process will be bound to later. This step is crucial because otherwise MPS would be limited using only a single core (it can use more than that) and even worse compete against VASP for CPU cycles on that core. The script continues executing VASP and then stops MPS afterwards.

The script must be called from the mpirun command, as you might have seen in the advanced section already. The mpirun command works just like when running without MPS, but note we call the script instead of the VASP binary:

Regarding the B.hR105 benchmark, MPS improved the runtime on a single node and this holds on two nodes as well: enabling MPS speeds up the calculation time and using more ranks per GPU is beneficial up to a certain point of (over-) saturation. The sweet spot on our system was 4 ranks per GPU, which resulted in a runtime of 104.052 s. Compared to the baseline of a Haswell single node this is a speedup of 9.05x and compared to all 64 CPU cores this is still faster by a factor of 6.61x.

If we use NSIM=32 with 4 ranks per GPU on each of the 2 nodes and do not use MPS, the calculation took only 71.222 s.

Table 4 Comparison between elapsed times for the B.hR105 benchmark varying the number of MPI processes per GPU each with and without MPS for NSIM=4 using 2 nodes

MPI ranks per GPU

Total MPI ranks

Elapsed time without MPS

Elapsed time with MPS

0

32 (CPU only – 1 node)

1027.525 s

0

64 (CPU only)

763.939 s4

1

8

127.945 s

186.853 s

2

16

117.471 s

110.158 s

4

32

130.454 s

104.052 s

7

56

191.211 s

148.662 s

8

64

234.307 s5

182.260 s

4, 5 Here 256 bands were used, which increases the workload.

Recommended System Configurations

Hardware Configuration

Workstation

Parameter

Specs

CPU Architecture

x86_64

System Memory

32-64 GB

CPUs

8 Cores, 3+ GHz
10 cores, 2.2+ GHz
16 Cores, 2+ GHz

GPU Model

NVIDIA Quadro GP100

GPUs

2-4

Servers

Parameter

Specs

CPU Architecture

x86_64

System Memory

64-128 GB

CPUs

16+ Cores, 2.7+ GHz

GPU Model

NVIDIA Tesla P100, V100

GPUs per Node

2-4

Software Configuration

Software stack

Parameter

Version

OS

Linux 64

GPU Driver

352.79 or newer

CUDA Toolkit

8.0 or newer

Compiler

PGI Compiler 16.10
Intel Compiler Suite 16

MPI

OpenMPI
Intel MPI

Troubleshooting

ADAPTING BUILD VARIABLES (optional)

Your local software environment might deviate from what the VASP build system can automatically handle. In this case, the build will fail and you will need to make minor adjustments to makefile.include. Open makefile.include with your favorite editor (e.g. nano, vim or emacs are available on many systems by default) and make the necessary changes (see below):

nano makefile.include

Whenever you have made changes to any file, make sure to execute the following command to start building from scratch:

make veryclean

In the following, we list a few typical error messages and how-to work around them:

mpiifort: Command not found

This error message simply tells you, that on your system the MPI-aware Intel Fortran compiler has a different name than we could have guessed. In makefile.include, please change all occurrences of mpiifort to whatever it is called on your system (e.g. mpif90).

# error "This Intel <math.h> is for use with only the Intel compilers!"

To get around this error, you have to do two things. First edit makefile.include and add -ccbin=icc to the NVCC variable, so that the line reads:

/usr/local/cuda//bin/nvcc: Command not found

That message tells you that make cannot find the NVIDIA CUDA compiler nvcc. You can either correct the path in the line

CUDA_ROOT := /usr/local/cuda/

or even comment it out (using a # as first symbol of the line) if CUDA_ROOT is set as an environment variable.

No rule to make target `/cm/shared/apps/intel/composer_xe/2015.5.223/mkl/interfaces/fftw3xf/libfftw3xf_intel.a', needed by `vasp'. Stop.

Probably, your local MKL was installed without support for the FFTW3 interface as a static library. If you comment out the line referencing that static library by inserting a # in its very beginning, the linker will pull in the dynamic analogue. Make sure to comment out the line associated (and following) to OBJECTS_GPU and not just the one after OBJECTS.

Get the Latest from NVIDIA
on Data Center

LIMITED TIME OFFER: $49,900 ON NVIDIA DGX STATION

For a limited time only, purchase a DGX Station for $49,900 - over a 25% discount - on your first DGX Station purchase.* Additional Station purchases will be at full price.
Reselling partners, and not NVIDIA, are solely responsible for the price provided to the End Customer. Please contact your reseller to obtain final pricing and offer details.
Discounted price available for limited time, ending April 29, 2018. May not be combined with other promotions. NVIDIA may discontinue promotion at any time and without advance notice.

The NVIDIA DGX-1 is available for purchase in select countries

The NVIDIA DGX-1 is available for purchase in select countries and is priced at:

DGX-1 with P100 at $129,000*

DGX-1 with V100 at $149,000*

When ordering DGX-1 with V100, you can choose between getting DGX-1 with P100 now and receiving an upgrade to V100 as soon it’s available, or getting a DGX-1 with V100 when it starts shipping. DGX-1 support plan is required and must be purchased separately.