QMMM for biochemical systems

This tutorial will illustrate the setup process of a QM/MM protein - substrate system. The model system we will be using is chorismate mutase, an enzyme that catalyzes the conversion of chorismate to prephenate, which is a critical reaction in the shikimate pathway. One attractive feature of this enzyme is that it catalyzes the reaction in an electrostatic manner, without forming covalent bonds to the substrate. This allows us to treat the complete enzyme at the computationally cheap MM level, saving us the headache of setting up bonds across the QM/MM border. In the production simulations we will simulate the substrate at the QM level (using the semi-empirical AM1 method), but we will first equilibrate our system with all components at the MM level.

The enzyme Chorismate Mutase with its native substrate

Before we can get started on the simulations, however, we need to provide CP2K with the proper input files. The coordinates for the chorismate mutase system can be obtained from the protein data bank under PDB ID 2CHT. The active complex consists of three protein subunits, forming three active sites, each of which catalyzes the reaction separately from the others. The PDB structure, however, contains twelve units with twelve inhibitors bound. In order to generate one functional complex, we need to remove the excess protein subunits and their inhibitors. In addition, the inhibitors bound to the crystal structure have to be replaced with copies of the substrate, using your favourite molecular editor. For this tutorial, all input files have been prepared for you.

Generating a topology

Now that we have our coordinates, we need to use them to generate a topology that CP2K can use. CP2K supports CHARMM-type PSF files and Amber-type prmtop files. I will be using Ambertools to generate the input. Small molecule parametrization for the GAFF2 forcefield is performed with Antechamber and parmchk2. Charges are derived with the bcc method. Antechamber generates MM parameters for small molecules while parmchk2 verifies this topology, reports missing parameters and outputs a best guess for these in an .frcmod file. For the purposes of this tutorial, we will take the parameters straight from antechamber/parmchk2, but for production simulations these should be examined cautiously. As the active complex has three ligands, we will perform this parameterization for each of them in a bash loop.

We will then use tleap to load in all the parameters from antechamber and parmchk2, generate parameters for the protein according to the Amber14 forcefield and combine everything into a complex. The commands for tleap can be placed in an input file and ran using tleap -f script.in.

Now that we have assembled our complex, we have to solvate it. For simplicity's sake, we will use a cubic box with water extending out 14 A from the protein and neutralize with Na+ ions. The distance is determined by the cutoffs for nonbonded interactions, plus some buffer to account for cell shrinking. For production simulations you may want to switch to a dodecahedron or octahedron for increased efficiency.

These two files contain everything we need to get started. Complex.prmtop contains the topology information and complex.inpcrd contains the coordinates. Using your favourite visualisation tool, check that the Na+ ions did not get placed right next to your ligand - their placement is random. The box size can be found at the bottom of the inpcrd file, followed by the vectors. These dimensions should be specified in the FORCE_EVAL/SUBSYS/CELL subsection of your input file.

Equilibration at MM level

We can now move on to actually running CP2K. We will first run 1000 steps of energy minimization (or fewer, depending on the convergence) with the input file em.inp to eliminate bad contacts. For quantities like cell size, for instance, CP2K allows you to specify your units by placing them in brackets, for instance MOTION/MD#TIMESTEP [fs] 0.5.

cp2k can be run using a single process:
cp2k.sopt em.inp > em.out
or using several processes, where N is the number of processes:
mpirun -np N cp2k.popt em.inp > em.out

Visualize your simulation to check for any abnormalities. Take a look at the output file, paying extra attention to any warnings you might receive. Note: CP2K will warn you about missing forcefield terms. You can have CP2K output these using FORCE_EVAL/MM/PRINT/FF_INFO.In this case, the missing terms are Urey-Bradley interactions. This is normal, as the Amber force field we are employing here does not use Urey-Bradley terms. A second warning states that our CRD file lacks velocities and box information will not be read. This is not a problem either, as we do not have velocities at this moment and box parameters are provided in the input file.

In the next step we will perform 5 ps of dynamics in the NVT ensemble, to equilibrate the temperature of our simulation using this nvt.inp. We are using the EXT_RESTART section to restart with the last coordinates of the energy minimization procedure. Once the simulation has finished, take a look at the temperature file NVT-1.PARTICLES.temp to check that the temperature is oscillating around the correct value (in this case, 298K). Note that we are using the CSVR thermostat with a low time constant of 10 fs to efficiently and quickly control the system temperature. For production simulations a higher time constant such as 100 fs should be chosen to avoid interfering with the dynamics of the system.

The next step is simulating our system in the constant pressure NPT ensemble to get the correct dimensions for our cell. npt.inp runs 50 ps of NPT, outputting the cell dimensions into a separate file every 100 timesteps. A C-C distance in of the chorismate molecules is constrained, in order to keep it near the reactive conformation to facilitate our later simulations in the QM/MM phase. Pressure control is achieved by adding a barostat with a reference pressure of 1 bar and a coupling time of 100fs, and changing MOTION/MD#ENSEMBLE from NVT to NPT_I, indicating that the box will be scaled isotropically. We expect our cell to shrink at first, as the solvation procedure cannot fill up the cell completely due to overlap with the protein and the edges of the box. After a while the cell size should stabilize. Take a look at the cell dimensions in NPT.cell; we will be using these equilibrated cell dimensions for the next tutorial.

Congratulations, you should now have a fully equilibrated MM system. We will now move on to QM/MM to simulate the bond-breaking process.We will also bias our simulation to ensure that we can see the conversion of chorismate into prephenate occur in a reasonable timeframe.

Moving on to QMMM

Chorismate(left) is converted into prephenate(right) by Chorismate Mutase

As chorismate mutase performs its catalytic function through noncovalent interactions, we can afford to treat the entire enzyme at the MM level. For most biochemistry problems, the boundary will not be so clear and key residues will need to be treated at the QM level as well. This means the QM - MM border will cut through covalent bonds, necessitating the use of link atoms. For our simple system however, this is not necessary. We first have to specify which atoms should be treated at the QM level using the FORCE_EVAL/QMMM/QM_KIND section:

Now that the QM atoms are defined, CP2K needs to know how to treat these. We will use the semi-empirical AM1 method as our QM method, electrostatically embedded in the MM system: FORCE_EVAL/QMMM#E_COUPL COULOMB. This allows for the QM region to be polarized by the MM environment. Mechanical embedding, where the QM region only interacts with the MM region as point charges and no polarization occurs, can be selected through E_COUPL NONE. In DFT calculations, the highly efficient GEEP method can be used as well. For stability reasons, we need to make some slight modifications to our prmtop file. In the TIP3P water model, the hydrogens do not have a separate Lennard-Jones site. The same is true for the amber serine, threonine and tyrosine HO atoms. This can cause unphysical interactions with the QM system: the hydrogen atoms strongly interact with the QM electron clouds, crashing the system. In order to prevent this we will add LJ sites to water hydrogens and serine/threonine/tyrosine hydrogens with the ambertools utility parmed. We will change their LJ parameters to those of a GAFF2 type alcohol using the changeLJSingleType command. Take care when using this, as it changes the LJ type for all of the atoms that share the LJ type, not just the one you selected. For instance, changing the serine HO also alters all the threonine and tyrosine HO atoms in the system. In this case, this behaviour is useful, but it is strongly advised to use printLJTypes to get a good idea of what exactly you are modifying. For a more selective manner of doing this, the addLJType command is useful. This adds an LJ type for your selected atom or group of atoms, rather than modifying all occurrences of a specific LJ type. Confirm that @5724 corresponds to a water hydrogen and that @4843 is a serine HO using printDetails @(atomnumber) in parmed. If not, use printDetails to locate the correct atom numbers and modify the changeLJSingleType commands accordingly.

The input file monitor.inp will perform 5 ps of simulation in the NVT ensemble. Note that the parameter file should be changed to our modified file with LJ sites on the OH hydrogens. We have also defined a collective variable (CV), i.e. a variable that that describes our process of interest well and allows us to monitor this variable as a surrogate for the process. The CV for this system is the difference in distance between the C-O bond that we will be breaking and the C-C bond that we will be forming. Start the simulation and observe the collective variable output in the MONITOR-COLVAR.metadynLog file.

Metadynamics

We have now simulated a full QM/MM system for 5 ps. Unfortunately, as we can see in the metadynamics logfile and in the .xyz trajectory file, no reaction occurred. This makes sense, as there is a significant energy barrier to overcome. One option would be to run this simulation for months on end, hoping for the reaction to occur spontaneously. A second, more practical option would be to bias the system. We will use metadynamics to bias our collective variables (CVs). The metadynamics procedure adds new a gaussian potential at the current location after a number of MD steps, controlled by MOTION/FREE_ENERGY/METADYN#NT_HILLS, encouraging the system to explore new regions of CV space. The height of the gaussian can be controlled with the keyword MOTION/FREE_ENERGY/METADYN#WW, while the width can be controlled with MOTION/FREE_ENERGY/METADYN/METAVAR#SCALE. These three parameters determine how the energy surface will be explored. Exploring with very wide and tall gaussians will explore various regions of your system quickly, but with poor accuracy. On the other hand, if you use tiny gaussians, it will take a very long time to fully explore your CV space. An issue to be aware of is hill surfing: when gaussians are stacked on too quickly, the simulation can be pushed along by a growing wave of hills. Choose the MOTION/FREE_ENERGY/METADYN#NT_HILLS parameter large enough to allow the simulation to equilibrate after each hill addition.

You can now run the metadynamics simulation using metadynamics.inp. The evolution of the collective variables over time can be observed in the METADYN-COLVAR.metadynLog file. The substrate can be found near the 3 to 8 bohr CV region, whereas the product can be found near the -3 to -8 bohr region. For this tutorial, we will stop the simulation once we have sampled the forward and reverse reaction once. In a production setting you should observe several barrier crossings and observe the change in free energy profile as a function of simulation time. The free energy landscape for the reaction can be obtained by integrating all the gaussians used to bias the system. Fortunately, CP2K comes with the graph.sopt tool that does this all for you. You will need to compile this program separately if you don't have it already. Run the tool passing the restart file as input.

graph.sopt -cp2k -file METADYN-1.restart -out fes.dat

This outputs the fes.dat file, containing a one-dimensional energy landscape. Plot it to obtain a graph charting the energy as a function of the distance difference.

Observe the two energy minima for substrate and product, and the maximum associated with the transition state. From this graph we can estimate the activation energy in the enzyme, amounting to roughly 44 kcal/mol. Results from literature generally estimate the energy to be around 20 kcal/mol. While we are in the right order of magnitude, our estimate is not very accurate. One contributing factor might be our choice of QM/MM subsystems. Up to this point we have only treated our substrate with QM and everything else with MM. This is a significant simplification, which could affect our energy barrier.

QM for enzyme residues

High-level calculations in literature have shown Arg90 (take care, this is residue 89 in parmed) to be the most important residue for catalysis, so we will treat this residue at the QM level. This presents us with a bit of trouble. Earlier, we could just add all of the chorismate atoms to the QM system, because it is not covalently bound to the protein. This situation is of course different for enzyme sidechains, which are bound to the main chain. We thus have to choose where to cut the residue into a QM part and an MM part. Try to cut across the most boring aliphatic C-C bond in your residue, and definitely avoid cutting across heavily polarized bonds. In our case, we will cut the Cα - Cβ bond. The Cα will be a part of the MM subsystem while the Cβ will be a part of the QM subsystem. To do this, replace the FORCE_EVAL/QMMM#QM_KIND section from the previous input file with the following:

Note the FORCE_EVAL/QMMM#LINKsubsection. This tells cp2k where the two subsystems are linked, in this case through a bond between MM atom 1411 (Cα) and QM atom 1413 (Cβ), as well as how to treat the link. We are using the IMOMM method for the link. The arginine sidechain has now been moved to the QM region. This means we need to adjust the QM charge to -1, but this also affects the MM region. The partial charges on the atoms of the arginine residue sum up to +1. When we move the sidechain atoms to the QM region, their charges are no longer counted in the MM region, leading to a residual partial charge. In our case, the total charge amounts to -0.0362. We can neutralize the system by distributing an equal but opposite charge across all 6 mainchain atoms of the residue. For example, the charge on the N is -0.3479. Subtracting -0.0362/6 from that gives us the new charge, -0.3419. Do the same calculation for the 5 remaining mainchain atoms and check that the modified charges for the six atoms add up to zero. Then, apply the changes using parmed’s change command:

When we compare the two energy profiles, we can see that the activation energy for the reaction is around 35 kcal/mol, which is quite a bit closer to the literature values. Of course, there are a number of ways we can improve our estimate of the barrier, for instance sampling several transitions, decreasing the height and width of the gaussians for more fine-grained energy profiles, switching to a more accurate QM method and treating more residues at the QM level. This tutorial is intended as a starting point for further experimentation, and the interested reader is encouraged to explore this system further using more accurate techniques.