....
AlQuraishi and
McAdams (3) demonstrate an impressive advance in predictive capability: when applied to the prediction of the binding speciﬁcity of proteins to DNA, they found approximately 90% accuracy, compared with approximately 60% for the best-performing alternative computational methods. It will be exciting to see future applications of this method to other areas. With all its strengths, it is important to stress that this model does not completely solve the problem of transferability. Although these models are sufﬁciently regularized to avoid overﬁtting, they are still limited by the fundamental nature of the data used as input. This differs from a physics-based approach that, in principle, does not suffer from this issue of transferability, if all the relevant degrees of freedom are included in the model (9).

This suggests that a fusion of both approaches is particularly appealing, whereby physical properties are used as prior knowledge (i.e., as a prior in a Bayesian formulation) but then the model is derived from existing data. The ability to fuse data-driven and physics based approaches could push these types of models even further.

AlQuraishi’s “aha” moment came when he realized that determining atomic-level energy potentials in protein–DNA complexes could be treated mathematically as a signal acquisition problem. By using the crystal structures of the complexes as the “camera” or sensors, and using the experimentally determined binding affinities of the complexes as the compressive measurements, he could determine the “signal”—the attraction between specific pairs of atoms in the protein and DNA. The crystal structures are typically obtained by X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy and are archived in the Protein Data Bank, a community resource for biological research.

“Measurements that we take for granted as being one kind of measurement, a structural measurement, can actually be seen as a different kind of measurement,” AlQuraishi explains. “In a sense it’s two different mathematical formulations, one that’s of use for the structural stuff, and one that’s used for the statistical compressed sensing stuff. On face value they look different. But I had been staring at each individually for a long time, and it occurred to me that with a simple transformation, you could get one to look like the other.”

AlQuraishi’s advisor and collaborator at Stanford, Harley McAdams, a research professor of developmental biology, explains why this transformation works: “You have to ask the question, Why does a protein bind to DNA? It’s because there’s a set of atoms in the protein and a set of atoms in the DNA that have an attraction to each other—that’s what we call a potential. The sum of these individual atomic attractions gives you the net binding energy of the protein to the DNA. If you were able to figure out the set of all atomic relationships or proximities within this structure—which you can from the crystal structures, and that’s where a lot of computation comes in—and you did that for a lot of different cases, then you could statistically determine which of these interactions are important.”

That’s exactly what the compressed sensing computation does, says McAdams: “Given what we know about the set of atomic proximities between the protein and the DNA from these different cases, and what we know about their binding energies, we can take that information and infer what are the atom-to-atom potentials.”

AlQuraishi sums it up by saying, “The idea is that these protein–DNA complexes serve as natural experimental apparatus, because we know where the two pieces, the protein and DNA, are; and we also know the energy of that interaction. So each complex is effectively a probe into the underlying biophysics of protein–DNA interactions at the atomic level.”

“And it’s a direct measurement of the atomic-level interactions,” McAdams adds. “All other previous methods have not been that direct—they’ve been just purely statistical correlations, or based upon some hypothesized physical theory that would be applicable. But this method doesn’t use any hypothetical theory, it says OK, let’s just go in and measure it. That’s its power.”

The protein-DNA elementis the (compressive) measurement. wow, just wow, a beautiful aha moment. The other thing I like about this paper is that they deifne this new De Novo potentials while substantially removing themselves from the traditional Van der Walls deterministic approaches. My bet is that only computational chemistry outsiders would take this risk.

Compressed sensing has revolutionized signal acquisition, by enabling complex signals to be measured with remarkable fidelity using a small number of so-called incoherent sensors. We show that molecular interactions, e.g., protein–DNA interactions, can be analyzed in a directly analogous manner and with similarly remarkable results. Specifically, mesoscopic molecular interactions act as incoherent sensors that measure the energies of microscopic interactions between atoms. We combine concepts from compressed sensing and statistical mechanics to determine the interatomic interaction energies of a molecular system exclusively from experimental measurements, resulting in a “de novo” energy potential. In contrast, conventional methods for estimating energy potentials are based on theoretical models premised on a priori assumptions and extensive domain knowledge. We determine the de novo energy potential for pairwise interactions between protein and DNA atoms from (i) experimental measurements of the binding affinity of protein–DNA complexes and (ii) crystal structures of the complexes. We show that the de novo energy potential can be used to predict the binding specificity of proteins to DNA with approximately 90% accuracy, compared to approximately 60% for the best performing alternative computational methods applied to this fundamental problem. This de novo potential method is directly extendable to other biomolecule interaction domains (enzymes and signaling molecule interactions) and to other classes of molecular interactions.

SI Methods describe the general methodology for de novo potential determination using compressed sensing, the determination of protein–DNA potentials, and the prediction of protein–DNAbinding sites and testing methodology. There are three sections. Section I contains the derivation of the general methodology for de novo potential determination using compressed sensing. The formulation in Section I is not specific to a particular application, but is applicable to a range of biological and chemical systems. The mathematical notation and terminology used throughout is in Section I.

This work brings together disparate concepts from information theory, statistical mechanics, and structural biology. The following background references may be useful to the reader:

• Linear and logistic regression (1).

• Compressed sensing (2, 3).

• Statistical mechanical ensembles (4, 5).

• Structural basis of protein–DNA interactions (6, 7).

Section II describes the application of the general methodology from Section I to the determination of a de novo protein–DNA potential. We reformulate the abstract constructs of the general methodology to the specifics of protein–DNA interactions and introduce several modifications that exploit the unique properties of protein–DNA interactions. Our choices for metaparameters and implementation details are also described in

Section II.

Section III contains a description of the use of the protein–DNA potentials described in Section II to predict protein–DNA-binding sites. We detail our structure-based approach toprotein–DNA-binding site prediction, the dataset used for training and testing, and the quantitative metrics used to compare results between our de novo potential method and previously published methods..