2013

MOTIVATION:Most proteins interact with small-molecule ligands such as metabolites or drug compounds. Over the past several decades, many of these interactions have been captured in high-resolution atomic structures. From a geometric point of view, most interaction sites for grasping these small-molecule ligands, as revealed in these structures, form concave shapes, or 'pockets', on the protein's surface. An efficient method for comparing these pockets could greatly assist the classification of ligand-binding sites, prediction of protein molecular function and design of novel drug compounds.

Molecular structures and functions of the majority of proteins across different species are yet to be identified. Much needed functional annotation of these gene products often benefits from the knowledge of protein-ligand interactions. Towards this goal, we developed eFindSite, an improved version of FINDSITE, designed to more efficiently identify ligand binding sites and residues using only weakly homologous templates. It employs a collection of effective algorithms, including highly sensitive meta-threading approaches, improved clustering techniques, advanced machine learning methods and reliable confidence estimation systems. Depending on the quality of target protein structures, eFindSite outperforms geometric pocket detection algorithms by 15-40 % in binding site detection and by 5-35 % in binding residue prediction. Moreover, compared to FINDSITE, it identifies 14 % more binding residues in the most difficult cases. When multiple putative binding pockets are identified, the ranking accuracy is 75-78 %, which can be further improved by 3-4 % by including auxiliary information on binding ligands extracted from biomedical literature. As a first across-genome application, we describe structure modeling and binding site prediction for the entire proteome of Escherichia coli. Carefully calibrated confidence estimates strongly indicate that highly reliable ligand binding predictions are made for the majority of gene products, thus eFindSite holds a significant promise for large-scale genome annotation and drug development projects. eFindSite is freely available to the academic community at http://www.brylinski.org/efindsite .

The webPDBinder (http://pdbinder.bio.uniroma2.it/PDBinder) is a web server for the identification of small ligand-binding sites in a protein structure. webPDBinder searches a protein structure against a library of known binding sites and a collection of control non-binding pockets. The number of similarities identified with the residues in the two sets is then used to derive a propensity value for each residue of the query protein associated to the likelihood that the residue is part of a ligand binding site. The predicted binding residues can be further refined using conservation scores derived from the multiple alignment of the PFAM protein family. webPDBinder correctly identifies residues belonging to the binding site in 77% of the cases and is able to identify binding pockets starting from holo or apo structures with comparable performances. This is important for all the real world cases where the query protein has been crystallized without a ligand and is also difficult to obtain clear similarities with bound pockets from holo pocket libraries. The input is either a PDB code or a user-submitted structure. The output is a list of predicted binding pocket residues with propensity and conservation values both in text and graphical format.

LISE is a web server for a novel method for predicting small molecule binding sites on proteins. It differs from a number of servers currently available for such predictions in two aspects. First, rather than relying on knowledge of similar protein structures, identification of surface cavities or estimation of binding energy, LISE computes a score by counting geometric motifs extracted from sub-structures of interaction networks connecting protein and ligand atoms. These network motifs take into account spatial and physicochemical properties of ligand-interacting protein surface atoms. Second, LISE has now been more thoroughly tested, as, in addition to the evaluation we previously reported using two commonly used small benchmark test sets and targets of two community-based experiments on ligand-binding site predictions, we now report an evaluation using a large non-redundant data set containing >2000 protein-ligand complexes. This unprecedented test, the largest ever reported to our knowledge, demonstrates LISE's overall accuracy and robustness. Furthermore, we have identified some hard to predict protein classes and provided an estimate of the performance that can be expected from a state-of-the-art binding site prediction server, such as LISE, on a proteome scale. The server is freely available at http://lise.ibms.sinica.edu.tw.

Nucleos is a web server for the identification of nucleotide-binding sites in protein structures. Nucleos compares the structure of a query protein against a set of known template 3D binding sites representing nucleotide modules, namely the nucleobase, carbohydrate and phosphate. Structural features, clustering and conservation are used to filter and score the predictions. The predicted nucleotide modules are then joined to build whole nucleotide-binding sites, which are ranked by their score. The server takes as input either the PDB code of the query protein structure or a user-submitted structure in PDB format. The output of Nucleos is composed of ranked lists of predicted nucleotide-binding sites divided by nucleotide type (e.g. ATP-like). For each ranked prediction, Nucleos provides detailed information about the score, the template structure and the structural match for each nucleotide module composing the nucleotide-binding site. The predictions on the query structure and the template-binding sites can be viewed directly on the web through a graphical applet. In 98% of the cases, the modules composing correct predictions belong to proteins with no homology relationship between each other, meaning that the identification of brand-new nucleotide-binding sites is possible using information from non-homologous proteins. Nucleos is available at http://nucleos.bio.uniroma2.it/nucleos/.

Understanding molecular recognition is one major requirement for drug discovery and design. Physicochemical and shape complementarity between two binding partners is the driving force during complex formation. In this study, the impact of shape within this process is analyzed. Protein binding pockets and co-crystallized ligands are represented by normalized principal moments of inertia ratios (NPRs). The corresponding descriptor space is triangular, with its corners occupied by spherical, discoid, and elongated shapes. An analysis of a selected set of sc-PDB complexes suggests that pockets and bound ligands avoid spherical shapes, which are, however, prevalent in small unoccupied pockets. Furthermore, a direct shape comparison confirms previous studies that on average only one third of a pocket is filled by its bound ligand, supplemented by a 50 % subpocket coverage. In this study, we found that shape complementary is expressed by low pairwise shape distances in NPR space, short distances between the centers-of-mass, and small deviations in the angle between the first principal ellipsoid axes. Furthermore, it is assessed how different binding pocket parameters are related to bioactivity and binding efficiency of the co-crystallized ligand. In addition, the performance of different shape and size parameters of pockets and ligands is evaluated in a virtual screening scenario performed on four representative targets.

Due to the rising number of solved protein structures, computer-based techniques for automatic protein functional annotation and classification into families are of high scientific interest. DoGSiteScorer automatically calculates global descriptors for self-predicted pockets based on the 3D structure of a protein. Protein function predictors on three levels with increasing granularity are built by use of a support vector machine (SVM), based on descriptors of 26632 pockets from enzymes with known structure and EC classification. The SVM models represent a generalization of the available descriptor space for each enzyme class, subclass, and substrate-specific sub-subclass. Cross-validation studies show accuracies of 68:2% for predicting the correct main class and accuracies between 62:8% and 80:9% for the six subclasses. Substrate-specific recall rates for a kinase subset are 53:8%. Furthermore, application studies show the ability of the method for predicting the function of unknown proteins and gaining valuable information for the function prediction field. Proteins 2012.

We present TRAPP (TRAnsient Pockets in Proteins), a new automated software platform for tracking, analysis, and visualization of binding pocket variations along a protein motion trajectory or within an ensemble of protein structures that may encompass conformational changes ranging from local side chain fluctuations to global backbone motions. TRAPP performs accurate grid-based calculations of the shape and physicochemical characteristics of a binding pocket for each structure and detects the conserved and transient regions of the pocket in an ensemble of protein conformations. It also provides tools for tracing the opening of a particular subpocket and residues that contribute to the binding site. TRAPP thus enables an assessment of the druggability of a disease-related target protein taking its flexibility into account.

Prediction of the protein residues most likely to be involved in ligand recognition is of substantial value in structure-based drug design. Considering multiple ligand binding modes is of potential relevance to studying ligand recognition, but is generally ignored by currently available techniques. We have previously presented the site mapping technique, which considers multiple ligand binding modes in its analysis of protein-ligand recognition. AutoMap is a partially automated implementation of our previously developed site mapping procedure. It consists of a series of Perl scripts that utilize the output of molecular docking to generate "site maps" of a protein binding site. AutoMap determines the hydrogen bonding and van der Waals interactions taking place between a target protein and each pose of a ligand ensemble. It tallies these interactions according to the protein residues with which they occur, then normalizes the tallies and maps these to the surface of the protein. The residues involved in interactions are selected according to specific cutoffs. The procedure has been demonstrated to perform well in studying carbohydrate-protein and peptide-antibody recognition. An automated procedure to optimize cutoff selection is demonstrated to rapidly identify the appropriate cutoffs for these previously studied systems. The prediction of key ligand binding residues is compared between AutoMap using automatically optimized cutoffs, AutoMap using a previously selected cutoff, the top ranked pose from docking and the predictions supplied by FTMap. AutoMap using automatically optimized cutoffs is demonstrated to provide improved predictions, compared to other methods, in a set of immunologically relevant test cases. The automated implementation of the site mapping technique provides the opportunity for rapid optimization and deployment of the technique for investigating a broad range of protein-ligand systems.

We present a method to identify small molecule ligand binding sites and poses within a given protein crystal structure using GPU-accelerated Hamiltonian replica exchange molecular dynamics simulations. The Hamiltonians used vary from the physical end state of protein interacting with the ligand to an unphysical end state where the ligand does not interact with the protein. As replicas explore the space of Hamiltonians interpolating between these states, the ligand can rapidly escape local minima and explore potential binding sites. Geometric restraints keep the ligands from leaving the vicinity of the protein and an alchemical pathway designed to increase phase space overlap between intermediates ensures good mixing. Because of the rigorous statistical mechanical nature of the Hamiltonian exchange framework, we can also extract binding free energy estimates for all putative binding sites. We present results of this methodology applied to the T4 lysozyme L99A model system for three known ligands and one non-binder as a control, using an implicit solvent. We find that our methodology identifies known crystallographic binding sites consistently and accurately for the small number of ligands considered here and gives free energies consistent with experiment. We are also able to analyze the contribution of individual binding sites to the overall binding affinity. Our methodology points to near term potential applications in early-stage structure-guided drug discovery.

We propose a new molecular dynamics (MD) protocol to identify the binding site of a guest within a host. The method utilizes a four spatial (4D) dimension representation of the ligand allowing for rapid and efficient sampling within the receptor. We applied the method to two different model receptors characterized by diverse structural features of the binding site and different ligand binding affinities. The Abl kinase domain is comprised of a deep binding pocket and displays high affinity for the two chosen ligands examined here. The PDZ1 domain of PSD-95 has a shallow binding pocket that accommodates a peptide ligand involving far fewer interactions and a micromolar affinity. To ensure completely unbiased searching, the ligands were placed in the direct center of the protein receptors, away from the binding site, at the start of the 4D MD protocol. In both cases, the ligands were successfully docked into the binding site as identified in the published structures. The 4D MD protocol is able to overcome local energy barriers in locating the lowest energy binding pocket and will aid in the discovery of guest binding pockets in the absence of a priori knowledge of the site of interaction.

Accurate determination of potential ligand binding sites (BS) is a key step for protein function characterization and structure-based drug design. Despite promising results of template-based BS prediction methods using global structure alignment (GSA), there is room to improve the performance by properly incorporating local structure alignment (LSA) because BS are local structures and often similar for proteins with dissimilar global folds. We present a template-based ligand BS prediction method using G-LoSA, our LSA tool. A large benchmark set validation shows that G-LoSA predicts drug-like ligands' positions in single-chain protein targets more precisely than TM-align, a GSA-based method, while the overall success rate of TM-align is better. G-LoSA is particularly efficient for accurate detection of local structures conserved across proteins with diverse global topologies. Recognizing the performance complementarity of G-LoSA to TM-align and a nontemplate geometry-based method, fpocket, a robust consensus scoring method, CMCS-BSP (Complementary Methods and Consensus Scoring for ligand Binding Site Prediction), is developed and shows improvement on prediction accuracy.

Computational solvent mapping finds binding hot spots, determines their druggability and provides information for drug design. While mapping of a ligand-bound structure yields more accurate results, usually the apo structure serves as the starting point in design. The FTFlex algorithm, implemented as a server, can modify an apo structure to yield mapping results that are similar to those of the respective bound structure. Thus, FTFlex is an extension of our FTMap server, which only considers rigid structures. FTFlex identifies flexible residues within the binding site and determines alternative conformations using a rotamer library. In cases where the mapping results of the apo structure were in poor agreement with those of the bound structure, FTFlex was able to yield a modified apo structure, which lead to improved FTMap results. In cases where the mapping results of the apo and bound structures were in good agreement, no new structure was predicted. AVAILABILITY: FTFlex is freely available as a web-based server at http://ftflex.bu.edu/. CONTACT: vajda@bu.edu or midas@bu.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

The ProBiS algorithm performs a local structural comparison of the query protein surface against the nonredundant database of protein structures. It finds proteins that have binding sites in common with the query protein. Here, we present a new parallelized algorithm, Parallel-ProBiS, for detecting similar binding sites on clusters of computers. The obtained speedups of the parallel ProBiS scale almost ideally with the number of computing cores up to about 64 computing cores. Scaling is better for larger than for smaller query proteins. For a protein with almost 600 amino acids, the maximum speedup of 180 was achieved on two interconnected clusters with 248 computing cores. Source code of Parallel-ProBiS is available for download free for academic users at http://probis.cmm.ki.si/download.

To understand the activity and cross reactivity of ligands and G protein-coupled receptors, we take stock of relevant existing receptor mutation, sequence, and structural data to develop a statistically robust and transparent scoring system. Our method evaluates the viability of binding of any ligand for any GPCR sequence of amino acids. This enabled us to explore the binding repertoire of both receptors and ligands, relying solely on correlations between carefully identified receptor features and without requiring any chemical information about ligands. This study suggests that sequence similarity at specific binding pockets can predict relative affinity of ligands; enabling recovery of over 80% of known ligands for a withheld receptor and almost 80% of known receptors for a ligand. The method enables qualitative prediction of ligand binding for all nonredundant human G protein-coupled receptors.

The estimation of prediction quality is important because without quality measures, it is difficult to determine the usefulness of a prediction. Currently, methods for ligand binding site residue predictions are assessed in the function prediction category of the biennial Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiment, utilizing the Matthews Correlation Coefficient (MCC) and Binding-site Distance Test (BDT) metrics. However, the assessment of ligand binding site predictions using such metrics requires the availability of solved structures with bound ligands. Thus, we have developed a ligand binding site quality assessment tool, FunFOLDQA, which utilizes protein feature analysis to predict ligand binding site quality prior to the experimental solution of the protein structures and their ligand interactions. The FunFOLDQA feature scores were combined using: simple linear combinations, multiple linear regression and a neural network. The neural network produced significantly better results for correlations to both the MCC and BDT scores, according to Kendall's $\tau$, Spearman's $\rho$ and Pearson's r correlation coefficients, when tested on both the CASP8 and CASP9 datasets. The neural network also produced the largest Area Under the Curve score (AUC) when Receiver Operator Characteristic (ROC) analysis was undertaken for the CASP8 dataset. Furthermore, the FunFOLDQA algorithm incorporating the neural network, is shown to add value to FunFOLD, when both methods are employed in combination. This results in a statistically significant improvement over all of the best server methods, the FunFOLD method (6.43%), and one of the top manual groups (FN293) tested on the CASP8 dataset. The FunFOLDQA method was also found to be competitive with the top server methods when tested on the CASP9 dataset. To the best of our knowledge, FunFOLDQA is the first attempt to develop a method that can be used to assess ligand binding site prediction quality, in the absence of experimental data.

This chapter describes a method for analyzing the allosteric influence of molecular interactions on protein conformational distributions. The method, called Dynamics Perturbation Analysis (DPA), generally yields insights into allosteric effects in proteins and is especially useful for predicting ligand-binding sites. The use of DPA for binding site prediction is motivated by the following allosteric regulation hypothesis: interactions in native binding sites cause a large change in protein conformational distributions. Here, we review the reasoning behind this hypothesis, describe the math behind the method, and present a recipe for predicting binding sites using DPA.

A computational pipeline PocketAnnotate for functional annotation of proteins at the level of binding sites has been proposed in this study. The pipeline integrates three in-house algorithms for site-based function annotation: PocketDepth, for prediction of binding sites in protein structures; PocketMatch, for rapid comparison of binding sites and PocketAlign, to obtain detailed alignment between pair of binding sites. A novel scheme has been developed to rapidly generate a database of non-redundant binding sites. For a given input protein structure, putative ligand-binding sites are identified, matched in real time against the database and the query substructure aligned with the promising hits, to obtain a set of possible ligands that the given protein could bind to. The input can be either whole protein structures or merely the substructures corresponding to possible binding sites. Structure-based function annotation at the level of binding sites thus achieved could prove very useful for cases where no obvious functional inference can be obtained based purely on sequence or fold-level analyses. An attempt has also been made to analyse proteins of no known function from Protein Data Bank. PocketAnnotate would be a valuable tool for the scientific community and contribute towards structure-based functional inference. The web server can be freely accessed at http://proline.biochem.iisc.ernet.in/pocketannotate/.

Computational drug repositioning offers promise for discovering new uses of existing drugs, as drug related molecular, chemical, and clinical information has increased over the past decade and become broadly accessible. In this study, we present a new computational approach for identifying potential new indications of an existing drug through its relation to similar drugs in disease-drug-target network. When measuring drug pairwise similarly, we used a bipartite-graph based method which combined similarity of drug compound structures, similarity of target protein profiles, and interaction between target proteins. In evaluation, our method compared favorably to the state of the art, achieving AUC of 0.888. The results indicated that our method is able to identify drug repositioning opportunities by exploring complex relationships in disease-drug-target network.

Predicting druggability and prioritizing certain disease modifying targets for the drug development process is of high practical relevance in pharmaceutical research. DoGSiteScorer is a fully automatic algorithm for pocket and druggability prediction. Besides consideration of global properties of the pocket, also local similarities shared between pockets are reflected. Druggability scores are predicted by means of a support vector machine (SVM), trained, and tested on the druggability data set (DD) and its nonredundant version (NRDD). The DD consists of 1069 targets with assigned druggable, difficult, and undruggable classes. In 90% of the NRDD, the SVM model based on global descriptors correctly classifies a target as either druggable or undruggable. Nevertheless, global properties suffer from binding site changes due to ligand binding and from the pocket boundary definition. Therefore, local pocket properties are additionally investigated in terms of a nearest neighbor search. Local similarities are described by distance dependent histograms between atom pairs. In 88% of the DD pocket set, the nearest neighbor and the structure itself conform with their druggability type. A discriminant feature between druggable and undruggable pockets is having less short-range hydrophilic-hydrophilic pairs and more short-range lipophilic-lipophilic pairs. Our findings for global pocket descriptors coincide with previously published methods affirming that size, shape, and hydrophobicity are important global pocket descriptors for automatic druggability prediction. Nevertheless, the variety of pocket shapes and their flexibility upon ligand binding limit the automatic projection of druggable features onto descriptors. Incorporating local pocket properties is another step toward a reliable descriptor-based druggability prediction.

Protein-protein interfaces are considered difficult targets for small-molecule protein-protein interaction modulators (PPIMs ). Here, we present for the first time a computational strategy that simultaneously considers aspects of energetics and plasticity in the context of PPIM binding to a protein interface. The strategy aims at identifying the determinants of small-molecule binding, hot spots, and transient pockets, in a protein-protein interface in order to make use of this knowledge for predicting binding modes of and ranking PPIMs with respect to their affinity. When applied to interleukin-2 (IL-2), the computationally inexpensive constrained geometric simulation method FRODA outperforms molecular dynamics simulations in sampling hydrophobic transient pockets. We introduce the PPIAnalyzer approach for identifying transient pockets on the basis of geometrical criteria only. A sequence of docking to identified transient pockets, starting structure selection based on hot spot information, RMSD clustering and intermolecular docking energies, and MM-PBSA calculations allows one to enrich IL-2 PPIMs from a set of decoys and to discriminate between subgroups of IL-2 PPIMs with low and high affinity. Our strategy will be applicable in a prospective manner where nothing else than a protein-protein complex structure is known; hence, it can well be the first step in a structure-based endeavor to identify PPIMs.

We have developed FINDSITEX, an extension of FINDSITE, a protein threading based algorithm for the inference of protein binding sites, biochemical function and virtual ligand screening, that removes the limitation that holo protein structures (those containing bound ligands) of a sufficiently large set of distant evolutionarily related proteins to the target be solved; rather, predicted protein structures and experimental ligand binding information are employed. To provide the predicted protein structures, a fast and accurate version of our recently developed TASSERVMT, TASSERVMT-lite, for template-based protein structural modeling applicable up to 1000 residues is developed and tested, with comparable performance to the top CASP9 servers. Then, a hybrid approach that combines structure alignments with an evolutionary similarity score for identifying functional relationships between target and proteins with binding data has been developed. By way of illustration, FINDSITEX is applied to 998 identified human G-protein coupled receptors (GPCRs). First, TASSERVMT-lite provides updates of all human GPCR structures previously modeled in our lab. We then use these structures and the new function similarity detection algorithm to screen all human GPCRs against the ZINC8 nonredundant (TC < 0.7) ligand set combined with ligands from the GLIDA database (a total of 88,949 compounds). Testing (excluding GPCRs whose sequence identity > 30% to the target from the binding data library) on a 168 human GPCR set with known binding data, the average enrichment factor in the top 1% of the compound library (EF0.01) is 22.7, whereas EF0.01 by FINDSITE is 7.1. For virtual screening when just the target and its native ligands are excluded, the average EF0.01 reaches 41.4. We also analyze off-target interactions for the 168 protein test set. All predicted structures, virtual screening data and off-target interactions for the 998 human GPCRs are available at http://cssb.biology.gatech.edu/skolnick/webservice/gpcr/index.html.

MOTIVATION:Finding geometrically similar protein binding sites is crucial for understanding protein functions and can provide valuable information for protein-protein docking and drug discovery. As the number of known protein-protein interaction structures has dramatically increased, a high-throughput and accurate protein binding site comparison method is essential. Traditional alignment-based methods can provide accurate correspondence between the binding sites but are computationally expensive.

Proteins perform functions through interacting with other molecules. However, structural details for most of the protein-ligand interactions are unknown. We present a comparative approach (COFACTOR) to recognize functional sites of protein-ligand interactions using low-resolution protein structural models, based on a global-to-local sequence and structural comparison algorithm. COFACTOR was tested on 501 proteins, which harbor 582 natural and drug-like ligand molecules. Starting from I-TASSER structure predictions, the method successfully identifies ligand-binding pocket locations for 65% of apo receptors with an average distance error 2\AA}. The average precision of binding-residue assignments is 46% and 137% higher than that by FINDSITE and ConCavity. In CASP9, COFACTOR achieved a binding-site prediction precision 72% and Matthews correlation coefficient 0.69 for 31 blind test proteins, which was significantly higher than all other participating methods. These data demonstrate the power of structure-based approaches to protein-ligand interaction predictions applicable for genome-wide structural and functional annotations.

Complex biological functions emerge through intricate protein-protein interaction networks. An important class of protein-protein interaction corresponds to peptide-mediated interactions, in which a short peptide stretch from one partner interacts with a large protein surface from the other partner. Protein-peptide interactions are typically of low affinity and involved in regulatory mechanisms, dynamically reshaping protein interaction networks. Due to the relatively small interaction surface, modulation of protein-peptide interactions is feasible and highly attractive for therapeutic purposes. Unfortunately, the number of available 3D structures of protein-peptide interfaces is very limited. For typical cases where a protein-peptide structure of interest is not available, the PepSite web server can be used to predict peptide-binding spots from protein surfaces alone. The PepSite method relies on preferred peptide-binding environments calculated from a set of known protein-peptide 3D structures, combined with distance constraints derived from known peptides. We present an updated version of the web server that is orders of magnitude faster than the original implementation, returning results in seconds instead of minutes or hours. The PepSite web server is available at http://pepsite2.russelllab.org.

Analyzing protein binding sites provides detailed insights into the biological processes proteins are involved in, e.g., into drug-target interactions, and so is of crucial importance in drug discovery. Herein, we present novel alignment-independent binding site descriptors based on DrugScore potential fields. The potential fields are transformed to a set of information-rich descriptors using a series expansion in 3D Zernike polynomials. The resulting Zernike descriptors show a promising performance in detecting similarities among proteins with low pairwise sequence identities that bind identical ligands, as well as within subfamilies of one target class. Furthermore, the Zernike descriptors are robust against structural variations among protein binding sites. Finally, the Zernike descriptors show a high data compression power, and computing similarities between binding sites based on these descriptors is highly efficient. Consequently, the Zernike descriptors are a useful tool for computational binding site analysis, e.g., to predict the function of novel proteins, off-targets for drug candidates, or novel targets for known drugs.

The accurate identification of cavities that can bind ligands on the surface of proteins is of major importance for the characterization of the function of proteins based on its structure. In addition it can be helpful for rational structure-based drug design on target proteins of medical relevance and for evaluating the tendency of proteins to aggregate or oligomerize. A new approach termed dPredGB to detect and evaluate putative binding cavities on protein surfaces has been developed. In contrast to existing prediction methods that are based on purely geometric features of binding sites or on possible direct interactions with a putative binding partner the dPredGB approach combines rapid geometric detection with an evaluation of the desolvation properties of the putative binding pocket. It has been tested on a variety of proteins known to bind ligands in bound and unbound conformations. The approach outperforms most available methods and offers also the spatial characterization of the desolvation properties of a binding region. On a test set of proteins the method identifies in 69% of the unbound cases and 85% of the bound cases the known ligand binding cavity as the top ranking prediction. Possibilities to improve the prediction performance even further are also discussed.

MOTIVATION: Knowledge about the site at which a ligand binds provides an important clue for predicting the function of a protein and is also often a prerequisite for performing docking computations in virtual drug design and screening. We have previously shown that certain ligand interacting triangles of protein atoms, called protein triangles, tend to occur more frequently at ligand binding sites than at other parts of the protein. RESULTS: In this work, we describe a new ligand binding site prediction method that was developed based on binding siteenriched protein triangles. The new method was tested on two benchmark datasets and also on 19 targets from two recent community-based studies of such predictions and excellent results were obtained. Where comparisons were made, the success rates for the new method for the first predicted site were significantly better than methods that are not a meta-predictor. Further examination showed that, for most of the unsuccessful predictions, the pocket of the ligand binding site was identified, but not the site itself, while, for some others, the failure was not due to the method itself, but to the use of an incorrect biological unit in the structure examined, although using correct biological units would not necessarily improve the prediction success rates. These results suggest that the new method is a valuable new addition to a suite of existing structure-based bioinformatics tools for studies of molecular recognition and related functions of proteins in post-genomics research. AVAILABILITY: The executable binaries and a web server for our method are available from http://sourceforge.net/projects/msdock/ and http://lise.ibms.sinica.edu.tw, respectively, free for academic users. CONTACT: mjhwang@ibms.sinica.edu.tw SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

ABSTRACT: BACKGROUND: Protein structures provide a valuable resource for rational drug design. For a protein with no known ligand, computational tools can predict surface pockets that are of suitable size and shape to accommodate a complementary small-molecule drug. However, pocket prediction against single static structures may miss features of pockets that arise from proteins' dynamic behaviour. In particular, ligand-binding conformations can be observed as transiently populated states of the apo protein, so it is possible to gain insight into ligand-bound forms by considering conformational variation in apo proteins. This variation can be explored by considering sets of related structures: computationally generated conformers, solution NMR ensembles, multiple crystal structures, homologues or homology models. It is non-trivial to compare pockets, either from different programs or across sets of structures. For a single structure, difficulties arise in defining a particular pocket's boundaries. For a set of conformationally distinct structures the challenge is how to make reasonable comparisons between them given that a perfect structural alignment is not possible. RESULTS: We have developed a computational method, Provar, that provides a consistent representation of predicted binding pockets across sets of related protein structures. The outputs are probabilities that each atom or residue of the protein borders a predicted pocket. These probabilities can be readily visualised on a protein using existing molecular graphics software. We show how Provar simplifies comparison of the outputs of different pocket prediction algorithms, of pockets across multiple simulated conformations and between homologous structures. We demonstrate the benefits of use of multiple structures for protein-ligand and protein-protein interface analysis on a set of complexes and consider three case studies in detail: i) analysis of a kinase superfamily highlights the conserved occurrence of surface pockets at the active and regulatory sites; ii) a simulated ensemble of unliganded Bcl-2 structures reveals extensions of a known ligand-binding pocket not apparent in the apo crystal structure; iii) visualisations of interleukin-2 and its homologues highlight conserved pockets at the known receptor interfaces and regions whose conformation is known to change on inhibitor binding. CONCLUSIONS: Through post-processing of the output of a variety of pocket prediction software, Provar provides a flexible approach to visualization of the persistence or variability of pockets in sets of related protein structures.

MOTIVATION: Computational characterization of ligand binding sites in proteins provides preliminary information for functional annotation, protein design and ligand optimization. SiteComp implements binding site analysis for comparison of binding sites, evaluation of residue contribution to binding sites, and identification of sub-sites with distinct molecular interaction properties. AVAILABILITY: The SiteComp server and tutorials are freely available at http://sitecomp.sanchezlab.org. CONTACT: roberto@sanchezlab.org or roberto.sanchez@mssm.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

The distribution of ligand-binding pockets around protein-protein interfaces suggests a general mechanism for pocket formation.
Gao, Mu and Skolnick, Jeffrey
Proceedings of the National Academy of Sciences of the United States of America, 2012, 109(10), 3784-3789
PMID: 22355140
doi: 10.1073/pnas.1117768109

Protein-protein and protein-ligand interactions are ubiquitous in a biological cell. Here, we report a comprehensive study of the distribution of protein-ligand interaction sites, namely ligand-binding pockets, around protein-protein interfaces where protein-protein interactions occur. We inspected a representative set of 1,611 representative protein-protein complexes and identified pockets with a potential for binding small molecule ligands. The majority of these pockets are within a 6\AA} distance from protein interfaces. Accordingly, in about half of ligand-bound protein-protein complexes, amino acids from both sides of a protein interface are involved in direct contacts with at least one ligand. Statistically, ligands are closer to a protein-protein interface than a random surface patch of the same solvent accessible surface area. Similar results are obtained in an analysis of the ligand distribution around domain-domain interfaces of 1,416 nonredundant, two-domain protein structures. Furthermore, comparable sized pockets as observed in experimental structures are present in artificially generated protein complexes, suggesting that the prominent appearance of pockets around protein interfaces is mainly a structural consequence of protein packing and thus, is an intrinsic geometric feature of protein structure. Nature may take advantage of such a structural feature by selecting and further optimizing for biological function. We propose that packing nearby protein-protein or domain-domain interfaces is a major route to the formation of ligand-binding pockets.

Estimating the pairwise similarity of protein-ligand binding sites is a fast and efficient way of predicting cross-reactivity and putative side effects of drug candidates. Among the many tools available, three-dimensional (3D) alignment-dependent methods are usually slow and based on simplified representations of binding site atoms or surfaces. On the other hand, fast and efficient alignment-free methods have recently been described but suffer from a lack of interpretability. We herewith present a novel binding site description (VolSite), coupled to an alignment and comparison tool (Shaper) combining the speed of alignment-free methods with the interpretability of alignment-dependent approaches. It is based on the comparison of negative images of binding cavities encoding both shape and pharmacophoric properties at regularly spaced grid points. Shaper approximates the resulting molecular shape with a smooth Gaussian function and aligns protein binding sites by optimizing their volume overlap. Volsite and Shaper were successfully applied to compare protein-ligand binding sites and to predict their structural druggability.

Pharmacophore Fingerprint-Based Approach to Binding Site Subpocket Similarity and Its Application to Bioisostere Replacement.
Wood, David J and Vlieg, Jacob de and Wagener, Markus and Ritschel, Tina
Journal of chemical information and modeling, 2012, 52(8), 2031-2043
PMID: 22830492
doi: 10.1021/ci3000776

Bioisosteres have been defined as structurally different molecules or substructures that can form comparable intermolecular interactions, and therefore, fragments that bind to similar protein structures exhibit a degree of bioisosterism. We present KRIPO (Key Representation of Interaction in POckets): a new method for quantifying the similarities of binding site subpockets based on pharmacophore fingerprints. The binding site fingerprints have been optimized to improve their performance for both intra- and interprotein family comparisons. A range of attributes of the fingerprints was considered in the optimization, including the placement of pharmacophore features, whether or not the fingerprints are fuzzified, and the resolution and complexity of the pharmacophore fingerprints (2-, 3-, and 4-point fingerprints). Fuzzy 3-point pharmacophore fingerprints were found to represent the optimal balance between computational resource requirements and the identification of potential replacements. The complete PDB was converted into a database comprising almost 300 000 optimized fingerprints of local binding sites together with their associated ligand fragments. The value of the approach is demonstrated by application to two crystal structures from the Protein Data Bank: (1) a MAP kinase P38 structure in complex with a pyridinylimidazole inhibitor ( 1A9U ) and (2) a complex of thrombin with melagatran ( 1K22 ). Potentially valuable bioisosteric replacements for all subpockets of the two studied protein are identified.

MOTIVATION: Many drug discovery projects fail because the underlying target is finally found to be undruggable. Progress in structure elucidation of proteins now opens up a route to automatic structure-based target assessment. DoGSiteScorer is a newly developed automatic tool combining pocket prediction, characterization and druggability estimation and is now available through a web server. AVAILABILITY: The DoGSiteScorer web server is freely available for academic use at http://dogsite.zbh.uni-hamburg.de CONTACT: rarey@zbh.uni-hamburg.de.

BACKGROUND:Identifying the location of binding sites on proteins is of fundamental importance for a wide range of applications including molecular docking, de novo drug design, structure identification and comparison of functional sites. Structural genomic projects are beginning to produce protein structures with unknown functions. Therefore, efficient methods are required if all these structures are to be properly annotated. Lots of methods for finding binding sites involve 3D structure comparison. Here we design a method to find protein binding sites by direct comparison of protein 3D structures.

Empty space in a protein structure can provide valuable insight into protein properties such as internal hydration, structure stabilization, substrate translocation, storage compartments or binding sites. This information can be visualized by means of cavity analysis. Numerous tools are available depicting cavities directly or identifying lining residues. So far, all available techniques base on a single conformation neglecting any form of protein and cavity dynamics. Here we report a novel, grid-based cavity detection method that uses protein and solvent residence probabilities derived from molecular dynamics simulations to identify (I) internal cavities, (II) tunnels or (III) clefts on the protein surface. Driven by a graphical user interface, output can be exported in PDB format where cavities are described as individually selectable groups of adjacent voxels representing regions of high solvent residence probability. Cavities can be analyzed in terms of solvent density, cavity volume and cross-sectional area along a principal axis. To assess dxTuber performance we performed test runs on a set of six example proteins representing the three main classes of protein cavities and compared our findings to results obtained with SURFNET, CAVER and PyMol.

Depth measures the extent of atom/residue burial within a protein. It correlates with properties such as protein stability, hydrogen exchange rate, protein-protein interaction hot spots, post-translational modification sites and sequence variability. Our server, DEPTH, accurately computes depth and solvent-accessible surface area (SASA) values. We show that depth can be used to predict small molecule ligand binding cavities in proteins. Often, some of the residues lining a ligand binding cavity are both deep and solvent exposed. Using the depth-SASA pair values for a residue, its likelihood to form part of a small molecule binding cavity is estimated. The parameters of the method were calibrated over a training set of 900 high-resolution X-ray crystal structures of single-domain proteins bound to small molecules (molecular weight <1.5 KDa). The prediction accuracy of DEPTH is comparable to that of other geometry-based prediction methods including LIGSITE, SURFNET and Pocket-Finder (all with Matthew's correlation coefficient of ∼0.4) over a testing set of 225 single and multi-chain protein structures. Users have the option of tuning several parameters to detect cavities of different sizes, for example, geometrically flat binding sites. The input to the server is a protein 3D structure in PDB format. The users have the option of tuning the values of four parameters associated with the computation of residue depth and the prediction of binding cavities. The computed depths, SASA and binding cavity predictions are displayed in 2D plots and mapped onto 3D representations of the protein structure using Jmol. Links are provided to download the outputs. Our server is useful for all structural analysis based on residue depth and SASA, such as guiding site-directed mutagenesis experiments and small molecule docking exercises, in the context of protein functional annotation and drug discovery.

Thermodynamic analysis of water molecules at the surface of proteins and applications to binding site prediction and characterization
Beuming, Thijs and Che, Ye and Abel, Robert and Kim, Byungchan and Shanmugasundaram, Veerabahu and Sherman, Woody
Proteins, 2011, 80(3), 871-883
PMID: 22223256
doi: 10.1002/prot.23244

Water plays an essential role in determining the structure and function of all biological systems. Recent methodological advances allow for an accurate and efficient estimation of the thermo- dynamic properties of water molecules at the sur- face of proteins. In this work, we characterize these thermodynamic properties and relate them to various structural and functional characteris- tics of the protein. We find that high-energy hydration sites often exist near protein motifs typically characterized as hydrophilic, such as backbone amide groups. We also find that waters around alpha helices and beta sheets tend to be less stable than waters around loops. Further- more, we find no significant correlation between the hydration site-free energy and the solvent ac- cessible surface area of the site. In addition, we find that the distribution of high-energy hydra- tion sites on the protein surface can be used to identify the location of binding sites and that binding sites of druggable targets tend to have a greater density of thermodynamically unstable hydration sites. Using this information, we char- acterize the FKBP12 protein and show good agreement between fragment screening hit rates from NMR spectroscopy and hydration site ener- getics. Finally, we show that water molecules observed in crystal structures are less stable on average than bulk water as a consequence of the high degree of spatial localization, thereby result- ing in a significant loss in entropy. These find- ings should help to better understand the charac- teristics of waters at the surface of proteins and are expected to lead to insights that can guide structure-based drug design efforts.

MOTIVATION: Binding site identification is a classical problem that is important for a range of applications, including the structure-based prediction of function, the elucidation of functional relationships among proteins, protein engineering, and drug design. We describe an accurate method of binding site identification, namely FTSite. This method is based on experimental evidence that ligand binding sites also bind small organic molecules of various shapes and polarity. The FTSite algorithm does not rely on any evolutionary or statistical information, but achieves near experimental accuracy: it is capable of identifying the binding sites in over 94% of apo proteins from established test sets that have been used to evaluate many other binding site prediction methods. AVAILABILITY: FTSite is freely available as a web-based server at http://ftsite.bu.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. CONTACT: vajda@bu.edu; midas@bu.edu.

We report here a robust automated active site detection, docking, and scoring (AADS) protocol for proteins with known structures. The active site finder identifies all cavities in a protein and scores them based on the physicochemical properties of functional groups lining the cavities in the protein. The accuracy realized on 620 proteins with sizes ranging from 100 to 600 amino acids with known drug active sites is 100% when the top ten cavity points are considered. These top ten cavity points identified are then submitted for an automated docking of an input ligand/candidate molecule. The docking protocol uses an all atom energy based Monte Carlo method. Eight low energy docked structures corresponding to different locations and orientations of the candidate molecule are stored at each cavity point giving 80 docked structures overall which are then ranked using an effective free energy function and top five structures are selected. The predicted structure and energetics of the complexes agree quite well with experiment when tested on a data set of 170 protein-ligand complexes with known structures and binding affinities. The AADS methodology is implemented on an 80 processor cluster and presented as a freely accessible, easy to use tool at http://www.scfbio-iitd.res.in/dock/ActiveSite_new.jsp .

Background: The accurate prediction of ligand binding residues from amino acid sequences is important for the automated functional annotation of novel proteins. In the previous two CASP experiments, the most successful methods in the function prediction category were those which used structural superpositions of 3D models and related templates with bound ligands in order to identify putative contacting residues. However, whilst most of this prediction process can be automated, visual inspection and manual adjustments of parameters, such as the distance thresholds used for each target, have often been required to prevent over prediction. Here we describe a novel method FunFOLD, which uses an automatic approach for cluster identification and residue selection. The software provided can easily be integrated into existing fold recognition servers, requiring only a 3D model and list of templates as inputs. A simple web interface is also provided allowing access to non-expert users. The method has been benchmarked against the top servers and manual prediction groups tested at both CASP8 and CASP9.Results: The FunFOLD method shows a significant improvement over the best available servers and is shown to be competitive with the top manual prediction groups that were tested at CASP8. The FunFOLD method is also competitive with both the top server and manual methods tested at CASP9. When tested using common subsets of targets, the predictions from FunFOLD are shown to achieve a significantly higher mean Matthews Correlation Coefficient (MCC) scores and Binding-site Distance Test (BDT) scores than all server methods that were tested at CASP8. Testing on the CASP9 set showed no statistically significant separation in performance between FunFOLD and the other top server groups tested.Conclusions: The FunFOLD software is freely available as both a standalone package and a prediction server, providing competitive ligand binding site residue predictions for expert and non-expert users alike. The software provides a new fully automated approach for structure based function prediction using 3D models of proteins.

Motivation: Protein-ligand binding sites are the active sites on protein surface that perform protein functions. Thus, the identification of those binding sites is often the first step to study protein functions and structure-based drug design. There are many computational algorithms and tools developed in recent decades, such as LIGSITE(cs/c), PASS, Q-SiteFinder, SURFNET, and so on. In our previous work, MetaPocket, we have proved that it is possible to combine the results of many methods together to improve the prediction result.Results: Here, we continue our previous work by adding four more methods Fpocket, GHECOM, ConCavity and POCASA to further improve the prediction success rate. The new method MetaPocket 2.0 and the individual approaches are all tested on two datasets of 48 unbound/bound and 210 bound structures as used before. The results show that the average success rate has been raised 5% at the top 1 prediction compared with previous work. Moreover, we construct a non-redundant dataset of drug-target complexes with known structure from DrugBank, DrugPort and PDB database and apply MetaPocket 2.0 to this dataset to predict drug binding sites. As a result, > 74% drug binding sites on protein target are correctly identified at the top 3 prediction, and it is 12% better than the best individual approach.

Location of functional binding pockets of bioactive ligands on protein molecules is essential in structural genomics and drug design projects. If the experimental determination of ligand-protein complex structures is complicated, blind docking (BD) and pocket search (PS) calculations can help in the prediction of atomic resolution binding mode and the location of the pocket of a ligand on the entire protein surface. Whereas the number of successful predictions by these methods is increasing even for the complicated cases of exosites or allosteric binding sites, their reliability has not been fully established. For a critical assessment of reliability, we use a set of ligand-protein complexes, which were found to be problematic in previous studies. The robustness of BD and PS methods is addressed in terms of success of the selection of truly functional pockets from among the many putative ones identified on the surfaces of ligand-bound and ligand-free (holo and apo) protein forms. Issues related to BD such as effect of hydration, existence of multiple pockets, and competition of subsidiary ligands are considered. Practical cases of PS are discussed, categorized and strategies are recommended for handling the different situations. PS can be used in conjunction with BD, as we find that a consensus approach combining the techniques improves predictive power.

Knowledge of protein-ligand binding sites is very important for structure-based drug designs. To get information on the binding site of a targeted protein with its ligand in a timely way, many scientists tried to resort to computational methods. Although several methods have been released in the past few years, their accuracy needs to be improved. In this study, based on the combination of incremental convex hull, traditional geometric algorithm, and solvent accessible surface of proteins, we developed a novel approach for predicting the protein-ligand binding sites. Using PDBbind database as a benchmark dataset and comparing the new approach with the existing methods such as POCKET, Q-SiteFinder, MOE-SiteFinder, and PASS, we found that the new method has the highest accuracy for the Top 2 and Top 3 predictions. Furthermore, our approach can not only successfully predict the protein-ligand binding sites but also provide more detailed information for the interactions between proteins and ligands. It is anticipated that the new method may become a useful tool for drug development, or at least play a complementary role to the other existing methods in this area.

Protein similarity comparisons may be made on a local or global basis and may consider sequence information or differing levels of structural information. We present a local three-dimensional method that compares protein binding site surfaces in full atomic detail. The approach is based on the morphological similarity method which has been widely applied for global comparison of small molecules. We apply the method to all-by-all comparisons two sets of human protein kinases, a very diverse set of ATP-bound proteins from multiple species, and three heterogeneous benchmark protein binding site data sets. Cases of disagreement between sequence-based similarity and binding site similarity yield informative examples. Where sequence similarity is very low, high pocket similarity can reliably identify important binding motifs. Where sequence similarity is very high, significant differences in pocket similarity are related to ligand binding specificity and similarity. Local protein binding pocket similarity provides qualitatively complementary information to other approaches, and it can yield quantitative information in support of functional annotation. Proteins 2011;

Motivation: Identification of ligand binding pockets on proteins is crucial for the characterization of protein functions. It provides valuable information for protein-ligand docking and rational engineering of small molecules that regulate protein functions. A major number of current prediction algorithms of ligand binding pockets are based on cubic grid representation of proteins and, thus, the results are often protein orientation dependent.Results: We present the MSPocket program for detecting pockets on the solvent excluded surface of proteins. The core algorithm of the MSPocket approach does not use any cubic grid system to represent proteins and is therefore independent of protein orientations. We demonstrate that MSPocket is able to achieve an accuracy of 75% in predicting ligand binding pockets on a test dataset used for evaluating several existing methods. The accuracy is 92% if the top three predictions are considered. Comparison to one of the recently published best performing methods shows that MSPocket reaches similar performance with the additional feature of being protein orientation independent. Interestingly, some of the predictions are different, meaning that the two methods can be considered complementary and combined to achieve better prediction accuracy. MSPocket also provides a graphical user interface for interactive investigation of the predicted ligand binding pockets. In addition, we show that overlap criterion is a better strategy for the evaluation of predicted ligand binding pockets than the single point distance criterion.

Evolutionary approach to predicting the binding site residues of a protein from its primary sequence
Tseng, Yan Yuan and Li, Wen-Hsiung
Proceedings of the National Academy of Sciences of the United States of America, 2011, 108(13), 5313-5318
PMID: 21402946
doi: 10.1073/pnas.1102210108

Protein binding site residues, especially catalytic residues, play a central role in protein function. Because more than 99% of the similar to 12 million protein sequences in the nonredundant protein database have no structural information, it is desirable to develop methods to predict the binding site residues of a protein from its primary sequence. This task is highly challenging, because the binding site residues constitute only a small portion of a protein. However, the binding site residues of a protein are clustered in its functional pocket(s), and their spatial patterns tend to be conserved in evolution. To take advantage of these evolutionary and structural principles, we constructed a database of similar to 50,000 templates (called the pocket-containing segment database), each of which includes not only a sequence segment that contains a functional pocket but also the structural attributes of the pocket. To use this database, we designed a template-matching technique, termed residue-matching profiling, and established a criterion for selecting templates for a query sequence. Finally, we developed a probabilistic model for assigning spatial scores to matched residues between the template and query sequence in local alignments using a set of selected scoring matrices and for computing the binding likelihood of each matched residue in the query sequence. From the likelihoods, one can predict the binding site residues in the query sequence. An automated computational pipeline was developed for our method. A performance evaluation shows that our method achieves a 70% precision in predicting binding site residues at 60% sensitivity.

A new binding site comparison algorithm using optimal superposition of the continuous pharmacophoric property distributions is reported. The method demonstrates high sensitivity in discovering both, distantly homologous and convergent binding sites. Good quality of superposition is also observed on multiple examples. Using the new approach, a measure of site similarity is derived and applied to clustering of ligand binding pockets in PDB.

MOTIVATION:A variety of pocket detection algorithms are now freely or commercially available to the scientific community for the analysis of static protein structures. However, since proteins are dynamic entities, enhancing the capabilities of these programs for the straightforward detection and characterization of cavities taking into account protein conformational ensembles should be valuable for capturing the plasticity of pockets, and therefore allow gaining insight into structure-function relationships.

The recognition of cryptic small-molecular binding sites in protein structures is important for understanding off-target side effects and for recognizing potential new indications for existing drugs. Current methods focus on the geometry and detailed chemical interactions within putative binding pockets, but may not recognize distant similarities where dynamics or modified interactions allow one ligand to bind apparently divergent binding pockets. In this paper, we introduce an algorithm that seeks similar microenvironments within two binding sites, and assesses overall binding site similarity by the presence of multiple shared microenvironments. The method has relatively weak geometric requirements (to allow for conformational change or dynamics in both the ligand and the pocket) and uses multiple biophysical and biochemical measures to characterize the microenvironments (to allow for diverse modes of ligand binding). We term the algorithm PocketFEATURE, since it focuses on pockets using the FEATURE system for characterizing microenvironments. We validate PocketFEATURE first by showing that it can better discriminate sites that bind similar ligands from those that do not, and by showing that we can recognize FAD-binding sites on a proteome scale with Area Under the Curve (AUC) of 92%. We then apply PocketFEATURE to evolutionarily distant kinases, for which the method recognizes several proven distant relationships, and predicts unexpected shared ligand binding. Using experimental data from ChEMBL and Ambit, we show that at high significance level, 40 kinase pairs are predicted to share ligands. Some of these pairs offer new opportunities for inhibiting two proteins in a single pathway.

Systematic investigation of a protein and its binding site characteristics are crucial for designing small molecules that modulate protein functions. However, fundamental uncertainties in binding site interactions and insufficient knowledge of the properties of even well-defined binding pockets can make it difficult to design optimal drugs. Herein, we report the development and implementation of a cavity detection algorithm built with HINT toolkit functions that we are naming Vectorial Identification of Cavity Extents (VICE). This very efficient algorithm is based on geometric criteria applied to simple integer grid maps. In testing, we carried out a systematic investigation on a very diverse data set of proteins and protein-protein/protein-polynucleotide complexes for locating and characterizing the indentations, cavities, pockets, grooves, channels, and surface regions. Additionally, we evaluated a curated data set of unbound proteins for which a ligand-bound protein structures are also known; here the VICE algorithm located the actual ligand in the largest cavity in 83% of the cases and in one of the three largest in 90% of the cases. An interactive front-end provides a quick and simple procedure for locating, displaying and manipulating cavities in these structures. Information describing the cavity, including its volume and surface area metrics, and lists of atoms, residues, and/or chains lining the binding pocket, can be easily obtained and analyzed. For example, the relative cross-sectional surface area (to total surface area) of cavity openings in well-enclosed cavities is 0.06 +/- 0.04 and in surface clefts or crevices is 0.25 +/- 0.09. Proteins 2010. (c) 2009 Wiley-Liss, Inc.

The shape of the protein surface dictates what interactions are possible with other macromolecules, but defining discrete pockets or possible interaction sites remains difficult. First, there is the problem of defining the extent of the pocket. Second, one has to characterize the shape of each pocket. Third, one needs to make quantitative comparisons between pockets on different proteins. An elegant solution to these problems is to sort all surface and solvent points by travel depth and then collect a hierarchical tree of pockets. The connectivity of the tree is determined via the deepest saddle points between each pair of neighboring pockets. The resulting pocket surfaces tessellate the entire protein surface, producing a complete inventory of pockets. This method of identifying pockets also allows one to easily compute important shape metrics, including the problematic pocket volume, surface area, and mouth size. Pockets are also annotated with their lining residue lists and polarity and with other residue-based properties. Using this tree and the various shape metrics pockets can be merged, grouped, or filtered for further analysis. Since this method includes the entire surface, it guarantees that any pocket of interest will be found among the output pockets, unlike all previous methods of pocket identification. The resulting hierarchy of pockets is easy to visualize and aids users in higher level analysis. Comparison of pockets is done by using the shape metrics, avoiding the complex shape alignment problem. Example applications show that the method facilitates pocket comparison along mutational or time-dependent series. Pockets from families of proteins can be examined using multiple pocket tree alignments to see how ligand binding sites or how other pockets have changed with evolution. Our method is called CLIPPERS for complete liberal inventory of protein pockets elucidating and reporting on shape.

The proteome-wide characterization and analysis of protein ligand-binding sites and their interactions with ligands can provide pivotal information in understanding the structure, function and evolution of proteins and for designing safe and efficient therapeutics. The SMAP web service (SMAP-WS) meets this need through parallel computations designed for 3D ligand-binding site comparison and similarity searching on a structural proteome scale. SMAP-WS implements a shape descriptor (the Geometric Potential) that characterizes both local and global topological properties of the protein structure and which can be used to predict the likely ligand-binding pocket [Xie,L. and Bourne,P.E. (2007) A robust and efficient algorithm for the shape description of protein structures and its application in predicting ligand-binding sites. BMC bioinformatics, 8 (Suppl. 4.), S9.]. Subsequently a sequence order independent profile-profile alignment (SOIPPA) algorithm is used to detect and align similar pockets thereby finding protein functional and evolutionary relationships across fold space [Xie, L. and Bourne, P.E. (2008) Detecting evolutionary relationships across existing fold space, using sequence order-independent profile-profile alignments. Proc. Natl Acad. Sci. USA, 105, 5441-5446]. An extreme value distribution model estimates the statistical significance of the match [Xie, L., Xie, L. and Bourne, P.E. (2009) A unified statistical model to support local sequence order independent similarity searching for ligand-binding sites and its application to genome-based drug discovery. Bioinformatics, 25, i305-i312.]. These algorithms have been extensively benchmarked and shown to outperform most existing algorithms. Moreover, several predictions resulting from SMAP-WS have been validated experimentally. Thus far SMAP-WS has been applied to predict drug side effects, and to repurpose existing drugs for new indications. SMAP-WS provides both a user-friendly web interface and programming API for scientists to address a wide range of compute intense questions in biology and drug discovery. SMAP-WS is available from the URL http://smap.nbcr.net.

BACKGROUND:With the classical, active-site oriented drug-development approach reaching its limits, protein ligand-binding sites in general and allosteric sites in particular are increasingly attracting the interest of medicinal chemists in the search for new types of targets and strategies to drug development. Given that allostery represents one of the most common and powerful means to regulate protein function, the traditional drug discovery approach of targeting active sites can be extended by targeting allosteric or regulatory protein pockets that may allow the discovery of not only novel drug-like inhibitors, but activators as well. The wealth of available protein structural data can be exploited to further increase our understanding of allosterism, which in turn may have therapeutic applications. A first step in this direction is to identify and characterize putative effector sites that may be present in already available structural data.

3DLigandSite is a web server for the prediction of ligand-binding sites. It is based upon successful manual methods used in the eighth round of the Critical Assessment of techniques for protein Structure Prediction (CASP8). 3DLigandSite utilizes protein-structure prediction to provide structural models for proteins that have not been solved. Ligands bound to structures similar to the query are superimposed onto the model and used to predict the binding site. In benchmarking against the CASP8 targets 3DLigandSite obtains a Matthew's correlation co-efficient (MCC) of 0.64, and coverage and accuracy of 71 and 60%, respectively, similar results to our manual performance in CASP8. In further benchmarking using a large set of protein structures, 3DLigandSite obtains an MCC of 0.68. The web server enables users to submit either a query sequence or structure. Predictions are visually displayed via an interactive Jmol applet. 3DLigandSite is available for use at http://www.sbg.bio.ic.ac.uk/3dligandsite.

MOTIVATION:Exploitation of locally similar 3D patterns of physicochemical properties on the surface of a protein for detection of binding sites that may lack sequence and global structural conservation.

In this paper, we describe a Monte Carlo method for determining the volume of a molecule. A molecule is considered to consist of hard, overlapping spheres. The surface of the molecule is defined by rolling a probe sphere over the surface of the spheres. To determine the volume of the molecule, random points are placed in a three-dimensional box, which encloses the whole molecule. The volume of the molecule in relation to the volume of the box is estimated by calculating the ratio of the random points placed inside the molecule and the total number of random points that were placed. For computational efficiency, we use a grid-cell based neighbor list to determine whether a random point is placed inside the molecule or not. This method in combination with a graph-theoretical algorithm is used to detect internal cavities and surface clefts of molecules. Since cavities and clefts are potential water binding sites, we place water molecules in the cavities. The potential water positions can be used in molecular dynamics calculations as well as in other molecular calculations. We apply this method to several proteins and demonstrate the usefulness of the program. The described methods are all implemented in the program McVol, which is available free of charge from our website at http://www.bisb.uni-bayreuth.de/software.html.

Detection of pockets on protein surfaces is an important step toward finding the binding sites of small molecules. In a previous study, we defined a pocket as a space into which a small spherical probe can enter, but a large probe cannot. The radius of the large probes corresponds to the shallowness of pockets. We showed that each type of binding molecule has a characteristic shallowness distribution. In this study, we introduced fundamental changes to our previous algorithm by using a 3D grid representation of proteins and probes, and the theory of mathematical morphology. We invented an efficient algorithm for calculating deep and shallow pockets (multiscale pockets) simultaneously, using several different sizes of spherical probes (multiscale probes). We implemented our algorithm as a new program, ghecom (grid-based HECOMi finder). The statistics of calculated pockets for the structural dataset showed that our program had a higher performance of detecting binding pockets, than four other popular pocket-finding programs proposed previously. The ghecom also calculates the shallowness of binding ligands, R(inaccess) (minimum radius of inaccessible spherical probes) that can be obtained from the multiscale molecular volume. We showed that each part of the binding molecule had a bias toward a specific range of shallowness. These findings will be useful for predicting the types of molecules that will be most likely to bind putative binding pockets, as well as the configurations of binding molecules. The program ghecom is available through the Web server (http://biunit.naist.jp/ghecom).

Here, we describe a family of methods based on residue-residue connectivity for characterizing binding sites and apply variants of the method to various types of protein-ligand complexes including proteases, allosteric-binding sites, correctly and incorrectly docked poses, and inhibitors of protein-protein interactions. Residues within ligand-binding sites have about 25% more contact neighbors than surface residues in general; high-connectivity residues are found in contact with the ligand in 84% of all complexes studied. In addition, a k-means algorithm was developed that may be useful for identifying potential binding sites with no obvious geometric or connectivity features. The analysis was primarily carried out on 61 protein-ligand structures from the MEROPS protease database, 250 protein-ligand structures from the PDBSelect (25%), and 30 protein-protein complexes. Analysis of four proteases with crystal structures for multiple bound ligands has shown that residues with high connectivity tend to have less variable side-chain conformation. The relevance to drug design is discussed in terms of identifying allosteric-binding sites, distinguishing between alternative docked poses and designing protein interface inhibitors. Taken together, this data indicate that residue-residue connectivity is highly relevant to medicinal chemistry.

Because of the increasing number of structures of unknown function accumulated by ongoing structural genomics projects, there is an urgent need for computational methods for characterizing protein tertiary structures. As functions of many of these proteins are not easily predicted by conventional sequence database searches, a legitimate strategy is to utilize structure information in function characterization. Of particular interest is prediction of ligand binding to a protein, as ligand molecule recognition is a major part of molecular function of proteins. Predicting whether a ligand molecule binds a protein is a complex problem due to the physical nature of protein-ligand interactions and the flexibility of both binding sites and ligand molecules. However, geometric and physicochemical complementarity is observed between the ligand and its binding site in many cases. Therefore, ligand molecules which bind to a local surface site in a protein can be predicted by finding similar local pockets of known binding ligands in the structure database. Here, we present two representations of ligand binding pockets and utilize them for ligand binding prediction by pocket shape comparison. These representations are based on mapping of surface properties of binding pockets, which are compactly described either by the two-dimensional pseudo-Zernike moments or the three-dimensional Zernike descriptors. These compact representations allow a fast real-time pocket searching against a database. Thorough benchmark studies employing two different datasets show that our representations are competitive with the other existing methods. Limitations and potentials of the shape-based methods as well as possible improvements are discussed.

The complex interactions between proteins and small organic molecules (ligands) are intensively studied because they play key roles in biological processes and drug activities. Here, we present a novel approach to characterize and map the ligand-binding cavities of proteins without direct geometric comparison of structures, based on Principal Component Analysis of cavity properties (related mainly to size, polarity, and charge). This approach can provide valuable information on the similarities and dissimilarities, of binding cavities due to mutations, between-species differences and flexibility upon ligand-binding. The presented results show that information on ligand-binding cavity variations can complement information on protein similarity obtained from sequence comparisons. The predictive aspect of the method is exemplified by successful predictions of serine proteases that were not included in the model construction. The presented strategy to compare ligand-binding cavities of related and unrelated proteins has many potential applications within protein and medicinal chemistry, for example in the characterization and mapping of "orphan structures", selection of protein structures for docking studies in structure-based design, and identification of proteins for selectivity screens in drug design programs.

Patterns of receptor-ligand interaction can be conserved in functionally equivalent proteins even in the absence of sequence homology. Therefore, structural comparison of ligand-binding pockets and their pharmacophoric features allow for the characterization of so-called "orphan" proteins with known three-dimensional structure but unknown function, and predict ligand promiscuity of binding pockets. We present an algorithm for rapid pocket comparison (PoLiMorph), in which protein pockets are represented by self-organizing graphs that fill the volume of the cavity. Vertices in these three-dimensional frameworks contain information about the local ligand-receptor interaction potential coded by fuzzy property labels. For framework matching, we developed a fast heuristic based on the maximum dispersion problem, as an alternative to techniques utilizing clique detection or geometric hashing algorithms. A sophisticated scoring function was applied that incorporates knowledge about property distributions and ligand-receptor interaction patterns. In an all-against-all virtual screening experiment with 207 pocket frameworks extracted from a subset of PDBbind, PoLiMorph correctly assigned 81% of 69 distinct structural classes and demonstrated sustained ability to group pockets accommodating the same ligand chemotype. We determined a score threshold that indicates "true" pocket similarity with high reliability, which not only supports structure-based drug design but also allows for sequence-independent studies of the proteome.

MOTIVATION:Prediction of ligand binding sites of proteins is significant as it can provide insight into biological functions and reaction mechanisms of proteins. It is also a prerequisite for protein-ligand docking and an important step in structure-based drug design.

BACKGROUND:The study of protein-small molecule interactions is vital for understanding protein function and for practical applications in drug discovery. To benefit from the rapidly increasing structural data, it is essential to improve the tools that enable large scale binding site prediction with greater emphasis on their biological validity.

Druggability predictions are important to avoid intractable targets and to focus drug discovery efforts on sites offering better prospects. However, few druggability prediction tools have been released and none has been extensively tested. Here, a set of druggable and nondruggable cavities has been compiled in a collaborative platform ( http://fpocket.sourceforge.net/dcd ) that can be used, contributed, and curated by the community. Druggable binding sites are often oversimplified as closed, hydrophobic cavities, but data set analysis reveals that polar groups in druggable binding sites have properties that enable them to play a decisive role in ligand recognition. Finally, the data set has been used in conjunction with the open source fpocket suite to train and validate a logistic model. State of the art performance was achieved for predicting druggability on known binding sites and on virtual screening experiments where druggable pockets are retrieved from a pool of decoys. The algorithm is free, extremely fast, and can effectively be used to automatically sieve through massive collections of structures ( http://fpocket.sourceforge.net ).

MOTIVATION:The identification of putative ligand-binding sites on proteins is important for the prediction of protein function. Knowledge-based approaches using structure databases have become interesting, because of the recent increase in structural information. Approaches using binding motif information are particularly effective. However, they can only be applied to well-known ligands that frequently appear in the structure databases.

A key challenge of the post-genomic era is the identification of the function(s) of all the molecules in a given organism. Here, we review the status of sequence and structure-based approaches to protein function inference and ligand screening that can provide functional insights for a significant fraction of the approximately 50% of ORFs of unassigned function in an average proteome. We then describe FINDSITE, a recently developed algorithm for ligand binding site prediction, ligand screening and molecular function prediction, which is based on binding site conservation across evolutionary distant proteins identified by threading. Importantly, FINDSITE gives comparable results when high-resolution experimental structures as well as predicted protein models are used.

BACKGROUND:The rate of protein structures being deposited in the Protein Data Bank surpasses the capacity to experimentally characterise them and therefore computational methods to analyse these structures have become increasingly important. Identifying the region of the protein most likely to be involved in function is useful in order to gain information about its potential role. There are many available approaches to predict functional site, but many are not made available via a publicly-accessible application.

Identifying a protein's functional sites is an important step towards characterizing its molecular function. Numerous structure- and sequence-based methods have been developed for this problem. Here we introduce ConCavity, a small molecule binding site prediction algorithm that integrates evolutionary sequence conservation estimates with structure-based methods for identifying protein surface cavities. In large-scale testing on a diverse set of single- and multi-chain protein structures, we show that ConCavity substantially outperforms existing methods for identifying both 3D ligand binding pockets and individual ligand binding residues. As part of our testing, we perform one of the first direct comparisons of conservation-based and structure-based methods. We find that the two approaches provide largely complementary information, which can be combined to improve upon either approach alone. We also demonstrate that ConCavity has state-of-the-art performance in predicting catalytic sites and drug binding pockets. Overall, the algorithms and analysis presented here significantly improve our ability to identify ligand binding sites and further advance our understanding of the relationship between evolutionary sequence conservation and structural and functional attributes of proteins. Data, source code, and prediction visualizations are available on the ConCavity web site (http://compbio.cs.princeton.edu/concavity/).

Identification and characterization of binding sites is key in the process of structure-based drug design. In some cases there may not be any information about the binding site for a target of interest. In other cases, a putative binding site has been identified by computational or experimental means, but the druggability of the target is not known. Even when a site for a given target is known, it may be desirable to find additional sites whose targeting could produce a desired biological response. A new program, called SiteMap, is presented for identifying and analyzing binding sites and for predicting target druggability. In a large-scale validation, SiteMap correctly identifies the known binding site as the top-ranked site in 86% of the cases, with best results (>98%) coming for sites that bind ligands with subnanomolar affinity. In addition, a modified version of the score employed for binding-site identification allows SiteMap to accurately classify the druggability of proteins as measured by their ability to bind passively absorbed small molecules tightly. In characterizing binding sites, SiteMap provides quantitative and graphical information that can help guide efforts to critically assess virtual hits in a lead-discovery application or to modify ligand structure to enhance potency or improve physical properties in a lead-optimization context.

BACKGROUND:Virtual screening methods start to be well established as effective approaches to identify hits, candidates and leads for drug discovery research. Among those, structure based virtual screening (SBVS) approaches aim at docking collections of small compounds in the target structure to identify potent compounds. For SBVS, the identification of candidate pockets in protein structures is a key feature, and the recent years have seen increasing interest in developing methods for pocket and cavity detection on protein surfaces.

Many important protein-protein interactions are mediated by the binding of a short peptide stretch in one protein to a large globular segment in another. Recent efforts have provided hundreds of examples of new peptides binding to proteins for which a three-dimensional structure is available (either known experimentally or readily modeled) but where no structure of the protein-peptide complex is known. To address this gap, we present an approach that can accurately predict peptide binding sites on protein surfaces. For peptides known to bind a particular protein, the method predicts binding sites with great accuracy, and the specificity of the approach means that it can also be used to predict whether or not a putative or predicted peptide partner will bind. We used known protein-peptide complexes to derive preferences, in the form of spatial position specific scoring matrices, which describe the binding-site environment in globular proteins for each type of amino acid in bound peptides. We then scan the surface of a putative binding protein for sites for each of the amino acids present in a peptide partner and search for combinations of high-scoring amino acid sites that satisfy constraints deduced from the peptide sequence. The method performed well in a benchmark and largely agreed with experimental data mapping binding sites for several recently discovered interactions mediated by peptides, including RG-rich proteins with SMN domains, Epstein-Barr virus LMP1 with TRADD domains, DBC1 with Sir2, and the Ago hook with Argonaute PIWI domain. The method, and associated statistics, is an excellent tool for predicting and studying binding sites for newly discovered peptides mediating critical events in biology.

Identification of potential ligand-binding pockets is an initial step in receptor-based drug design. While many geometric or energy-based binding-site prediction methods characterize the size and shape of protein cavities, few of them offer an estimate of the pocket's ability to bind small drug-like molecules. Here, we present a shape-based technique to examine binding-site druggability from the crystal structure of a given protein target. The method includes the PocketPicker algorithm to determine putative binding-site volumes for ligand-interaction. Pocket shape descriptors were calculated for both known ligand binding sites and empty pockets and were then subjected to self-organizing map clustering. Descriptors were calculated for structures derived from a database of representative drug-protein complexes with experimentally determined binding affinities to characterize the "druggable pocketome". The new method provides a means for selecting drug targets and potential ligand-binding pockets based on structural considerations and addresses orphan binding sites.

The identification of ligand-binding sites is often the starting point for protein function annotation and structure-based drug design. Many computational methods for the prediction of ligand-binding sites have been developed in recent decades. Here we present a consensus method metaPocket, in which the predicted sites from four methods: LIGSITE(cs), PASS, Q-SiteFinder, and SURFNET are combined together to improve the prediction success rate. All these methods are evaluated on two datasets of 48 unbound/bound structures and 210 bound structures. The comparison results show that metaPocket improves the success rate from similar to 70 to 75% at the top 1 prediction. MetaPocket is available at http://metapocket.eml.org.

SplitPocket (http://pocket.uchicago.edu/) is a web server to identify functional surfaces of protein from structure coordinates. Using the Alpha Shape Theory, we previously developed an analytical approach to identify protein functional surfaces by the geometric concept of a split pocket, which is a pocket split by a binding ligand. Our geometric approach extracts site-specific spatial information from coordinates of structures. To reduce the search space, probe radii are designed according to the physicochemical textures of molecules. The method uses the weighted Delaunay triangulation and the discrete flow algorithm to obtain geometric measurements and spatial patterns for each predicted pocket. It can also measure the hydrophobicity on a surface patch. Furthermore, we quantify the evolutionary conservation of surface patches by an index derived from the entropy scores in HSSP (homology-derived secondary structure of proteins). We have used the method to examine approximately 1.16 million potential pockets and identified the split pockets in >26,000 structures in the Protein Data Bank. This integrated web server of functional surfaces provides a source of spatial patterns to serve as templates for predicting the functional surfaces of unbound structures involved in binding activities. These spatial patterns should also be useful for protein functional inference, structural evolution and drug design.

SiteHound uses Molecular Interaction Fields (MIFs) produced by EasyMIFs to identify protein structure regions that show a high propensity for interaction with ligands. The type of binding site identified depends on the probe atom used in the MIF calculation. The input to EasyMIFs is a PDB file of a protein structure; the output MIF serves as input to SiteHound, which in turn produces a list of putative binding sites. Extensive testing of SiteHound for the detection of binding sites for drug-like molecules and phosphorylated ligands has been carried out. AVAILABILITY: EasyMIFs and SiteHound executables for Linux, Mac OS X, and MS Windows operating systems are freely available for download from http://sitehound.sanchezlab.org/download.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

MOTIVATION:The binding sites of proteins generally contain smaller regions that provide major contributions to the binding free energy and hence are the prime targets in drug design. Screening libraries of fragment-sized compounds by NMR or X-ray crystallography demonstrates that such 'hot spot' regions bind a large variety of small organic molecules, and that a relatively high 'hit rate' is predictive of target sites that are likely to bind drug-like ligands with high affinity. Our goal is to determine the 'hot spots' computationally rather than experimentally.

MOTIVATION: The ability to predict binding profiles for an arbitrary protein can significantly improve the areas of drug discovery, lead optimization and protein function prediction. At present, there are no successful algorithms capable of predicting binding profiles for novel proteins. Existing methods typically rely on manually curated templates or entire active site comparison. Consequently, they perform best when analyzing proteins sharing significant structural similarity with known proteins (i.e. proteins resulting from divergent evolution). These methods fall short when used to characterize the binding profile of a novel active site or one for which a template is not available. In contrast to previous approaches, our method characterizes the binding preferences of sub-cavities within the active site by exploiting a large set of known protein-ligand complexes. The uniqueness of our approach lies not only in the consideration of sub-cavities, but also in the more complete structural representation of these sub-cavities, their parametrization and the method by which they are compared. By only requiring local structural similarity, we are able to leverage previously unused structural information and perform binding inference for proteins that do not share significant structural similarity with known systems. RESULTS: Our algorithm demonstrates the ability to accurately cluster similar sub-cavities and to predict binding patterns across a diverse set of protein-ligand complexes. When applied to two high-profile drug targets, our algorithm successfully generates a binding profile that is consistent with known inhibitors. The results suggest that our algorithm should be useful in structure-based drug discovery and lead optimization.

Detection of ligand-binding sites in protein structures is a crucial task in structural bioinformatics, and has applications in important areas like drug discovery. Given the knowledge of the site in a particular protein structure that binds to a specific ligand, we can search for similar sites in the other protein structures that the same ligand is likely to bind. In this paper, we propose a new method named "BSAlign" (Binding Site Aligner) for rapid detection of potential binding site(s) in the target protein(s) that is/are similar to the query protein's ligand-binding site. We represent both the binding site and the protein structure as graphs, and employ a subgraph isomorphism algorithm to detect the similarities of the binding sites in a very time-efficient manner. Preliminary experimental results show that the proposed BSAlign binding site detection method is about 14 times faster than a well-known method called SiteEngine, while offering the same level of accuracy. Both BSAlign and SiteEngine achieve 60% search accuracy in finding adenine-binding sites from a data set of 126 proteins. The proposed method can be a useful contribution towards speed-critical applications such as drug discovery in which a large number of proteins are needed to be processed. The program is available for download at: http://www1.i2r.a-star.edu.sg/~azeyar/BSAlign/.

Characterization of local geometry of protein surfaces with the visibility criterion.
Li, Bin and Turuvekere, Srinivasan and Agrawal, Manish and La, David and Ramani, Karthik and Kihara, Daisuke
Proteins, 2008, 71(2), 670-683
PMID: 17975834
doi: 10.1002/prot.21732

Experimentally determined protein tertiary structures are rapidly accumulating in a database, partly due to the structural genomics projects. Included are proteins of unknown function, whose function has not been investigated by experiments and was not able to be predicted by conventional sequence-based search. Those uncharacterized protein structures highlight the urgent need of computational methods for annotating proteins from tertiary structures, which include function annotation methods through characterizing protein local surfaces. Toward structure-based protein annotation, we have developed VisGrid algorithm that uses the visibility criterion to characterize local geometric features of protein surfaces. Unlike existing methods, which only concerns identifying pockets that could be potential ligand-binding sites in proteins, VisGrid is also aimed to identify large protrusions, hollows, and flat regions, which can characterize geometric features of a protein structure. The visibility used in VisGrid is defined as the fraction of visible directions from a target position on a protein surface. A pocket or a hollow is recognized as a cluster of positions with a small visibility. A large protrusion in a protein structure is recognized as a pocket in the negative image of the structure. VisGrid correctly identified 95.0% of ligand-binding sites as one of the three largest pockets in 5616 benchmark proteins. To examine how natural flexibility of proteins affects pocket identification, VisGrid was tested on distorted structures by molecular dynamics simulation. Sensitivity decreased approximately 20% for structures of a root mean square deviation of 2.0 A to the original crystal structure, but specificity was not much affected. Because of its intuitiveness and simplicity, the visibility criterion will lay the foundation for characterization and function annotation of local shape of proteins.

Proteins consist of atoms. Given a protein, the automatic recognition of depressed regions, called pockets, on the surface of proteins is important for protein-ligand docking and facilitates fast development of new drugs. Recently, computational approaches have emerged for recognizing pockets from the geometrical point of view. Presented in this paper is a geometric method for the pocket recognition which is based on the Voronoi diagram for atoms. Given a Voronoi diagram, the proposed algorithm transforms the atomic structure to meshes which contain the information of the proximity among atoms, and then recognizes depressions on the surface of a protein using the meshes.

Predicting functional sites in proteins is important in structural biology for understanding the function and also for structure-based drug design. Here we report a new binding site prediction method PocketDepth, which is geometry based and uses a depth based clustering. Depth is an important parameter considered during protein structure visualisation and analysis but has been used more often intuitively than systematically. Our current implementation of depth reflects how central a given subspace is to a putative pocket. We have tested the algorithm against PDBbind, a large curated set of 1091 proteins. A prediction was considered a true-positive if the predicted pocket had at least 10% overlap with the actual ligand. Two different parameter sets, 'deeper' and 'surface' were used, for wider coverage of different types of binding sites in proteins. With deeper parameters, true-positives were observed for 841 proteins, resulting in a prediction accuracy of 77%, for any ranked prediction. Of these, 55.2% were first ranked predictions, whereas 91.2% and 97.4% were covered in the first 5 and 10 ranks, respectively. With the 'surface' parameters, a prediction rate of 95.8% was observed, albeit with much poorer ranks. The deeper set identified pocket boundaries more precisely and yielded better ranks, while the latter missed fewer predictions and hence had better coverage. The two parameter sets were therefore algorithmically combined, resulting in prediction accuracies of 96.5% for any ranked prediction. About 41.8% of these were in the first rank, 82% and 94% were in top 5 and 10 ranks, respectively. The algorithm is available at http://proline.physics.iisc.ernet.in/pocketdepth. (c) 2007 Elsevier Inc. All rights reserved.

New Method for the Assessment of All Drug-Like Pockets Across a Structural Genome
Nicola, George and Smith, Colin A and Abagyan, Ruben
http://dx.doi.org/10.1089/cmb.2007.0178, 2008, 15(3), 231-240

With the increasing wealth of structural information available for human pathogens, it is now becoming possible to leverage that information to aid in rational selection of targets for inhibitor discovery. We present a methodology for assessing the drugability of all small-molecule binding pockets in a pathogen. Our approach incorporates accurate pocket identification, sequence conservation with a similar organism, sequence conservation with the host, and structure resolution. This novel method is applied to 21 structures from the malarial parasite Plasmodium falciparum. Based on our survey of the structural genome, we selected enoyl-acyl carrier protein reductase (ENR) as a promising candidate for virtual screening based inhibitor discovery.

The identification of ligand binding sites on a protein is an essential step in the selection of inhibitors of protein-ligand or protein-protein interactions via virtual database screening. To facilitate binding site identification, a novel descriptor, the binding response, is proposed in the present paper to quantitatively evaluate putative binding sites on the basis of their response to a test set of probe compounds. The binding response is determined on the basis of contributions from both the ligand-protein interaction energy and the geometry of binding poses for a database of test ligands. A favorable binding response is obtained for binding sites with favorable ligand binding energies and with ligand geometries within the putative site for the majority of compounds in the test set. The utility of this descriptor is illustrated by applying it to a number of known protein-ligand complexes, showing the approach to identify the experimental binding sites as the highest scoring site in 26 out of 29 cases; in the remaining three cases, it was among the top three scoring sites. This method is combined with sphere-based site identification and clustering methods to yield an automated approach for the identification of binding sites on proteins suitable for database screen or de novo drug design.

We present a method for detecting and comparing cavities on protein surfaces that is useful for protein binding site recognition. The method is based on a representation of the protein structures by a collection of spin-images and their associated spin-image profiles. Results of the cavity detection procedure are presented for a large set of non-redundant proteins and compared with SURFNET-ConSurf. Our comparison method is used to find a surface region in one cavity of a protein that is geometrically similar to a surface region in the cavity of another protein. Such a finding would be an indication that the two regions likely bind to the same ligand. Our overall approach for cavity detection and comparison is benchmarked on several pairs of known complexes, obtaining a good coverage of the atoms of the binding sites.

BACKGROUND:Identification and evaluation of surface binding-pockets and occluded cavities are initial steps in protein structure-based drug design. Characterizing the active site's shape as well as the distribution of surrounding residues plays an important role for a variety of applications such as automated ligand docking or in situ modeling. Comparing the shape similarity of binding site geometries of related proteins provides further insights into the mechanisms of ligand binding.

Structure-based drug design seeks to exploit the structure of protein-ligand or protein-protein binding sites, but the site is not always known at the outset. Even when the site is known, the researcher may wish to identify alternative prospective binding sites that may result in different biological effects or new class of compounds. It is also vital in lead optimization to clearly understand the degree to which known binders or docking hits satisfy or violate complementarity to the receptor. SiteMap is a new technique for identifying potential binding sites and for predicting their druggability in lead-discovery applications and for characterizing binding sites and critically assessing prospective ligands in lead-optimization applications. In large-scale validation tests, SiteMap correctly identifies the known binding site in > 96% of the cases, with best results (> 98%) coming for sites that bind ligands tightly. It also accurately distinguishes between sites that bind ligands and sites that don't. In binding-site analysis, SiteMap provides a wealth of quantitative and graphical information that can help guide efforts to modify ligand structure to enhance potency or improve physical properties. These attributes allow SiteMap to nicely complement techniques such as docking and computational lead optimization in structure-base drug design.

One of the simplest ways to predict ligand binding sites is to identify pocket-shaped regions on the protein surface. Many programs have already been proposed to identify these pocket regions. Examination of their algorithms revealed that a pocket intrinsically has two arbitrary properties, "size" and "depth". We proposed a new definition for pockets using two explicit adjustable parameters that correspond to these two arbitrary properties. A pocket region is defined as a space into which a small probe can enter, but a large probe cannot. The radii of small and large probe spheres are the two parameters that correspond to the "size" and "depth" of the pockets, respectively. These values can be adjusted individual putative ligand molecule. To determine the optimal value of the large probe spheres radius, we generated pockets for thousands of protein structures in the database, using several size of large probe spheres, examined the correspondence of these pockets with known binding site positions. A new measure of shallowness, a minimum inaccessible radius, R(inaccess), indicated that binding sites of coenzymes are very deep, while those for adenine/guanine mononucleotide have only medium shallowness and those for short peptides and oligosaccharides are shallow. The optimal radius of large probe spheres was 3-4 A for the coenzymes, 4 A for adenine/guanine mononucleotides, and 5 A or more for peptides/oligosaccharides. Comparison of our program with two other popular pocket-finding programs showed that our program had a higher performance of detecting binding pockets, although it required more computational time.

Depth is a term frequently applied to the shape and surface of macromolecules, describing for example the grooves in DNA, the shape of an enzyme active site, or the binding site for a small molecule in a protein. Yet depth is a difficult property to define rigorously in a macromolecule, and few computational tools exist to quantify this notion, to visualize it, or analyze the results. We present our notion of travel depth, simply put the physical distance a solvent molecule would have to travel from a surface point to a suitably defined reference surface. To define the reference surface, we use the limiting form of the molecular surface with increasing probe size: the convex hull. We then present a fast, robust approximation algorithm to compute travel depth to every surface point. The travel depth is useful because it works for pockets of any size and complexity. It also works for two interesting special cases. First, it works on the grooves in DNA, which are unbounded in one direction. Second, it works on the case of tunnels, that is pockets that have no "bottom", but go through the entire macromolecule. Our algorithm makes it straightforward to quantify discussions of depth when analyzing structures. High-throughput analysis of macromolecule depth is also enabled by our algorithm. This is demonstrated by analyzing a database of protein-small molecule binding pockets, and the distribution of bound magnesium ions in RNA structures. These analyses show significant, but subtle effects of depth on ligand binding localization and strength.

In this article we introduce a new method for the identification and the accurate characterization of protein surface cavities. The method is encoded in the program SCREEN (Surface Cavity REcognition and EvaluatioN). As a first test of the utility of our approach we used SCREEN to locate and analyze the surface cavities of a nonredundant set of 99 proteins cocrystallized with drugs. We find that this set of proteins has on average about 14 distinct cavities per protein. In all cases, a drug is bound at one (and sometimes more than one) of these cavities. Using cavity size alone as a criterion for predicting drug-binding sites yields a high balanced error rate of 15.7%, with only 71.7% coverage. Here we characterize each surface cavity by computing a comprehensive set of 408 physicochemical, structural, and geometric attributes. By applying modern machine learning techniques (Random Forests) we were able to develop a classifier that can identify drug-binding cavities with a balanced error rate of 7.2% and coverage of 88.9%. Only 18 of the 408 cavity attributes had a statistically significant role in the prediction. Of these 18 important attributes, almost all involved size and shape rather than physicochemical properties of the surface cavity. The implications of these results are discussed. A SCREEN Web server is available at http://interface.bioc.columbia.edu/screen.

The accurate identification of ligand binding sites in protein structures can be valuable in determining protein function. Once the binding site is known, it becomes easier to perform in silico and experimental procedures that may allow the ligand type and the protein function to be determined. For example, binding pocket shape analysis relies heavily on the correct localization of the ligand binding site. We have developed SURFNET-ConSurf, a modular, two-stage method for identifying the location and shape of potential ligand binding pockets in protein structures. In the first stage, the SURFNET program identifies clefts in the protein surface that are potential binding sites. In the second stage, these clefts are trimmed in size by cutting away regions distant from highly conserved residues, as defined by the ConSurf-HSSP database. The largest clefts that remain tend to be those where ligands bind. To test the approach, we analyzed a nonredundant set of 244 protein structures from the PDB and found that SURFNET-ConSurf identifies a ligand binding pocket in 75% of them. The trimming procedure reduces the original cleft volumes by 30% on average, while still encompassing an average 87% of the ligand volume. From the analysis of the results we conclude that for those cases in which the ligands are found in large, highly conserved clefts, the combined SURFNET-ConSurf method gives pockets that are a better match to the ligand shape and location. We also show that this approach works better for enzymes than for nonenzyme proteins.

BACKGROUND:Identifying pockets on protein surfaces is of great importance for many structure-based drug design applications and protein-ligand docking algorithms. Over the last ten years, many geometric methods for the prediction of ligand-binding sites have been developed.

An increasing attention has been dedicated to the characterization of complex networks within the protein world. This work is reporting how we uncovered networked structures that reflected the structural similarities among protein binding sites. First, a 211 binding sites dataset has been compiled by removing the redundant proteins in the Protein Ligand Database (PLD) (http://www-mitchell.ch.cam.ac.uk/pld/). Using a clique detection algorithm we have performed all-against-all binding site comparisons among the 211 available ones. Within the set of nodes representing each binding site an edge was added whenever a pair of binding sites had a similarity higher than a threshold value. The generated similarity networks revealed that many nodes had few links and only few were highly connected, but due to the limited data available it was not possible to definitively prove a scale-free architecture. Within the same dataset, the binding site similarity networks were compared with the networks of sequence and fold similarity networks. In the protein world, indications were found that structure is better conserved than sequence, but on its own, sequence was better conserved than the subset of functional residues forming the binding site. Because a binding site is strongly linked with protein function, the identification of protein binding site similarity networks could accelerate the functional annotation of newly identified genes. In view of this we have discussed several potential applications of binding site similarity networks, such as the construction of novel binding site classification databases, as well as the implications for protein molecular design in general and computational chemogenomics in particular.

BACKGROUND:The main aim of this study was to develop and implement an algorithm for the rapid, accurate and automated identification of paths leading from buried protein clefts, pockets and cavities in dynamic and static protein structures to the outside solvent.

The sc-PDB is a collection of 6 415 three-dimensional structures of binding sites found in the Protein Data Bank (PDB). Binding sites were extracted from all high-resolution crystal structures in which a complex between a protein cavity and a small-molecular-weight ligand could be identified. Importantly, ligands are considered from a pharmacological and not a structural point of view. Therefore, solvents, detergents, and most metal ions are not stored in the sc-PDB. Ligands are classified into four main categories: nucleotides (< 4-mer), peptides (< 9-mer), cofactors, and organic compounds. The corresponding binding site is formed by all protein residues (including amino acids, cofactors, and important metal ions) with at least one atom within 6.5 angstroms of any ligand atom. The database was carefully annotated by browsing several protein databases (PDB, UniProt, and GO) and storing, for every sc-PDB entry, the following features: protein name, function, source, domain and mutations, ligand name, and structure. The repository of ligands has also been archived by diversity analysis of molecular scaffolds, and several chemoinformatics descriptors were computed to better understand the chemical space covered by stored ligands. The sc-PDB may be used for several purposes: (i) screening a collection of binding sites for predicting the most likely target(s) of any ligand, (ii) analyzing the molecular similarity between different cavities, and (iii) deriving rules that describe the relationship between ligand pharmacophoric points and active-site properties. The database is periodically updated and accessible on the web at http://bioinfo-pharma.u-strasbg.fr/scPDB/.

Protein surface regions with similar physicochemical properties and shapes may perform similar functions and bind similar binding partners. Here we present two web servers and software packages for recognition of the similarity of binding sites and interfaces. Both methods recognize local geometrical and physicochemical similarity, which can be present even in the absence of overall sequence or fold similarity. The first method, SiteEngine (http:/bioinfo3d.cs.tau.ac.il/SiteEngine), receives as an input two protein structures and searches the complete surface of one protein for regions similar to the binding site of the other. The second, Interface-to-Interface (I2I)-SiteEngine (http:/bioinfo3d.cs.tau.ac.il/I2I-SiteEngine), compares protein-protein interfaces, which are regions of interaction between two protein molecules. It receives as an input two structures of protein-protein complexes, extracts the interfaces and finds the three-dimensional transformation that maximizes the similarity between two pairs of interacting binding sites. The output of both servers consists of a superimposition in PDB file format and a list of physicochemical properties shared by the compared entities. The methods are highly efficient and the freely available software packages are suitable for large-scale database searches of the entire PDB.

MOTIVATION:Identifying the location of ligand binding sites on a protein is of fundamental importance for a range of applications including molecular docking, de novo drug design and structural identification and comparison of functional sites. Here, we describe a new method of ligand binding site prediction called Q-SiteFinder. It uses the interaction energy between the protein and a simple van der Waals probe to locate energetically favourable binding sites. Energetically favourable probe sites are clustered according to their spatial proximity and clusters are then ranked according to the sum of interaction energies for sites within each cluster.

We developed a new computational algorithm for the accurate identification of ligand binding envelopes rather than surface binding sites. We performed a large scale classification of the identified envelopes according to their shape and physicochemical properties. The predicting algorithm, called PocketFinder, uses a transformation of the Lennard-Jones potential calculated from a three-dimensional protein structure and does not require any knowledge about a potential ligand molecule. We validated this algorithm using two systematically collected data sets of ligand binding pockets from complexed (bound) and uncomplexed (apo) structures from the Protein Data Bank, 5616 and 11,510, respectively. As many as 96.8% of experimental binding sites were predicted at better than 50% overlap level. Furthermore 95.0% of the asserted sites from the apo receptors were predicted at the same level. We demonstrate that conformational differences between the apo and bound pockets do not dramatically affect the prediction results. The algorithm can be used to predict ligand binding pockets of uncharacterized protein structures, suggest new allosteric pockets, evaluate feasibility of protein-protein interaction inhibition, and prioritize molecular targets. Finally the data base of the known and predicted binding pockets for the human proteome structures, the human pocketome, was collected and classified. The pocketome can be used for rapid evaluation of possible binding partners of a given chemical compound.

Recognition of regions on the surface of one protein, that are similar to a binding site of another is crucial for the prediction of molecular interactions and for functional classifications. We first describe a novel method, SiteEngine, that assumes no sequence or fold similarities and is able to recognize proteins that have similar binding sites and may perform similar functions. We achieve high efficiency and speed by introducing a low-resolution surface representation via chemically important surface points, by hashing triangles of physico-chemical properties and by application of hierarchical scoring schemes for a thorough exploration of global and local similarities. We proceed to rigorously apply this method to functional site recognition in three possible ways: first, we search a given functional site on a large set of complete protein structures. Second, a potential functional site on a protein of interest is compared with known binding sites, to recognize similar features. Third, a complete protein structure is searched for the presence of an a priori unknown functional site, similar to known sites. Our method is robust and efficient enough to allow computationally demanding applications such as the first and the third. From the biological standpoint, the first application may identify secondary binding sites of drugs that may lead to side-effects. The third application finds new potential sites on the protein that may provide targets for drug design. Each of the three applications may aid in assigning a function and in classification of binding patterns. We highlight the advantages and disadvantages of each type of search, provide examples of large-scale searches of the entire Protein Data Base and make functional predictions.

We have developed a new computational algorithm for de novo identification of protein-ligand binding pockets and performed a large-scale validation of the algorithm on two systematically collected datasets from all crystallographic structures in the Protein Data Bank (PDB). This algorithm, called DrugSite, takes a three-dimensional protein structure as input and returns the location, volume and shape of the putative small molecule binding sites by using a physical potential and without any knowledge about a potential ligand molecule. We validated this method using 17,126 binding sites from complexes and apo-structures from the PDB. Out of 5,616 binding sites from protein-ligand complexes, 98.8% were identified by predicted pockets. In proteins having known binding sites, 80.9% were predicted by the largest predicted pocket and 92.7% by the first two. The average ratio of predicted contact area to the total surface area of the protein was 4.7% for the predicted pockets. In only 1.2% of the cases, no "pocket density" was found at the ligand location. Further, 98.6% of 11,510 binding sites collected from apo-structures were predicted. The algorithm is accurate and fast enough to predict protein-ligand binding sites of uncharacterized protein structures, suggest new allosteric druggable pockets, evaluate druggability of protein-protein interfaces and prioritize molecular targets by druggability. Furthermore, the known and the predicted binding pockets for the proteome of a particular organism can be clustered into a "pocketome", that can be used for rapid evaluation of possible binding partners of a given chemical compound.

PDBSiteScan: a program for searching for active, binding and posttranslational modification sites in the 3D structures of proteins
Ivanisenko, VA and Pintus, SS and Grigorovich, DA
Nucleic acids\ldots}, 2004, 32, W549-W554

PDBSiteScan is a web-accessible program designed for searching three-dimensional (3D) protein frag- ments similar in structure to known active, binding and posttranslational modification sites. A collection of known sites we designated as PDBSite was set up by automated processing of the PDB database using the data on site localization in the SITE field. Additionally, protein-protein interaction sites were generated by analysis of atom coordinates in hetero- complexes. The total number of collected sites was more than 8100; they were assigned to more than 80 functional groups. PDBSiteScan provides automated search of the 3D protein fragments whose maximum distance mismatch (MDM) between N, Ca and C atoms in a fragment and a functional site is not larger than the MDM threshold defined by the user. PDBSite- Scan requires perfect matching of amino acids. PDBSiteScan enables recognition of functional sites in tertiary structures of proteins and allows pro- teins with functional information to be annotated. The program PDBSiteScan is available at http://wwwmgs. bionet.nsc.ru/mgs/systems/fastprot/pdbsitescan.html.

Computational mapping methods place molecular probes-small molecules or functional groups-on a protein surface in order to identify the most favorable binding positions by calculating an interaction potential. Mapping is an important step in a number of flexible docking and drug design algorithms. We have developed improved algorithms for mapping protein surfaces using small organic molecules as molecular probes. The calculations reproduce the binding of eight organic solvents to lysozyme as observed by NMR, as well as the binding of four solvents to thermolysin, in good agreement with x-ray data. Application to protein tyrosine phosphatase 1B shows that the information provided by the mapping can be very useful for drug design. We also studied why the organic solvents bind in the active site of proteins, in spite of the availability of alternative pockets that can very tightly accommodate some of the probes. A possible explanation is that the binding in the relatively large active site retains a number of rotational states, and hence leads to smaller entropy loss than the binding elsewhere else. Indeed, the mapping reveals that the clusters of the ligand molecules in the protein's active site contain different rotational-translational conformers, which represent different local minima of the free energy surface. In order to study the transitions between different conformers, reaction path and molecular dynamics calculations were performed. Results show that most of the rotational states are separated by low free energy barriers at the experimental temperature, and hence the entropy of binding in the active site is expected to be high.

An innovative bioinformatic method has been designed and implemented to detect similar three-dimensional (3D) sites in proteins. This approach allows the comparison of protein structures or substructures and detects local spatial similarities: this method is completely independent from the amino acid sequence and from the backbone structure. In contrast to already existing tools, the basis for this method is a representation of the protein structure by a set of stereochemical groups that are defined independently from the notion of amino acid. An efficient heuristic for finding similarities that uses graphs of triangles of chemical groups to represent the protein structures has been developed. The implementation of this heuristic constitutes a software named SuMo (Surfing the Molecules), which allows the dynamic definition of chemical groups, the selection of sites in the proteins, and the management and screening of databases. To show the relevance of this approach, we focused on two extreme examples illustrating convergent and divergent evolution. In two unrelated serine proteases, SuMo detects one common site, which corresponds to the catalytic triad. In the legume lectins family composed of >100 structures that share similar sequences and folds but may have lost their ability to bind a carbohydrate molecule, SuMo discriminates between functional and non-functional lectins with a selectivity of 96%. The time needed for searching a given site in a protein structure is typically 0.1 s on a PIII 800MHz/Linux computer; thus, in further studies, SuMo will be used to screen the PDB.

We present a new shape-based method, LigandFit, for accurately docking ligands into protein active sites. The method employs a cavity detection algorithm for detecting invaginations in the protein as candidate active site regions. A shape comparison filter is combined with a Monte Carlo conformational search for generating ligand poses consistent with the active site shape. Candidate poses are minimized in the context of the active site using a grid-based method for evaluating protein-ligand interaction energies. Errors arising from grid interpolation are dramatically reduced using a new non-linear interpolation scheme. Results are presented for 19 diverse protein-ligand complexes. The method appears quite promising, reproducing the X-ray structure ligand pose within an RMS of 2Angstrom in 14 out of the 19 complexes. A high-throughput screening study applied to the thymidine kinase receptor is also presented in which LigandFit, when combined with LigScore, an internally developed scoring function [1], yields very good hit rates for a ligand pool seeded with known actives. (C) 2002 Published by Elsevier Science Inc.

2002

Rapid progress in structural biology and whole-genome sequencing technology means that, for many protein families, structural and evolutionary information are readily available. Recent developments demonstrate how this information can be integrated to identify canonical determinants of protein structure and function. Among these determinants, those residues that are on protein surfaces are especially likely to form binding sites and are the logical choice for further mutational analysis and drug targeting.

MOTIVATION: A number of proteins of known three-dimensional (3D) structure exist, with yet unknown function. In light of the recent progress in structure determination methodology, this number is likely to increase rapidly. A novel method is presented here: 'Rate4Site', which maps the rate of evolution among homologous proteins onto the molecular surface of one of the homologues whose 3D-structure is known. Functionally important regions often correspond to surface patches of slowly evolving residues. RESULTS: Rate4Site estimates the rate of evolution of amino acid sites using the maximum likelihood (ML) principle. The ML estimate of the rates considers the topology and branch lengths of the phylogenetic tree, as well as the underlying stochastic process. To demonstrate its potency, we study the Src SH2 domain. Like previously established methods, Rate4Site detected the SH2 peptide-binding groove. Interestingly, it also detected inter-domain interactions between the SH2 domain and the rest of the Src protein that other methods failed to detect.

Molecular surfaces are important because surface-shape complementarity is often a necessary condition in protein-ligand interactions and docking studies. We have previously described a fast and efficient method to obtain triangulated surface-meshes by topologically mapping ellipsoids on molecular surfaces. In this paper, we present an extension of our work to spherical harmonic surfaces in order to approximate molecular surfaces of both ligands and receptor-cavities and to easily check the surface-shape complementarity. The method consists of (1) finding lobes and holes on both ligand and cavity surfaces using contour maps of radius functions with spherical harmonic expansions, (2) superposing the surfaces around a given binding site by minimizing the distance between their respective expansion coefficients. This docking procedure capabilities was demonstrated by application to 35 protein-ligand complexes of known crystal structures. The method can also be easily and efficiently used as a filter to detect in a large conformational sampling the possible conformations presenting good complementarity with the receptor site, and being, therefore, good candidates for further more elaborate docking studies. This "virtual screening" was demonstrated on the platelet thrombin receptor.

Experimental approaches for the identification of functionally important regions on the surface of a protein involve mutagenesis, in which exposed residues are replaced one after another while the change in binding to other proteins or changes in activity are recorded. However, practical considerations limit the use of these methods to small-scale studies, precluding a full mapping of all the functionally important residues on the surface of a protein. We present here an alternative approach involving the use of evolutionary data in the form of multiple-sequence alignment for a protein family to identify hot spots and surface patches that are likely to be in contact with other proteins, domains, peptides, DNA, RNA or ligands. The underlying assumption in this approach is that key residues that are important for binding should be conserved throughout evolution, just like residues that are crucial for maintaining the protein fold, i.e. buried residues. A main limitation in the implementation of this approach is that the sequence space of a protein family may be unevenly sampled, e.g. mammals may be overly represented. Thus, a seemingly conserved position in the alignment may reflect a taxonomically uneven sampling, rather than being indicative of structural or functional importance. To avoid this problem, we present here a novel methodology based on evolutionary relations among proteins as revealed by inferred phylogenetic trees, and demonstrate its capabilities for mapping binding sites in SH2 and PTB signaling domains. A computer program that implements these ideas is available freely at: http://ashtoret.tau.ac.il/ approximately rony

A major problem in genome annotation is whether it is valid to transfer the function from a characterised protein to a homologue of unknown activity. Here, we show that one can employ a strategy that uses a structure-based prediction of protein functional sites to assess the reliability of functional inheritance. We have automated and benchmarked a method based on the evolutionary trace approach. Using a multiple sequence alignment, we identified invariant polar residues, which were then mapped onto the protein structure. Spatial clusters of these invariant residues formed the predicted functional site. For 68 of 86 proteins examined, the method yielded information about the observed functional site. This algorithm for functional site prediction was then used to assess the validity of transferring the function between homologues. This procedure was tested on 18 pairs of homologous proteins with unrelated function and 70 pairs of proteins with related function, and was shown to be 94 % accurate. This automated method could be linked to schemes for genome annotation. Finally, we examined the use of functional site prediction in protein-protein and protein-DNA docking. The use of predicted functional sites was shown to filter putative docked complexes with a discrimination similar to that obtained by manually including biological information about active sites or DNA-binding residues.

2000

PASS (Putative Active Sites with Spheres) is a simple computational tool that uses geometry to characterize regions of buried volume in proteins and to identify positions likely to represent binding sites based upon the size, shape, and burial extent of these volumes. Its utility as a predictive tool for binding site identification is tested by predicting known binding sites of proteins in the PDB using both complexed macromolecules and their corresponding apoprotein structures. The results indicate that PASS can serve as a front-end to fast docking. The main utility of PASS lies in the fact that it can analyze a moderate-size protein (approximately 30 kDa) in under 20 s, which makes it suitable for interactive molecular modeling, protein database analysis, and aggressive virtual screening efforts. As a modeling tool, PASS (i) rapidly identifies favorable regions of the protein surface, (ii) simplifies visualization of residues modulating binding in these regions, and (iii) provides a means of directly visualizing buried volume, which is often inferred indirectly from curvature in a surface representation. PASS produces output in the form of standard PDB files, which are suitable for any modeling package, and provides script files to simplify visualization in Cerius2, InsightII, MOE, Quanta, RasMol, and Sybyl. PASS is freely available to all.

An automated computer-based method for mapping of protein surface cavities was developed and applied to a set of 176 metalloproteinases containing zinc cations in their active sites. With very few exceptions, the cavity search routine detected the active site among the five largest cavities and produced reasonable active site surfaces. Cavities were described by means of solvent-accessible surface patches. For a given protein, these patches were calculated in three steps: (i) definition of cavity atoms forming surface cavities by a grid-based technique; (ii) generation of solvent accessible surfaces; (iii) assignment of an accessibility value and a generalized atom type to each surface point. Topological correlation vectors were generated from the set of surface points forming the cavities, and projected onto the plane by a self-organizing network. The resulting map of 865 enzyme cavities displays clusters of active sites that are clearly separated from the other cavities. It is demonstrated that both fully automated recognition of active sites, and prediction of enzyme class can be performed for novel protein structures at high accuracy.

We report a procedure for the description and comparison of protein sur- faces, which is based on a three-dimensional (3D) transposition of the pro{\textregistered}le method for sensitive protein homology sequence searches. Although the principle of the method can be applied to detect similarities to a single protein surface, the possibility of extending this approach to protein families displaying common structural and/or functional proper- ties, makes it a more powerful tool.
In analogy to pro{\textregistered}les derived from the multiple alignment of protein sequences, we derive a 3D surface pro{\textregistered}le from a protein structure or from a multiple structure alignment of several proteins. The 3D pro{\textregistered}le is used to screen the protein structure database, searching for similar pro- tein surfaces.
The application of the procedure to SH2 and SH3 binding pockets and to the nucleotide binding pocket associated with the p-loop structural motif is described. The SH2 and SH3 3D pro{\textregistered}les can identify all the SH2 and SH3 binding regions present in the test dataset; the p-loop 3D pro{\textregistered}le is able to recognize all the p-loop-containing proteins present in the test dataset. Analysis of the p-loop 3D pro{\textregistered}le allowed the identi{\textregistered}cation of a positive charge whose position is conserved in space but not in sequence. The best ranking non-p-loop-containing protein is an ADP-forming succi- nyl coenzyme A synthetase, whose nucleotide-binding region has not yet been identi{\textregistered}ed.

We are developing a new site descriptor for the DOCK molecular modeling program suite. Sphgen, the current site description program for the DOCK suite, describes the pockets of a macromolecule by filling a volume with intersecting spheres. DOCK then identifies possible ligand orientations in the pocket by overlapping the atoms of proposed ligands with the sphere centers. Sphgen limits use of the DOCK program to concave binding regions, but macromolecular binding regions can be solvent-exposed rather than buried pockets. We present a more general site descriptor, based on the surface solid angle, which generates site points by determining the solid angle of exposure for points on the surface of the molecule, then identifying patches of surface with similar solid angle values which are then built into site points. We find possible ligand orientations by matching shape-based site points on the ligand and protein and demanding complementary solid angle values. Orientations are evaluated using the DOCK's force field-based score, which evaluates the Coulombic and van der Waals energy. The surface solid angle descriptor displays the complementary characteristics of the interfaces of our test systems: trypsin/trypsin inhibitor, chymotrypsin/turkey ovomucoid third domain, and subtilisin/chymotrypsin inhibitor. The solid angle site points can be used by DOCK to generate orientations within 1.5 A r.m.s.d. of the crystal structure orientation.

Identification and size characterization of surface pockets and occluded cavities are initial steps in protein structure-based ligand design. A new program, CAST, for automatically locating and measuring protein pockets and cavities, is based on precise computational geometry methods, including alpha shape and discrete flow theory. CAST identifies and measures pockets and pocket mouth openings, as well as cavities. The program specifies the atoms lining pockets, pocket openings, and buried cavities; the volume and area of pockets and cavities; and the area and circumference of mouth openings. CAST analysis of over 100 proteins has been carried out; proteins examined include a set of 51 monomeric enzyme-ligand structures, several elastase-inhibitor complexes, the FK506 binding protein, 30 HIV-1 protease-inhibitor complexes, and a number of small and large protein inhibitors. Medium-sized globular proteins typically have 10-20 pockets/cavities. Most often, binding sites are pockets with 1-2 mouth openings; much less frequently they are cavities. Ligand binding pockets vary widely in size, most within the range 10(2)-10(3)A3. Statistical analysis reveals that the number of pockets and cavities is correlated with protein size, but there is no correlation between the size of the protein and the size of binding sites. Most frequently, the largest pocket/cavity is the active site, but there are a number of instructive exceptions. Ligand volume and binding site volume are somewhat correlated when binding site volume is < or

Molecular docking is a popular way to screen for novel drug compounds. The method involves aligning small molecules to a protein structure and estimating their binding affinity. To do this rapidly for tens of thousands of molecules requires an effective representation of the binding region of the target protein. This paper presents an algorithm for representing a protein's binding site in a way that is specifically suited to molecular docking applications. Initially the protein's surface is coated with a collection of molecular fragments that could potentially interact with the protein. Each fragment, or probe, serves as a potential alignment point for atoms in a ligand, and is scored to represent that probe's affinity for the protein. Probes are then clustered by accumulating their affinities, where high affinity clusters are identified as being the "stickiest" portions of the protein surface. The stickiest cluster is used as a computational binding "pocket" for docking. This method of site identification was tested on a number of ligand-protein complexes; in each case the pocket constructed by the algorithm coincided with the known ligand binding site. Successful docking experiments demonstrated the effectiveness of the probe representation.

LIGSITE is a new program for the automatic and time-efficient detection of pockets on the surface of proteins that may act as binding sites for small molecule ligands. Pockets are identified with a series of simple operations on a cubic grid. Using a set of receptor-ligand complexes we show that LIGSITE is able to identify the binding sites of small molecule ligands with high precision. The main advantage of LIGSITE is its speed. Typical search times are in the range of 5 to 20 s for medium-sized proteins. LIGSITE is therefore well suited for identification of pockets in large sets of proteins (e.g., protein families) for comparative studies. For graphical display LIGSITE produces VRML representations of the protein-ligand complex and the binding site for display with a VRML viewer such as WebSpace from SGI.

The biological function of a protein typically depends on the structure of specific binding sites. These sites are located at the surface of the protein molecule and are determined by geometrical arrangements and physico-chemical properties of tens of non-hydrogen atoms. In this paper we describe a new algorithm called APROPOS, based purely on geometric criteria for identifying such binding sites using atomic co-ordinates. For the description of the protein shape we use an alpha-shape algorithm which generates a whole family of shapes with different levels of detail. Comparing shapes of different resolution we find cavities on the surface of the protein responsible for ligand binding. The algorithm correctly locates more than 95% of all binding sites for ligands and prosthetic groups of molecular mass between about 100 and 2000 Da in a representative set of proteins. Only in very few proteins does the method find binding sites of single ions outside the active site of enzymes. With one exception, we observe that interfaces between subunits show different geometric features compared to binding sites of ligands. Our results clearly support the view that protein-protein interactions occur between flat areas of protein surface whereas specific interactions of smaller ligands take place in pockets in the surface.

X-ray or NMR structures of proteins are often derived without their ligands, and even when the structure of a full complex is available, the area of contact that is functionally and energetically significant may be a specialized subset of the geometric interface deduced from the spatial proximity between ligands. Thus, even after a structure is solved, it remains a major theoretical and experimental goal to localize protein functional interfaces and understand the role of their constituent residues. The evolutionary trace method is a systematic, transparent and novel predictive technique that identifies active sites and functional interfaces in proteins with known structure. It is based on the extraction of functionally important residues from sequence conservation patterns in homologous proteins, and on their mapping onto the protein surface to generate clusters identifying functional interfaces. The SH2 and SH3 modular signaling domains and the DNA binding domain of the nuclear hormone receptors provide tests for the accuracy and validity of our method. In each case, the evolutionary trace delineates the functional epitope and identifies residues critical to binding specificity. Based on mutational evolutionary analysis and on the structural homology of protein families, this simple and versatile approach should help focus site-directed mutagenesis studies of structure-function relationships in macromolecules, as well as studies of specificity in molecular recognition. More generally, it provides an evolutionary perspective for judging the functional or structural role of each residue in protein structure.

1995

The biological activity of a protein typically depends on the presence of a small number of functional residues. Identifying these residues from the amino acid sequences alone would be useful. Classically, strictly conserved residues are predicted to be functional but often conservation patterns are more complicated. Here, we present a novel method that exploits such patterns for the prediction of functional residues. The method uses a simple but powerful representation of entire proteins, as well as sequence residues as vectors in a generalised 'sequence space'. Projection of these vectors onto a lower-dimensional space reveals groups of residues specific for particular subfamilies that are predicted to be directly involved in protein function. Based on the method we present testable predictions for sets of functional residues in SH2 domains and in the conserved box of cyclins.

The SURFNET program generates molecular surfaces and gaps between surfaces from 3D coordinates supplied in a PDB-format file. The gap regions can correspond to the voids between two or more molecules, or to the internal cavities and surface grooves within a single molecule. The program is particularly useful in clearly delineating the regions of the active site of a protein. It can also generate 3D contour surfaces of the density distributions of any set of 3D data points. All output surfaces can be viewed interactively, along with the molecules or data points in question, using some of the best-known molecular modeling packages. In addition, PostScript output is available, and the generated surfaces can be rendered using various other graphics packages.

A computer program, VOIDOO, is described which can be employed in the study of cavities such as they occur in macromolecular structures (in particular, in proteins). The program can be used to detect unknown cavities or to delineate known cavities, either of which may be connected to the outside of the molecule or molecular assembly under study. Optionally, output files can be requested that contain a description of the shape of the cavity which can be displayed by the crystallographic modelling program O. Additionally, VOIDOO can be used to calculate the volume of a molecule and to create a file containing data pertaining to the surface of the molecule which can also be displayed using O. Examples of the use of VOIDOO are given for P2 myelin protein, cellular retinol-binding protein and cellobiohydrolase II. Finally, operational definitions to discern different types of cavity are introduced and guidelines for assessing the accuracy and improving the comparability of cavity calculations are given.

1993

A new approach to the automatic identification of candidates for ligand receptor sites in proteins: (I). Search for pocket regions.
Del Carpio, C A and Takahashi, Y and Sasaki, S
Journal of molecular graphics, 1993, 11(1), 23-9, 42
PMID: 8499393

The work presented here is aimed at the topographical analysis of localized regions of receptor proteins leading to the identification of pocket areas (superficial depressions or internal cavities), which may play the role of receptor sites. An algorithm is described that yields complete information about the position of each cavity or superficial depression relative to any point of the protein molecules, as well as detailed information on the atoms constituting it. The applicability of this algorithm to the automatic identification of candidate receptor sites in a receptor protein is also discussed using the typical receptor structure dihydrofolate reductase-methotrexate complex.

1992

A method for solid-filling protein cavities is presented. The method uses a pattern-recognition technique based on cellular logic operations to distinguish between convex and concave regions of a protein. In doing this it solid fills protein cavities and automatically defines a boundary between cavity and exterior free space. The operations used to fill the cavities also can be used to process the filler to filter out small-scale features. So far the main use of the method has been in visualizing protein active sites for docking. The method can be used to find cavities of a given size range and could be used to find novel protein binding sites.

A new interactive graphics program is described that provides a quick and simple procedure for identifying, displaying, and manipulating the indentations, cavities, or holes in a known protein structure. These regions are defined as, e.g., the xo, yo, zo values at which a test sphere of radius r can be placed without touching the centers of any protein atoms, subject to the condition that there is some x < xo and some x > xo where the sphere does touch the protein atoms. The surfaces of these pockets are modeled using a modification of the marching cubes algorithm. This modification provides identification of each closed surface so that by "clicking" on any line of the surface, the entire surface can be selected. The surface can be displayed either as a line grid or as a solid surface. After the desired "pocket" has been selected, the amino acid residues and atoms that surround this pocket can be selected and displayed. The protein database that is input can have more than one protein "segment," allowing identification of the pockets at the interface between proteins. The use of the program is illustrated with several specific examples. The program is written in C and requires Silicon Graphics graphics routines.

A set of algorithms designed to enhance the display of protein binding cavities is presented. These algorithms, collectively entitled CAVITY SEARCH, allow the user to isolate and fully define the extent of a particular cavity. Solid modeling techniques are employed to produce a detailed cast of the active site region, which can then be color-coded to show electrostatic and steric interactions between the protein cavity and a bound ligand.

1982

Computational drug repositioning offers promise for discovering new uses of existing drugs, as drug related molecular, chemical, and clinical information has increased over the past decade and become broadly accessible. In this study, we present a new computational approach for identifying potential new indications of an existing drug through its relation to similar drugs in disease-drug-target network. When measuring drug pairwise similarly, we used a bipartite-graph based method which combined similarity of drug compound structures, similarity of target protein profiles, and interaction between target proteins. In evaluation, our method compared favorably to the state of the art, achieving AUC of 0.888. The results indicated that our method is able to identify drug repositioning opportunities by exploring complex relationships in disease-drug-target network.