Molecular Informatics

Wiley Online Library : Molecular Informatics

The development of high-throughput in vitro assays to study quantitatively the toxicity of chemical compounds on genetically characterized human-derived cell lines paves the way to predictive toxicogenetics, where one would be able to predict the toxicity of any particular compound on any particular individual. In this paper we present a machine learning-based approach for that purpose, kernel multitask regression (KMR), which combines chemical characterizations of molecular compounds with genetic and transcriptomic characterizations of cell lines to predict the toxicity of a given compound on a given cell line. We demonstrate the relevance of the method on the recent DREAM8 Toxicogenetics challenge, where it ranked among the best state-of-the-art models, and discuss the importance of choosing good descriptors for cell lines and chemicals.

Here, we describe an algorithm to visualize chemical structures on a grid-based layout in such a way that similar structures are neighboring. It is based on structure reordering with the help of the Hilbert Schmidt Independence Criterion, representing an empirical estimate of the Hilbert-Schmidt norm of the cross-covariance operator. The method can be applied to any layout of bi- or three-dimensional shape. The approach is demonstrated on a set of dopamine D5 ligands visualized on squared, disk and spherical layouts.

Despite recent advances in Computer Aided Drug Discovery and High Throughput Screening, the attrition rates of drug candidates continue to be high, underscoring the inherent complexity of the drug discovery paradigm. Indeed, a compromise between several objectives is often required to obtain successful clinical drugs. The present manuscript details a multi-objective workflow that integrates the 4D-QSAR and molecular docking methods in the simultaneous modeling of the Rho Kinase inhibitory activity and acute toxicity of Benzamide derivatives. To this end, the pIC50/pLD50 ratio is considered as the response variable, permitting the concurrent modeling of both properties and representing a shift from classical step-by-step evaluations. The 4D-QSAR strategy is used to generate the Grid Cell Occupancy Descriptors (GCODs), and Stochastic Gradient Boosting (SGB) and Partial Least Squares (PLS) methods as the model fitting techniques. While the statistical parameters for the PLS model do not meet established criteria for acceptability, the SGB model yields satisfactory performance, with correlation coefficients r2=0.95 and r2pred=0.65 for the training and test set, respectively. Posteriorly, the structural interpretation of the most relevant GCODs according to the SGB model is performed, allowing for the proposal of 139 novel benzamide derivatives, which are then screened using the same model. Of these 9 compounds were predicted to possess pIC50/pLD50 ratio values higher than those for the employed dataset. Finally, in order to corroborate the results obtained with the SGB model, a docking simulation was formed to evaluate the binding affinity of the proposed molecules to the ROCK2 active site and 3 chemical structures (i. e. p6, p14 and p131) showed higher binding affinity than the most active compound in the training set, while the rest generally demonstrated comparable behavior. It may therefore be concluded that the consensus models that intertwine the 4D-QSAR and molecular docking methods contribute to more reliable virtual screening and compound optimization experiments. Additionally, the use of multi-objective modeling schemes permits the simultaneous evaluation of different chemical and biological profiles, which should contribute to the control a priori of causative factors for the high attrition rates in later drug discovery phases.

In this study, a novel series of phenyl substituted imidazo[2,1-b][1,3,4]thiadiazole derivatives were synthesized, characterized and explored for antibacterial activity against Gram-negative Escherichia coli, Gram-positive Staphylococcus aureus and Bacillus subtilis and antifungal activity against Candida albicans. Most of the synthesized compounds exhibited remarkable antimicrobial activities, some of which being ten times more potent than positive controls. The most promising compound showed excellent activity with MIC value of 0.03 μg/ml against both S. aureus and B. subtilis (MIC values of positive compound Chloramphenicol are 0.4 μg/ml and 0.85 μg/ml, respectively). Furthermore, structure-activity relationship was also investigated with the help of computational tools. Some physicochemical and ADME properties of the compounds were calculated too. The combination of electronic structure calculations performed at PM6 level and molecular docking simulations using Glide extra-precision mode showed that the hydrophobic nature of keto aryl ring with no electron withdrawing substituents at para position enhances activity while electron-donating substituents at the second aryl ring is detrimental to activity.

Active molecules among numerous chemical structures in a chemical database can be searched easily by statistical prediction of compound–protein interactions. However, constructing a simple prediction model against one protein does not aid drug design, because detecting chemical structures that act similarly against multiple proteins is necessary for preventing side effects of the potential drug. To tackle this problem, we propose a new method that visualizes chemical and protein spaces.
For simultaneous visualization of both spaces, we employ a counterpropagation neural network (CPNN) and develop a new visualization method named multi-input CPNN (MICPNN). In a case study of the kinase protein family, the MICPNN model predicted accurately the complex relationships between compounds and proteins. The proposed method identified chemical structures with promising activity against kinases. Our proposed method is also applicable to other protein families, such as G-protein coupled receptors, ion channels and transporters.

Human ether-a-go-go related gene (hERG) K+ channel plays an important role in cardiac action potential. Blockage of hERG channel may result in long QT syndrome (LQTS), even cause sudden cardiac death. Many drugs have been withdrawn from the market because of the serious hERG-related cardiotoxicity. Therefore, it is quite essential to estimate the chemical blockage of hERG in the early stage of drug discovery. In this study, a diverse set of 3721 compounds with hERG inhibition data was assembled from literature. Then, we make full use of the Online Chemical Modeling Environment (OCHEM), which supplies rich machine learning methods and descriptor sets, to build a series of classification models for hERG blockage. We also generated two consensus models based on the top-performing individual models. The consensus models performed much better than the individual models both on 5-fold cross validation and external validation. Especially, consensus model II yielded the prediction accuracy of 89.5 % and MCC of 0.670 on external validation. This result indicated that the predictive power of consensus model II should be stronger than most of the previously reported models. The 17 top-performing individual models and the consensus models and the data sets used for model development are available at https://ochem.eu/article/103592.

There has been an increasing interest in the study of fluorinated derivatives of gamma-aminobutyric acid (GABA), an acetylcholine (AC) analog. This work reports a theoretical study on the effect of an α-carbonyl fluorination in AC, aiming at understanding the role of a distant fluorine relative to the positively charged nitrogen on the conformational folding of the resulting fluorinated AC. In addition, the chemical and structural changes were evaluated on the basis of ligand-enzyme (acetylcholinesterase) interactions. In an enzyme-free environment, the fluorination yields conformational changes relative to AC due to the appearance of some attractive interactions with fluorine and a weaker steric repulsion between the fluorine substituent and the carboxyl group, rather than to a possible electrostatic interaction F⋅⋅⋅N+. Moreover, the gauche orientation in the N−C−C−O fragment of AC owing to the electrostatic gauche effect is reinforced after fluorination. For instance, the conformational equilibrium in AC is described by a competition between gauche and anti conformers (accounting for the N−C−C−O dihedral angle) in DMSO, while the population for a gauche conformer in the fluorinated AC is almost 100 % in both gas phase and DMSO. However, this arrangement is disrupted in the biological environment even in the fluorinated derivative (whose bioconformation-like geometry shows a ligand-protein interaction of −84.1 kcal mol−1 against −79.5 kcal mol−1 for the most stable enzyme-free conformation), which shows an anti N−C−C−O orientation, because the enzyme induced-fit takes place. Nevertheless, the most likely bioconformation for the fluorinated AC does not match the bioactive AC backbone nor the most stable enzyme-free conformation, thus revealing the role of fluorination on the bioconformational control of AC.

In drug and material design, the activity and property values of the designed chemical structures can be predicted by quantitative structure−activity and structure−property relationship (QSAR/QSPR) models. When a QSAR/QSPR model is applied to chemical structures, its applicability domain (AD) must be considered. The predicted activity/property values are only reliable for chemical structures inside the AD. Chemical structures outside the AD are usually neglected, as the predicted values are unreliable. The purpose of this study is to develop a methodology for obtaining novel chemical structures with the desired activity or property based on a QSAR/QSPR model by making use of the neglected structures. We propose a structure modification strategy for the AD that considers the activity and property simultaneously. The AD is defined by a one-class support vector machine and the structure modification is guided by a partial derivative of the AD model and matched molecular pairs analysis. Three proof-of-concept case studies generate novel chemical structures inside the AD that exhibit preferable activity/property values according to the QSAR/QSPR model.

While recent literature focuses on drug promiscuity, the characterization of promiscuous binding sites (ability to bind several ligands) remains to be explored. Here, we present a proteochemometric modeling approach to analyze diverse ligands and corresponding multiple binding sub-pockets associated with one promiscuous binding site to characterize protein-ligand recognition. We analyze both geometrical and physicochemical profile correspondences. This approach was applied to examine the well-studied druggable urokinase catalytic domain inhibitor binding site, which results in a large number of complex structures bound to various ligands. This approach emphasizes the importance of jointly characterizing pocket and ligand spaces to explore the impact of ligand diversity on sub-pocket properties and to establish their main profile correspondences. This work supports an interest in mining available 3D holo structures associated with a promiscuous binding site to explore its main protein-ligand recognition tendency.

In drug discovery, network-based approaches are expected to spotlight our understanding of drug action across multiple layers of information. On one hand, network pharmacology considers the drug response in the context of a cellular or phenotypic network. On the other hand, a chemical-based network is a promising alternative for characterizing the chemical space. Both can provide complementary support for the development of rational drug design and better knowledge of the mechanisms underlying the multiple actions of drugs. Recent progress in both concepts is discussed here. In addition, a network-based approach using drug-target-therapy data is introduced as an example.

Promiscuity is an interesting concept in fragment-based drug design as fragments with low specificity can be advantageous for finding many screening hits. We present a PDB-wide analysis of multi-target fragments and their binding mode conservation. Focussing on multi-target fragments, we found that the majority shows non-conserved binding modes, even if they bind in a similar conformation or similar protein targets. Surprisingly, fragment properties alone are not able to predict whether a fragment will exhibit a versatile or conserved binding mode, emphasizing the interplay between protein and fragment features during a binding event and the importance of structure-based modelling.

We computed the channels of the 3A4 isoform of the cytochrome P450 3A4 (CYP) on the basis of 24 crystal structures extracted from the Protein Data Bank (PDB). We identified three major conformations (denoted C, O1 and O2) using an enhanced version of the CCCPP software that we developed for the present work, while only two conformations (C and O2) are considered in the literature. We established the flowchart of definition of these three conformations in function of the structural and physicochemical parameters of the ligand. The channels are characterized with qualitative and quantitative parameters, and not only with their surrounding secondary structures as it is usually done in the literature

In order to obtain a better understanding why some Jamu formulas can be used to treat a specific disease, we performed metabolomic studies of Jamu by taking into consideration the biologically active compounds existing in plants used as Jamu ingredients. A thorough integration of information from omics is expected to provide solid evidence-based scientific rationales for the development of modern phytomedicines. This study focused on prediction of Jamu efficacy based on its component metabolites and also identification of important metabolites related to each efficacy group. Initially, we compared the performance of Support Vector Machines and Random Forest to predict the Jamu efficacy with three different data pre-processing approaches, such as no filtering, Single Filtering algorithm, and a combination of Single Filtering algorithm and feature selection using Regularized Random Forest. Both classifiers performed very well and according to 5-fold cross-validation results, the mean accuracy of Support Vector Machine with linear kernel was slightly better than Random Forest. It can be concluded that machine learning methods can successfully relate Jamu efficacy with metabolites. In addition, we extended our analysis by identifying important metabolites from the Random Forest model. The inTrees framework was used to extract the rules and to select important metabolites for each efficacy group. Overall, we identified 94 significant metabolites associated to 12 efficacy groups and many of them were validated by published literature and KNApSAcK Metabolite Activity database.

Nuclear receptors (NRs) constitute an important class of therapeutic targets. During the last 4 years, we tackled the pharmacological profile assessment of NR ligands for which we constructed the NRLiSt BDB. We evaluated and compared the performance of different virtual screening approaches: mean of molecular descriptor distribution values, molecular docking and 3D pharmacophore models. The simple comparison of the distribution profiles of 4885 molecular descriptors between the agonist and antagonist datasets didn′t provide satisfying results. We obtained an overall good performance with the docking method we used, Surflex-Dock which was able to discriminate agonist from antagonist ligands. But the availability of PDB structures in the “pharmacological-profile-to-predict-bound-state” (agonist-bound or antagonist-bound) and the availability of enough ligands of both pharmacological profiles constituted limits to generalize this protocol for all NRs. Finally, the 3D pharmacophore modeling approach, allowed us to generate selective agonist pharmacophores and selective antagonist pharmacophores that covered more than 99 % of the whole NRLiSt BDB. This study allowed a better understanding of the pharmacological modulation of NRs with small molecules and could be extended to other therapeutic classes.

In the last decade, many statistical-based approaches have been developed to improve poor pharmacokinetics (PK) and to reduce toxicity of lead compounds, which are one of the main causes of high failure rate in drug development. Predictive QSAR models are not always very efficient due to the low number of available biological data and the differences in the experimental protocols. Fortunately, the number of available databases continues to grow every year. However, it remains a challenge to determine the source and the quality of the original data. The main goal is to identify the relevant databases required to generate the most robust predictive models. In this study, an interactive network of databases was proposed to easily find online data sources related to ADME-Tox parameters data. In this map, relevant information regarding scope of application, data availability and data redundancy can be obtained for each data source. To illustrate the usage of data mining from the network, a dataset on plasma protein binding is selected based on various sources such as DrugBank, PubChem and ChEMBL databases. A total of 2,606 unique molecules with experimental values of PPB were extracted and can constitute a consistent dataset for QSAR modeling.

Cathechins and flavonoids are responsible of numerous health benefits. Two of the most representatives’ compounds for their antioxidant and therapeutic effects are Epigallocatechin 3-Gallate (EGCG), from green tea extracts, and morelloflavone (MF), from Garcinia dulcis. Here we explore, by atomistic Molecular Dynamics simulations, how EGCG and MF interact with lipid bilayers and we show the salts’ influence on their encapsulation degree in neutral liposomes. As a result, we found out that EGCGs naturally bind to the hydrophilic regions of phospholipids, positioning themselves mostly at the interface between water and lipid phases. The presence of a salt clearly influences the EGCG molecules’ absorption and the total effect depends strongly on the salt nature and concentration. Beside, for MF, we observed a high stability of the intermolecular MFs aggregates in water that strongly penalizes the flavonoid's interaction with the lipid polar heads. However, salts can influence MF′s liposomal penetration, even if they are not able to promote completely its absorption inside the bilayer. For both compounds, the increase of penetration is more marked in presence of magnesium chloride, whilst calcium chloride showed the opposite effect.

In Energy-Based Neural Networks (EBNNs), relationships between variables are captured by means of a scalar function conventionally called “energy”. In this article, we introduce a procedure of “harmony search”, which looks for compounds providing the lowest energies for the EBNNs trained on active compounds. It can be considered as a special kind of similarity search that takes into account regularities in the structures of active compounds. In this paper, we show that harmony search can be used for performing virtual screening. The performance of the harmony search based on two types of EBNNs, the Hopfield Networks (HNs) and the Restricted Boltzmann Machines (RBMs), was compared with the performance of the similarity search based on Tanimoto coefficient with “data fusion”. The AUC measure for ROC curves and 1 %-enrichment rates for 20 targets were used in the benchmarking. Five different scores were computed: the energy for HNs, the free energy and the reconstruction error for RBMs, the mean and the maximum values of Tanimoto coefficients. The performance of the harmony search was shown to be comparable or even superior (significantly for several targets) to the performance of the similarity search. Important advantages of using the harmony search for virtual screening are very high computational efficiency of prediction, the ability to reveal and take into account regularities in active structures, flexibility and interpretability of models, etc.

Enzyme interactions with ligands are crucial for various biochemical reactions governing life. Over many years attempts to identify these residues for biotechnological manipulations have been made using experimental and computational techniques. The computational approaches have gathered impetus with the accruing availability of sequence and structure information, broadly classified into template-based and de novo methods. One of the predominant de novo methods using sequence information involves application of biological properties for supervised machine learning. Here, we propose a support vector machines-based ensemble for prediction of protein-ligand interacting residues using one of the most important discriminative contributing properties in the interacting residue neighbourhood, i. e., evolutionary information in the form of position-specific- scoring matrix (PSSM). The study has been performed on a non-redundant dataset comprising of 9269 interacting and 91773 non-interacting residues for prediction model generation and further evaluation. Of the various PSSM-based models explored, the proposed method named ROBBY (pRediction Of Biologically relevant small molecule Binding residues on enzYmes) shows an accuracy of 84.0 %, Matthews Correlation Coefficient of 0.343 and F-measure of 39.0 % on 78 test enzymes. Further, scope of adding domain knowledge such as pocket information has also been investigated; results showed significant enhancement in method precision. Findings are hoped to boost the reliability of small-molecule ligand interaction prediction for enzyme applications and drug design.

The enzymatic hydrolysis of chemicals, which is important for in vitro drug metabolism assays, is an important indicator of drug stability profiles during drug discovery and development. Herein, we employed a stepwise feature elimination (SFE) method with nonlinear support vector machine regression (SVR) models to predict the in vitro half-lives in human plasma/blood of various esters. The SVR model was developed using public databases and literature-reported data on the half-lives of esters in human plasma/blood. In particular, the SFE method was developed to prevent over fitting and under fitting in the nonlinear model, and it provided a novel and efficient method of realizing feature combinations and selections to enhance the prediction accuracy. Our final developed model with 24 features effectively predicted an external validation set using the time-split method and presented reasonably good R2 values (0.6) and also predicted two completely independent validation datasets with R2 values of 0.62 and 0.54; thus, this model performed much better than other prediction models.

Dihydrofolate reductase (DHFR) is an essential enzyme of the folate metabolic pathway in protozoa and it is a validated, potential drug target in many infectious diseases. Information about unique conserved residues of the DHFR enzyme is required to understand residual selectivity of the protozoan DHFR enzyme. The three dimensional crystal structures are not available for all the protozoan DHFR enzymes. Enzyme-substrate/inhibitor interaction information is required for the binding mode characterization in protozoan DHFR for selective inhibitor design. In this work, multiple sequence analysis was carried out in all the studied species. Homology models were built for protozoan DHFR enzymes, for which 3D structures are not available in PDB. The molecular docking and Prime-MMGBSA calculations of the natural substrate (dihydrofolate, DHF) and classical DHFR inhibitor (methotrexate, MTX) were performed in protozoan DHFR enzymes. Comparative sequence analysis showed that an overall sequence identity between the studied species ranging from 22.94 % (CfDHFR-BgDHFR) to 94.61 % (LdDHFR-LmDHFR). Interestingly, it was observed that most of the active site residues were conserved in all the cases and all the enzymes exhibit similar key binding interactions with DHF and MTX in molecular docking analysis, but there are a few key binding residues which differ in protozoan species that makes it suitable for target selectivity. This information can be used to design selective and potent protozoan DHFR enzyme inhibitors.

The response regulator PhoP is part of the PhoP/PhoQ two-component system, which is responsible for regulating the expression of multiple genes involved in controlling virulence, biofilm formation, and resistance to antimicrobial peptides. Therefore, modulating the transcriptional function of the PhoP protein is a promising strategy for developing new antimicrobial agents. There is evidence suggesting that phosphorylation-mediated dimerization in the regulatory domain of PhoP is essential for its transcriptional function. Disruption or stabilization of protein-protein interactions at the dimerization interface may inhibit or enhance the expression of PhoP-dependent genes. In this study, we performed molecular dynamics simulations on the active and inactive dimers and monomers of the PhoP regulatory domains, followed by pocket-detecting screenings and a quantitative hot-spot analysis in order to assess the druggability of the protein. Consistent with prior hypothesis, the calculation of the binding free energy shows that phosphorylation enhances dimerization of PhoP. Furthermore, we have identified two different putative binding sites at the dimerization active site (the α4-β5-α5 face) with energetic “hot-spot” areas, which could be used to search for modulators of protein-protein interactions. This study delivers insight into the dynamics and druggability of the dimerization interface of the PhoP regulatory domain, and may serve as a basis for the rational identification of new antimicrobial drugs.

This article introduces a new type of structural fragment called a geometrical pattern. Such geometrical patterns are defined as molecular graphs that include a labelling of atoms together with constraints on interatomic distances. The discovery of geometrical patterns in a chemical dataset relies on the induction of multiple decision trees combined in random forests. Each computational step corresponds to a refinement of a preceding set of constraints, extending a previous geometrical pattern. This paper focuses on the mutagenicity of chemicals via the definition of structural alerts in relation with these geometrical patterns. It follows an experimental assessment of the main geometrical patterns to show how they can efficiently originate the definition of a chemical feature related to a chemical function or a chemical property. Geometrical patterns have provided a valuable and innovative approach to bring new pieces of information for discovering and assessing structural characteristics in relation to a particular biological phenotype.

Over the past decades, virtual screening has proved itself to be a valuable asset to identify new bioactive compounds. The vast majority of commonly used techniques can be described in three steps: pre-processing the dataset i. e. small (ligands) and eventually larger (receptors) molecules, execute the method and finally analyse the results. Hence, the preparation of ligands is a critical step for success of commonly used virtual screening approaches such as protein-ligand docking, similarity or pharmacophore search. We present here a new workflow, VSPrep, for the pre-processing of small molecules; it is based on freely accessible tools for academics and is integrated within the KNIME platform. It can be used to perform several chemoinformatics tasks such as molecular database cleaning, tautomer and stereoisomer enumeration, focused library design and conformer generation. Additionally, graphical reports of the results are provided to the user as a convenient analysis tool.

Clustering 16S rRNA sequences into operational taxonomic units (OTUs) is a crucial step in analyzing metagenomic data. Although many methods have been developed, how to obtain an appropriate balance between clustering accuracy and computational efficiency is still a major challenge. A novel density-based modularity clustering method, called DMclust, is proposed in this paper to bin 16S rRNA sequences into OTUs with high clustering accuracy. The DMclust algorithm consists of four main phases. It first searches for the sequence dense group defined as n-sequence community, in which the distance between any two sequences is less than a threshold. Then these dense groups are used to construct a weighted network, where dense groups are viewed as nodes, each pair of dense groups is connected by an edge, and the distance of pairwise groups represents the weight of the edge. Then, a modularity-based community detection method is employed to generate the preclusters. Finally, the remaining sequences are assigned to their nearest preclusters to form OTUs. Compared with existing widely used methods, the experimental results on several metagenomic datasets show that DMclust has higher accurate clustering performance with acceptable memory usage.

Natural product chemistry began in Reims, France, in a pharmacognosy research laboratory whose main emphasis was the isolation and identification of bioactive molecules, following the guidelines of chemotaxonomy. The structure elucidation of new compounds of steadily increasing complexity favored the emergence of methodological work in nuclear magnetic resonance. As a result, our group was the first to report the use of proton-detected heteronuclear chemical shift correlation spectra for the computer-assisted structure elucidation of small organic molecules driven by atom proximity relationships and without relying on databases. The early detection of known compounds appeared as a necessity in order to deal more efficiently with complex plant extracts. This goal was reached by an original combination of mixture fractionation by centrifugal partition chromatography, analysis by 13C NMR, digital data reduction and alignment, hierarchical data clustering, and computer database search.

Herein, Generative Topographic Mapping (GTM) was challenged to produce planar projections of the high-dimensional conformational space of complex molecules (the 1LE1 peptide). GTM is a probability-based mapping strategy, and its capacity to support property prediction models serves to objectively assess map quality (in terms of regression statistics). The properties to predict were total, non-bonded and contact energies, surface area and fingerprint darkness. Map building and selection was controlled by a previously introduced evolutionary strategy allowed to choose the best-suited conformational descriptors, options including classical terms and novel atom-centric autocorrellograms. The latter condensate interatomic distance patterns into descriptors of rather low dimensionality, yet precise enough to differentiate between close favorable contacts and atom clashes. A subset of 20 K conformers of the 1LE1 peptide, randomly selected from a pool of 2 M geometries (generated by the S4MPLE tool) was employed for map building and cross-validation of property regression models. The GTM build-up challenge reached robust three-fold cross-validated determination coefficients of Q2=0.7…0.8, for all modeled properties. Mapping of the full 2 M conformer set produced intuitive and information-rich property landscapes. Functional and folding subspaces appear as well-separated zones, even though RMSD with respect to the PDB structure was never used as a selection criterion of the maps.

The objective of the present paper is to summarize chemoinformatics based research, and more precisely, the development of quantitative structure property relationships performed at IFP Energies nouvelles (IFPEN) during the last decade. A special focus is proposed on research activities performed in the “Thermodynamics and Molecular Simulation” department, i. e. the use of multiscale molecular simulation methods in responses to projects. Molecular simulation techniques can be envisaged to supplement dataset when experimental information lacks, thus the review includes a section dedicated to molecular simulation codes, development of intermolecular potentials, and some of their possible applications. Know-how and feedback from our experiences in terms of machine learning application for thermophysical property predictions are included in a section dealing with methodological aspects. The generic character of chemoinformatics is emphasized through applications in the fields of energy, transport, and environment, with illustrations for three IFPEN business units: “Transports”, “Energy Resources”, and “Processes”. More precisely, the review focus on different challenges such as the prediction of properties for alternative fuels, the prediction of fuel compatibility with polymeric materials, the prediction of properties for surfactants usable in chemical enhanced oil recovery, and the prediction of guest-host interactions between gases and nanoporous materials in the frame of carbon dioxide capture or gas separation activities.

Quantitative structure-property relationships represent alternative method to experiments to access the estimation of physico-chemical properties of chemicals for screening purpose at R&D level but also to gather missing data in regulatory context. In particular, such predictions were encouraged by the REACH regulation for the collection of data, provided that they are developed respecting the rigorous principles of validation proposed by OECD. In this context, a series of organic peroxides, unstable chemicals which can easily decompose and may lead to explosion, were investigated to develop simple QSPR models that can be used in a regulatory framework. Only constitutional and topological descriptors were employed to achieve QSPR models predicting the heat of decomposition, which could be used without any time consuming preliminary structure calculations at quantum chemical level. To validate the models, the original experimental dataset was divided into a training and a validation set according to two methods of partitioning, one based on the property value and the other based on the structure of the molecules by the mean of PCA. Four QSPR models were developed upon the type of descriptors and the methods of partitioning. The 2 models issuing from the PCA based method were highlighted as they presented good predictive power and they are easier to apply than our previous quantum chemical based model, since they do not need any preliminary calculations.

Some major proteins families, such as carbonic anhydrases (CAs), have a conical cavity at the active site. No algorithm was available to compute conical cavities, so we needed to design one. The fast algorithm we designed let us show on a set of 717 CAs extracted from the PDB database that γ-CAs are characterized by active site cavity cone angles significantly larger than those of α-CAs and β-CAs: the generatrix-axis angles are greater than 60° for the γ-CAs while they are smaller than 50° for the other CAs. Free binaries of the CONICA software implementing the algorithm are available through a software repository at http://petitjeanmichel.free.fr/itoweb.petitjean.freeware.html

Our research and teaching group called MTi (Molécules Thérapeutiques in silico) has developed numerous applications available online, thanks to the RPBS platform (Ressource Parisienne en Bioinformatique Structurale), in the field of chemoinformatics, structural bioinformatics and drug design. Since its opening in 2009, over 200 articles/reviews have been reported and involve virtual screening studies, prediction of druggability, analysis of protein-protein interaction inhibitors, development of databases, data mining and knowledge discovery, as well as combined in silico-in vitro work to search for new hits and chemical probes acting on original targets in several therapeutic areas. An international training program has also been developed pertaining to the field of in silico drug design. In this review, we present some tools developed in our laboratory with a special emphasis on the prediction of some ADMET properties, compound collection preparation and 3D-ADMET computations.

3D-QSAR, molecular docking and activity evaluation were used to study the bioactivities of ACE-inhibitory peptides with phenylalanine C-terminus. Both CoMFA (Q2=0.773, R2=0.992) and CoMSIA (Q2=0.664, R2=0.990) models were constructed. According to the established models, four novel potent ACE-inhibitory tripeptides GEF, VEF, VRF, and VKF were synthesized. The IC50 values were respectively determined to be 13 μM, 23 μM, 5 μM, and 11 μM by in vitro evaluation. The results show good agreement with the predicted values. The established models play an important role in revealing the structure-activity relationship of ACE-inhibitory peptides and designing novel peptides with enhanced biological activity.

Small molecules interact with their protein target on surface cavities known as binding pockets. Pocket-based approaches are very useful in all of the phases of drug design. Their first step is estimating the binding pocket based on protein structure. The available pocket-estimation methods produce different pockets for the same target. The aim of this work is to investigate the effects of different pocket-estimation methods on the results of pocket-based approaches. We focused on the effect of three pocket-estimation methods on a pocket-ligand (PL) classification. This pocket-based approach is useful for understanding the correspondence between the pocket and ligand spaces and to develop pharmacological profiling models. We found pocket-estimation methods yield different binding pockets in terms of boundaries and properties. These differences are responsible for the variation in the PL classification results that can have an impact on the detected correspondence between pocket and ligand profiles. Thus, we highlighted the importance of the pocket-estimation method choice in pocket-based approaches.

Protein interactions (PI) underlie complex biological processes. Protein interaction partners include DNA, RNA, ions, small chemical compounds, and proteins (protein-protein interactions; PPI). Analysis of sequence variants within regions corresponding to experimentally validated PI sites presents novel opportunities for understanding of complex diseases. Such information has not been systematically collected due to the fact that datasets are dispersed throughout databases and publications. Sequence variants and PI regions were obtained from the UniProt database. The location of the variants was compared to start and end positions of each PPI. Associations of sequence variants with phenotype were obtained from databases including COSMIC, GAD, PharmGKB, and dbSNP. We developed a catalogue of 603 sequence variants located within regions corresponding to experimentally validated PI sites, mostly PPI regions. These sequence variants were previously associated with risk for cancer, reproduction, ageing, renal, and immune system diseases. The developed catalogue connects information from different research papers and databases, represents a new layer of information and enables designing new hypotheses. It provides a baseline for prioritization of sequence variants, which may affect protein function and binding sites. The study contributes to the development of the proteogenomics field and provides new insights for understanding molecular mechanisms underlying disease development.

A data mining approach is proposed as a useful tool for the control parameters analysis of the 3-stage CIGSe photovoltaic cell production process, in order to find variables that are the most relevant for cell electric parameters and efficiency. The analysed data set consists of stage duration times, heater power values as well as temperatures for the element sources and the substrate – there are 14 variables per sample in total. The most relevant variables of the process have been found based on the so-called random forest analysis with the application of the Boruta algorithm. 118 CIGSe samples, prepared at Institut des Matériaux Jean Rouxel, were analysed. The results are close to experimental knowledge on the CIGSe cells production process. They bring new evidence to production parameters of new cells and further research.