Technical Program

With the increasing availability of genomic information and biological expression data, one of the current challenges in drug discovery is linking biological pathway data with small-molecule drug data. How can drug pathway target information and metabolic pathway information be linked to small ligand information? These are some of the issues and questions addressed by the CINF symposium “Informatics and Chemical Biology: Identifying Targets and Biological Pathways”, at the Fall 2017 ACS Meeting in Washington, DC.

David Sheen, NIST, (National Institute of Standards and Technology), started off the symposium by discussing incompatibilities with metabolic data reported by different groups, and the need to have improved databases and data harmonization methods to address varying uncertainties in reported experimental metabolic data. This will enable comparison of data from different laboratories and sources. NIST maintains http://qmet.nist.gov, the quality assurance program in metabolomics, to encourage exchange of spectral data and comparison of uncertainties in measurement, and is conducting literature interlibrary comparison studies. Reproducibility analysis for spectral data is an issue as well.

Dr. Karina Martinez Mayorga, Insituto de Quimica, UNAM, reported using the PLIF (Chemical Computing Group Protein Ligand Interaction Fingerprints) method for screening biased ligands for opioid receptors. There are approximately 800 opioid receptors that are members of the G-protein coupled receptor (GPCR )family and they are significant targets for pain management. Databases of these interaction fingerprints, combined with methodologies to identify structural traits for selective agonists, will lead to the successful development of drugs with fewer side effects.

Dr. Doug Selinger, Plex (http://www.plexresearch.com), discussed development of a search engine for chemical biology and drug discovery. The search engine begins with a query molecule and expands to more compounds with similar chemical structures and biological transcriptional profiles. Compound-compound and compound-target relationships are used in search algorithms to rank compounds and targets. Data sources include: Open Targets, PubChem, Entrez Gene, chemical similarity, and ChEMBL bioactivities. Plex as a search engine, searches data (1.7 billon rows of data), not Web pages. The search engine can search compounds, targets, or pathways; InChIs, SMILES, and structures can be drawn directly into the search bar. The more datasets included in the search engine, the better the search engine gets at providing answers.

Dr. Anne Wassermann, Merck Informatics, discussed the chemical probe databases: libraries of small molecules with known targets, which permit the development of correlations between chemical and mechanistic properties. She discussed generating target hypotheses for molecules through the use of biologically annotated libraries. The Chemical Probes Portal (http://chemicalprobes.org) is one example of a publicly available database of probes. Merck is working on Web applications that can be used to relate phenotypes and protein targets and biological pathways.

Safety and toxicity are among the most significant drug development issues. Matthew Clark, Elsevier, discussed the development of bioassays as predictors of adverse events in clinical trials. FDA submissions, a large number of journals, and Open PHACTS were used as data, looking for relationships between bioactivity and toxicity, the goal being the development of methods for corroborating evidence from pathway analysis for prediction of important targets.

The session concluded with presentations by two groups on deep learning neural network (DNN) applications for small molecule drug discovery. Dr. Abraham Heifets, Atomwise, http://www.atomwise.com/, gave a presentation on developing predictive models for drug mechanism of actions using deep convolutional neural networks. Deep neural networks are constrained neural networks. AtomNet is a structure-based DNN for molecule bioactivity prediction, which uses a nearest-neighbor structure-based binding algorithm. Atomwise is working on developing these methods, and Abraham presented some results and benchmarks based on their efforts. Antonio de la Vega de Leon, The University of Sheffield, gave a presentation on deep neural networks to predict the activity in a specific screen, and to suggest which target hits the compound. The machine learning algorithm was based on assay description, biological pathway data, and ChEMBL bioactivity data.

Neural network methods show significant promise in their ability to make extensions and predictions based on learning sets and data. With more data available, improved learning sets, and better algorithms to develop correlations, predictions of phenotype, pathways, and targets with small molecule structure and chemical properties data will greatly improve.

Collaborating for Success: Professional Skills Development for Undergraduates, Graduates, and Post-Docs

Employers in every sector seek to hire employees that have a variety of skills and talents. Though there are not always standard definitions, effective communication, critical thinking, creativity, initiative, and adaptability continually rank high in surveys related to desirable skills employers seek. However, these skills are not always part of the academic experience. With only a small percentage of STEM graduates securing tenure-track positions, expanding the training to cover these areas can have a great impact on their careers as future STEM professionals.

The symposium “Collaborating for Success: Professional Skills Development for Undergraduates, Graduates and Post-Docs” took place Monday, August 21, 2017. It explored the professional development needs of undergraduates, graduate students, and postdoctoral researchers in chemistry and other STEM fields. It also revolved around ways that institutions, graduate programs, funders, professional societies, and libraries are contributing to their success.

Several talks focused on the professional development needs of graduate students from a programmatic approach. Laura Regassa and Nadeene Riddick of the National Science Foundation (NSF) spoke about the NSF Research Traineeship (NRT) program, which encourages new models in STEM graduate education. They discussed the professional development skills that have been more often addressed in NRT projects: science communication, including oral, written, and digital communication; mentoring (both faculty and student mentoring); career preparation, such as internships, networking, and career paths; and research ethics, including responsible conduct of research and ethics of data acquisition and management.

Also from a broad perspective, David Zwicky described a needs assessment project aimed at understanding how to support graduate students at Purdue University. The overall needs that surfaced were: 1) professional development, including teaching, building professional identity, coding, communicating professionally, and project management; 2) spaces, such as spaces for collaboration and research; and 3) information resources, data services, and software.

Rigoberto Hernandez of Johns Hopkins University addressed the topic of diversity and equity in chemistry departments, and talked about the Open Chemistry Collaborative in Diversity and Equity (OXIDE). OXIDE works with department chairs to reduce inequitable diversity barriers to career advancement through National Diversity Equity Workshops (NDEW) that facilitate discussion between department chairs, federal agency representatives, and diversity policy leaders.

Danielle Watt of the Chemistry at the Space-Time Limit (CASTL) Center, one of the NSF Centers for Chemical Innovation (CCI), discussed how CCIs train STEM students in leadership. Danielle described professional development needs identified by trainees and how they are addressing those needs. Specific examples include training in innovation, collaboration, and effective communication.

The symposium also focused on the professional development needs of undergraduate students. Thomas Wenzel of Bates College, who chairs the ACS Committee on Professional Training (CPT), addressed the importance of skills development on the ACS certified bachelor’s degree in chemistry. The 2015 guidelines state that “programs must provide experiences that go beyond chemistry content knowledge to develop competence in other critical skills necessary for a professional chemist”. These skills include problem solving, chemical literature and information management, laboratory safety, teamwork, communication, and ethics.

The symposium balanced these broad, programmatic perspectives by including talks that described examples of developing specific skills. Donna Wrublewski of Caltech discussed her experience organizing Data Carpentry workshops to teach programming skills to scientists and engineers. Ron Kaminecki focused on the development of courses on patent information research and analysis to equip students who have a scientific background with practical skills in patent research. Svetla Baykoucheva of the University of Maryland described the implementation and assessment of a program aimed at helping students develop information literacy skills, including finding, managing, and sharing scientific information. From a database provider’s perspective, Mindy Pozenel from Chemical Abstracts Service described their recently created Chemical Class Advantage (CCA) modules for instructors to use in organic chemistry courses. These modules encourage students to use SciFinder to discover the scientific literature, as well as provide opportunities for students to demonstrate their ability to read those articles while quizzing them on the content of the articles. Megan Sheffield (Clemson University) and Marguerite Savidakis-Dunn (Shippensburg University) spoke about developing data management skills in chemists. Rachel Borchardt of American University described the major metrics available to chemists, including journal Impact Factors, citation distributions, and altmetrics, and discussed the importance of mastering those metrics to influence the research evaluation narrative. Along the same lines, Antony Williams of EPA talked about the importance of creating an online presence, and the free tools available for that purpose; specific examples include LinkedIn; Slideshare and Google Scholar to track publications and citations; ResearchGate for networking and citation tracking; Publons for getting credit for reviewing papers; Kudos; Figshare; and altmetrics tools such as ImpactStory and Altmetric scores.

Developing safety skills was addressed in presentations by Joseph Pickel of Oak Ridge National Laboratory and Samuella Sigmann of Appalachian State University. Joe described his experience transitioning to a safety officer position after a career as a scientist, and discussed the skills required in a research operations position. Sammye spoke about embedding safety professionals to engage with faculty and help educate undergraduate students and develop their critical thinking skills.

The importance of communication skills also featured prominently in some of the presentations. Christin Monroe of Princeton University discussed the Science Communication Education Network (SCENe). A collaboration with the NSF-funded communication program Portal to the Public (PoP) National Network, this workshop seeks to develop communication skills of scientists, including their ability to engage with different audiences, and build confidence as communicators. Kiyomi Deards of the University of Nebraska Lincoln reported several ways to engage with a wide audience and facilitate broader impacts; high commitment examples that Kiyomi described include Sci Pop talks and partnering with the Undergraduate Research Council to showcase undergraduates’ research and creative work.

Finally, expanding career opportunities for STEM graduates was also a recurrent theme of the sessions. Amy Clobes of the University of Virginia and Natalie Lundsteen of The University of Texas Southwestern Medical Center described the work of the Graduate Career Consortium (GCC) organization, which helps members provide career and professional development for doctoral students and postdoctoral scholars. They discussed several specific resources for occupational exploration, and also ways to incorporate these tools into the work of librarians, including engaging with campus partners or collaborating with GCC. There were also two talks from professional societies with a careers focus. Shannon O’Reilly of the ACS discussed how the ACS on Campus program has evolved since 2010 to meet the career and professional development needs of students and faculty in the chemical sciences. The program has become increasingly modularized as well as expanding out to international audiences. Scott Nichols of AAAS gave an overview of AAAS Professional Development & Career Services, including myIDP, which is an individual development plan to help explore career possibilities and set goals, and the AAAS Career Development Center resources.

To conclude, the symposium offered a nearly comprehensive overview of the many approaches to contribute to the professional development of STEM students and graduates. The combination of programmatic approaches and case studies focused on specific skills was particularly enriching, and encouraged a very positive engagement among the different speakers and the audience.

What do synthetic chemists want from their reaction systems?

David Evans and I organized a CINF symposium at the fall 2017 ACS national meeting. We had sought talks on progress in reaction searching, reaction planning, synthesis design, retrosynthesis, and reaction prediction. We would really have liked contributions from practicing synthetic chemists on their current needs, both met and unmet, and their frustrations with current systems, but no end users volunteered. Nevertheless, it was an interesting symposium and I have received positive feedback.

Academic research

Connor Coley of MIT was the first speaker. A critical challenge for computer-assisted synthesis design is that the reaction steps proposed may fail when attempted in the laboratory. The true measure of success for any synthesis program is whether the predicted outcome matches what is observed experimentally. Connor and his co-workers have trained a neural network model on experimental data from the USPTO and Reaxys to provide qualitative predictions of organic reaction outcomes in silico. In this method reaction databases are supplemented with chemically plausible negative reaction examples to overcome the literature bias towards successful reactions. Traditional reaction templates are used to generate a list of candidate outcomes for the machine learning model to score, so reactivity rules are implicitly learned rather than encoded. A new, edit-based reaction representation has been developed to focus on the fundamental transformation at the reaction site. In a 5-fold cross-validation, the trained model assigns the major product rank 1 in 71.8% of cases, rank ≤ 3 in 86.7% of cases, and rank ≤ 5 in 90.8% of cases.1 Connor presented some correct and incorrect predictions. Mispredictions are often chemically reasonable or attributable to data quality issues. Extension of the method to condition-dependent predictions achieves similar performance, but conditions are rarely necessary to make the prediction. Multi-step pathway planning remains challenging.

Mark Waller of the University of Muenster and Shanghai University has also used neural networks, but in this case deep neural networks, in both retrosynthesis and reaction prediction.2,3 The machine is trained with essentially the complete published knowledge of organic chemistry (more than 3.5 million reactions acquired from the Reaxys database). Circular fingerprints are used to represent the structures. Training can be carried out overnight with GPUs, and retraining can be carried out weekly. The approach has a higher than 95% accuracy when allowed to suggest up to 10 different routes for a target molecule on a test set of around one million reactions. Deep learning is 150 times faster than a rule-based approach, so handling multistep syntheses becomes feasible. Furthermore, preliminary studies indicate that coupling the neural networks with Monte Carlo tree search techniques outperforms traditional computational synthesis planning with hand-coded transformations.4,5

The international chemical identifier for reactions

The next two talks concerned the International Chemical Identifier for Reactions (RInChI). Gerd Blanke of StructurePendium Technologies explained that RInChI is a single string providing a unique representation of a reaction, independent of how the reaction has been drawn. The Long-RInChIKey is calculated from the IUPAC International Chemical Identifiers (InChIs) of each reactant, product and reagent. The Short-RInChIKey is a fixed-length hash over all reagents, products and agents. Web-RInChIKey is a fixed-length hash developed from the reaction components, but ignoring the specific role within the reaction.

Long-RInChIKeys are valuable for the storage of reactions. They allow uniqueness checks, and the identification of each reaction component by simple text searches based on Standard InChIKeys, but they do not have a fixed length. Short-RInChIKey has a fixed length of 55 letters, plus 8 hyphens as separators. The fixed length of Short-RInChIKey makes it suitable for exact searches of reactions in databases (and on the Web), indexing reactions in databases, and linking identical reactions in different databases. Web-RInChIKey allows for the fact that the depiction of a chemical reaction is not uniquely defined. For Web-RInChIKey, all InChIs of the reaction components are ordered alphabetically. Roles of the components are ignored. The Web-RInChIKey has a fixed length of 47 characters, with 17 letters in the major layer, and 15 letters in the minor layer. It is used for searches over reaction databases with an unknown drawing model, and comparison of reaction databases with different data models. The longer string sets for the major and minor layers make searches over the Web more precise. The first RInChI release was in March 2017. The InChI and RInChI formats and algorithms are non-proprietary, and the software is open source. RInChIs for 4.5 million reactions from the SPRESI database have been generated by InfoChem: only 239 reactions could not be converted.

Jonathan Goodman of the University of Cambridge started his talk with an example of an in silico inspired6 total synthesis of (-)-Dolabriferol.7 Synthetic chemists want data that are accessible, comprehensive, and reliable. InChIs are successful because people use them. Can RInChI be useful too? A good synthesis uses cheap, sustainable, and reproducible starting materials; has low hazards; produces low waste products; uses familiar reactions, and chemists’ expertise; has no inseparable by-products; gives high yields and high stereoselectivity; uses convenient processes; makes a product quickly, cheaply, and reproducibly; and is suitable for making analogues.

Jonathan believes that to achieve a good synthesis, we need to understand our reactions, to make best use of our analytical data, to search the literature effectively, and to store our results, so we, and others, can make best use of this knowledge for the next project and the next molecule. The contributions of Jonathan’s team to experimental chemistry, computational chemistry, and chemical informatics have helped advance all of these areas. Jonathan presented some examples of work that his team has done on the automatic generation of diastereomers using InChI strings,8 prediction of stereochemistry,9 the conformational properties of a polypeptide,10 and the risk assessment of chemicals.11 It is desirable to bring these disparate fields together, so that a single reaction system can enable users to benefit from them all. Using RInChI, we can connect diverse data to individual reactions. Jonathan concluded with an amusing vision of the future synthesis machine.

Search and faceting of large reaction databases

The next talk was by John Mayfield of NextMove Software. Synthetic chemists want data, diagrams, classification and search for their reaction systems. Workers at NextMove have previously described the extraction of reactions from patents. LeadMine and Chemical Tagger convert unstructured text to a structured reaction table. NextMove have also assembled over six million extracted reaction details consisting of the connection tables, procedure, quantities, solvents, catalysts and yields into a searchable ELN for multiple pharmaceutical companies. Good reaction diagramsare essential in communicating synthetic chemistry: NextMove has also done work in this field. In the area of classification, NameRXN software allows the recognition and categorization of reactions from their connection tables. Using a large rule-base of known reaction mechanisms and transformations, NameRXN is able to categorize reactions to a NameRXN code.12 Reactions are classified and assigned to leaves in the RXNO ontology. The ontologies are used to provide organization, faceting, and filtering of results. Pistachio is a reaction dataset interface providing loading, querying, and analytics of chemical reactions. NextMove’s Arthor technology is reportedly up to 100 times fasterthan other “fast search” systems.

The history of chemical reactivity

Guillermo Restrepo of the University of Leipzig showed that a computational approach to the history of chemical reactions sheds light on the patterns behind the development and use of substances and reaction conditions along two centuries. He and his co-workers have explored more than 45 million reactions in Reaxys and revealed historical patterns for substances, types of substances, catalysts, solvents, temperatures, and pressures of those reactions. Reaxys was treated as a graph database. Despite the exponential growth of substances and reactions, little variation of catalysts, solvents, and reactants is observed throughout time. The vast majority of reactions fall into a narrow domain of temperature and pressure. World wars caused a drop in chemical novelty for substances and reactions. The First World War took production back around 30 years and the Second around 15. After the Second World War, the use of organic solvents skyrocketed. Guillermo anticipates that this study, and especially its methodological approach, will be the starting point for the history of chemical reactivity, where social and economic contexts are integrated.

SciFindern and ChemPlanner

The next two papers concerned work that CAS is doing to enhance Wiley’s ChemPlanner13,14 with additional reaction content and associated references, including reactions from patents. A new version of ChemPlanner, including stereoselective retrosynthetic prediction and customizable relevance ranking, will be delivered exclusively in SciFindern. Orr Ravitz spoke first, largely concentrating on ChemPlanner itself. Chemists use ChemPlanner to boost creativity, overcome biases, and cover more options. Previous perceptions of retrosynthesis have been skepticism, fear of overload of information, and concerns about the coverage and the currency of the reaction database, and about accuracy and selectivity. Orr discussed automatic rule generation. Deriving selectivity from data requires statistical power, which is not always sufficient with a database such as CIRX. Literature examples, sorted by similarity to predictions, provide insight into experimental conditions, and enhance user confidence. Greater coverage is expected by using Chemical Abstracts data instead of CIRX. A nearly exhaustive reaction source will have many variations on the same reaction, or the same reaction with very similar reactants and products. Growth of the rule set will be significantly sublinear. Adding examples to existing rules will address functional group tolerance, give more statistical power for regioselectivity calculations, more automation for stereoselective rules, and improved yield prediction. There will be some consolidation of rules.

Jonathan Taylor of CAS started his talk with an introduction to SciFindern. Everything about SciFindern is new: the interface, the application, the search architecture and the data model. User feedback and usability testing were critical in the design. Layout and first surface information were users’ main priorities. The final design balances surfaced information, aesthetics, and browsability and filter options. In the past, synthetic chemists wanted reaction finding tools, today they have synthetic planning tools, and in future they will have help with predictive synthetic routes: SciFindern will deliver new predictive synthesis planning capabilities by integration of an enhanced ChemPlanner. Having ten times the reaction content will provide ChemPlanner with more synthetic options to build pathways and improve prediction quality. Jonathan concluded with some screen mockups of user input, of how SciFindern will propose potential synthetic routes, and of how users will know how the prediction was constructed.

Reaction classification

Next, Valentina Eigner-Pitto of InfoChem spoke about the renaissance of reaction classification and visualization. InfoChem’s ICMAP reaction mapping software identifies reaction centers. The CLASSIFY15 software automatically categorizes a reaction according to the type of chemical transformation, and it can be used for organization of large reaction databases and hit lists. It provides unique identifiers (ClassCodes) that can be used in reaction database analysis. This allows companies to study the kind of chemistry performed in-house, to examine the evolution of chemistry over time, and to compare in-house content with other repositories. Classification can also be used in network graphs, which can be used as visualization tools for reaction content. Workers at Merck KGaA, in collaboration with BioSolveIT and InfoChem, have demonstrated a workflow which exploits the chemist’s electronic laboratory notebook (ELN) in order to obtain and refine transforms for existing and novel chemical transforms,16 which in turn are used to enrich existing virtual libraries. The novelty of the added chemical space is assessed through a multitude of descriptors with a particular focus on three-dimensionality, scaffold diversity, and fingerprint enrichment. Additionally, each added transform is evaluated for its propensity to reconstitute known drugs and chemical probes. Computer-aided synthesis design programs include ChemPlanner and InfoChem’s17 ICSYNTH. Prediction of chemical space (forward reaction prediction) is also illustrated in the Merck poster.16

Use of Reaxys and ReaxysTree

Two papers followed from experts at Elsevier. Juergen Swienty Busch discussed ReaxysTree and the taxonomies used in Reaxys. He began with an exposition of new Reaxys, before turning to the taxonomies. Reaxys has information on documents, substances, reactions, and substance properties, and on bioactivities and targets in Reaxys Medicinal Chemistry Index. For documents, terms from ReaxysTree, Embase, Compendex, and Geobase make search and analysis possible on ReaxysTree. For substances, analysis is possible on substance classes and available properties. For reactions, search and analysis is possible on reaction classes, catalyst classes, and solvent classes. For targets, search and analysis is possible on gene and protein taxonomy, organisms, cell lines, and administrative route. Substances have been curated by Richter classes, rings, and functional groups. Solvents, reagents, and catalysts have been curated for reactions. ReaxysTree allows concepts and synonyms to be used for search, filtering, analysis, and indexing. ReaxysTree concepts for reactions include name reactions, and classes and types such as cyclization, condensation, and addition. Juergen next outlined how reaction mapping is carried out, a transition state is assigned, and the transform is coded. In searching reactions with ReaxysTree, taxonomy terms are connected with actual Reaxys queries using transforms, and other appropriate search terms such as product substructures.

Matt Clark of Elsevier thinks that medicinal chemists themselves only want to find transformation details for chosen steps in synthesis, while management wants to lower the cost of making compounds, and wants reliable reaction schemes that can be sent to a contract research organization (CRO) for fast turnaround. Reaxys is a treasury of reported chemistry, with a built-in synthesis planning tool and display of experimental procedures. The API allows you to use similarity for compounds and reactions, access some data elements not visible in the user interface, and create your own analytics and reaction networks. Pipeline Pilot and KNIME offer an easy way to use the API, and offer interoperability with other software products.

Matt discussed a reaction graph analysis application to address questions around a specific potential CDK8 inhibitor. What chemistries are known about compounds like this? What conditions and solvents were used by different chemists? Where is this chemistry reported? Ultimately, what are the most efficient and flexible methods to make compounds like this? The application involves searching for reactions with the target compound as product, and similarity search for very similar compounds, and then searching for reactions using the reactants as the product, and then repeating for the desired graph depth. An interesting finding was that for very similar compounds, different chemistries and starting materials have been used. One tree showed a set of compounds that used a common set of starting materials. Using Cytoscape you can drill down to references for each edge. You can compare intermediates for similar compounds made by different groups and, by accessing Scopus, examine a network of institutions publishing a specific chemistry.

Using Reaxys you can also analyze reaction conditions, grouping known transformations at different levels of detail to get the best conditions. Grouping uses reaction similarity, based on Reaxys transformation codes. Searching for “Buchwald-Hartwig Aminations” by keyword produced 4,179 results. These were grouped by transformation codes, from general to specific: level 0 had one group with 4,179 members, level 1 had 99 groups, level 2 had 160 groups, and so on. A summary of solvent and conditions for level 0 showed that toluene is a popular solvent, a temperature of around 110°C is common, reaction time is not very long, and inert atmosphere and microwave use were mentioned. These conditions can be selected based on membership in one of the other groupings.

An expert searcher’s viewpoint

The final speaker was Judith Currano of the University of Pennsylvania. Introducing variable substituents during a reaction search is challenging. A researcher may not have a definite substituent in mind, instead suggesting that a site can be occupied by “any aryl group” or, still worse, “any electron withdrawing group”. Even a researcher who generates an R-group and populates it with specific substituents can run into problems because atom mapping from reactant to product is prohibited within R-group fragments. Judith used the term “specific ambiguity” to talk about a type of attachment without specifying exactly what it is. This includes general classes of attachments, user-defined groups of attachments (variables or R-groups), and stereocenters where you do not care about the identity of all of the attachments. She presented case studies based on troublesome requests from synthetic chemists.

The first examples concerned functional group transformations (plus mapping from reactant to product), and sensitive functional groups. Searchers should understand that sometimes a review source is worth a thousand searches. (Science of Synthesis was good for one example.) Searchers should also use caution when employing mapping. Database vendors should perhaps give users the ability to make mapping less atom-specific and more atom-type-specific. Structure search algorithms should have a way of manually grouping fragments that appear in the same reactant or product, allowing the searcher to specify multiple fragments in one substance while allowing additional substances on that side of the equation. (Old Beilstein Crossfire worked well in one of Judith’s examples.)

The second set of examples involved specific ambiguity of stereocenters or variables. Judith recommends searchers make use of system-defined generics whenever possible. In the case of user-defined generics, it may be necessary to run multiple searches if your generic does not exist. Vendors should note that all structure search algorithms should permit stereocenters containing system- or user-defined variables, and all search algorithms should permit stereo-specific reaction searches.

Finally Judith discussed reaction searches involving both specific transformations and specific ambiguity (mapping R-groups, mapping variables, and including the elusive electron-withdrawing group). She warns users that if it is essential that they map a user-defined R-group from reactant to product, they should be prepared to do multiple searches for the various substances represented. Database vendors should note that adding generics like electron withdrawing groups would make users very, very happy.

Acknowledgments

My thanks to all the speakers for their interesting contributions, and for providing me with copies of their slides, allowing me to study the talks in more depth, and, ultimately to include more detailed summaries in my meeting report. My thanks also to Matt Clark for handling all the PC and projector issues so that I could concentrate on introducing speakers, on handling questions, and, above all, on being stimulated by the interesting science.

Herman Skolnik Award Symposium

Herman Skolnik Award Symposium 2017
Honoring David Winkler

Introduction

David Winkler, CSIRO Fellow, and professor at Latrobe Institute for Molecular Science, and Monash Institute of Pharmaceutical Sciences, Melbourne, Australia, received the 2017 Herman Skolnik Award for his seminal contributions to chemical information in the development of optimally sparse, robust machine learning methods for QSAR, and in leading the application of cheminformatics methods to biomaterials, nanomaterials, and regenerative medicine. A summary of his achievements has been published in the Chemical Information Bulletin. David was invited to present an award symposium at the Fall 2017 ACS National Meeting in Washington, DC. He invited six speakers:

Tim Clark: Approaching reality - simulating electronic devices

796 reads

Tim Clark, of the University of Erlangen-Nürnberg, was the first speaker. The impact of modern hardware and software on simulations has not been an issue of doing things faster and faster, but rather one of doing calculations that we could not do before. Ab initio calculations can now be done on compounds with several hundred atoms, density functional theory calculations on a few thousand atoms, and semiempirical molecular orbital (MO) calculations on 100,000 atoms. Simulations of several microseconds are now standard.

Semiempirical (neglect of diatomic differential overlap, NDDO) molecular orbital (MO) calculations without local approximations are now possible for 100,000 atoms or more with the massively parallel semiEMPIRical molEcular-Orbital Program (EMPIRE) program,1-3 which is freely available to academic groups. Calculation scales with approximately N2.5. We are no longer limited to small or homogeneous, perfect systems, but can now include defects, dopants, impurities or domain boundaries in the calculations, or even calculate amorphous systems.

The results of such calculations can be used to simulate charge-transport through disordered monolayers. Clark’s team has studied self-assembled monolayer field-effect transistors (SAMFETs) handling conformational freedom using classical atomistic molecular-dynamics (MD) simulations, electronic properties using very large scale semiempirical MO theory, and conductance by propagating single electrons or using diffusion quantum Monte-Carlo (DQMC) charge-transport simulations.4-9

The molecules that comprise the SAM contain insulating and semiconducting moieties, so that they serve as both gate dielectric and the active transistor channel in a device:

Tim’s team has used simulations to describe and optimize complex systems of self-assembled monolayers on surfaces, not only to explain their morphology, but also to predict molecular compositions and arrangements favorable for improved charge transport.7 In more recent work,10 they have constructed transistors based on SAMs of two molecules that consist of the organic p-type semiconductor benzothieno[3,2-b][1]benzothiophene (BTBT), linked to a C11 or C12 alkylphosphonic acid. Both molecules form ordered SAMs, but the experiments show that the size of the crystalline domains and the charge-transport properties vary considerably in the two systems. Because of the angle of the head groups one can form crystalline domains and the other cannot. This can be reproduced with simple force field calculations.

The procedure for charge transfer simulations is as follows:

Calculate the neutral system and use local properties as external potentials:

Local electron affinity11,12 for electrons, local ionization energy13 for holes

Propagate single charge carriers on these potentials to determine time scales.

Tim showed an MD simulation of the charge transport paths. For the transport calculations, the team employed a fully quantum mechanical description, namely Landauer transport theory.9 In accord with experiment, they found an improved charge transport across BTBT-C11-PA SAMs compared to BTBT-C12-PA SAMs.

DQMC reproduces voltage/current curves (assuming that the number of Monte Carlo steps correlates with time) and reproduces experimentally observed hysteresis. It also revealed dimeric fullerene electron traps.15 Density functional theory calculations indicate that van der Waals fullerene oligomers can form interstitial electron traps in which the electrons are even more strongly bound than in isolated fullerene radical anions. Spectroelectrochemical measurements on a bis-fullerene-substituted peptide provide experimental support. The proposed deep electron traps are relevant for all organic electronics applications in which non-covalently linked fullerenes in van der Waals contact with one another serve as n-type semiconductors.

Finally Tim showed the results of simulations of hole-transport through a self-assembled monolayer substituted with a p-type organic semiconductor and with crystalline domains (see the work above on BTBT linked to a C11 or C12 alkylphosphonic acid). He illustrated hole transport through the monolayers. Hysteresis is not observed in this case. Tim also illustrated well-defined paths through the crystalline domains of the O2(OH)P(CH2)11-BTBT material. The researchers have shown that structural order is particularly important for the electronic properties of semiconducting self-assembled monolayers, and they predict that semiconducting SAMs with a higher degree of crystallinity and larger crystalline regions will exhibit superior performance.

Alex Tropsha, of the University of North Carolina Chapel Hill, UNC Eshelman School of Pharmacy, is benefiting from the explosive growth of materials data. There are 160,000 entries in the Inorganic Crystal Structure Database (ICSD). There are numerous commercial and open experimental databases (NIST, MatWeb, MatBase etc.), and huge databases such as AFLOWLIB, Materials Project, and Harvard Clean Energy. The chemical space of possible materials is huge : about 10100 candidates.16 The US government’s Materials Genome Initiative recognizes the need for new high performance materials. The growth of materials databases and emerging informatics approaches offers the opportunity to transform materials discovery into data- and knowledge-driven rational design.

AFLOW is a globally available database of 1,688,245 material compounds, with over 167,136,255 calculated properties. The optimized geometries, symmetries, band structures, and densities of states available in the AFLOWLIB consortium databases have been converted into two distinct types of fingerprints: Band structure fingerprints (B- fingerprints), and Density of States fingerprints (D-fingerprints).17 The framework is employed to query large databases of materials using similarity concepts, to map the connectivity of materials space (as a materials cartogram) for rapidly identifying regions with unique organizations and properties, and to develop predictive quantitative materials structure−property relationship (QMSPR) models for guiding materials design.

To represent the library of materials as a network (a material cartogram), the researchers considered each material, encoded by its fingerprint, as a node. Edges exist between nodes with similarities above certain thresholds (in this case, Tanimoto similarity and a threshold of 0.7). A materials map from B-fingerprints was made from 15,000 materials from ICSD, using DFT PBE calculations from AFLOWLIB. Four big clusters were observed: insulators, ceramics, and complex oxides; bimetals and polymetals; metallic and nonmetallic combinations; and small band gap semiconductors.

Novel descriptors (property-labeled materials fragments) not requiring prior DFT calculations have also been developed by Voronoi tessellation and neighbors search of crystal structures, followed by infinite periodic graph construction and property labeling, and generation of circular fingerprints.18 Starting from only a crystal structure, regression models can be built to predict band gap energy, and thus electronic properties, or to predict thermo-mechanical properties such as bulk modulus, shear modulus, thermal expansion, heat capacity, and thermal conductivity. All the models are trained based on DFT-computed properties. Heuristic design rules can be extracted.

Material informatics has also been applied to the design of a novel photocathode material for dye-sensitized solar cells (DSSCs).19 By conducting a virtual screening of 50,000 known inorganic compounds, the researchers have identified lead titanate (PbTiO3), as the most promising photocathode material. Notably, lead titanate is significantly different from the traditional base elements or crystal structures used for photocathodes. In experimental validation, the fabricated lead titanate DSSC devices exhibited the best performance in aqueous solution, showing remarkably high fill factors compared to typical photocathode systems. Currently, device performance is low, but it might be improved by designing a new dye.

Next Alex discussed applications of machine learning to designing chemicals with the desired physical and biological properties where compound structure is described only by its SMILES notation, and no other conventional chemical descriptors are used. The new approach developed in his lab is based on concepts from text mining that rely on neural networks to solve the problem of semantic similarity of texts.

The British linguist J. R. Firth is noted for drawing attention to the context-dependent nature of meaning. In particular, he is known for the 1957 quotation: “You shall know a word by the company it keeps”. To define the semantic similarity between two entities, Alex and his colleagues have made use of approaches embedded in Word2Vec, a neural-network-based approach to describe linguistic context of words developed at Google.20 With Word2Vec, a network is trained using each word of a corpus of text and some configurable number of surrounding words. The model can be trained to either predict the surrounding context based on the current word, or to predict the current word from the context. Elena Tutubalina and Alex (manuscript in preparation) have performed drug clustering in semantic similarity space, using webmd.com, patient.info, drugs.com, amazon.com, askapatient.com, and dailystrength.org as sources of user comments, and showed that drugs with similar pharmaceutical action do cluster together in the semantic similarity space.

Alex’s team has also experimented with de novo design of molecules with the desired properties using SMILES in Deep Reinforcement Learning:

Structural bias, physical properties, and biological activity have been used in proof of concept case studies of user-biased molecular design. In summary, Alex cited Confucius who said, “Without knowing the force of words, it is impossible to know more”. Alex quipped “And remember: anything you say can, and will be used … for text mining!”.

Yoram Cohen of the University of California Center for Environmental Implications of Nanotechnology gave a talk co-authored by colleagues at the University of California. Nanoinfo.org is a nanoinformatics platform that supports the environmental impact assessment of engineered nanomaterials (ENMs) with a central database of ENM safety data and a toolkit for various exploration and analysis methods.21 These methods include the estimation of environmental exposure levels of ENMs (MendNano), evaluation of environmental releases of ENMs (LearNano), analysis of high throughput toxicity data of ENMs (ToxNano), and predictive toxicity models, and analysis of the environmental impact of ENMs via Bayesian inference (NanoEIA).

NanoDatabank is a data repository of ENM properties, and experimental and simulation datasets of ENM toxicity and environmental fate and transport (F&T). It contains databases that include physicochemical properties; toxicological properties; experimental datasets of ENM toxicity and F&T; and results of model simulations and estimation of ENM toxicity and F&T behavior, and physicochemical properties. It includes data for over 300 nanomaterials, and toxicity data for various cell lines, zebrafish and bacterial strains, from 325 publications. ToxNano is a high-content data analysis tool (HDAT)22,23 offering QSARs using random forest and Bayesian network toxicity models; analysis of knowledge evidence, and data visualization. MendNano (multimedia environmental distribution of nanomaterials) is a Web-based modeling platform.24,25 Nanoinf.org has 400 users from more than 50 countries.

As an example of work on the toxicity of nanomaterials, Yoram presented unpublished results on evaluating the body of evidence on quantum dots (QDs) via meta-analysis. QDs are very small semiconductor particles, only several nanometers in size, so small that their optical and electronic properties differ from those of larger particles. Many types of quantum dot will emit light of specific frequencies if electricity or light is applied to them, and these frequencies can be precisely tuned by changing the dots’ size, shape, and material.

QD data were collected from 448 publications, reporting 2,703 samples, with 7 core types, 12 shell types, 13 surface modifications, 14 surface ligands, and 20 assay types. In the predictive toxicity model R2 was about 0.81 for cell viability, and about 0.83 for IC50. Yoram and his colleagues studied cause-effect relationships between cellular bioactivity and QD attributes. Median IC50 was ≤ 10 mg/L, for the surface ligands of type amphiphilic polymer, lipid, other hydrophobic, aminothiol, and other amphiphilic. It was uniformly distributed for silica. There was no correlation between surface charge and IC50. The sensitivity distribution of IC50 for cell anatomical type suggests that more differentiated cells are more adversely affected by exposure to QDs. Toxicity is not governed by QD size alone: there is a wide range of IC50 for a given size, and toxicity can be high or low irrespective of the size. Core type affects toxicity, but the wide range of IC50 for a given core type suggests that there are other important attributes.

Bayesian network models can be useful for handling uncertainties, mixed attributes, and hidden conditional relationships since they provide rigorous and simple mathematical means of handling data uncertainty; they integrate graphical representation of the problem with probabilistic evaluation of variable relationships; they can incorporate prior knowledge based on data as well as expert opinion in a convenient representation of probability distributions; and they calculate the likelihood of specific scenarios based on prior knowledge.

Bayesian networks for new explorations of association rules among various biological responses as a result of exposure to manufactured nanomaterials have also been demonstrated in zebrafish toxicity studies. Yoram and his co-workers used a nanomaterial biological interaction knowledge base of zebrafish phenotype data with 1,147 samples, and 11 biological responses (including mortality). The data included exposure to seven material types (carbon, cellulose, dendrimer, metal, (metal) oxide, polymeric, and semiconductor) of 0.8–250 nm average primary size; concentration; number of embryos per experiment; and responses recorded for each exposure scenario.

The Bayesian network model for zebrafish mortality (percentage of dead embryos) had an R2 of about 0.79. Sensitivity analysis of the key material properties and exposure conditions that correlate with zebrafish mortality was carried out, and cause-effect relationships between zebrafish phenotypes and material properties and exposure conditions were investigated. Attribute significance was determined by exhaustive search of 13 attributes using bootstrapping. Mortality at 120 hours post-fertilization correlated with concentration used, core atomic composition, outermost surface, average particle size, surface charge, shell composition and purity. The significant attributes at 24 hours post-fertilization were the same but the ranking of the top four differed slightly.

The responsible development of beneficial manufactured nanomaterials requires a thorough understanding of their potential adverse environmental and human health impacts. This requires predicting the biological response of various receptors when exposed to these materials, along with an understanding of their fate and transport, and their range of likely exposure concentrations. Yoram’s work helps to rank various nanomaterials with respect to their potential environmental impact.

Ceyda Oksel of Imperial College London reported on the PhD work27 she had done at the University of Leeds in collaboration with Xue Wang and David Winkler. Given the ever-increasing use of ENMs, it is essential to assess properly all potential risks that may occur as a result of exposure to ENMs. The distinctive characteristics of ENMs that have made them superior to bulk materials for particular applications might also have a substantial impact on the level of risk they pose. Despite the clear benefits that nanotechnology can bring, there are serious concerns about the potential health risks associated with the production and use of ENMs, intensified by the limited understanding of what makes ENMs toxic and how to make them safe.

The involvement of computational specialists in nano-safety research has become more prominent since Registration, Evaluation, Authorization and restriction of CHemicals (the European Union’s REACH regulation) promoted the use of in silico techniques such as QSAR for toxicity assessment. Data-driven models that decode the relationships between the biological activities of ENMs and their physicochemical characteristics provide an attractive means of maximizing the value of scarce, and expensive, experimental nanotoxicity data.

Nano-QSAR models can be used to predict the properties of new materials and to design safer materials. Leeds-based genetic programming-based decision tree (GPTree) approach27 applies decision tree learning algorithms to identify the best combination of physicochemical properties to predict biological activity of ENMs. The trees are automatically constructed from the data. Decision trees have several advantages. They are able to deal with small, large and noisy datasets; they can detect nonlinear relationships (as well as linear ones); they allow input variables to be selected automatically; they are transparent; and they represent knowledge clearly (i.e., the models are interpretable).

GPTree begins with a random population of solutions and repeatedly attempts to find better solutions by applying genetic operators such as mutation and crossover. The first step is to construct a user-specified number of trees (usually a large number) starting from a random compound and a randomly chosen descriptor. Once the initial population is generated, tournament selection is performed to identify the best tree to be used as a parent tree for genetic operators such as crossover. The best tree from the subset of trees is chosen by its fitness (e.g., accuracy). Genetic operators such as crossover and mutation are used to form the next generation of trees that are added or replace the current generation. These steps are repeated until the user-specified number of generations has been created. The decision tree model with the highest accuracy of classification for the training set is selected as the optimal decision tree model.

Ceyda demonstrated the application of genetic-programming-based decision tree construction algorithms to QSAR modeling of ENM toxicity by five case studies. The accuracy of the model predictions was satisfactorily high and clearly highly statistically significant relative to the classification rate due to chance.

In the first case study, a large set of in-house in vitro data (obtained in collaboration with Edinburgh University) was used. The dataset included a panel of 18 ENMs with varying structures (e.g., carbon-based materials and metal oxides), a set of in vitro cytotoxicity assays (e.g., LDH release, apoptosis, necrosis, viability, MTT and hemolytic effects), and several experimentally measured physicochemical properties (e.g., particle size and size distribution, surface area, morphology, metal content, reactivity and free radical generation). After a set of data preparation and scaling steps, a heat map of toxicity data combined with hierarchical clustering was constructed. As a second step, C-Visual Explorer (CVE) was used as a tool to create a parallel coordinate plot of the multivariate toxicity data. Similar to the heat map visualization results, the parallel coordinate plot showed that the aminated polystyrene latex beads and zinc oxide had the highest toxicity values in nearly all assays, followed by nanotubes that had medium to high toxicity values in viability and MTT assays.

Then, a dimensionality reduction technique, principal component analysis, was performed on all the toxicity data and the ENMs were divided into five categories according to their toxicity values. GPTree was used to identify potential descriptors contributing to the toxicity of four particular ENMs that were clearly separated from the main cluster formed by low-toxicity ENMs. It was concluded that high aspect ratio contributed to the toxicity of nanotubes, while the most likely factor driving the toxicity of zinc oxide was its high zinc content.

In the second case study, the cellular uptake of nanoparticles, 13 descriptors representing the hydrogen-bonding characteristics, functional group counts, molecular shape, composition and polarizability were found to be significant among a larger set of 147 chemically interpretable descriptors. The findings of GPTree analysis regarding the large contribution of lipophilicity, hydrogen bonding and molecular shape descriptors in the cellular uptake behavior of nanoparticles is consistent with earlier studies.

For a cytotoxicity to human keratinocytes dataset (the third case study),28 the descriptors selected by GPTree were the enthalpy of formation of metal oxide nanocluster representing a fragment of the surface (), the Mulliken’s electronegativity of the cluster, Xc, and the chemical hardness, η. The former two descriptors are consistent with the properties reported to be important for cytotoxicity of metal oxide nanoparticles. In addition, the chemical hardness corresponding to the reactivity was found to be an influential parameter on the cytotoxicity of nanoparticles.

The descriptors selected by GPTree were used to develop a regression model which was statistically significant and had good predictivity (R2 = 0.92, Q2 = 0.72). A variable importance plot showed that Xc was twice as important as , which was a little more important than η.

The data used in the fourth case study included a set of 27 descriptors, 23 ENMs, and a set of multi- and single-parameter toxicity screening assays. The descriptors selected by the GPTree model included nanoparticle conduction band energy, EC, and ionic index of metal cation, Z2/r. This finding is very consistent with past studies that identified these two descriptors as being important for the toxicity of metal oxide nanoparticles.

In the last case study, exocytosis of gold nanoparticles in macrophages, the optimal descriptors for predicting the exocytosis were the charge accumulation, zeta potential and charge density. These findings are in line with previous studies revealing an association between surface characteristics of gold nanoparticles, especially high positive surface charge, and their exocytosis patterns in macrophages.

Ceyda concludes that the genetic-programming-based decision tree construction algorithm shows considerable promise in its ability to identify the relationship between molecular descriptors and biological effects of ENMs. Selected decision tree models yielded (external) prediction accuracies of 86-100%. Another statistical test (Y-randomization) was also performed to demonstrate the robustness of the selected models. This work is a first step in the implementation of a genetic programming based decision tree construction algorithm to nano-QSAR studies.

Johnny Gasteiger: Self-organizing neural networks in chemistry

599 reads

Johnny Gasteiger of the University of Erlangen-Nürnberg is skeptical about deep neural networks: they are good for getting funding, but they are yet to be proven. Johnny illustrated some of the useful applications of shallow neural networks. Much like the human brain generates two-dimensional sensory maps of the environment, a Kohonen network (a self-organizing map) can generate two-dimensional maps of high-dimensional chemical data. Crucial for the success of the study of chemical problems by a self-organizing neural network is the representation of the chemical data.

The shape and surface of molecules are very important: the entire electrostatic potential can be seen in a colored 3D model. Johnny has projected the 3D Cartesian coordinates of, for example, 2-chloro-4-hydroxy-2-methylbutane onto a Kohonen net to get a 2D map:

The neurotransmitter acetylcholine binds to two types of receptors, the muscarinic and the nicotinic receptor. Kohonen maps of the van der Waals surface of muscarinic agonists (muscarine, atropine, scopolamine, pilocarpine) and nicotinic agonists (nicotine, (+)-anatoxin a, mecamylamine, pempidine) have also been produced by projecting points of the 3D surface on a 2D space.29 Such maps allowed the total molecular electrostatic potential (MEP) of a compound to be represented in a single picture, instead of requiring a series of pictures as formerly. Johnny showed the maps of the MEPs of the eight compounds with muscarinic agonists in the top row and nicotinic agonists below.

The results showed that the MEP is important for the binding of these compounds to their receptors. The Kohonen maps reflect significant characteristics of the MEPs and can therefore be used in the search for biologically active compounds.

In analytical chemistry, neural networks have been used in the classification of Italian olive oils.30 The classification was performed on a set of 572 Italian olive oils, from nine different regions, on the basis of an analysis of eight fatty acids. Kohonen learning was superior to a network using the back-propagation of errors. There were 250 oils in the training set and 322 in the test set; 312 of the 322 were correctly predicted. The nine Italian regions were nicely differentiated in the Kohonen map. What is, however, even more interesting is that the Kohonen map is reflecting the map of Italy. This emphasizes the power of unsupervised learning, discovering information that is hidden in the data. In this case, clearly, the different climates and the different soils are responsible for the separation of the regions of Italy in the self-organizing map:

Kohonen networks use unsupervised learning. Johnny next discussed examples of supervised learning. In one experiment the electronic properties located on the atoms of a molecule such as partial atomic charge, and electronegativity and polarizability values were encoded by an autocorrelation vector accounting for the constitution of a molecule.31 Using the 49-dimensional vector of seven properties and seven distances, it is possible to distinguish between 112 dopamine agonists and 60 benzodiazepine receptor agonists even after projection into a Kohonen map. The two types of compounds can still be distinguished if they are buried in a dataset of 8,323 compounds of a chemical supplier catalog comprising a wide structural variety. The method can be used for searching for structural similarity, and, in particular, for finding new lead structures with biological activity.

Gasteiger’s team has also worked on simulation of infrared spectra.32 They developed an empirical approach to the modeling of the relationships between the 3D structure of a molecule and its IR spectrum based on a novel 3D structure representation, and a counterpropagation (CPG) neural network. The 3D coordinates of the atoms of a molecule are transformed into a structure code that has a fixed number of descriptors irrespective of the size of a molecule. The structure coding technique is referred to as radial distribution function (RDF) code.33 3D structures were transformed into radial codes (128 values) and put into a CPG network. IR spectra (128 absorbance values) were also input, and the network was trained. When IR spectra are simulated the fingerprint region is predicted well because of the representation of the 3D structure. A CPG network can be operated in reverse mode,33 enabling the prediction of a structure code. The input of a query infrared spectrum into a trained CPG network provides a structure code vector, which represents the radial distribution function with 128 discrete values. This RDF code is then decoded to provide the Cartesian coordinates of a 3D structure.

Johnny concluded by mentioning his recent collaboration with David Winkler on dye solubility in carbon dioxide.34 David has also worked on melting points of ionic liquids, fibrinogen adsorption to polymeric surfaces, and normalized metabolic activity of polymeric biomaterials. Johnny encouraged David to continue to do good science.

Tudor Oprea: Understudied proteins. Time to shift the paradigm

644 reads

Tudor Oprea of the University of New Mexico believes that identifying novel targets as a precompetitive endeavor can lead to new therapeutic opportunities if academia and industry work together. Most protein classification schemes are based on structural and functional criteria. For therapeutic development, it is useful to understand how many data and what types of data are available for a given protein, thereby highlighting well-studied and understudied targets. Tudor and his co-workers classify proteins annotated as drug targets as “Tclin”; proteins for which potent small molecules are known as “Tchem”; proteins for which biology is better understood as “Tbio”; and proteins that lack antibodies, publications or National Center for Biotechnology Information (NCBI) Gene References Into Function (GeneRIFs) as “Tdark”.

Tclin proteins are associated with drug mechanism of action (MoA). Tchem proteins have bioactivities in ChEMBL and DrugCentral, plus human curation for some targets. A Tbio protein lacks small molecule annotation, and is above the cutoff criteria for Tdark, or is annotated with a Gene Ontology (GO) molecular function or biological process leaf term(s) with an experimental evidence code, or has confirmed Online Mendelian Inheritance in Man (OMIM) phenotype(s). Tudor and his colleagues used name entity recognition software35 from L. J. Jensen’s lab to evaluate nearly 27 million abstracts to derive a publication score per protein. Tdark proteins (“understudied proteins”) have little information available, and meet two of the following three criteria: a PubMed text mining score of less than five, three or fewer GeneRIFs, and 50 or fewer antibodies available according to antibodypedia. As external validation, Tdark proteins have statistically significantly lower values compared to the other three target development levels (TDLs) in terms of fewer GO terms, fewer patents, fewer National Institutes of Health (NIH) R01 grants, and fewer searches of the STRING-db database.

Tudor’s first “take home message” was that there is a knowledge deficit: over 37% of the proteins remain understudied (the Tdark ones) and only about 10% of the proteome (Tclin and Tchem) can be targeted by potent small molecules. Are Tdark proteins underfunded because there is no scientific interest in this category, or is the lack of knowledge perpetuated by lack of funding? It is possible that the absence of high-quality, well-characterized molecular tools (i.e., antibodies or chemical probes) may be a root cause for this situation, but lack of tools leads to lack of interest, and lack of interest diminishes the probability of such tools being developed.

The patent literature is also of interest. Almost half of patent bioactivity data are never published elsewhere, and compounds may appear in patents two to four years before they appear in the literature. The SureChEMBL team has annotated the SureChEMBL patent corpus with gene and disease terms. Looking at patents between 2001 and 2013, they processed a set of 99 approved patents of interest to the Illuminating the Druggable Genome (IDG) consortium. These bioactivity data from 99 patents were manually extracted: 20,941 activity measurements for 11,358 compounds, and 1,134 assays. These data are already uploaded into ChEMBL 23. Data for seven IDG Phase 2 targets were uncovered by this patent data extraction exercise, data which progress TDLs of two targets (GPR6 and HCAR1) from Tbio to Tchem.

Anne Hersey of ChEMBL has estimated that more than 50% of the data from patents do not end up in peer-reviewed papers. IDG, Open Targets, BindingDB, and others could collectively, in a precompetitive manner, mine data from patents (if necessary, for only terminated projects, or out-of-patent drugs) and upload these data into ChEMBL and Pharos. Pharos36 is the user interface to the Knowledge Management Center (KMC) for the IDG program funded by the NIH.

Approximately one-third of all mammalian genes are essential for life. Phenotypes resulting from knockouts of these genes in mice have provided insight into gene function and congenital disorders. The International Mouse Phenotyping Consortium (IMPC) has published research on the high-throughput discovery of novel developmental phenotypes.37 They identified 2,788 genes with 8,241 significant phenotype calls in 25 major categories. The promise of the IMPC annotations is illustrated by examining the definite and clear links between human neurological and behavioral disorders (191 human genes) and the corresponding gene knockout mouse neurological and behavioral phenotypes. The majority of these links are for schizophrenia, Alzheimer’s disease, epilepsy, and amyotrophic lateral sclerosis. Several rare diseases are also associated with these genes.

Of 119 Tdark genes prioritized by KMC to IMPC, 45 mouse lines were produced, with 41 phenotypes observed. Knockouts of the Tdark kinase Alpk3 have increased embryonic and perinatal lethality, with the surviving adults displaying severe heart defects. Of 482 Tbio genes submitted by KMC, 184 mouse lines were produced, with 145 phenotypes observed. Knockouts of the Tbio GPCR Adgrd1 display reproductive defects. (These are Tdark and Tbio statistics as of April 2017.) Tudor commented: “If you don't know very much to begin with, don't expect to learn a lot quickly.”

Data from Cristian Bologa suggest that on average it takes 15-20 years for Tdark to bear fruit. The leptin receptor was Tdark in 1995, but led to an approved drug in 2014. The smoothened receptor was Tdark in 1997, and a drug was launched in 2012. Tudor gave several other examples. There is room for improvement in research funding. Text mining of all NIH grants for the period 2000-2015 suggests that 8,858 proteins received zero NIH funding. Of these, 6,051 are Tdark, and 2,616 are Tbio. This is to be expected, but 119 are Tchem and 72 are Tclin. Possible explanations could be old drug targets or research funded elsewhere. (Data from funding sources other than NIH are not available.) Pharma and academia could pay more attention to these 8,858 underfunded proteins.

Tudor’s second take home message was that just because something is ignored it does not mean it lacks importance. Understudied proteins need funding and patience. Based on current evidence, IMPC has the most concerted Tdark exploration approach.

DrugCentral (http://drugcentral.org ) is an open access online drug compendium38 integrating structure, bioactivity, regulatory information, pharmacologic actions, and indications for active pharmaceutical ingredients approved by regulatory agencies. It integrates content for active ingredients with pharmaceutical formulations, indexing drugs and drug label annotations, and complementing similar resources available online. Tudor’s team used it initially to find how many drugs there are, but they also wanted to know how many drug targets there are. They have studied innovation patterns per therapeutic area:39

They have also examined the commercial impact of target classes by evaluating data from IMS Health on drug sales from 75 countries, aggregated over a five-year period (2011–2015). After excluding categories such as homeopathic medicines, they identified 51,095 unique products, and mapped them to 1,069 active pharmaceutical ingredients from DrugCentral, corrected by the number of active pharmaceutical ingredients (APIs) per product, then by the number of Tclin targets per API. The most lucrative target class from a therapeutic perspective was G-protein coupled receptors (GPCR, 27.42% market share). Tudor also tabulated the top 20 targets by revenue. His third take home message was that there are many unexplored opportunities. By his conservative estimate (about 15,000 disease concepts, and about 2500 unique drug indications), we address about 15% of human diseases with therapeutic agents.

It has been said that the absence of a quantitative language is the flaw of biological research40 or “the more facts we learn the less we understand”. Again, when little is known, we should not expect knowledge to accumulate quickly. Separation by organ and cell is a conceptual fallacy. Medicine maintains this separation for necessity: by organ (e.g., cardiology or ophthalmology), and by disease category (e.g., oncology or infection). NIH Institutes are organized in a similar way. Many pharmaceutical companies are organized by therapeutic area. Yet genes, proteins and pathways do not observe such separation. The impact of this “mental divide” in science has yet to be understood.

A. B. Jensen et al. have studied disease correlations and temporal disease progression (trajectories)41 on a large scale over 15 years, and grouped 1,171 significant trajectories into temporal patterns centered on a small number of early diagnoses that are central to disease progression. Hence it is important to focus on early diagnoses in order to mitigate the risk of adverse patient outcomes. The authors suggest such trajectory analyses may be useful for predicting and preventing future diseases of individual patients. Using data from the Cerner HealthFacts database, Tudor’s team has found that the top diseases prior to Alzheimer’s (over 5 years or more) are essential hypertension, hyperlipidemia, Type 2 diabetes mellitus, hypercholesterolemia, and coronary atherosclerosis. For renal failure, diseases over the previous five years are essential hypertension, heart failure, angina pectoris, chronic heart disease, and diabetes mellitus.

Diseases are concepts. They lack physical manifestation outside patients, so the search for cures has to be patient-centered.42 Animal models should be combined with mining of patient data. We ought to use electronic health record data to prioritize targets for further drug discovery. For example, we should get genes associated with diseases that precede Alzheimer’s to investigate possible causality. Such priorities could be disease-specific, or phenotype-specific.

It is time to acknowledge that target prioritization for drug discovery is precompetitive knowledge. The pharmaceutical industry reward system is based on patents, which are awarded for drugs, not targets. Finding a good target leads to the “me-too” phenomenon. It is time to pool resources together on targets, team up with Open Targets and create a Target Selection Consortium, partnering industry with academia. “Double blind” studies could be cosponsored, to avoid the reproducibility crisis. IDG KMC is seeking new knowledge.

David Winkler’s award address was co-authored by his colleague Frank Burden, now retired from CSIRO, and by co-workers at Imperial College London, King’s College London, and the University of Nottingham, whose work is acknowledged in the literature references.

David’s research concerns computational chemistry applied to a molecular level understanding of interactions of molecules and materials with biology. He has a strong interdisciplinary, translational research focus. His modeling, design, and optimization of bioactive materials focus on testing model predictions by subsequent experiments. He employs a range of computational tools including quantum chemistry, molecular dynamics and mechanics, molecular graphics, pharmacophore models, protein docking, and, in the case of this talk, quantitative structure-property relationship modeling. He is interested in the design of drugs and materials for therapeutic and regenerative medicine, especially control of stem cell fate, with a particular focus on the application of artificial intelligence (AI), machine learning, pattern recognition, complex systems science, evolutionary algorithms, and adaptive learning.

His work has had commercial impact, including the transfer of neural network modeling technology to BioRAD Corporation; several field trials candidates with Du Pont and Schering Plough; and clinical trials of a radioprotectant drug for cancer radiotherapy patients (with Sirtex and the Peter Mac Cancer Institute). He developed core intellectual property (a novel antibacterial target in bacterial replisome) for the Betabiotics company spinoff, and discovered a new mechanism for strontium biomaterial-induced differentiation of mesenchymal stem cells to bone. He carried out a large project with Air Liquide Santé on using in silico methods to understand the surprisingly rich biological properties of noble gases. He discovered new antifibrotic and antihypertensive agents for Vectus Biosystems (allowing them to float on the stock market) and a first in class drug lead for myelofibrosis, which will be further developed by a new spin off company soon.

Winkler’s research thinking was greatly influenced by complex systems science, which finds deep mechanistic similarities between areas of science that appear to have nothing in common. Concepts include nonlinear dynamical behavior, networks and their attractor states, self-organized criticality, chaos, and emergent properties. Complex systems science stimulates substantial lateral thinking and novel problem solving. Methods from other areas of science can provide novel solutions to problems in drug discovery; and methods developed for drug discovery can provide novel solutions to problems in other areas of science, such as biomaterials, gene expression, non-biological materials, and regenerative medicine.

QSAR was invented by Toshio Fujita (very recently deceased) and Corwin Hansch, and rapidly evolved into a method for optimization of drugs and agrochemicals. David and Toshio published a recent paper43 on the two forms of QSAR: “explain” and “predict”. Graham Richards’ and Peter Andrews’ seminal commercialization ventures influenced David to make translation a strong focus in his research.

The research for which David received the Skolnik award involved the application of modern computational and mathematical methods to optimizing the QSAR modeling process.44 The first operation is to generate descriptors. Model quality is critically dependent on descriptors. Descriptors with low or no relevance to the property modeled degrade the model. Bad descriptors were a problem in early QSAR work, and there is still a major research need for good descriptors for materials. Next a subset of descriptors is chosen for the model in a context-dependent way. Choosing too many subsets can give chance correlations. In generating the relationship between the descriptors and the target property, model quality is less dependent on the modeling algorithm than on the descriptors, but there can be issues in overfitting, overtraining, ambiguity in network architecture, and subjective choices. The next operation is validating the performance of the model in predicting properties of new data. Here, cross validation and bootstrapping generate optimistic measures of performance, and an independent test set not used in training is best. The final operation is making new predictions from the model and synthesizing and testing new materials.

Descriptors are the last major research problem for QSAR. Many (such as DRAGON descriptors) are arcane; efficient, interpretable descriptors are needed. Descriptors specific to complex materials are essential, but the field is embryonic. High throughput characterization data can augment computed descriptors.

There are advantages in removing irrelevant features. Least squares in multiple linear regression (MLR) has a Gaussian prior. This can be replaced with a Laplacian prior which effects the removal of uninformative weights by driving them to zero. Sparse Bayesian feature selection methods (feature selection using expectation maximization) identify a small number of relevant features very efficiently.45

There are many methods of varying sophistication in finding structure-activity relationships,44 including simple linear statistical regression methods such as multiple linear regression; nonlinear regression methodsusing polynomials or nonlinear kernels, and nonlinear machine learning; bioinspiredmethods such as neural nets; support vector machines; and random forests. These have new applications in materials, nanotechnology, and regenerative medicine.

The universal approximation theorem states that neural networks can model any complex relationship given sufficient training data. Neural networks are very well suited to modeling of complex data, but they have problems such as overfitting and overtraining. They raise an ill-posed problem in statistics (instability), and optimum network architecture is ambiguous. The contribution of David and his co-workers is to develop very robust, self-optimizing sparse feature selection and neural network methods that overcome all these problems.46 These methods have been shown to have performance similar to that of deep neural networks.

Sparse Bayesian modeling and feature selection, replacing the Gaussian prior with the Laplacian prior, is a general nonlinear modeling method45,47-49 that automatically optimizes model complexity, prunes neural network weights to avoid overfitting, and prunes irrelevant descriptors to optimize the predictivity of a model. A sparsity-inducing Laplacian prior (LP) was introduced into Winkler’s Bayesian Regularized Artificial Neural Network algorithm (BRANN) creating BRANNLP.47,49 Low relevance weights are set to zero, and descriptors are also pruned from the model if all weights are zero.

From selection and mapping, David turned to validation. Cross validation, bootstrapping, and other methods give an overly optimistic estimate of predictive power because the test set is not independent of the training set. An independent test set never seen by the model is the gold standard. Many measures of predictivity have been proposed. Test set validation is actually a simple problem in statistics; standard error of prediction, test set (SEP) is preferred over r2 as it is less dependent on dataset size and model complexity.46,50

Methods from other areas of science can provide novel solutions to problems in drug discovery, and methods developed for drug discovery can provide novel solutions to problems in other areas of science. Implantable medical devices are an example. Bacterial adhesion and growth on biomaterial surfaces of joint prostheses, heart valves, shunts, vascular and urinary catheters, and intraocular lenses are serious problems in health care. There is a major unmet medical need for new coating materials for implantable and indwelling medical devices. David and his co-workers from Morgan Alexander’s research team at the University of Nottingham have used machine learning methods to derive quantitative models relating the molecular structure of a polymer to the attachment of the bacteria to that polymer surface. These models can be used to screen large databases of new materials for those with low pathogen attachment.

Hook et al. have detected the attachment of selected bacterial species to 576 polymeric materials in a high-throughput microarray format.51 In work by David and his colleagues, data from a large polymer microarray exposed to three clinical pathogens were used to derive robust and predictive machine learning models of pathogen attachment.52 The BRANN models can predict pathogen attachment for the polymer library quantitatively. The models also successfully predict pathogen attachment for a second-generation library, and identify polymer surface chemistries that enhance or diminish pathogen attachment. A manuscript on work on multiple pathogen attachment models has been submitted.

Sparse feature selection methods have also identified a new mechanism for strontium biomaterial-induced differentiation of mesenchymal stem cells to bone. Strontium ranelate (Protelos) is a drug approved in the European Union for the treatment and prevention of osteoporosis. It reduces risk of vertebral and non-vertebral fractures in post-menopausal women. Although controversial, it is reported to have an anabolic and anti-catabolic effect on bone. Strontium ion’s mechanism of action is not fully understood, but it is thought to up-regulate differentiation of osteoprogenitors or stimulate bone formation.53-55

David and his Imperial College co-workers,56 Molly Stevens, Eileen Gentleman, and Hélene Autefage, have evaluated the global response of human mesenchymal stem cells to strontium-substituted bioactive glasses using a combination of unsupervised biological and physical science techniques. Their objective analyses of whole gene-expression profiles, confirmed by standard molecular biology techniques, revealed that strontium-substituted bioactive glasses up-regulated the isoprenoid pathway, suggesting an influence on both sterol metabolite synthesis and protein prenylation processes.

In future, David hopes to see exploitation of new AI methods such as deep learning; improved descriptors for molecules that are effective and interpretable; exploitation of evolutionary methods of discovery aided by robotics; synergy of AI and evolutionary methods for adaptive evolution; adoption of in silico methods from drug discovery for materials and regeneration; development of autonomous or semiautonomous “closed loop” design methods; and more effective exploration of vast molecular or materials spaces.

Deep learning was predicted to be a breakthrough technology in 2013. Deep neural networks are not necessarily magic. According to the universal approximation theorem, a feed-forward network with a single hidden layer containing a finite number of neurons can approximate any continuous function, under mild assumptions on the activation function. This was first proved by Cybenko in 1989 for sigmoid activation functions. Hornik showed in 1991 that it is not the choice of the activation function, but the multilayer architecture itself which gives neural networks the potential of universal approximators.46

Deep learning methods have generated impressive improvements in image and voice recognition, and are now being applied to QSAR and QSAR modeling. A recent publication46 describes the differences in approach between deep and shallow neural networks, compares their abilities to predict the properties of test sets for 15 large drug datasets, discusses the results in terms of the universal approximation theorem for neural networks, and describes how deep neural networks may ameliorate or remove troublesome “activity cliffs” in QSAR datasets. Materials space is vast and at least in some of its many dimensions, the fitness landscape is smooth. This allows adaptation, one step (one mutation) at a time. Evolution and machine learning can be combined in adaptive learning (the Baldwin effect).

A recent review discusses the problems of large materials spaces, the types of evolutionary algorithms employed to identify or optimize materials, and how materials can be represented mathematically as genomes.57 It describes fitness landscapes and mutation operators commonly employed in materials evolution, and provides a comprehensive summary of published research on the use of evolutionary methods to generate new catalysts, phosphors, and a range of other materials. Another recent paper describes the materials genome in action.58

In summary, AI tools developed for therapeutic medicine also work well for regenerative medicine. Neural networks are machine learning methods that are very applicable to (bio)materials design. The universal approximation theorem means that deep learning methods should not be superior to shallow neural networks for molecular design. Bayesian regularized neural networks can generate robust, predictive models of many types of materials and properties. Sparse Bayesian feature selection methods can reduce the dimensionality of problems, improve interpretability, and generate robust models with better predictivity. Evolutionary methods, combined with machine learning (adaptive evolution) can find effective materials quickly and efficiently.

Conclusion

Erin Davis, chair of the ACS Division of Chemical Information, formally presented the Herman Skolnik Award to David Winkler at a reception held in honor of David, following the symposium.