National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States
PubChem is a public repository for information on chemical substances and their biological activities. Since its launch in 2004, PubChem has been a key chemical information resource that serve scientific communities in many areas, including cheminformatics, chemical biology, and medicinal chemistry. Currently, PubChem contains more than 219 million depositor-provided chemical substance descriptions, 88 million unique chemical structures, and 229 million biological activity test results provided from over one million biological assay records.
Many PubChem records include depositor-provided cross-references to scientific articles in PubMed. Some PubChem contributors provide bioactivity data extracted from scientific articles, which complement high-throughput screening (HTS) data from the concluded NIH Molecular Libraries Program (MLP) and other HTS projects. Some journals provide PubChem with information on chemicals that appear in their newly published articles, enabling concurrent publication of scientific articles in journals and associated data in public databases. In addition, PubChem provides links to patent information for chemicals, thanks to data contribution from a growing number of organizations, including IBM and SureChEMBL (formerly known as SureChem). Currently, PubChem offers links between about 6 million patent documents and more than 17 million unique chemical structures, with over 345 million chemical substance-patent links covering U.S., European, Japanese, and World Intellectual Property Organization patent documents published since 1800.
Literature and patent information in PubChem can be accessed using PubChem’s web interfaces, allowing users to explore information related to PubChem records beyond typical web search results. Moreover, this information can also be accessed programmatically, enabling one to build a drug discovery pipeline that automatically check existing literature and patent information for compounds of interest.

Cambridge Crystallographic Data Centre, Piscataway, New Jersey, United States
Crystal structures provide insights into molecular shape and interactions that are relevant to a range of scientific domains. Some 50 years ago, the Cambridge Crystallographic Data Centre (CCDC) began abstracting published crystal structure sets from scientific literature into the Cambridge Structural Database (CSD) and developing software that enabled this to be searched and analyzed. Over time the software around the CSD has evolved into a platform of software applications and services that enables the knowledge embedded in over 820,000 crystal structures to be applied to compound and materials design.

Today, processes for collating and curating data are somewhat different compared to the early beginnings due to adoption by the crystallographic community of robust digital data publishing workflows. Crystallography and the Cambridge Structural Database demonstrate the value of communities coming together to maximize the digital availability and accessibility of the data underpinning the scientific literature. Data access is only as good as how easily it can be queried however. In this talk we present Cross-Miner, a new interactive 3D pharmacophore querying tool, searching across federated crystallographic databases (CSD, in-house databases, PDB) to find not only matches within individual molecules, but also matching protein-ligand complexes.

Gilead Sciences, Foster City, California, United States
Chemogenomics databases capture chemical structures and activity values reported in the medicinal chemistry and patent literature.
While services like Reaxys and Thomson-Reuters Integrity/Pharma offer the full content only online, databases like ChEMBL/SureChEMBL from EBI and GOSTAR from GVKBio offer the entire content as database dumps, SDF-files etc. The latter are valuable for applications like comprehensive structure activity analysis (SAR) for an entire target/target class, single/multi target activity model building, and idea generation exercises like fragment/scaffold extraction (“privileged structures”).
GOSTAR covers both literature and patents which are manually curated by experts. It is limited to 11 major target classes and requires a commercial license. ChEMBL is freely available, manually curated, but covers only literature and other sources like Pubchem bioassays. Since being freely released in 2010 it is the most widely used chemogenomics database. More recently SureChembl (formerly SureChem) is now freely available as a collection of automatically extracted patent chemical structures.
In this study we compare GOSTAR and ChEMBL/SureChEMBL in terms of coverage, overlap, and accuracy. What is the value added for the commercial database versus the publically offered content? How are targets and target classes annotated, how well is the data normalized and how much clean-up work is required for comprehensive model building and analysis?
Finally, an in house query and annotation system was developed to access the GOSTAR and ChEMBL data.

10:15am-10:30am

Intermission

10:30am-11:05am

CINF 4: Exploring available compound data with the open PHACTS discovery platform and KNIME

University of Vienna, Vienna, Austria
The Open PHACTS project [1] integrates several public databases, thus allowing answering questions relevant for research in the drug-discovery process [2]. The data collected in the project can be accessed with web tools, such as the Open PHACTS Explorer (www.openphacts.org/explorer), however, in a drug discovery project, this might be just another database to search information in. If the aim is to combine the data from the Open PHACTS Discovery Platform with in-house data or other specialised data sources, workflow tools represent a very convenient option. An example of combining public data with a manually curated dataset was published recently [3].
Here, we present possibilities to access the Open PHACTS Discovery Platform from within a KNIME workflow that can be used at the beginning of a drug-discovery project. It returns available data for a list of compounds and similar molecules, which can be used to prioritize molecules for follow-up.
Information collected for the molecule includes function and toxicity annotations (from Drugbank), the role of the molecule (from ChEBI), biological pathways containing this molecule (from Wikipathways), and patents mentioning the compound (from SureChEMBL). In the next step, proteins where the molecule is reported to be active in ChEMBL are returned, and the connection of the proteins to biological pathways (from Wikipathways) and to diseases (from DisGeNET) are shown. The data from all these sources is retrieved via the Open PHACTS API, easily connecting the identifiers used in the different databases. Links to the original data sources are retained, allowing manual curation of the collected associations. Additionally, external data sources or in-house data can be added in.

School of Medicine, UCSD, La Jolla, California, United States
NDEx, the Network Data Exchange (www.ndexbio.org), is an open-source project to enable scientists and organizations to share, store, manipulate, and publish biological network knowledge. NDEx can aid in informed compound design as a channel for programmatic access to knowledge about biological mechanisms and their interactions with compounds. It also provides a software framework to enable compound design applications using or producing networks. The NDEx public site is already serving as a publication channel for both compound-protein interaction networks and biological networks describing both molecular and phenotypic level information. For practical application in drug development, NDEx was developed with features that promote bridging between the academic and commercial communities, presenting a layered strategy in which users can control access to the networks they store and where organizations may run private NDEx Servers. In this presentation, we will explore examples in which NDEx content and software are used to link compounds to mechanism and phenotype, assembling information relevant to the compound design process.

CINF 6: Learning to find the right information: A survey of chemistry information literacy in the undergraduate classroomThibault Geoui, t.geoui@elsevier.com

Marketing, Elsevier, Frankfurt, Hesse, Germany
As part of today's undergraduate training, chemistry students are asked to radically change the way in which they interact with information. Search and use strategies that have served them well in the past are a poor match for the structure of scientific information, and pose a hindrance to their development as scientists. Over the course of 3 months, we informally discussed needs in information literacy training with librarians, faculty and teaching staff at various undergraduate institutions. We encountered a range of approaches to building information literacy into department curricula and just as broad a range of opinions about what makes it so difficult to successfully teach information retrieval and use skills. From our learnings, we have constructed an initial list of best practices, which we aim to improve as we collect more input from successful and unsuccessful experiences in the classroom.

1 Clark Library, Cornell University, Ithaca, New York, United States; 2 Dept of Env Hlth Safety, Keene State College, Keene, New Hampshire, United States
The 2015 edition of the American Chemical Society’s Guidelines and Evaluation Procedures for Bachelor’s Degree Programs identifies six skill sets that undergraduate chemistry programs should instill in their students. In our roles as support staff for chemistry departments at two different institutions, we have been collaboratively studying these requirements and have found significant synergies between two in particular: “Chemical Literature and Information Management Skills” and “Laboratory Safety Skills”. We believe that by integrating emerging tools in the laboratory safety field into information literacy frameworks, a strong foundation can be established for the development of all the skills called out by the ACS. This presentation describes this strategy and provides examples of how these concepts can be implemented in both the chemistry teaching and research laboratory settings.

Chemistry & Biochemistry, Calvin College, Grand Rapids, Michigan, United States
Spectrophotometric titrations are a simple and powerful way to thermodynamically characterize multicomponent systems. The data is relatively easy to obtain but proper analysis requires appropriate computer programs along with the requisite training. SIVVU is one of several programs that is capable of such analyses and after years of development has now been made available through the internet. Designed from a chemist’s perspective, it employs singular value decomposition to analyze the mathematical factor structure of pertinent datasets. More importantly, it can then model the data according to user-provided chemical reactions in order to determine the spectroscopic signatures and binding constants for the system. Data is all uploaded through a single MIcrosfot Excel spreadsheet, and outputs are written back to the same. Most functionalities are available free of cost making it ideal for implementation in undergraduate chemistry laboratories.

'The name says is all',

Sivvu takes a spectrophotometric dataset and deconvolutes it into a set of molar absorptivity curves and equilibrium concentration profiles.

9:20am-9:30am

Intermission

9:30am-9:50am

CINF 9: Integration of cheminformatics material into the STEMWiki hyperlibrary

1 Department of Chemistry, University of Arkansas at Little Rock, Little Rock, Arkansas, United States; 2 Department of Chemistry, Univ California Davis, Davis, California, United States; 3 Department of Chemistry, Univ of Arkansas at Little Rck, Little Rock, Arkansas, United States
This presentation will describe a project to contribute cheminformatics educational material to the STEMWiki Hyperlibrary. The Hyperlibrary consists of multiple interconnected and independently operating STEMWiki hypertext applications (ChemWiki, BioWiki, MathWiki, StatWiki, GeoWiki, PhysWiki), and is a collaborative platform that enables dissemination and evaluation of new education developments and approaches, with an emphasis on data-driven assessment of student learning and performance. The contents of these STEMWikis are both horizontally (across multiple fields) and vertically (across multiple levels of complexity) integrated within a massively interconnected network that provides, not just single textbooks, but an infinitely large Hyperlibrary through which interconnected STEM textbooks can be built.

This enables reuse or repurposing of material across the curriculum, and an objective of this project is to create a cheminformatics hypertextbook using material generated in the Fall 2015 Cheminformatics OLCC as the initial nucleus for cheminformatics educational content generation. Much of this material is in the form of Teaching and Learning Objects (TLOs), like YouTube Videos and modular assignments, which by their nature can be directly integrated into other STEM textbooks within the STEMWiki Hyperlibrary. The objective is not only to create a place for a community to contribute to a cheminformatics hypertextbook, but to do so in a way that enables integration of cheminformatics TLOs into other hypertextbooks of the chemistry curriculum. This presentation will describe the Cheminformatics OLCC and the integration of OLCC content into the STEMWiki Hyperlibrary.

Chemistry, Haverford College, Haverford, Pennsylvania, United States
Cheminformatics is an intrinsically interdisciplinary field, and most “cheminformaticians” began as either computer scientists or chemists. The Liberal Arts college environment presents unique opportunities to teach cheminformatics to undergraduate students in a way that mimics the evolution of this discipline. As such, we must question also what is meant by the ‘classroom’; with the inclusion of Undergraduate Thesis research projects carried out from both computer science and chemical perspectives, which are often notionally components of the same over-arching cheminformatics research, and ‘self-study’ modules, the line of what constitutes a ‘classroom’ and more specifically a ‘chemistry classroom’ becomes ever more blurred. To illustrate these different aspects, I will discuss different approaches to teaching cheminformatics, from the inclusion of aspects of Cheminformatics in computer science courses (one, for instance, centered around relational database schema design and implementation), through joint supervision of dual major students, and the employment of self-study courses to aid students in visiting this vibrant interdisciplinary field, as well as the use of cheminformatics as a component of more ‘traditional’ chemistry thesis students. A discussion of the experiences of a sample of the involved students and staff will be included to highlight the experience of cheminformatics in a Liberal Arts environment for undergraduates, and and to highlight potential areas of improvement moving forward.

10:10am-10:30am

CINF 11: Cheminformatics education and research at home: the best way to teach graduate chemistry in the professional communityHao Zhu, hao.zhu99@rutgers.edu

Chemistry Department, Rutgers Univesity, Camden, New Jersey, United States
As one of the major goals in regional universities, education needs to be offered to a significant number of part-time students. For example, in Rutgers-Camden, 50% graduate students (764 out of 1,509) are part-time students by the end of 2015. Most of these graduate students have their full time jobs in the daytime and they can only use their free period to learn necessary courses and even perform research projects. The urgent requirements of these students are not only the flexible course schedule (e.g. courses in the evenings) but also the feasibility to finish most of the research works off the campus. In the past decades, there are many cheminformatics tools developed and most of them are public available through internet. Since I started the new cheminformatics class in graduate school of Rutgers-Camden, there have been over 40 graduate students enrolled in this class in the past four years. Although these students are still required to attend in the cheminformatics lectures on campus, they are able to finish most of the assignments at home. Furthermore, five students chose to perform research in the cheminformatics area to get their master degree. They finished almost all the research works at home using public available cheminformatics tools in their free periods. These efforts resulted in four research papers in peer reviewed scientific journals. The cheminformatics studies, majorly performed at home, greatly advanced their careers and also strengthened the newly developed graduate program in Rutgers-Camden.

Univ of Arkansas at Little Rck, Little Rock, Arkansas, United States
In the Fall of 2015, four campuses participated in a hybrid (face to face and online) intercollegiate course in cheminformatics, the Cheminformatics OLCC. The purpose of this course, which was collaboratively taught with online guest lecturers and onsite faculty facilitators, was to enable the presentation of chemical subjects that are typically not available in the undergraduate curriculum due to their specialized nature. The course was structured around a series of modules covering different topics in cheminformatics, and in addition to module-specific assignments, each student developed their own project. Many of these projects were presented during a symposium of the Spring 2016 ACS National Meeting. This presentation will be by a member of the 2015 class who currently manages an analytical lab and has extended work on his class project into a subject of graduate study. The first part of the presentation will deal with the student’s perspective of the Cheminformatics OLCC as a distributed, collaboratively-taught hybrid course, and the second will focus on the student’s project of using databases to validate information within Wikipedia, specifically, information related to chemical hazards.

Wikipedia is a 21st Century Information Source of the People, by the People and for the People, and is globally the 7th top visited site on the Internet. It includes a variety of information about chemicals and chemical processes. However, the open-access crowdsourced nature of Wikipedia leads to new types of information literacy challenges; addressing these challenges fits into the theme of the meeting of “Chemistry of the People, by the People and for the People”. How can the People trust the chemical information within Wikipedia?

This presentation will describe the student’s work in developing electronic systems to validate chemical safety information using the structure of the Wikipedia Chembox, which models an RDF triple, to compare that information to values within more authoritative databases, specifically those collected by PubChem. Currently, it appears that only chemical identifiers are validated and this work seeks to assess the practicality and value of extending Chembox validation to other high value parts of the Chembox, such as the safety and hazard information found there.

Department of Chemistry, Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States
Learning the nomenclature of organic molecules and the chemical reactions they undergo is not a trivial task. Nor is teaching molecular modeling concepts such as structure-based virtual screening and quantitative structure-activity relationships. Meanwhile, cheminformatics software tools to help understanding and accomplishing all the aforementioned tasks have never been so easily accessible. Therefore chemistry majors should be exposed and trained to these concepts and software as early as possible. This presentation will start by a brief review of several cheminformatics software programs that are used in the Organic Chemistry (CH221) classroom at NC State to help students (i) draw molecules and chemical reactions in 2D, (ii) visualize complex molecules (e.g., stereoisomers, conformers of cyclohexane) in 3D, and (iii) practice their knowledge on molecular nomenclature and chemical reactions. Publicly-available tools used in CH221 will be put in perspective with commercial educational platforms such as Sapling and Connect and their numerous complementarities will be highlighted. Second, we will present the computer-aided molecular design class (CH795) which aims to familiarize the graduate students in the chemistry PhD program at NC State to the concepts of molecular descriptors, QSAR modeling, structure-based docking, virtual screening, molecular dynamics simulations, and HTS data analysis. Software tools such as Knime, AutoDock Vina, PyMol, Chimera, and NAMD used in CH795 will be briefly reviewed. Finally, new technological trends (e.g., mobile devices, virtual reality) will be introduced as a perspective for improving the way we use and teach cheminformatics in the Chemistry classroom.

1 Optibrium Ltd, Cambridge, Massachusetts, United States; 2 MCPHS University, Worcester, Massachusetts, United States
A 5-week long laboratory exercise has been incorporated into the Pharmaceutical Sciences graduate program syllabus at MCPHS University to simulate an early stage hit-to-lead and lead optimization in a drug discovery program. Students use the StarDrop™ cheminformatics software package from Optibrium Ltd., to guide the selection and design of compounds with an optimal balance of properties, together with publicly available datasets downloaded from the European Molecular Biology Laboratory (EMBL) Neglected Tropical Disease website. The laboratory simulation exercise provides a much-needed hands-on experience related to complex topics normally only discussed in theory, including mining primary screening data, predictive modelling and drug metabolism, and provides the students with practical experience utilizing modern cheminformatics software.

1 MedChemica Limited, Macclesfield, United Kingdom; 2 Medchemica Ltd, Macclesfield, United Kingdom; 3 Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Liverpool, Liverpool, United Kingdom
The remorseless increase in the cost of drug discovery requires medicinal chemists to generate compounds with properties acceptable for in vivo testing as efficiently as possible. An approach to this problem is to extract and record medicinal chemistry knowledge from measured data. The vast size of medicinal chemistry space, the global research efforts in compound design and intrinsically complex nature of drug sized molecules make the manual capture of such knowledge increasingly challenging. An automated approach based on advanced match molecular pair analysis combining two algorithms and capturing the local chemical environment will be presented. Case studies showing how such knowledge has been used to solve problems will be shared.

Discovery Chemistry, Genentech, South San Francisco, California, United States
Through many years of drug discovery effort, pharmaceutical companies have accumulated a large set of in-vitro ADME data. Useful knowledge can be extracted from the data using matched molecular pair (MMP) and statistical testing.
This talk will describe how different pharmaceutical companies share the knowledge without disclosing molecular structures, and how we extracted the knowledge from share data sets using matched molecular pair analysis (MMPA)

We have recently started to make the chemical universe much more maneuverable by generating a new virtual chemistry space: 58 robust chemical reactions (Hartenfeller et al., 2011), 42 from a previous fragment space and 21 text book chemistry reactions were collated and together with building blocks from trusted vendors, a virtual chemistry space generated, containing 16.314.207.184.647.693 molecules (more than 16 quadrillion compounds!). All these virtual molecules contain high likelihood of straight forward synthetic access. Together with this literature-derived collection of compounds we provide a unique search method that is capable of handling such vast amounts of molecules relatively easily.
Imagine de novo design of (a) hit expansion libraries, (b) follow-up series and (c) fragment evolution design from within an all accessible, gigantic compound space. We will demonstrate how this can become reality.

Baylor College of Medicine, Houston, Texas, United States
Data integration is essential to overcome the logjam of data and publications. Here, we show a set of approaches that twin networks with evolution.

A first example integrates gene interaction networks in hundreds of species by eliminating redundant evolutionary relationships. This enables novel functional predictions, including for the essential but functionally uncharacterized malarial antigen EXP1. We find that EXP1 is a GST that efficiently degrades cytotoxic hematin and which is potently inhibited by artesunate. Thus, EXP1 is a possible molecular target to a frontline antimalarial drug.

A different example focuses on reasoning over the literature. A Knowledge Integration Toolkit (KnIT) automatically and scalably mines 25 million public PubMed abstracts to suggest novel protein kinases that phosphorylate the tumor suppressor p53. Focusing on a top candidate of pharmaceutical interest, we found that this protein phosphorylates p53 at Ser315, in vitro and in vivo, and functionally inhibits p53. This study demonstrates that automated reasoning over the entire literature generates falsifiable, novel and useful molecular hypotheses that test true and accelerate scientific discovery.

The last example aims to personalize networks by quantifying the impact of individual mutational variations. This impact depends on the unique context of each mutation, which is complex and often cryptic. Modeling evolution as a continuous and differentiable mapping between genotype to phenotype, yields a formal equation for the Evolutionary Action (EA) of coding mutations on fitness, the terms of which are readily computable. Mutational, clinical, and population genetic evidence show this Evolutionary Action equation predicts the effect of point mutations in vivo and in vitro in diverse proteins, correlates disease-causing gene mutations with morbidity, and determines the frequency of human coding polymorphisms, respectively.

Together these early studies suggest a broad integrative network formalism that unifies structured and unstructured, and which can be personalized to an individual's relevant mutational variations.

1 School of Informatics and Computing, Indiana University, Bloomington, Indiana, United States; 2 School of Medicine, University of New Mexico, Albuquerque, New Mexico, United States
Science is illuminated first and foremost by knowing the knowns, traditionally from peer reviewed scientific literature. However, this model has been strained and recast with the advent and evolution of the WWW. Informatics and data science have emerged as a hybrid discipline combining library science, computer science, and domain knowledge as in cheminformatics and bioinformatics. Scale-out of traditional publication has exceeded capacity for human consumption, while alternate online publication modes grow, improve and surpass old ways. Automated text mining, programmable web standards such as REST APIs, machine learning, community semantics through ontologies and vocabularies, and knowledge graph -based systems are some of the emerging technologies. In this talk we discuss projects that illustrate such emerging methods for knowledge processing and discovery. Chem2Bio2RDF, from Indiana University, is an integrated system of public datasets converted to RDF, all relevant to chemical biology and drug discovery. Since its release in 2010, several applications have been developed using Chem2Bio2RDF, including SLAP (Semantically Linked Association Prediction) for missing-link prediction of compound-target activity. Chem2Bio2RDF combines 25 datasets from 16 sources, employing formal semantic formats and domain expertise based novel linkages. An ongoing major upgrade reflects the profound advances of methods and resources in the last few years. Target Central Research Database (TCRD), developed at the University of New Mexico, is the main repository for the Illuminating the Druggable Genome Knowledge Management Center (IDG-KMC), and serves as the primary source for the related web portal, Pharos (pharos.nih.gov). TCRD integrates diverse datasets relevant to the druggable genome in a platform for data integration and analytics, with data types including: proteins, compounds, text mined bibliometric associations, gene expression, disease, phenotype and pathway associations, bioactivity and drug interactions. Chem2Bio2RDF and TCRD are exemplars of a new genre of knowledge resource, harnessing emerging methods and resources to offer unique discovery opportunities in drug discovery.

Syracuse University, Syracuse, New York, United States
Research libraries assist scholars in demonstrating the value of their scholarly output through citation metrics and other measures. As the forms of scholarly communication change, so do the metrics used for assessing them. The services libraries offer must evolve in concert with these changes. This talk will provide a general overview of the ways in which altmetrics complement traditional citation metrics and will explore how libraries can benefit from engaging with a broader set of metrics to reach a wide range of users. The talk will cover the roles librarians can play in helping researchers and institutions understand the benefits and limitations of these metrics and will discuss how altmetrics are being used in library discovery services, how they are represented in institutional repositories, and they can drive collection development and other library decisions.

National Information Standards Organization (NISO), Baltimore, Maryland, United States
For decades, the coin of the realm in scholarly assessment has been citations. But as content has moved to digital distribution, the variety of ways in which activity with and related to scholarship can be tracked and reported on has grown rapidly. Understanding what these data streams are, how they can be aggregated, and how they correlate to traditional metrics are all important elements to a network of assessment that can be trusted. Over the past three years, the National Information Standards Organization has been exploring these issues and is putting forward a set of recommended practices related to new forms of assessment. This talk will cover the ways in which the community is beginning to use non-traditional metrics and how the NISO recommendations will support a network of trust around these metrics.

Mendeley, Mountain View, California, United States
Managing attention only becomes more and more important as the rate at which information is produced grows. One way to do this is to focus on a few well-curated sources, but another way which puts the reader and the author more in control is to use metrics to help you filter a broader range of sources. This presentation will cover the kinds of metrics that are useful for discovery of scholarly content, using chemistry-focused examples from Mendeley and Scopus, and will also discuss how the scholarly community is dealing with challenges such as trust and reliability of data sources.

2:35pm-2:55pm

CINF 24: Is that a wart or a beauty mark? An altmetrics analysis of an assistant professor’s scholarly activityMatthew Hartings1, hartings@american.edu, Rachel Borchardt2

1 Chemistry, American University, Gaithersburg, Maryland, United States; 2 American University, Washington, District of Columbia, United States
In an effort to better quantify the research impact of practicing scientists a number of alternative metrics (altmetrics) have been developed to complement traditional metrics such as h-index and journal impact factor. The problem for chemical researchers (which is magnified for early career independent chemists) is understanding how and when to employ altmetrics. Specifically, a chemist must know what each metric is supposed to analyze and must understand what this analysis can and cannot tell. Only then will a researcher be able to decide if a particular altmetric can and should have sway over decisions that impact their career (made by deans, provosts, journal editors, and funding agencies). For this talk, a scholar of alternative metrics (Rachel Borchardt) has performed an analysis of the scholarly activities of a pre-tenure faculty member (Matthew Hartings). Matthew will discuss Rachel’s findings and try to put each metric into a proper and strategic perspective, discussing along the way whether the analysis notices a beauty mark or a blemish (or perhaps a little of both) on his résumé.

Nature Chemistry, Cottenham, United Kingdom
Journals such as Nature Chemistry (and many others) are based on a model in which a large proportion of the submissions that they receive are ultimately declined for publication. The role of an editor working on one of these selective titles is to try to gauge just how interesting or significant (or useful) the papers that it receives are potentially going to be. This is not an easy task (and certainly not always a thankful one) especially when you consider that it is often still difficult to measure what the impact of any particular paper is in the years *after* it has been published. Do editors select papers based on citation potential or the past performance of papers on that topic (or indeed the same set of authors)? Is it even possible to predict if a paper will be a citation blockbuster? And exactly what do citations measure anyway? As editors, we do also look at other metrics to see how much attention papers are getting, but even then, if a paper is all over Twitter, is that a good thing? (Hint: maybe, but maybe not). And surely a large number of page views suggests that a paper is quite popular? (Hint: yes, but have you considered why?). This will be an editor's take on metrics, what their value is and what they do or don't mean.

US Department of Energy, Advanced Research Projects Agency – Energy (ARPA-E), Washington, District of Columbia, United States
The U.S. Department of Energy’s Advanced Research Project Agency for Energy (ARPA-E) was established in 2009 to fund early stage transformational energy technologies that are too risky for private-sector investment alone. ARPA-E’s investment portfolio aims to generate options to address specific energy challenges that could provide dramatic benefits for the nation. The Agency is beginning to see real commercial impact in the areas that contribute to ARPA-E’s mission of promoting a more secure, affordable and sustainable American energy future.
ARPA-E invests in a range of different technologies across the energy spectrum such as renewable energy production and storage, energy efficiency, and bio-mass. The investment topics vary by year as ARPA-E identifies high potential and high impact technology “white spaces”. Selection metrics include the transformative nature of the technology, the potential impact of the technology on ARPA-E’s energy goals, potential environmental impact, and the potential for the project to yield commercial applications that benefit U.S. economic and energy security.
Within ARPA-E, the role of the Technology to Market group is to maximize the deployment, and ultimately impact, of funded project. This includes supporting project teams with their commercialization efforts and position them to be attractive to private or public investment to carry them through to deployment.
Ideally, metrics of success would be directly related to ARPA-E goals, such as a quantified reduction of greenhouse gas emissions or a reduced amount of imported oil. However, since these metrics are slow to evolve, ARPA-E is required to use intermediary metrics such as private follow-on-funding, new companies formed, and post project government partnerships to quantify success, with the assumption that they will predict the impact on ARPA-E mission.

Elsevier, Amsterdam, Netherlands
There continues to be much discussion about the responsible use of research metrics, and the value that they can offer. This presentation will cover current best practice, and discuss how a basket of carefully chosen metrics can be used to support human judgment in decision making. We will look at the importance of combining novel metrics, such as altmetrics, with more familiar metrics to enable benchmarking that provides input into a wide range of activities.

National Center for Computational Toxicology, U.S. Environmental Protection Agency, Research Triangle Park, Durham, North Carolina, United States
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program integrates advances in biology, chemistry, and computer science to help prioritize chemicals for further research based on potential human health risks. This work involves computational and data driven approaches that integrate chemistry, exposure and biological data. We have delivered public access to terabytes of open data, as well to a large number of publicly accessible databases and applications, to support the research efforts for a large community of scientists. Many of our contributions to science are summarily described in research papers but to date we have not optimized our contributions to inform altmetrics statistics associated with our work. Critically missing from altmetrics is access to our numerous software applications and web service accesses, as well as the growing importance of our experimental data and models (e.g ToxCast, ExpoCast, DSSTox and others) to the scientific and regulatory communities. This presentation will provide an overview of our efforts to more fully understand, and quantify, our impact on the environmental sciences using a combination of our measurement approaches and available altmetrics tools. This abstract does not reflect U.S. EPA policy.

4:30pm-4:50pm

CINF 29: Altmetrics: What has been the impact on ACS Publications?Jeff Lang, j_lang@acs.org

ACS, Washington, District of Columbia, United States
In Spring of 2016, ACS Publications placed Altmetric badges on individual articles. This talk examines how readers and authors have used and reacted to this new source of information about an article's impact. We will analyze the data and take feedback from the audience on the value of these features on the ACS Publications website.

1 Center for Computational and Integrative Biology, Rutgers University, Camden, New Jersey, United States; 2 Shandong University, Jinan, China; 3 Department of Chemistry, Rutgers University, Camden, New Jersey, United States
Drug delivery using nano materials (e.g. nanoparticle) is a promising method to achieve cell recognition. Folate receptors (FRs), which are overexpressed in many human cancer cells, have been used as ideal targets for the treatment of cancer and inflammatory diseases over several decades. In this study, we compiled a dataset consisting of 30 mono-ligand gold nanoparticles (GNPs) and 30 dual-ligand GNPs on cell recognition and uptake against four human cancer cell lines that express different levels of FR. Quantitative nanostructure toxicity relationship (QNTR) models were developed using this dataset. Specifically, we simulated the surface chemistry of GNPs by constructing virtual nanoparticles with various surface ligands. The receptor-binding affinities correlate to the important surface properties (e.g. surface shape, electron density and etc), which can be calculated by the virtual nanoparticles. Various modeling approaches (e.g. random forest, support vector machine and etc.) were applied to the resulting surface chemical descriptor set using ten-fold cross-validation procedure. The modeling results clearly indicate the relationship between nanostructure (i.e. GNPs with different ligands) and cell recognition. The validated models can be used to design new GNPs with desired cell recognition and the developed virtual nano-particle model can be used to evaluate other nanotoxicity endpoints for new nanomaterials.

The core of this study is to simulate the surface chemistry of GNPs (Figure 1) during the modeling procedure. The ligands attached to the Au core has relatively low accessibility than free ligands. A novel “expose parameter”, ranged from 0 to 1, was designed by calculating the distance of each ligand atom from the Au core and the ligand density.

6:30pm-8:30pm

CINF 31: Experimental errors in QSAR modeling sets: What we can do and what we cannot do

1 Rutgers University, Camden, New Jersey, United States; 2 Center for Computational and Integrative Biology, Rutgers University, Camden, New Jersey, United States; 3 Chemistry Department, Rutgers Univesity, Camden, New Jersey, United States; 4 Multicase Inc., Beachwood, Ohio, United States
Numerous data sources have become available for quantitative structure–activity relationship (QSAR) modeling studies. However, the quality of various data sources may be different based on the nature of experimental protocols. In this study, we explored the relationship between the ratio of questionable data, which was obtained by simulating experimental errors, in the modeling sets and the QSAR modeling performance. To this end, we used eight datasets (four continuous endpoints and four binary endpoints) that has been extensively curated in our lab to create over 1,800 various QSAR models. Each dataset was duplicated to seven new modeling sets with different ratios of simulated experimental errors (i.e. randomizing the activities of part of the compounds) in the modeling process. The five-fold cross validation process was used to show the model performance, which becomes worse when the ratio of experimental errors increases. All the resulting models were also used to predict external sets of new compounds which were excluded at the beginning of modeling process. The modeling results showed that the compounds with relatively large prediction errors in cross validation process are more likely to be those with simulated experimental errors. However, after removing certain number of compounds with large prediction errors in cross-validation process, the external predictions of new compounds did not gain improvement. Our conclusion is that the QSAR predictions, especially consensus predictions, are able to indicate those compounds with potential experimental errors. But removing those compounds will not result in better model performance due to overfitting. Apparently extra experimental testing is necessary for those compounds found to be questionable by QSAR predictions.

Research Informatics, John Wiley & Sons, Hoboken, New Jersey, United States
Organic synthesis is a vital component of the drug discovery process, essential to achieving high productivity, efficiency and novelty along the entire development pipeline. The knowledge accumulated in pharmaceutical companies in this domain is one of the most valuable assets of those organizations, and as such requires both protection and means of dissemination. Yet more often than not, the accessibility and utilization of this knowledge is limited. Data are frequently scattered across several systems in non-standardized formats, and their discoverability is wanting. ChemPlanner, the state of art in computer-aided synthesis design (CASD), typically offered as an online service, will see the release in the next few weeks of a version that can be hosted locally, behind the organization’s firewall. This platform can host the users’ proprietary data along with databases of published reactions, and thus enables federated queries of reaction retrieval and retrosynthesis within a secured environment.

ChemPlanner is a synthesis planning tool that allows chemists to consider broader sets of synthetic approaches to make their target molecules by carrying out retrosynthetic analysis based on large reaction databases and reaction rules that are derived from them. ChemPlanner also has integrated “traditional” searching capabilities such as structure, substructure and similarity searches, with metadata fields offering means of refining results sets. The retrosynthetic search can identify a large spectrum of synthetic routes leading to the target from commercially available starting materials, with literature examples supporting each reaction step and giving essential experimental information. Users’ reactions and starting material collections can supplement the provided data sources and offer to discovery and process chemists even broader coverage of synthetic know-how.

In the poster we give a general overview of ChemPlanner via test-cases. We discuss the main principles of extracting synthetic knowledge from reaction databases, including rules, stereo-selectivity and functional group tolerance, and the means to segregate data sources and extracted knowledge to guarantee the integrity of intellectual property. The requirements from the data source, as well as the available searching options will be shown as well.

Chemistry & Biochemistry, Calvin College, Grand Rapids, Michigan, United States
Spectrophotometric titrations are a simple and powerful way to thermodynamically characterize multicomponent systems. While an increasing number of researchers rely on factor analysis to deconvolute the data in order to determine binding constants, the non-linear relationship between binding constants and spectroscopic data make it non-trivial to ascertain the error on the former when modeling the latter. An appraisal of direct methods of error analysis are presented alongside of Monte-Carlo simulations. Data sources include both real examples and artificially generated spectrophotometric titration data.

Spectrophotometric data, left, upon being deconvoluted into molar absoprtivity curves and equilibrium concentration traces, center, still exhibits residual, right, due to various forms of error in the measurement and the model.

1 Haverford College, Bryn Mawr, Pennsylvania, United States; 2 Chemistry, Haverford College, Haverford, Pennsylvania, United States
Cheminformatics covers a wide remit of problems, including, but by no means limited to, statistical and machine learning models and their validation, consistent descriptions of systems under examination and frequently, software engineering. This poster highlights my work in each of these spheres as aspects of the Dark Reactions Project[1] at Haverford College, an open-source and publicly-available hydrothermal synthesis database with a web-based interface and associated software, which has been used to build models of hydrothermal synthetic reactions and to make predictions and hypotheses about those systems by harnessing big data approaches.
[1] http://darkreactions.haverford.edu

Department of Chemistry, Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States
Human leukocyte antigen (HLA) genes are encoding for cell surface proteins involved in key signaling mechanisms of the immune system.1 Recently, these proteins have been shown to be directly responsible for idiosyncratic adverse drug reactions (ADR).1–3 Herein, building upon our first proof-of-concept docking study with clozapine2, we present the analysis of the common HLA-B*57:01 variant that is notably responsible for the abacavir hypersensitivity syndrome. First, we analyzed three crystal structures (PDB codes: 3VRI, 3VRJ, and 3UPR) involving the HLA-B*57:01 protein variant as well as the anti-HIV drug abacavir and different endogenous peptides co-bound in the antigen-binding cleft.3,4 We superimposed the three structures and showed that abacavir had no significant conformational variation whatever the co-bound peptide (Figure 1). Second, we used Schrodinger’s Glide software to evaluate the abacavir-HLA binding affinity. The docking scores for abacavir without peptide in the cleft were as low as -8.27 and -7.99 kcal/mol using SP and XP scoring functions, respectively. In the presence of an endogenous co-binding peptide, we found a significant increase (~2 kcal/mol) of the docking scores and a key abacavir-peptide hydrogen bond indicating that the peptide plays a role in stabilizing the HLA-drug complex. Third, we docked a small set of drugs with known ADRs (e.g., allopurinol, fenofibrate, simvastatin) and analyzed their binding affinities toward HLA-B*57:01. Our presentation will focus on these drug-specific interactions with the B*57:01 variant and their matching with known HLA-mediated ADRs. This study demonstrates the appropriateness of molecular docking for evaluating HLA-drug interactions of high importance for precision medicine.

1 Chemical and Biological Engineering, University at Buffalo, Buffalo, New York, United States; 2 New York State Center of Excellence in Materials Informatics, Buffalo, New York, United States
The idea of utilizing modern data science in chemical and materials research has recently gained considerable attention. However, tools and techniques that could facilitate this work have oftentimes not yet been developed or are still in their infancy. Existing expertise tends to be in-house, specialized, or otherwise unavailable to the community at large. Data science is thus in practice beyond the scope and reach of most researchers in the field. Our work aims to address this situation by creating the ChemML, a program suite and software toolbox that is designed to overcoming this situation, filling the prevalent infrastructure gap, and thus making the application of big data analytics in the chemical and materials context – e.g., via machine learning and informatics – a viable and widely accessible proposition. ChemML can be employed for the validation, analysis, mining, and modeling of large-scale data sets. Its primary purpose is to uncover hidden structure-property relationships that govern the behavior of chemical and materials systems. These insights are a prerequisite for rational design and inverse engineering capability as outlined in the White House Materials Genome Initiative.

A key consideration of our work is to make ChemML as comprehensive, black-box, and user-friendly as possible, so that it can be readily employed by interested researchers without the need for excessive expert knowledge. Our presentation will detail the code design and modular structure of ChemML, its capabilities, methodological advances, and initial proof-of-principle applications.

CINF 37: Viewpoint on open access by an editor, author, reviewer, and readerJonathan Sweedler, jsweedle@illinois.edu

Chemistry, University of Illinois, Urbana, Illinois, United States
Open access to published articles and to the associated research data supporting the article is a requirement of many funders around the globe. Having ready access offers value not only to medical and life sciences, but also to chemistry as a discipline. As Editor-in-Chief of Analytical Chemistry, and as a frequent open access author, reviewer, and reader, I present my perspectives on the value that open brings to research, to education, and to public awareness of the sciences.

John Wiley and Sons Ltd, Chichester, United Kingdom
The data revolution is underway. Opening access to the world's research data offers huge potential to improve the transparency of research, accelerate the pace of discovery, improve return on investment, and lead to a future in which more research can be independently verified or made reproducible. Publishers, funders, and researchers have a shared responsibility to create an ecosystem that supports the sharing of data. To help address this Wiley has been implementing a Data Sharing Service that enables authors to transfer or link to data in approved repositories. This service is designed to increase discoverability, encourage innovation, and help authors comply with journal or funder mandates. Topics covered in the presentation will include the implementation of a data sharing policy, data accessibility statements, the licenses associated with data, and guidelines to authors on how to share data. The talk will illustrate how the Publisher supports and promotes data sharing in the community and how it provides guidance to enable effective data sharing.

1 Elsevier RDMS, Elsevier Inc., Jericho, Vermont, United States; 2 Mendeley, Elsevier Inc., Mountain View, California, United States
The main tenet of data science is that new science can be done on old data. To make this possible, however, the data needs to be collected and stored in a way that allows downstream scrutiny, validation, and use. This calls for a connected infrastructure of research data management tools, in which the inputs and outputs of the many tools and parties currently involved in data creation, storage, and access should work together. In this talk, we will discuss how different components of a research data ecosystems can work together to address data preservation, curation, archiving, access, comprehension, reproducibility, discoverability, trust, citation, and re-use. We will present a series of initiatives in which Elsevier is partnering with research institutions to improve such ecosystems, and in particular consider the role of chemical data within this framework.

Beilstein Institut, Frankfurt, Germany
We are in the middle of a large technological revolution. The internet of things is knocking on the door – household appliances, electricity grids, medical monitoring and automobiles, for example, are becoming interconnected. Chemistry is also changing, and machines and intelligent control systems are starting to have an impact. To date, one of the technologies most resilient to change has been scientific publishing. The basic functions of sharing research results need to be re-engineered. The current workflows need to be changed to accommodate better data reporting, validation and sharing – and they will change, as will the mindset of the practitioners when the theoretical advantages become practical ones.

9:05am-9:35am

Panel Discussion

9:35am-9:45am

Discussion

9:45am-10:00am

Intermission

10:00am-10:10am

CINF 42: NSF MPS Open Data workshop series: Taking the pulse of the research community on open data issuesMike Hildreth2, mhildret@nd.edu, Leah McEwen1

1 Clark Library, Cornell University, Ithaca, New York, United States; 2 Department of Physics, University of Notre Dame, Notre Dame, Indiana, United States
We have begun to coordinate a series of two NSF-funded workshops aimed at gauging the needs of the community of NSF MPS researchers in terms of data preservation, the infrastructure required to make research data public and useful, and a response to potential guidelines that might be implemented to insure the preservation of scientific knowledge without undue burden on the researchers, while making the data available in a useful way to the public. The first workshop was held in Arlington, VA, in November 2015 and has resulted in a preliminary report framing the opinions of the MPS community. A second workshop will be held in fall 2016 to adapt the report in order to incorporate feedback from the research community “at large”, collected at venues like this meeting. The presentation will include an overview of the conclusions presented in the report. The report can be accessed at https://mpsopendata.crc.nd.edu/

10:10am-10:20am

CINF 43: Open Data: What the reader wants to know rather than what the author wants to presentRobin Rogers, robin.rogers@mcgill.ca

Department of Chemistry, McGill University, Montreal, Quebec, Canada
The ACS journal Crystal Growth & Design has promoted the concept of ‘open data’ since its inception in 2000. Crystallographic data, with the current repositories and software to view and analyse the data offers a glimpse at the future when all data is available for analysis and use from the perspective of what the reader wants to know or study, rather than what the author wants to present. In this presentation some of the advantages and disadvantages of open data will be discussed in view of the idealistic goals and the pragmatic realities.

Cambridge Crystallographic Data Centre, Cambridge, United Kingdom
When it comes to open sharing of data, disciplinary repositories have been enabling this for many years. Today, the Cambridge Structural Database (CSD) enables access to over 820,000 crystal structures which may be associated with journal articles but are increasingly being published independently as CSD Communications. Disciplinary repositories can provide streamlined deposition mechanisms that offer value to the researcher who generates the data and lower the barriers to efficient and effective deposition. Crucially, disciplinary repositories provide the domain expertise necessary to make the data discoverable and reusable across a range of subject areas. In the case of crystallography, the CSD enables chemists to discover and use crystallographic data and knowledge in ways that are best suited to their specific application areas and operating environment. This presentation will explain the role that the CSD has in enabling meaningful and applicable publication of research data.

Figshare, Brooklyn, New York, United States
Good data management and infrastructure is at the foundation of reproducible research. This talk will touch on the evidence and challenges for reproducibility we’ve seen at Figshare and will delve deeper into the funder policies and incentives to motivate different stakeholders and communities toward best practices and workflows to achieve transparency in scientific research.

Mestrelab Research SL, Feliciano Barrera, Santiago de Compostela, Spain
Primary raw data is fundamental to the integrity and quality of the publication process in chemistry. All stakeholders in publication (publishers, readers, authors and reviewers) would derive very significant benefits from the inclusion of primary raw data with publications.

Raw data is critical to insure the integrity of the published materials, but has many other benefits, such as:
- Reproducibility (other scientists can reproduce the same analyses and the same analysis results)
- Interactivity (allows other scientists to interact with the data generated from the relevant experiments)
- Knowledge building: the community as a whole can build knowledge based on the primary research data and leverage this knowledge for future research.

Mestrelab is presenting here the results of our efforts to support the inclusion of chemistry analytical primary raw data in the publication process. We present results from the following initiatives:
- Automatic generation of formatted primary raw data for inclusion in publications
- Published analytical raw data review tools freely available to the chemistry community
- The role of ELNs in facilitating the sharing of research data, and an example of an ELN with a readily exportable data structure which significantly facilitates the sharing of chemistry data.

National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public archive that contains information on a broad range of chemical entities, including small molecules, lipids, carbohydrates, and (chemically-modified) amino acid and nucleic acid sequences (including siRNA and miRNA). Currently, PubChem contains more than 219 million substance descriptions, 88 million unique chemical structures, and 229 million bioactivity test results from one million bioassays, covering about ten thousand target protein sequences. This vast amount of chemical information is contributed from more than 400 data sources, including government agencies, academic institutions, pharmaceutical companies, chemical vendors, publishers, and other databases.
Chemical data sharing through a public database like PubChem presents some unique challenges. Although funding agencies can mandate sharing of data generated in studies they support, many organizations in the private sector, like publishers and chemical vendors, are not required to submit their data to PubChem or other public databases. Then, why would they share their data with other people? What kind of data would they want to share? What benefits can data sharing provide for these private entities? In addition to these questions, data sharing among more than 400 data sources raises many technical issues to consider. When multiple data sources provide redundant data on the same chemical, how can we extract unique information? What should we do if discrepancies exist in data from different sources? In this presentation, we will discuss how these issues are handled in PubChem.

Chemistry Library, University of Pennsylvania, Philadelphia, Pennsylvania, United States
The expansion of data sharing and archiving requirements by funding agencies has forced the issue of open data in the chemistry community, causing researchers to think more about methods by which they can make their data available to others and providing them with additional data streams for their research. Librarians and other information and data professionals are working to support these new open data initiatives, to help researchers develop and enhance their own best practices for managing and presenting data, and to help researchers integrate good data practices into the research life-cycle, using many of the same techniques that they have historically used to promote the integration of sound information-seeking practices in research. This summary presentation will offer a chemistry librarian's practical reflection on the key issues and takeaways from the preceding panel discussion.

University of New Mexico, Albuquerque, New Mexico, United States
The Illuminating the Druggable Genome Knowledge Management Center (IDGKMC) evaluates, organizes and distils more than 80 protein-centric and over 20 gene-centric resources, currently focused on G-Protein Coupled Receptors, Nuclear Receptors, Ion Channels and Kinases. Data wrangling, coupled with algorithmic processing, text mining of drug labels, patents, medical literature, as well as human curation and drug-target ontology development, yield emergent properties and knowledge for target-disease associations. Tissue expression data from GTEx and other sources, disease-centric text mining and other resources are integrated using a number of specialized ontologies, e.g., the Disease Ontology. Using metrics derived from text mining and gene reference into function, as well as the number of antibodies, IDGKMC catalogs proteins into four categories: “Tdark”, i.e., proteins that lack functional information and disease relevance, 'Tclin', proteins with confirmed drug mechanism of action, 'Tchem', for which potent) small molecules are known, and literature, functional and disease annotations data are available (“Tbio”) - see Figure 1. Data can be mined via the user interface portal, Pharos.
This integrative effort led to the following observations: i) there appears to be a knowledge deficit, i.e., we lack understanding of protein function for 38% of human proteome; ii) less than 3% of the human proteome is therapeutically addressed by drugs; iii) given current understanding of disease (~8,800 disease concepts), as well as all diseases addressed via on-label (~2,000) and off-label (~400) indications, we currently address at most a quarter of all diseases via therapeutic agents.

Figure 1: Target Development Levels for the Human Proteome, including four 'druggable' protein families.

CHEMBL, EMBL-EBI, CAMBRIDGE, HINXTON, United Kingdom
The identification of potential drug targets and up to date knowledge about the extent to which these targets have been studied is essential information in giving impetus to the next wave of drug discovery efforts.
This talk describes how as part of the Illuminating the Druggable Genome Program, we used data curation efforts to retrieve drug-discovery relevant information to understudied targets in the main drug target families. We will detail our use of publicly available data sources such as the ChEMBL database, SureCHEMBL patent resource and clinical trials information to collate compound and disease information to understudied protein targets and present some of the challenges we encountered while doing so. We present highlights on how this information extraction and curation has proved useful in providing insights to proteins occupying the dark part of the ‘druggable genome space’.

1 Center for Computational Science, University of Miami, Miami, Florida, United States; 2 Department of Pharmacology, University of Miami, Miami, Florida, United States
Several research consortia and countless projects in pharmaceutical companies generate, organize, and analyze small molecule drug screening data. Such consortia supported by the NIH Common Fund include the (now past) Molecular Libraries Program (MLP), and currently the Illuminating the Druggable Genome (IDG) and the Library of Integrated Network-based Cellular Signatures (LINCS) projects. A large component of the MLP program was the development of chemical probes to study a wide variety of biological questions. This program generated new assay technologies, huge amounts of chemical biology screening data and over 350 chemical probes. The observation of an apparent strong bias of drug discovery research and development efforts towards targets that are already well studied, motivated the IDG program to prioritize novel drug targets and catalyze the development of chemical entities to target understudied proteins with a focus on four protein families, kinases, GPCRs, nuclear receptors and ion channels. The LINCS program has a systems biology focus. The project creates a reference 'library' of molecular signatures, such as changes in gene expression and other cellular phenotypes that occur when cells are exposed to a variety of perturbing agents, and computational tools for data integration, access, and analysis. Dimensions of LINCS signatures include the biological model system (cell type), the perturbation (e.g. small molecules) and the assays that generate diverse of phenotypic profiles.
Data integration is a common and critical challenge in these and other projects; and data integration requires common metadata standards and conventions for data representation and exchange. Towards the goal of creating common data standards to represent data in these and other projects that produce data relevant for drug discovery, and to support software tools that we and others have been building as part of these projects, we have been developing ontologies including the BioAssay Ontology (BAO) and the Drug Target Ontology (DTO). The goal of these ontologies is enable the knowledge-based classification of diverse and large datasets into categories that facilitates re-use and context-specific integration and querying, for example to develop predictive models or to quickly explore and correlate different datasets.
BAO, DTO and other ontologies provide a robust framework to represent, integrate, model, and query diverse drug discovery data generated in different projects.

1 Cell Signaling Technology, Danvers, Massachusetts, United States; 2 Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, New York, New York, United States
Kinases are a class of cell signaling proteins that control diverse cellular functions through protein phosphorylation. Dysregulation of kinase activity is common in many cancers, and kinases are effective therapeutic drug targets. However, we still have a very partial view of the human kinome in normal physiology and disease. To address this challenge we constructed mammalian kinome networks from over twenty diverse public resources, and developed a web-based tool and database called kinase enrichment analysis version 2 (KEA2). KEA2 can be used to predict kinase activity given proteomics, phosphoproteomics and genomic data. The different views of the human kinome connect kinases based on their known binding partners, known substrates, co-expression, effects on cancer cell-lines when knocked down, effects on gene expression when knocked down, and similar roles in disease. As a case study, we applied KEA2 to the analysis of original unpublished phosphoproteomics dataset collected from 31 non-small cell lung cancer cell lines. The analysis generated unique kinome signatures for the cell lines with agreement with previous knowledge as well as point to potential new drivers in lung cancer. In conclusion, KEA2 is a useful resource to advance our understanding of the human kinome.

The druggable genome corresponds to the set of protein targets that are amenable to small molecule perturbation. While this set of targets has enormous potential in terms of understanding and treating many disease conditions, the bulk of them are understudied or not studied at all. To address this the NIH initiated the 'Illuminating the Druggable Genome' program to characterize the dark regions of the druggable genome. As part of this program, a Knowledge Management Center (KMC) was created to aggregate and integrate heterogeneous data sources and data types creating a centralized location for information about all protein targets indentified as part of the druggable genome. In this presentation we describe the design and deployment of Pharos, the user interface for the KMC. Based on modern web design principles the interface provides facile access to all data types collected by the KMC. We provide an overview of the data sources and types made available via Pharos and then describe the architecture of the system and its integration with KMC & external resources. Given the complexity of the data surrounding any target, efficient and intuitive visualization has been a high priority, to enable users to quickly navigate and summarize search results and rapidly identify patterns. We highlight the approaches we have taken to address this requirement. A critical feature of the interface is the ability to perform flexible search and subsequent drill down of search results. We describe the design of a faceted search interface coupled to the Drug-Target Ontology (DTO) that supports these activites. Underlying the interface is a RESTful API that provides programmatic access to all KMC data, allowing for easy consumption in user applications. We conclude by highlighting some workflows on targets of interest to the IDG program.

10:30am-10:55am

CINF 54: From dark chemical matter to shedding light on the dark genome: How can chemistry and informatics enable biology?Meir Glick, meir.glick@merck.com

Merck Research Laboratories, Boston, Massachusetts, United States
This presentation describes a forward-looking strategy on how integration of screening, chemical synthesis and informatics can enable target identification and validation. More specifically, design of high quality molecular probes that target novel targets and phenotypes, creation of the most biologically relevant assay systems, progressing of the right perturbagen into in vivo and eventually transforming the organization from data creators into decision makers.

UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States
The human genome encodes 518 protein kinases that are collectively referred as a human kinome. Kinases are among the most important targets for drug discovery and development in the pharmaceutical industry. A large number of protein kinase inhibitors are either in clinical development or have been approved to treat a wide variety of diseases including cancer, inflammation, diabetes, immunodeficiency and neurological disorders.
Traditionally QSAR models were developed for every individual target separately. Therefore, accurate prediction of full kinome profile for a molecule is a great challenge for computational drug discovery. Here, we address this challenge using an approach based on recent advances in machine learning – deep convolutional neural networks (CNNs). These neural networks allow for multi-task learning, procedure of learning several tasks at the same time with the aim of mutual benefits. Its architecture allows sharing of learnt information across sub-tasks, and therefore joint training for every endpoint. We have applied this approach (termed KinomeNet) to a large dataset of kinase inhibitors extracted from databases (Pubchem, ChEMBL), articles and patents.Dataset include over 250,000 compounds and 369 kinases. We showed that KinomeNet had high average specificity (0.90) and sensitivity (0.87) of prediction, which compared favorably to similar metrics for the state-of-the-art Random Forest models (SP 0.83 and SE 0.82) across over 200 kinases. We posit that the success of KinomeNet is due to information sharing, which is especially beneficial for endpoints with small or highly imbalanced datasets, those that traditionally most challenging for QSAR methods.

1 BCMB, University of Tennessee Knoxville, Oak Ridge, Tennessee, United States; 2 Biology, Oak Ridge Nationa Laboratory, Oak Ridge, South Dakota, United States
In annotated proteomes from different organisms, there are significant amounts of proteins that do not have homology to other proteins with known structures and/or functions, and hence are termed the “dark matter” of the proteomes. Many intrinsically disordered proteins (IDP) or proteins with large contents of residues in the intrinsically disordered regions (IDR), are of the proteomic “dark matter”. Recognition and understanding of the IDP/IDR have even urged the retirement of the Anfinsen’s Dogma in molecular biology, i.e., sequence determines structure determines function. Here, we propose a phylogenetic method covering both protein amino acid sequence length (L) and protein intrinsic disorder (D) of the whole proteomes. The phylogeny reconstructed in the two dimensional (2D) L-D space set up an intriguing landscape of the evolutionary dynamics of organisms. This approach can clearly distinguish eukaryotes from prokaryotes. The viral and plasmid gene pools, the giant DNA viruses (Giruses), and the Archezoa (mitochondrion-free eukaryotes) are all located in the eukaryotic basal zone, or the prokaryote-to-eukaryote transition zone. Moreover, the plants and animals, and even plant monocots and eudicots, exhibit different patterns in this L-D space. These method covers all proteins of the proteomes, including IDPs and other proteins in the dark regions. In addition, since the intrinsic disorder of proteins are predictable and the analogous phylogeny reconstructed based on the protein length and protein intrinsic disorder could clearly identify the evolutionary status of the organisms, we argue that the Anfinsen’s Dogma should stand for now, as disorder itself could be treated as a special case of structure, or order.

1 Cambridge Crystallographic Data Centre, Cambridge, United Kingdom; 2 Clark Library, Cornell University, Ithaca, New York, United States; 3 Department of Chemistry, University of North Florida, Jacksonville, Florida, United States
DIG Chemistry is a global conversation emerging in the chemical information professional community to map the challenges and opportunities for chemistry data across the enterprise. Quoting from the group presence in the open Research Data Alliance: “There is a wealth of chemical data in various heterogeneous formats, distributed across a myriad of systems with endless potential for reuse in chemistry research and many related domains. However, many social, technical and administrative factors have limited the opportunities for open sharing and interoperable exchange. The high reuse value of chemical information has sparked decades of innovative technologies addressing various challenges in handling chemical specific data, but very few approaches have persisted, are extensible beyond specific data types and/or are operable at scale. There is demonstrable need for coordinated development of updated and scaled infrastructures, hard and soft, for enabling chemical data exchange and connecting data providers with data users across sources and applications.” This discussion session will provide the opportunity for those most active in the field to identify the highest priorities and target productive community collaborations.

1 Cambridge Crystallographic Data Centre, Cambridge, United Kingdom; 2 National Center for Biotechnology Information, Bethesda, Maryland, United States; 3 Clark Library, Cornell University, Ithaca, New York, United States
At the last ACS meeting in San Diego, the Division of Chemical Information (CINF) held its first ever Data Summit over the course of five days. As a follow-up, this panel discussion will provide an overview of the challenges surrounding chemistry data representation. Collected from nearly one hundred 'pain points' expressed by chemical information experts, there emerged several crucial themes for managing and working with chemistry information: data access, data quality, chemical structure representation, data description and metadata, curation and management tools, and audience and community engagement. The panel will summarize these points, engage the audience in active discussion to further vet the issues, and surface potential solutions and approaches to improve the state of the art. Outcomes will be published in a form that helps to further programs and publications to reach broader audiences interested in chemistry data.

4:10pm-4:30pm

Concluding Remarks

CINF: Using New Media to Communicate Chemistry to the Public 1:30pm - 4:15pm
Monday, August 22
Room 112B - Pennsylvania Convention Center

American Chemical Society, Washington, District of Columbia, United States
Bill Nye has said that “if you want to teach something, you have to entertain people.” This entertaining, educational approach is at the heart of Reactions, a weekly ACS YouTube series highlighting the chemistry of everyday life. Over the course of more than 140 episodes, the series has explained the chemistry of pizza, wet dog smell, tattoos, cookies, bacon, moisturizer and much more. Since its launch in early 2014, Reactions episodes have received more than 20 million views and have been featured on the Today Show, NPR, Washington Post and more than 100 other media outlets. In this session, the series' creator Adam Dylewski

Chemical Heritage Foundation, Philadelphia, Pennsylvania, United StatesDistillations podcast explores the human stories behind science and technology, tracing a path through history in order to better understand the present. With thoughtfulness and humor we’ve traveled into the heart of a Silicon Valley asteroid mining company, explored the anti-GMO movement, and examined how new feats in “bloodless medicine” have been propelled in part by Jehovah’s Witnesses. We’ve even explained how DDT is the Britney Spears of chemicals. Distillations uncovers the many ways our daily lives intersect with science, while linking the present to the past and giving us a better grasp of the now.

2:20pm-2:40pm

CINF 61: Got something to say? Engaging with social media in the time you haveDavid Oppenheimer, oppenhe@ufl.edu, Paris Grey

University of Florida, Gainesville, Florida, United States
After creating the blog, Undergrad in the Lab (undergradinthelab.com), to help undergraduates be successful and make meaningful connections with their research, we quickly realized that many of our strategies applied to researchers at all levels. To engage with this wider audience we employ a variety of social media, namely facebook, twitter, and Instagram as @youinthelab. This talk will cover what's working for us and why we chose these platforms.

Compound Interest, Cambridge, United Kingdom
Online engagement is increasingly driven by images and multimedia. This session will look at how Compound Interest has taken advantage of this by using images and design to communicate chemistry concepts. It will also provide suggestions on how other chemists and science communicators can take advantage of graphic design to communicate chemistry ideas and research.

St. Edward's University, Austin, Texas, United States
Dodging zombies, killing kings, battling aliens, fact-checking cartoons, and sussing out stunt videos? Sounds silly, but it can be a serious science communication opportunity. In this talk, I'll share my adventures and strategies as a pop culture chemist on TV, in podcasts, and at genre cons.

Retired, Silver Spring, Maryland, United States
This presentation will describe how the author has been involved in developing databases and standards in chemsitry over the past 45 years, using the NIH/EPA/NIST mass spectrometry database, the NIH,/EPA Chemical Information System, the IUPAC InChI chemical structrure standard,and others as examples.

Information Technology Branch, Developmental Therapeutics Program, National Cancer Institute, Bethesda, Maryland, United States
The National Cancer Institute has been accepting compounds for testing in anti-cancer assays since 1955, a service run by the Developmental Therapeutics Program (DTP). The systems for keeping track of the structural information follow the history of structural information representation, from ink drawings on 3 X 5 cards to modern computers. The development of the internet and World Wide Web opened the possibility of sharing this information with the research community. In the mid 1990s the the first downloadable structure files were made available followed by the first web pages, including structure search, NCI-60 growth inhibition data, and COMPARE calculations. These tools are useful, and a proof of concept that data can be made available with only modest resources, but there was a need for a more comprehensive way to turn 'available data' into 'useful data'. A major step forward in this regard was the creation of PubChem as part of the Molecular Library Initiative. DTP was a major contributor to the early development of PubChem. When PubChem first went live, about a third of the chemical structures and all of the assay data were from DTP. The impressive growth of PubChem in the years that followed has established open chemical data as an important part of chemical research. Over two decades of providing open chemical data has given DTP perspective in how the field has developed and where it might go. This talk will discuss that the positive and negative experiences related to open chemical data and the challenges that need to be addressed in order to make open chemical data not just a useful part of chemical research, but an integral and necessary component of such research.

NIST, Gaithersburg, Maryland, United States
The IUPAC International Chemical Identifier (InChI) is a molecular identifier based on the structure of the molecular species. It and its standardized hash (InChIKey) have several properties which make them useful for database management and generating links to databases which contain data relevant to the chemical species. This talk will outline some of these advantages along with some of the challenges in the use of these identifiers.

The NIST Chemistry WebBook is a collection of data for molecular species from various sources. InChI has been shown to be useful for reliably merging such data. Its modular nature allows straightforward identification of geometric and stereo isomers along with isotoplogues. This is particularly useful in case of the Chemistry WebBook which contains legacy data collections which often do not specify stereogenic bonds.

InChI and InChIKey can also be used to link across data collections from different providers. Some of the features of the InChI string make it difficult (but certainly not impossible) to construct pre-defined web based queries or links. InChIKey, however, was designed to be compatible with Internet search engines and can be readily used for this purpose. While not as modular as InChI, InChIKey does store the hash of the molecule’s connectivity in its first component. This allows identification of similar species and the possible construction of links to “near misses.”

PubChem has become a very useful tool for accessing a wealth of information about chemical species along with links to additional resources for the species. InChI and InChIKey make it easy to link to this resource.

Nci Frederick Bldg 376 RM 207, Natl Inst Health Ft Detrick, Frederick, Maryland, United States
We will touch on the nearly two decades of web-based, freely accessible, small-molecule related resources the National Cancer Institute (NCI) Computer-Aided Drug Design (CADD) Group has made available to the scientific public in the fields of CADD and chemoinformatics. These resources build on even longer history of chemoinformatics work at the NCI over the past nearly 60 years. The NCI CADD Group Chemoinformatics Tools and User Services comprise services such as the Enhanced NCI Database Browser, the Optical Structure Recognition Application, and the Chemical Identifier Resolver. We will present our efforts in the context of, and how they intersect with, the history, current status, and future of other large chemoinformatics project at NIH such as PubChem.

Royal Society of Chemistry, Rockville, Maryland, United States
Open chemical information is at an exciting juncture. Scientists are beginning to understand their role in providing this content. Publishers are beginning to improve their capture of these data streams. Archives exist for scientists to put their information. There are still challenges. This talk will provide a brief overview of open chemical information and the role of the Royal Society of Chemistry is playing to foster it. The impact of PubChem and other resources, like ChemSpider, will be considered.

11:10am-11:35am

CINF 69: Open chemical information at the European Bioinformatics InstituteChristoph Steinbeck, steinbeck@ebi.ac.uk

Cheminformatics and Metabolism, European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Hinxton, Cambridge, United Kingdom
The European Bioinformatics Institute has contributed to developing the open chemical information space over more than ten years. It started with the ChEBI ontology and database and was later extended with the ChEMBL database, UniChem and MetaboLights. This talk will highlight some of the primary contributions EBI has made to open chemical information and how they intersect with the PubChem project.

11:35am-12:00pm

CINF 70: History and the future of tools and software components for working with public chemistry dataWolf-Dietrich Ihlenfeldt, wdi@xemistry.com

Xemistry GmbH, Konigstein, Germany
Over the last 15 years, I have been involved, in an active developer or component supplier role, in the development of several important Internet-based chemistry information portals. My company has provided software both for the user interfaces of these sites as well as for their behind-the-scenes operation. For consumers of Internet-based chemistry information interested in more than looking at Web pages, we have developed tools for acessing, matching and mixing data from such sources. It has been an interesting journey. I will recount experiences related to several past, current and future projects, their specific challenges which were and are often linked to the state of the art of general Internet and computer technology at their time, and our approaches in addressing these.

IBM Almaden Research Center, IBM, San Jose, California, United States
Without order, a collection is just a heap of stuff. For centuries, mankind has been working on bringing order to collections in the earliest libraries and archives. In the same way, without order, databases are just random bits and bytes. This talk will take a look at how we continuously seek to improve order, in databases and in searching. We are continuously challenged by factors that disturb the order, such as the growing amount of data, globalization, and obfuscation. Integration of big data demands new search technologies. We’ll take a look at promising developments in finding 'dark data' with examples from different technical fields. We'll discuss the role of computer curation to enhance the ability to organize, to search content, and ultimately to lead to predictive analysis. These developments will be viewed in the context of the value of scientific data provided to the scientific community through the efforts of NIH and PubChem.

FDA, Silver Spring, Maryland, United States
Structured Product Labeling (SPL) is an open document markup standard approved by Health Level Seven (HL7) and adopted by FDA as a mechanism for exchanging product and facility information. The SPL standard has also been used by FDA for indexing data on chemical and biological substances used as ingredients in medicinal products. Data available in form of SPL index files and its integration with openFDA and PubChem will be discussed.

2:55pm-3:20pm

CINF 73: Building a network of interoperable and independently produced linked and open biomedical dataMichel Dumontier, michel.dumontier@gmail.com

Medicine, Stanford University, Stanford, California, United States
Over 15 years ago, Sir Tim Berners Lee proclaimed the founding of an exciting new future involving intelligent agents operating over smarter data in order to perform complex tasks at the behest of their human controllers. At the heart of this vision lies an uneasy alliance between tedious formal knowledge representations and powerful analytics over big, but often messy data. Bio2RDF, our decade old open source project to create Linked Data for the life sciences, has weaved emergent Semantic Web technologies such as ontologies and Linked Data to generate FAIR - Findable, Accessible, Interoperable, and Reusable - data in the form of billions of machine accessible statements for use in downstream biomedical discovery.
This revolution in data publication has been strengthened by action from global bioinformatics institutions such as the NCBI, NCBO, EBI, and DBCLS. Notably, NCBI's PubChem has successfully coupled large scale data integration with community-based standards to offer a remakable biochemical knowledge resource amenable to data hungry discovery tools. Yet, in the face of increasing pressure from researchers, funders, and publishers, will these approaches be sufficient for growing and maintaining a comprehensive knowledge graph that is inclusive of all biomedical research?

NextMove Software, Cambridge, United Kingdom
For all of the grief that I give Evan, often over corner cases of chemical semantics that only one or two people care about, it is fair to say that PubChem represents the current state-of-the-art in chemical structure representation. Nobody does it better. Under the surface, unseen to most users, are a large number of technical and scientific innovations that have enabled PubChem to scale over the past decade and a half to now contain approaching 100 million compounds. From simple design decisions such as the substance vs. compound distinction [that allows PubChem to avoid the early mistakes of CAS] to breakthroughs such as canonical Kekule SMILES [to avoid the early mistakes of Daylight Chemical Information Systems], the architecture of Pubchem contains a treasure trove of cheminformatics innovations, covering normalization, tautomers, mixtures, 2D fingerprints and similarity, substructure search, biopolymers, text mining and much more. During this presentation I hope to share some of the cool insights that the remarkable staff at the NCBI often forget to mention or are too modest to point out.
Congratulations Evan and Steve.

4:00pm-4:25pm

CINF 75: iRAMP & PubChem: Of the people, for the peopleLeah McEwen, lrm1@cornell.edu

Clark Library, Cornell University, Ithaca, New York, United States

Chemistry and the need for chemical information are ubiquitous to the success of every wet lab. Infrastructure involved in supporting the ecosystem of chemistry data and information, from data quality to description, is impressive in scale and functionality to keep everything running smoothly under the hood. Narrowly focussed innovation is comparatively easy; broad-reaching, publicly accessible, semantically enabled infrastructure is a miracle. PubChem represents that miracle in chemical information daily and further enables other laboratory and research infrastructures such as chemical safety management to close the information gap. Setting the stage for these impact-driven collaborations taps chemistry librarian expertise to connect infrastructures, user communities, and technology teams. This talk will discuss the multiplying effect of the PubChem infrastructure in Recognizing, Assessing, Managing and Preparing disparate nodes of information and stakeholders for real world chemical information problems.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States
Open chemical information, once a dream, is now commonplace and necessary. A primary achievement involved bridging chemical resources on the internet, whether they are open or closed. The ability to find information about chemicals was greatly enhanced. However, aggregation of chemical information has revealed a number of challenges. Some of these include data quality/corruption, data representation, and the need for improved software algorithm and knowledge representation harmonization. This talk will highlight in the open chemical information space where improvements can be made and how, as a community, we might be able to make a real impact in improving the state of the art chemistry knowledge representation.

1 Clark Library, Cornell University, Ithaca, New York, United States; 2 Dept of Env Hlth Safety, Keene State College, Keene, New Hampshire, United States; 3 National Center for Biotechnology Information, Bethesda, Maryland, United States
In response to recommendations from a variety of federal organizations, the ACS Committee on Chemical Safety has been developing resources to support risk assessment of chemical processes in the small-scale laboratory. Risk assessment is a scientific information-centric approach to identifying, prioritizing and managing laboratory hazards that goes beyond the traditional 'fume hood, eye protection and gloves' approach to laboratory safety. Developing information tools that effectively support this process require clear definitions of use case scenarios for this information.

This roundtable discussion will kick off with summary perspectives from a variety of stakeholders to identify characteristics of key use cases, including research, teaching and service laboratories. Researchers and lab supervisors, Environmental Health and Safety staff, chemistry librarians and information professionals will present their requirements for data sources, assessment of information quality and user interfaces. Cross-sector discussion of these aspects of chemical safety information will help inform development of tools made available both in public resources and institution-specific contexts.

1 Marston Science Library, University of Florida, Gainesville, Florida, United States; 2 EH&S, University of Delaware, Newark, Delaware, United States
In the last few years, accidents in academic research laboratories have led to raising international concern about the safe research practices and safety of lab researchers in academic institutions. When planning for the research projects or experiments, risk assessment and crisis management are a few of the most crucial issues that a researcher needs to focus on before conducting any experiment. Apart from the hazards of the chemical/instrument being used, the risks should be outlined by the exposure and potential damage resulting from those hazards.
Risk assessment and emergency plans help lab researchers to understand the risk involved in the activities as well as preparing for the unwanted situations. Crisis communication and management is the process delivered at times of high trauma during or after an accident. In this presentation, we are going to discuss risk assessment in a research lab and recognition of the potential hazards using online information available through different resources, and formulation of emergency plans along with crisis management which could be life-saving in case of an emergency.

1 Clark Library, Cornell University, Ithaca, New York, United States; 2 Dept of Env Hlth Safety, Keene State College, Keene, New Hampshire, United States; 3 NLM/NCBI, National Institutes of Health, Bethesda, Maryland, United States
PubChem, a public chemical information database, has been serving the scientific community for more than a decade. PubChem provides chemical information in many aspects. In addition to serving the generic chemical information such as chemical structure, chemical and physical property data, PubChem also provides drug information, chemical safety and hazard information, patent information, and more. Chemical safety is a very important topic in chemical industry, chemistry labs, and in our daily scientific lives. PubChem has integrated safety and hazard information from various public domains such as ILO International Chemical Safety Cards, NIOSH Pocket Guide to Chemical Hazards, OSHA Occupational Chemical Database, HSDB, Cameo Chemicals, and more. In the chemical safety and hazard section, PubChem has added the GHS (the Globally Harmonized System of Classification and Labeling of Chemicals) classification [1]. In this presentation, we will discuss the PubChem safety information collection, data integration, and data access.

1 Clark Library, Cornell University, Ithaca, New York, United States; 2 NLM/NCBI, National Institutes of Health, Bethesda, Maryland, United States; 3 Department of Chemistry, University of North Florida, Jacksonville, Florida, United States; 4 University of Southampton, Southampton, United Kingdom; 5 Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, Washington, United States

Semantic Web standards and technologies have been emerging as an increasingly important approach to distribute and integrate scientific data. Resource Description Framework (RDF) constituents a family of World Wide Web consortium specifications for data exchange, and it is a core part of Semantic Web standards. PubChemRDF integrates the knowledge base across PubChem databases and other biological and biomedical databases with National Center for Biotechnology Information. The schemaless RDF data content can be queried and analyzed using readily available Semantic Web technologies (namely, SPARQL query language and logic-based inference). PubChem has crowdsourced chemical health and safety information from multiple national regulatory agencies who aligned the hazard regulation standards with the Globally Harmonized System (GHS) of Classification and Labeling of Chemicals established by United Nationals. The available GHS safety and hazard information in addition to the chemical and physical data from the regulatory agencies in European, Australian, Japanese, and United State has been integrated in the Laboratory Chemical Safety Summary (LCSS) for PubChem compounds. In the present work, we want to demonstrate the semantic annotations of the chemical health and safety information available in LCSS, and how to facilitate data exchange and information retrieval using Semantic Web standards and technologies.

NextMove Software, Cambridge, United Kingdom
When handling chemicals in a laboratory, the pictograms and labels found on containers are an immediate and obvious indication of the potential risks associated with their contents. Indeed, it is not uncommon to be misled into believing that access to reactant MSDS or SDS data sheets is the single solution required to eliminate all accidents from laboratories. However, few people pause to think where do such data sheets and labels come from, if not handed down on stone tablets from Sigma Aldrich and other chemical vendors. In theory, hazard classification is actually based on a rigorously defined set of physical experiments that are legislated by the United Nations and therefore consistent between vendors. Unfortunately, not only are these tests performed by relatively few organizations (i.e. rarely in academia), but many vendors often skip expensive testing by simply erring on the side of caution; labels and data sheets, after all, are legal not factual documents. Fortunately, the increasing availability of both public experimental data databases and of predictive models built on them enable the estimation of hazard classifications for the many millions of compounds that don't have an SDS, such as any of the novel reaction products made in academia.

Clark Library, Cornell University, Ithaca, New York, United States
There are several information transfer points critical to managing chemical assets in research. Several questions arise concerning the importance of tracking information provenance as well as chemical description relevant to practical use and experimental planning. What information is needed for communication between stakeholders and systems? Is a digital object identifier type approach applicable to tracking vendor SDSs as documents of record? What definitions are necessary to develop a ’chemistry' identifier that resolves to chemical components, 2D structure, mixture composition, and states and forms designations commonly encountered in chemical sample inventories? What systematic description is needed for richer, experimental chemical property and reactivity information in the systems linked behind these identifiers? What information elements are required for a QR code to facilitate communication and information transfer? This discussion will consider these questions from both scientific and community practice perspectives, and how various existing standards and projects under the auspices of IUPAC and other standards organizations can be used to address these needs.

Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland
Feedback-driven hypothesis refinement is a central pillar of medicinal chemistry. Active machine learning has been successfully transferred from computer science to drug discovery research for putting artificial intelligence in charge of compound selection and allow for fully-automated design-and-test processes.<span style='font-size:10.8333px; line-height:17.3333px'> </span>As an initial test, we used random forest prediction technology to retrospectively reproduce explorative and exploitive behaviours using various selection functions. Then, we pursued different active machine learning strategies in a virtual screening framework against the cancer- and HIV-relevant GPCR CXCR4. The compound selection strategy determines the outcome of each iteration in terms of model architecture change and improvement of the predictive performance: exploitive strategies retrieved active compounds that did not necessarily provide strong feedback while the explorative selection found inactive compounds that had a large impact on the model architecture. We prospectively validated a balanced approach that identifies informative active compounds to focus model-improvement on active regions of chemical space while providing desirable hits.

Structural and Chemical Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States
Computational technologies are fundamental components in early drug discovery projects and their application is more notable in the hit identification phase. However, scoring is ultimately a challenge in the lead optimization phase. Structure-based drug design is a crucial step in this phase, where medicinal chemists can make hundreds and even thousands of compounds to generate leads with desirable, drug-like physiochemical properties. Computational approaches are used hand in hand with medicinal chemistry to rank small molecules before they are synthesized. Recently, many advances were made predicting the free energy of binding of small molecules to proteins. But, estimating this affinity using physics-based methods is a tedious, slow process that usually requires manual intervention by experts. Thus cheminformatics approaches are valuable, fast tools to address this issue. Protein-ligand interactions are simplified cheminformatics representation of the enthalpic effect between small molecules and proteins. In this study, we introduce the use of frequency of protein-ligand interactions as descriptors in QSAR SVM models to predict ligand-binding affinity. SAR datasets of eight different targets were used to validate the methods. The overall cross validation performance was comparable to Free Energy Perturbation (FEP) protocol, with average Rp = 0.54 (vs Rp(FEP) = 0.56) and RMSE = 0.65 (vs RMSE(FEP) = 0.86). Predictive models were generated in a concrete structure-based design project involving novel bromodomain inhibitors. Binding affinity for multiple ligands were predicted and validated for the first and second domain of BRD4 protein, with an average accuracy of 70% in a ligand-ranking scenario for chemical synthesis. Our method is useful for ligand optimization and selectivity assessment with fast predictions, which can lead to fewer molecules requiring synthesis for binding affinity validation.

The Institute of Cancer Research, Sutton, United Kingdom
Multiobjective molecular design methods have come of age and are now being considered actively in many drug design programs. However, the challenge of optimizing ligands in silico against multiple targets given extant data remains and concerted effort is being undertaken to address these challenges. The effective multiobjective de novo design is enabled by effective search of the feasible chemistry space, synthetically-appropriate designs that can be realized, and appropriate and predictive models for important biological and physicochemical endpoints.

1 B-it, Universtity of Bonn, Bonn, Germany; 2 Department of Life Science Informatics, University of Bonn, Bonn, Germany; 3 Life Science Informatics, University of Bonn, B-IT, Bonn, Germany
The generation of analogs of active compounds dominates hit-to-lead and lead optimization projects in medicinal chemistry. Most computational approaches applied in the course of chemical optimization attempt to aid in the design of better analogs and/or the exploration of SAR information associated with compound series.
A computational framework is introduced to systematically detect all synthetically accessible analogs of bioactive compounds in databases and determine how their chemical exploration might influence compound promiscuity (i.e., the ability of a compound to interact with multiple targets). For more than a third of all active compounds across 90% activity classes, no analogs were detected. For the majority of compounds with analogs, chemical exploration had no detectable influence on promiscuity. However, for a subset of ∼26% of active compounds with analog sets, notable increases in promiscuity were observed, which were mostly due to the presence of single analogs with high degrees of promiscuity.

1 Department of Life Science Informatics, University of Bonn, Bonn, Germany; 2 Life Science Informatics, University of Bonn, B-IT, Bonn, Germany
The concept of matched molecular pairs (MMPs) has experienced increasing interest in medicinal chemistry. An MMP is defined as a pair of compounds that only differ by a structural change at a single site. MMPs are often used to associate specific structural modifications with changes in molecular properties. The matching molecular series (MMS) was introduced as an extension of the MMP concept and defined as a set of compounds with pairwise MMP relationships. Thus, an MMS represents a series of analogs with modifications at a single site.
We have systematically identified all publicly available MMSs, classified their SAR characteristics, and explored structural relationships between them. The combination of SAR and structural relationship information enabled the identification of structurally related MMSs with similar or distinct SAR characteristics. Such MMSs combine series of analogs with different substitution sites and reveal how structural modifications influence SARs. They can also be used to explore analog pathways that change SAR characteristics and provide additional SAR information.

A know problem of hierarchical cluster analysis (HCA), of ample use in chemoinformatics, is that resulting from ties in proximity, which comes into play once equidistances appear in the distance matrix of objects to classify. It has been shown that it is a very likely problem leading to no unique classification results (dendrograms). We have shown how big is the problem even if the HCA algorithm is run over a fixed data set, with fixed grouping methodology and similarity measure. We call attention to the widespread disregarding of the problem, where HCA results are taken as unique and conclusions are based upon them. This is for example the case of QSAR studies where HCA is used to select descriptors for the models.

We have introduced four methodologies to quantify cluster frequencies considering ties in proximity, two of them consider clusters and dendrograms as sets and the other two as graphs. We use a toy example of well separated clusters and a set of 1,666 molecular descriptors calculated for a group of molecules having hepatotoxic activity. The four methodologies can be used to derive cluster stability measurements on arbitrary sets of dendrograms having the same set of objects.

It was found that ties occurred frequently, some yielding tens of thousands of dendrograms, even for small data sets. Our results highlight the need for evaluating the effect of ties on clustering patterns before classification results can be used accurately.

NextMove Software, Cambridge, United Kingdom
Despite the huge advances made in nucleic acid synthetic chemistry over the last four decades, such as antisense siRNAs, the IUPAC/IUBMB recommendations on nucleic acid symbols and abbreviations haven't been updated since their original publication in 1970. This creates technical challenges for biopharmaceutical companies and efforts such as the Pistoia Alliance's HELM which attempt to encode the unusual backbone and bases found in current nucleic acid drug candidates as line notations. In this talk, we review several of the technical challenges in representing/encoding modern nucleic acid molecules, and present possible solutions to several of them. Case studies will include the FDA approved thiophosphate-linked antisense therapies fomivirsen and mipomersen. Hopefully, these proposals will help address the 'RNA informatics' gap between biological (activity) databases and the continually expanding chemical space of tractable nucleic acid analogs.

1 Computational Chemistry, CMDBioscience, East Windsor, New Jersey, United States; 2 Rochester Institute of Technology, Rochester, New York, United States
The future of virtual screening is to search through large areas of virtual chemical space. To do this efficiently, one needs to include experimentalist much earlier into the project to better focus on molecules that are more likely to be synthetically accessible. The VSviewer3D is a simple open source Java tool for visual exploration of 3D virtual screening data. The VSviewer3D brings together the ability to explore numerical data, such as calculated properties and virtual screening scores, structure depiction, interactive topological and 3D similarity searching, and 3D visualization. By doing so the user is better able to quickly identify outliers, assess tractability of large numbers of compounds, visualize hits of interest, annotate hits, and mix and match interesting scaffolds. We demonstrate the utility of the VSviewer3D by describing a use case in a docking based virtual screen.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States
PubChem is an NIH public repository of chemical and biological data providing, along with other NIH public databases, extensive resources for biomedical discovery. However, the tremendous growth in the amount of data, its increasing heterogeneity, and its widening variation in quality demand novel approaches that allow fast retrieval of relevant, non-redundant, and reliable information. Search results should be provided in a meaningful aggregated form, with links to important information in other NIH databases and outside resources.

The problem of biomedical big data management and data mining has been recognized as a major strategic challenge that requires collaborative efforts of biomedical and computer scientists, engineers, programmers and other specialists to innovate and develop new data models and organizational approaches. With the data deluge, we cannot treat all the data equally: the information must be organized hierarchically around the most useful and well annotated data and indexed in a way that allows fast retrieval of the most useful and reliable information. At the same time, the user should be provided with an option to conduct deep refined searches.

In this presentation, we will discuss our recent efforts to improve the reliability of chemical names (also called synonyms) in PubChem. Chemical names in the PubChem Compound database have been compared with those included in Medical Subject Heading (MeSH) as well as those extracted from PubMed abstracts using text-mining programs such as LeadMine and Pubtator. This cross-validation between synonyms from multiple sources has resulted in an improved scoring scheme for synonyms for a given compound, which promotes the most useful and reliable names in our search and retrieval procedures. It also allows us to correct or disallow some PubChem names when inconsistencies between different sources are found.

We plan to use the metadata-based relations between data from multiple sources as well as 2-D and 3-D similarity to provide advisory annotation information and web links. Further analysis of associations between records in multiple databases will help improve chemical annotation quality and more reliably link PubChem compounds to information about biological processes, hazards, genes and diseases.

NextMove Software, Cambridge, United Kingdom
Chemical sketches are ubiquitous in the published literature. Unlike connection table formats that precisely capture chemistry for database entry, the primary purpose of a sketch format is to produce a high quality image for conveying information to other chemists. Chemical sketches can be presented in a variety of chemistry-specific formats as well as image formats, with the later presenting additional challenges to interpretation. Since 2001 the United States Patent Office has redrawn all chemical sketches in ChemDraw, yielding to date over 25 million freely available CDX files.
Correctly extracting chemistry from these files required tackling of many areas including disambiguation of ambiguous labels (e.g. B, D, P, V, Ac), interpreting labels (e.g. COOH), interpretation of free text overlaid on the structure (e.g. brackets for a repeated group) and assignment of reaction role.
We report our work on extracting chemical structures and reactions from sketches and demonstrate the improvements in quality that tackling the intricacies of sketches provides over more naïve approaches. One notable improvement is the ability to better distinguish between specific compounds, fragments, generic structures and reaction schemes. We compare the chemistry extracted from sketches with the results from text-mining, and show that a large amount of chemistry is only available from one medium or the other. We also explore cases where the combination of the output from sketches and text enables extraction of data that either method in isolation could not e.g. Markush structures, reactions where the product is given as a sketch.

1 NCBI, NLM, NIH, Bethesda, Maryland, United States; 2 NCBI/CBB, NIH, Bethesda, Maryland, United States; 3 National Center for Biotechnology Information, Bethesda, Maryland, United States
Pubchem is a free chemical database and an open archive of the biological activities of millions of substances. PubChem has input data from more than 350 data sources worldwide with millions of unique compounds, deposited substance records, and bioactivities. Scientific data in PubChem are rich and refined, including calculated properties, deposited annotations, and cross-links between resources. The need of dissemination scientific knowledge covered by the data from research activities is demanding and imperative. However, neither getting access to numerous of data (numbering in millions and billions) is straightforward, nor summarizing countless pieces of information from them. The strategies for effective chemical information searching on internet have to consider the efficiency of a search job and the significance of a search result. This includes the infrastructures and databases that resource providers can use to finish a search job in a reasonable period of time, and the search results that are collected and organized to reflect the scientific knowledge and information relevant to the questions and interest of users.
This talk is about the novel PubChem search system which benefits from the speed of the text search ability of Sphinx search engine and the flexibility of retrieving various data of SQL databases. This Sphinx-SQL search system has been applied on PubChem widgets and PubChem Search. The talk will discuss the system from two perspectives: the infrastructures and the applications. For the researchers who are interested in searching for chemical information, it will also share ideas on user experience and show how to build their own custom search engine and web pages with the queries of this Sphinx-SQL system.

1 Earth-Life Science Institute, Tokyo Institute of Technology, Meguro, Tokyo, Japan; 2 PRESTO, Japan Science and Technology Agency, Kawaguchi, Saitama, Japan
When simulated for longer than a few microseconds, huge computational costs are required to undertake ab-initio approaches for the quantitative estimation of the chemical kinetics of unknown reactions, while that of known reactions can be readily achieved by solving rate equations. Here we introduce a heuristic approach for modeling a reaction as probabilistic dynamics to explore the optimal combinations of the bonding states of atoms, which allows us to estimate the unknown kinetics in a semi-quantitative manner that saves computational resources. We extend our previously proposed heuristic algorithm for solving an NP-complete constraint satisfaction problem (SAT) [Aono et al., Langmuir 2013], which was inspired by spatiotemporal dynamics of a unicellular amoeba that exhibits sophisticated computing capabilities. The extended model consists of numerous fluctuating units, each of which abstracts a pseudopod of the amoeba and represents bonded or unbonded states between two atoms [Aono et al., Orig. Life Evol. Biosph. 2015]. All units evolve concurrently, while unfavorable evolutions are prohibited probabilistically according to physicochemical constraints, which are defined by reflecting empirical laws such as Lewis's octet rule and electronic theory of organic chemistry. The model discovers unknown bonding combinations when stabilized as all the constraints are satisfied, indicating that stable molecules have formed. The dynamics can be viewed as traveling around metastable combinations that are identified as reactants and/or products. By properly tuning the prohibition probabilities, the difficulties of traveling across the transition states will be adjusted to achieve the semi-quantitative kinetics estimation. Our future goal is to simulate the emergence of protometabolic networks in the early Earth environment, leading to the understanding of the origin of life.

1 Select-O-Sep, Freeport, Ohio, United States; 2 Wright State University, Dayton, Ohio, United States
Although there are a number of highly sophistical electronic search service in many areas of science, rediscovering/reinventing concepts that were described years in the past continues to occur. This problem seems to occur more often when the initial idea/concept dates back 30 to 50 years or more. To illustrate this point, today’s talk will present four examples of broad-based analytical chemistry techniques/methods where this problem has occurred. Although analytical chemistry was selected, it is not because it has more rediscoveries but because the authors are most familiar with this area of chemistry and the important developments that have occurred over the last half century or more.

There are a number of reasons why this happens, which fit into broad categories: 1) there are problems with the data bases in not identifying early work, 2) researchers are not adequately using the data bases, or 3) a combination of both. In order to consider each of these, topics to be discussed will be: a) keywords and the use of keywords, b) terminology and changes in terminology, c) impatience with database searching, d) the ordering of items in search engines, e) lack of historical perspective, f) overuse of the Internet as the primary search tool, and g) default settings and direction of the search (e.g., from new to old or from old to new).

Although arguments can be made for all of the above causes, if a good search is carried out, past discoveries should not be missed and hence not be rediscovered. Likewise, if a reference is once missed and goes uncited for several years, the chances of finding it are diminished, but not impossible to find.