Category Archives: ACS Meetings

Curating and sharing structures and spectra for the environmental community

Presented by Emma Schymanski

The increasing popularity of high mass accuracy non-target mass spectrometry methods has yielded extensive identification efforts based on spectral and chemical compound databases in the environmental community and beyond. Increasingly, new methods are relying on open data resources. Candidate structures are often retrieved with either exact mass or molecular formula from large resources such as PubChem, ChemSpider or the EPA CompTox Chemistry Dashboard. Smaller, selective lists of chemicals (also called “suspect lists”) can be used to perform more efficient annotation. Mass spectral libraries can then be used to increase the confidence in tentative identification. Additional metadata (e.g. exposure and hazard information, reference and data source information) can be extremely useful to prioritize substances of high environmental interest. Exchanging information and “sharing structural linkages” between these resources requires extensive curation to ensure that the correct information is shared correctly, yet many valuable datasets arise from scientists and regulators with little official cheminformatics training. This talk will cover curation efforts undertaken to map spectral libraries (e.g. MassBank.EU, mzCloud) and suspect lists from the NORMAN Suspect Exchange (http://www.norman-network.com/?q=node/236) to unique chemical identifiers associated with the US EPA CompTox Chemistry Dashboard. The curation workflow takes advantage of years of experience, as well as contact with the original data providers, to enable open access to valuable, curated datasets to support environmental scientists and the broader research community (e.g. https://comptox.epa.gov/dashboard/chemical_lists). Note: This abstract does not reflect US EPA policy.

Automated Structure Annotation and Curation for MassBank: Potential and Pitfalls

Presented by Emma Schymanski

The European MassBank server (www.massbank.eu) was founded in 2012 by the NORMAN Network (www.norman-network.net) to provide open access to mass spectra of substances of environmental interest contributed by NORMAN members. The automated workflow RMassBank was developed as a part of this effort (https://github.com/MassBank/RMassBank/). This workflow included automated processing of the mass spectral data, as well as automated annotation using the SMILES, Names and CAS numbers provided by the user. Cheminformatics toolkits (e.g. Open Babel, rcdk) and web services (e.g. the CACTUS Chemical Identifier Resolver, Chemical Translation Services (CTS), ChemSpider, PubChem) were then used to convert and/or retrieve the remaining information for completion of the MassBank records (additional names, InChIs, InChIKeys, several database identifiers, mol files), to avoid excessive burden on the users and reduce the chance of errors. To date, approximately 16,000 MS/MS spectra (61 % of all open data as of Nov. 2016) corresponding with 1,269 (18 %) unique chemicals have been uploaded to MassBank.EU via RMassBank. Curating the MassBank.EU records, as part of efforts to provide EPA CompTox Dashboard identifiers (DTXSIDs) for each record, revealed several conflicts in the chemical metadata arising from varying sources. In addition, the representation of “ambiguous substances”, for example complex surfactant mixtures of various chain lengths and branching or incompletely-defined structures of transformaton products, is an ongoing challenge. In this work, we report on proof-of-concept solutions for “ambiguous structure” representation, currently unavailable in the majority of cheminformatics tools. This presentation reflects on the effectiveness of the original RMassBank concept but also identifies pitfalls that automated structure annotation with open resources offers to streamline spectra contributions from external laboratories and users with widely ranging cheminformatics experience. Note: this work does not necessarily reflect U.S. EPA policy.

The Spring ACS Meeting is coming, and it’s coming quickly. Every time the New Year starts I think I have a long time before I have to assemble posters and write talks for the ACS Meeting. When I worked at the RSC it was easier in some ways as NO ONE reviewed them, no one gave comments on them and there was no clearance process involved. Mostly I was writing the talks on the flight out to the ACS or, more commonly, was writing them the evening before or morning of the presentations. There have been days when I got up in the morning at 4am to write two talks on the day I presented. Quite exhausting but at least I got to show the latest and greatest capabilities.

As an employee at the EPA there are different expectations especially in regards to the clearance process where the presentations are reviewed and signed off, pushed through our internal repository and, post-presentation, released to the community via Science Inventory. Some, not all, of the presentations and papers I have been involved with since joining EPA, are here.

I will be going to the ACS meeting with a number of colleagues and chairing a session on Thursday, all day, with Chris Grulke for the Division of Environmental Chemistry. I will be presenting a number of posters and presentations as listed below. A number of my colleagues will also be presenting. Andrew McEachran, a recent postdoc with the center will be presenting on a lot of the work that has been done in terms of the use of the Chemistry Dashboard to facilitate structure identification. The recent publication “Identifying known unknowns using the US EPA’s CompTox Chemistry Dashboard” (http://link.springer.com/article/10.1007%2Fs00216-016-0139-z) reported on a comparison of the dashboard versus ChemSpider. Since then we have rolled out a lot of new functionality to support structure identification and Andrew will report on that.

PAPER ID: 2624963
PAPER TITLE: Twenty five years in cheminformatics: A career path through a diverse series of roles and responsibilities

This presentation was given as a 2 hour hands-on training course at the Frontier Building in the Research Triangle Park in NC funded by an Industry Award Grant from the ACS and matching financial support from the Research Triangle Institute.

Abstract “Many of us nowadays invest significant amounts of time in sharing our activities and opinions with friends and family via social networking tools such as Facebook, Twitter or other related websites. However, despite the availability of many platforms for scientists to connect and share with their peers in the scientific community the majority do not make use of these tools, despite their promise and potential impact and influence on our careers. We are already being indexed and exposed on the internet via our publications, presentations and data and new “AltMetric scores” are being assigned to scientific publications as measures of popularity and, supposedly, of impact. We now have even more ways to contribute to science, to annotate and curate data, to “publish” in new ways, and many of these activities are as part of a growing crowdsourcing network. This presentation provides an overview of the various types of networking and collaborative sites available to scientists and ways to expose your scientific activities online. It will discuss the new world of AltMetrics that is in an explosive growth curve and will help you understand how to influence and leverage some of these new measures. Participating online, whether it be simply for career advancement or for wider exposure of your research, there are now a series of web applications that can provide a great opportunity to develop a scientific profile within the community.”

A new paper that came out of a collaboration initiated at an ACS Meeting, maybe three years ago, has finally gone online. My recollection is that at an ACS CINF reception I started chatting with Vincent Scalfani. At that time I was involved with ChemSpider and he bounced an idea about 3D printing of crystal structures. I reported that we were going to host the Crystal Structures on ChemSpider (here) and Vincent even presented on it at the ACS (here, with >2000 views). But as happened on a fairly regular basis a great idea never came to fruition and the data were not put onto ChemSpider, and I left to join the EPA over eighteen months ago.

But it was still great work, and when it was made clear that the data would not see light of day the original article, written 2 years ago give or take, was adjusted to simply communicate that the data were available on Figshare here (https://dx.doi.org/10.6084/m9.figshare.c.3302859.v6). The peer review process gave good feedback and pretty much said “Why aren’t they on a searchable database”? Well, we tried, but Bob Hanson, JMol-hero, got to work and produced this site in a few days! Bob is incredibly productive.

Well then the paper was accepted, all is good, the data are open and the world has access to tens of thousands of crystal structures ready for printing.

The EPA iCSS Chemistry Dashboard to Support Compound Identification Using High Resolution Mass Spectrometry Data

There is a growing need for rapid chemical screening and prioritization to inform regulatory decision-making on thousands of chemicals in the environment. We have previously used high-resolution mass spectrometry to examine household vacuum dust samples using liquid chromatography time-of-flight mass spectrometry (LC-TOF/MS). Using a combination of exact mass, isotope distribution, and isotope spacing, molecular features were matched with a list of chemical formulas from the EPA’s Distributed Structure-Searchable Toxicity (DSSTox) database. This has further developed our understanding of how openly available chemical databases, together with the appropriate searches, could be used for the purpose of compound identification. We report here on the utility of the EPA’s iCSS Chemistry Dashboard for the purpose of compound identification using searches against a database of over 720,000 chemicals. We also examine the benefits of QSAR prediction for the purpose of retention time prediction to allow for alignment of both chromatographic and mass spectral properties. This abstract does not reflect U.S. EPA policy.

Last night I was honored to receive an award from the North Carolina Local Section of the American Chemical Society. I had the chance to review the past 20 years of my career with the attendees. I assembled a slide deck from about ten years of slides stored on Slideshare (I am glad I have been storing them there as it’s a great online storage place!). I appreciate the recognition from the Local Division. THANKS!

Cheminformatics and computational chemistry have had an enormous impact in regards to providing environmental chemists and toxicologists access to data, information and knowledge. With an overwhelming array of online resources and an increasingly rich collection of software tools, the ability to source information continues to expand. Scientists typically seek chemical data in the form of chemical properties, their function and use, as well as information regarding their exposure potential, persistence in the environment and their transformation in environmental and biological systems. Commonly, the most pressing concern regarding chemicals is their potential as environmental toxicants. The increasing rate of production and release of new chemicals into commerce requires improved access to historical data and information to assist in hazard and risk assessment. High-throughput in vitro and in silico analyses increasingly are being brought to bear to rapidly screen chemicals for their potential impacts and interweaving this information with more traditional in vivo toxicity data and exposure estimation to provide integrated insight into chemical risk is a burgeoning frontier on the cusp of cheminformatics and environmental sciences.

This symposium will bring together a series of talks to provide an overview of the present state of data, tools, databases and approaches available to environmental chemists. The session will include the various modeling approaches and platforms, will examine the issues of data quality and curation, and intends to provide the attendees with details regarding availability, utility and applications of these systems. We will focus especially on the availability of Open systems, data and code to ensure no limitations to access and reuse.

The topics that would be covered in this session are, but are not limited to:

Standards for data exchange and integration in environmental chemistry

Implementations of Read-across prediction

Adverse Outcome Pathway data and delivery

Please submit your abstracts using the ACS Meeting Abstracts Programming System (MAPS) at https://maps.acs.org. General information about the conference can be found at www.acs.org/meetings. Any other inquiries should be directed to the symposium organizers:

As part of our efforts to develop a public platform to provide access to predictive models we have attempted to disentangle the influence of the quality versus quantity of data available to develop and validate QSAR models. Using a thorough manual review of the data underlying the well-known EPI Suite software, we developed automated processes for the validation of the data using a KNIME workflow. This includes: approaches to validate different chemical structure representations (e.g. molfile and SMILES), identifiers (chemical names and registry numbers), and methods to standardize the data into QSAR-consumable formats for modeling. Our efforts to quantify and segregate data into various quality categories has allowed us to thoroughly investigate the resulting models developed from these data slices, as well as allowing us to examine whether or not efforts into the development of large high-quality datasets has the expected pay-off in terms of prediction performance. Machine-learning approaches have been applied to create a series of models that have been used to generate predicted physicochemical and environmental parameters for over 700,000 chemicals. These data are available online via the EPA’s iCSS Chemistry Dashboard. This abstract does not reflect U.S. EPA policy.

Title: Investigating Impact Metrics for Performance for the US-EPA National Center for Computational Toxicology

The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program integrates advances in biology, chemistry, and computer science to help prioritize chemicals for further research based on potential human health risks. This work involves computational and data driven approaches that integrate chemistry, exposure and biological data. We have delivered public access to terabytes of open data, as well to a large number of publicly accessible databases and applications, to support the research efforts for a large community of scientists. Many of our contributions to science are summarily described in research papers but to date we have not optimized our contributions to inform altmetrics statistics associated with our work. Critically missing from altmetrics is access to our numerous software applications and web service accesses, as well as the growing importance of our experimental data and models (e.g ToxCast, ExpoCast, DSSTox and others) to the scientific and regulatory communities. This presentation will provide an overview of our efforts to more fully understand, and quantify, our impact on the environmental sciences using a combination of our measurement approaches and available altmetrics tools. This abstract does not reflect U.S. EPA policy.