Data management

Details

ABOUT THIS PROJECT

Data Application Programming Interfaces

We are making toxicological data (and in principle any life science data) organised and easily accessible. Whenever we do that, we also make sure that the data access is always reproducible, be it in private or public environments. The basis for the accessibility are well defined Application Programming Interfaces (APIs). These APIs allow for automatic data discovery, exploration and use of data in wide array of processing tools.

And while APIs are great, we also realize that it is often people first, not the machines, that need to understand and explore the data. For this, we are developing an accompanying web based user interface which scientists use to quickly explore, search, compare and finally select and export the data they need. And yes, we provide means to use the selected data anywhere from Excel to piping it downstream (via APIs) to various modelling and processing tools. One very exciting moment for us is when we close the circle and finally use the output of the downstream processing tools to create new datasets in our data platform.

So how exactly do we expose hundreds of millions of data points in such a way that allows users to quickly drill down to exactly what they need? Instead of relying on a black-box search box approach, we heavily aggregate all the data which in turn allows us to construct dynamic user interfaces which users use to filter the data by selecting values from the calculated aggregations. We always combine this with a powerful search box as well.

Current Work

Douglas Connect is undertaking a major effort to simplify and modernize data access to scientific data sources, currently focusing on toxicology. Three of the most popular toxicologic data sources are already publically available for consumption: the US EPA's in vitro ToxCast/Tox21 database, the EPA's in vivo ToxRefDB database and the NIBIOHN's toxicogenomics Open TG-Gates database, and further resources are under development.

For all of these data sources, our aim was to make the data accessible online via the internet in real time as a REST style API. Such APIs can easily be consumed by a wide range of workflow tools (e.g. KNIME, Garuda) and programming languages (e.g. R, Python or Javascript). Our guiding principles are:

• Expose the data faithfully and completely (no missing information, no transformations or edits to the data, reuse existing fieldnames, ...)
• Provide means for easy data access and exploration (provide powerful mechanisms for filtering/searching of data and aggregations)
• Open source the implementation so the results can be audited, from downloading the official release to the data arriving at the client

In addition to adding more data sources, we focus on two major developments for this year:
First, to enable a degree of automated discovery and unification of data sources, we are working on creating standards to annotate API descriptions with ontology terms.
Second, we are working on a set of tools that will allow non-programmers to turn CSV style datasets into rich APIs. Our vision is to make finding and using data joyful. If you want to learn more, please read our recent blog post on our data APIs (https://douglasconnect.com/blog/opentox-data-apis) or directly the technical documentation on GitHub (https://github.com/DouglasConnect/opentox-toxcast)