ETL: The Dirty Little Secret of Data Science

Clinical research environments contain a vast diversity of data in various systems and formats. From large clinical data warehouses to a researcher’s spreadsheet of hand-annotated bio-specimen samples, all data are important and potentially useful in multiple studies. There is an assumption among many that because we have an abundance of data at our fingertips, it gets easier to answer questions. In reality, having more data increases the precision of our answers, but there is an undisclosed amount of work that goes into preprocessing the data before any questions are answered.

When integrating data in this kind of environment, developers must test, validate and debug their assumptions about the data prior to implementing any extract, transform, and load (ETL) steps. This process can be laborious since most of this data does not come with a convenient RESTAPI or robust documentation on how the data is structured. Developers typically record the results of their inspection in project or code documentation, flow diagrams, and implicitly in ETL scripts. These formats are not accessible to most and are generally not portable between projects. Further, they are static and do not support being annotated on behalf of the data itself, which is only available when the ETL processes run. Ideally, there needs to be a way to represent data elements that can be enriched with information and connected to other data elements prior to and during ETL.

My team and I developed a model and implementation for representing metadata from data models and systems in a uniform way making it possible to annotate these metadata as new information is learned. The initial implementation of this model is called Origins, an open source Python library for extracting structural and statistical metadata from data stores and representing them in graph structures. The graph data structure provides a simple and uniform way of representing data elements (via nodes) and the relationships between elements (via edges) such as how a “birth_date in source A is related date_of_birth in source B”. Origins currently has support for extracting metadata from relational database systems, delimited files (e.g. CSV), Excel spreadsheets, file systems, document stores, and REDCap projects. For data persistence, Origins supports exporting to Neo4j, an open source graph database.

In this talk I will cover:

A short background of the environment I work in and why the topic is so applicable

The inception of and introduction to Origins

A demo of how to extract metadata from data stores, use the API to connect data elements, export the graph to Neo4j, and ask questions about the data with Cypher queries

A focused description of my current use of Origins for robust ETL and change logging for integrating data from multiple institutions into the next version of the AudGenDB project

People planning to attend this session also want to see:

Byron Ruth

The Children's Hospital of Philadelphia

Byron Ruth is a Lead Analyst/Programmer in the Center for Biomedical Informatics at The Children’s Hospital of Philadelphia. Byron’s skills in advanced web programming environments, API, and architectural software design have enabled him to lead a variety of projects at CHOP, including the development of a highly integrated audiology research database, an electronic health record-mediated clinical decision support engine for the care of premature infants, and a data management system that helps to discover relationships between genetic markers of congenital heart defects and clinical outcomes.