Research

Citation is an essential part of scientific publishing and, more generally, of scholarship. It is used to gauge the trust placed in published information and, for better or for worse, is an important factor in judging academic reputation. Now that so much scientific publishing involves data and takes place through a database rather than conventional journals, how is some part of a database to be cited? More generally, how should data stored in a repository that has complex internal structure and that is subject to change be cited?

In many Big Data applications today, such as Next-Generation Sequencing, data processing pipelines are highly complex, span multiple institutions, and include many human and computational steps. The pipelines evolve over time and vary across institutions, so it is difficult to track and reason about the processing pipelines to ensure consistency and correctness of results. Provenance-enabled scientific workflow systems promise to aid here – yet such workflow systems are often avoided due to perceptions of inflexibility, lack of good provenance analytics tools, and emphasis on supporting the data consumer rather than producer.

Identifying Relationships in Collections of Scientific Datasets

Scientific datasets associated with a research project proliferate over time as a result of activities, such as sharing datasets among collaborators, extending existing datasets with new measurements, and extracting subsets of data for analysis. As such datasets begin to accumulate, it becomes increasingly difficult for a scientist to
keep track of their derivation history, which complicates data sharing, provenance tracking, and scientific reproducibility. Understanding
what relationships exist between datasets can help scientists recall their original derivation connection. For instance, if dataset A is contained in dataset B, then the connection could be that A was extended to create B.

We introduce a set of relevant relationships, propose the relationship-identification methodology for testing relationships between pairs of datasets, develop a set of algorithms for efficient discovery of these relationships, and organize these algorithms into a new system called ReConnect to assist scientists in relationship discovery.

While ReConnect helped with identifying relationships between two datasets, It is infeasible for scientists to use it for testing relationships between all possible pairs in a collection of datasets. We introduce an end-to-end prototype system, ReDiscover, that identify, from a collection of datasets, the pairs that are most likely related. Our preliminary evaluation shows that ReDiscover predicted duplicate, row_containment, and template relationships with F1 of 80%, 57%, and 80% respectively.