Do You Know Where Your Data Came From?

Unless you’ve been living under a rock the past few years (and if you have, no judgement), you know there has been growing cultural concern over data privacy and legitimacy. This is the age of fake news and consumer data monetization. Just last month, Google bought credit card information from Mastercard, inciting a flurry of privacy and compensation concerns; and of course, we all remember spring’s Cambridge Analytica debacle.

But the landscape is changing for the private sector as well. Fiujitsu Laboratories, a Japanese research and development firm, just announced its release of a Blockchain-supported Data Provenance solution called ChainedLineage. The technology “tracks the provenance history of data collected from multiple companies or individuals, including consent,” filling corporate demand for an efficient means of complying with the GDPR and other Data Regulation codes.

Though Data Privacy and Data Legitimacy may seem like separate issues, they are linked through Data Provenance. If we know where our data came from and how it got to us, we also know whether it was released with consent and by whom, giving us the ability to assess its credibility.

There may be no shortage of data in this information age, but credibility comes as a premium. Stating something as fact does not legitimize it as such. At best, we ignore dubious claims; at worst, we become convinced of them. And in the immortal words of Artemus Ward, “It ain’t so much the things we don’t know that get us in trouble. It’s the things we know that ain’t so.” It’s become more important than ever before to verify our sources before making data-driven decisions.

Data Provenance and Data Lineage are often used interchangeably, although some distinguish between the two. According to Datajigsaw, an informational resource by London-based Data Management firm Ortecha, Data Lineage is essentially a record showing the data’s transit from one point to another whereas Data Provenance “is the documentation of data in sufficient detail to allow reproducibility of a specific dataset.” In other words, Data Lineage explains where the data came from, and Data Provenance is a recipe for its recreation.

This conflicts somewhat with the World Wide Web Consortium’s (W3C) definitions, however. Their Data on the Web Best Practices report defines Data Provenance as “metadata that allows data providers to pass details about the data history to data users” and does not refer to Data Lineage at all. For the purposes of this article, we will be using the more authoritative WC3 definition as well as their more popular use of the term metadata, referring to data about other data.

The WC3 recommends that any entity publishing data on web provide metadata appropriate to the context in both human- and machine-readable formats. The report gives explicit instruction in how to supply provenance metadata, as its authors maintain that provenance “helps one determine whether to trust the data and provides important interpretive context.” Provenance data should include the data’s publisher, publication date, and individual creator. Those unconvinced that the W3C’s best practices are warranted can link from the report to its notes section, which lists examples of data publishers who would benefit the global community by supplying certain categories of metadata with its data sets.

Visualization depicting how data and its concomitant metadata should be published on the Web. From “Context” in Data on the Web Best Practices.

Because W3C is more an international community than a regulatory body, it’s up to the companies and organizations interested in using web-based data to demand that data publishers comply with an accepted standard. Of course, provenance alone will not validate the Data Quality, but it’s an important first step toward that end.

If and when Blockchain technology becomes ubiquitous, metadata files may be the best means of providing web-based data sets the pedigree they need in order to inform enterprises in their business decisions. As effective as regulators are at using punitive measures to incentivize good data practices, the real incentive lies in the monetary value of reliable data from a legitimate source. Financial firms rely on information gatekeepers like the Association of National Number Agencies (ANNA) and Bloomberg not to appease regulators but to capitalize on the data they provide. One day there may be such organizations for web-based information exchange; but until then, any company utilizing web-based data should strongly consider pushing for these metadata files and incorporating those standards into their Data Management processes.

About the author

Mike Brody is the co-founder and CEO of Exago Inc., a web-based solution for software companies looking to provide ad hoc reporting, dashboards, and analytics to their internal and external customers. Since the company’s inception in 2006, Mike has led Exago to its position as a self-funded and profitable player in the business intelligence software market.