It should not come as a surprise that in a big data environment, muchlike any environment, the end users might have concerns about the believ-ability of analytical results. This is particularly true when there is limitedvisibility into trustworthiness of the data sources. One added challenge isthat even if the producers of the data sources are known, the actualderivation of the acquired datasets may still remain opaque. Striving fordata trustworthiness has driven the continued development andmaturation of processes and tools for data quality assurance, data stan-dardization, and data cleansing. In general, data quality is generally seenas a mature discipline, particularly when the focus is evaluating datasetsand applying remedial or corrective actions to data to ensure that thedatasets are fit for the purposes for which they were originally intended.

5.1 THE EVOLUTION OF DATA GOVERNANCE

In the past 5 years or so, there have been a number of realizations thathave, to some extent, disrupted this perception of data quality matu-rity, namely:

Correct versus correction: In many environments, tools are used tofix data, not to ensure that the data is valid or correct. What wasonce considered to be the cutting edge in terms of identifying andthen fixing data errors has, to some extent, fallen out of favor inlieu of process-oriented validation, root cause analysis, andremediation.

Data repurposing: More organizational stakeholders recognize thatdatasets created for one functional purpose within the enterprise(such as sales, marketing, accounts payable, or procurement toname a few) are used multiple times in different contexts, particu-larly for reporting and analysis. The implication is that data qualitycan no longer be measured in terms of fitness for purpose, butinstead must be evaluated in terms of fitness for purposes, takingall downstream uses and quality requirements into account.

The need for oversight: This realization, which might be considereda follow-on to the first, is that ensuring the usability of data for allpurposes requires more comprehensive oversight. Such oversightshould include monitored controls incorporated into the systemdevelopment life cycle and across the application infrastructure.

These realizations lead to the discipline called data governance.Data governance describes the processes for defining corporate datapolicies, describing processes for operationalizing observance of thosepolicies, along with the organizational structures that include datagovernance councils and data stewards put in place to monitor, andhopefully ensure compliance with those data policies.

Stated simply, the objective of data governance is to institute theright levels of control to achieve one of three outcomes:

1. Alert: Identify data issues that might have negative business impact.2. Triage: Prioritize those issues in relation to their corresponding

business value drivers.3. Remediate: Have data stewards take the proper actions when

alerted to the existence of those issues.

When focused internally, data governance not only enables a degreeof control for data created and shared within an organization, itempowers the data stewards to take corrective action, either throughcommunication with the original data owners or by direct data inter-vention (i.e., correcting bad data) when necessary.

5.2 BIG DATA AND DATA GOVERNANCE

Naturally, concomitant with the desire for measurably high quality infor-mation in a big data environment is the inclination to institute big datagovernance. It is naive, however, to assert that when it comes to big datagovernance one should adopt the traditional approaches to data quality.Furthermore, one cannot assume that just because vendors, system inte-grators, and consultants stake their claims over big data by stressing theneed for big data quality that the same methods and tools can be usedto monitor, review, and correct data streaming into a big data platform.

Upon examination, the key characteristics of big data analytics arenot universally adaptable to the conventional approaches to data

40 Big Data Analytics

quality and data governance. For example, in a traditional approachto data quality, levels of data usability are measured based on the ideaof data quality dimensions, such as:

Accuracy, referring to the degree to which the data values arecorrect.

Completeness, which specifies the data elements that must havevalues.

Consistency of related data values across different data instances. Currency, which looks at the freshness of the data and whether

the values are up to date or not. Uniqueness, which specifies that each real-world item is represented

once and only once within the dataset.

These types of measures are generally intended to validate datausing defined rules, catch any errors when the input does not conformto those rules, and correct recognized errors when the situations allowit. This approach typically targets moderately sized datasets, fromknown sources, with structured data, with a relatively small set ofrules. Operational and analytical applications of limited size can inte-grate data quality controls, alerts, and corrections, and those correc-tions will reduce the downstream negative impacts.

5.3 THE DIFFERENCE WITH BIG DATASETS

On the other hand, big datasets neither exhibit these characteristics,nor do they have similar types of business impacts. Big data analyticsis generally centered on consuming massive amounts of a combinationof structured and unstructured data from both machine-generated andhuman sources. Much of the analysis is done without considering thebusiness impacts of errors or inconsistencies across the differentsources, from where the data originated, or how frequently it isacquired.

When the acquired datasets and data streams originate outside theorganization, there is little facility for control over the input. The origi-nal sources are often so obfuscated that there is little capacity to evenknow who created the data in the first place, let alone enable any typeof oversight over data creation.

Another issue involves the development and execution model forbig data applications. Data analysts are prone to develop their ownmodels in their private sandbox environments. In these cases, thedevelopers often bypass traditional IT and data management channels,opening greater possibilities for inconsistencies with sanctioned IT pro-jects. This is complicated more as datasets are tapped intoor downloaded directly without ITs intervention.

Consistency (or the lack thereof) is probably the most difficult issue.When datasets are created internally and a downstream user recognizesa potential error, that issue can be communicated to the originatingsystems owners. The owners then have the opportunity to find theroot cause of the problems and then correct the processes that led tothe errors.

But with big data systems that absorb massive volumes of data,some of which originates externally, there are limited opportunities toengage process owners to influence modifications or corrections to thesource. On the other hand, if you opt to correct the recognized dataerror, you are introducing an inconsistency with the original source,which at worst can lead to incorrect conclusions and flawed decisionmaking.

5.4 BIG DATA OVERSIGHT: FIVE KEY CONCEPTS

The conclusion is that the standard approach to data governance inwhich data policies defined by an internal governance council directcontrol of the usability of datasets cannot be universally applied to bigdata applications. And yet there is definitely a need for some type ofoversight that can ensure that the datasets are usable and that the ana-lytic results are trustworthy. One way to address the need for dataquality and consistency is to leverage the concept of data policies basedon the information quality characteristics that are important to the bigdata project.

42 Big Data Analytics

This means considering the intended uses of the results of the analy-ses and how the inability to exercise any kind of control on the originalsources of the information production flow can be mitigated by theusers on the consumption side. This approach requires a number ofkey concepts for data practitioners and business process owners tokeep in mind:

for entity extraction; repurposing and reinterpretation of data; data enrichment and enhancement when possible.

5.4.1 Managing Consumer Data ExpectationsThere may be a wide variety of users consuming the results of the spec-trum of big data analytics applications. Many of these applications usean intersection of available datasets. Analytics applications aresupposed to be designed to provide actionable knowledge to create orimprove value. The quality of information must be directly related tothe ways the business processes are either expected to be improved bybetter quality data or how ignoring data problems leads to undesirednegative impacts, and there may be varied levels of interest in assertinglevels of usability and acceptability for acquired datasets bydifferent parties.

This means, for the scope of the different big data analytics pro-jects, you must ascertain these collective user expectations by engagingthe different consumers of big data analytics to discuss how qualityaspects of the input data that might affect the computed results. Someexamples include:

datasets that are out of sync from a time perspective (e.g., one data-set refers to todays transactions being compared to pricing datafrom yesterday);

not having all the datasets available that are necessary to executethe analysis;

not knowing if the data element values that feed the algorithmstaken from different datasets share the same precision (e.g., salesper minute vs sales per hour);

not knowing if the values assigned to similarly named data attri-butes truly share the same underlying meaning (e.g., is a customerthe person who pays for our products or the person who is entitledto customer support?).

Engaging the consumers for requirements is a process of discussionswith the known end users, coupled with some degree of speculationand anticipation of who the pool of potential end users are, what theymight want to do with a dataset, and correspondingly, what their levelsof expectation are. Then, it is important to establish how those expec-tations can be measured and monitored, as well as the realisticremedial actions that can be taken.

5.4.2 Identifying the Critical Dimensions of Data QualityAn important step is to determine the dimensions of data quality that arerelevant to the business and then distinguish those that are onlymeasurable from those that are both measurable and controllable.This distinction is important, since you can use the measures to assessusability when you cannot exert control and to make corrections orupdates when you do have control. In either case, here are some dimen-sions for measuring the quality of information used for big data analytics:

Temporal consistency: Measuring the timing characteristics of data-sets used in big data analytics to see whether they are aligned froma temporal perspective.

Timeliness: Measuring if the data streams are delivered according toend-consumer expectations.

Currency: Measuring whether the datasets are up to date. Completeness: Measuring that all the data is available. Precision consistency: Assessing if the units of measure associated

with each data source share the same precision and if those units areproperly harmonized if not.

Unique identifiability: Focusing on the ability to uniquely identifyentities within datasets and data streams and link those entities toknown system of record information.

Semantic consistency: This metadata activity may incorporate aglossary of business terms, hierarchies and taxonomies for businessconcepts, and relationships across concept taxonomies for standard-izing ways that entities identified in structured and unstructureddata are tagged in preparation for data use.

44 Big Data Analytics

5.4.3 Consistency of Metadata and Reference Data forEntity ExtractionBig data analytics is often closely coupled with the concept of textanalytics, which depends on contextual semantic analysis of streamingtext and consequent entity concept identification and extraction. Butbefore you can aspire to this kind of analysis, you need to ground yourdefinitions within clear semantics for commonly used reference dataand units of measure, as well as identifying aliases used to refer to thesame or similar ideas.

Analyzing relationships and connectivity in text data is key to entityidentification in unstructured text. But because of the variety of typesof data that span both structured and unstructured sources, one mustbe aware of the degree to which unstructured text is replete with nuan-ces, variation, and double meanings. There are many examples of thisambiguity, such as references to a car, a minivan, an SUV, a truck, aroadster, as well as the manufacturers company name, make, ormodelall referring to an automobile.

These concepts are embedded in the value within a context, and aremanifested as metadata tags, keywords, and categories that are oftenrecognized as the terms that drive how search engine optimizationalgorithms associate concepts with content. Entity identification andextraction depend on the differentiation between words and phrasesthat carry high levels of meaning (such as person name, businessnames, locations, or quantities) from those that are used to establishconnections and relationships, mostly embedded within the languageof the text.

As data volumes expand, there must be some process for definition(and therefore control) over concept variation in source data streams.Introducing conceptual domains and hierarchies can help with seman-tic consistency, especially when comparing data coming from multiplesource data streams.

Be aware that context carries meaning; as there are different inferencesabout data concepts and relationship, you can make based on the identifi-cation of concept entiti...