The Data Science Laboratory

Let’s look at Data Science in a different way. The Data Scientist Team, or Data Analytics team as it used to be called, is a natural part of the R&D activity of an organizaion. The goal of the research they are carrying out, no matter what data is being examined, is to improve the operation of the organization. Maybe they can find a way to reduce costs or sell more product or hire better staff; whatever. The point is that they are searching for exploitable data patterns in some area of an organization’s operation.

So think of the Data Science Team as a scientific research team. We should not restrict the data that these researchers may wish to mine, but enable their access to as wide a population of data as we can. In the past, data analysts have tended to work on data collected in a data warehouse and drawn from operational systems. They explored the structured internal data of the organization. Such restrictions are no longer sensible, given the wealth of external data and unstructured data that can be added to the mix.

Hadoop, along with its various Open Source or commercial components enable work on much greater volumes of data than was previously feasible or economic. The data analysts should go at it. No doubt there is gold buried in some of this data.

The Research Process

Nevertheless, there are attendant data issues that need to be considered and are worth discussing. Think about the industrial mining. If a mining company suspects that a particular area contains mineable deposits, a small team is sent to the location to collect samples from the rock, assay the samples and estimate the potential for mining. This can be done at a relatively low cost before making a much bigger investment. There may be many such potential areas examined before a mine is established.

If a much large pool of data is available to explore, then data analytics will happen in a similar way. Consider the possibility, for example, a consumer products company that frequently markets new products and wishes to track the popularity of such products over time. As a data analyst you may be given this task, and you may suspect that there is some merit in gathering all the tweets that mention your products and then ananlyzing these in conjunction with data about advertising campaigns and mentions on the web.

You sample the data and build some kind of statistical model to investigate the situation. Say that you dsicover some interesting patterns that are worth further investigations on a larger sample of data, but in the end you find nothing truly significant. The question now is” “What do you do with that data?”

That’s what I’m interested in discussing in this blog post.

The Dividing Line

There is a distinct difference between explorative data analytics – exploring data, building hypotheses and testing them – and the software engineering that proceeds when you discover a data pattern that the company wants to exploit. The dividing line lies between data exploration and operational implementation. Once some useful analytical algorithm has been implemented, the data involved naturally comes under the normal data governance rules that apply to other analytical systems. The data used has become corporate data and the rules for the gorvenance of corporate data naturally apply even when some of the data is from an external source

But what rules apply to the data that is used in the exploration phase?

The Data Governance Situation

The governance of data may be handled well by some organizations, but in most organizations that I’m aware of, it is not handled well, partly because governance is difficult to implement. It is easy enough to formulate rules, for example:

There should be an audit trail of all data from its capture at source through to its final archiving or deletion.

A “master data model” must be created and maintained that includes all data used in any way by the company

Specific data security must be applied both in terms of who can view data items, who can copy them, who can change them, and so on.

Making rules is easy, implementing them can be tough, or damn near impossible. This situation has been exacerbated by the advent of Big Data. If governance was difficult to implement on relatively well defined corporate data, it will be still more awkward to impose on Big Data (with all that volume, velocity, variety, etc.) In my opinion it is made more difficult by the deployment of Hadoop, and its common use as a “data reservoir” or in some cases “data swamp.”

Few companies I’ve talked to apply full data governance to such data – preferring to think of it as outside the boundary of data governance until it is brought into action and used in some way. Data volume and variety is a disruting factor for global data governance.

The Data Science Laboratory

So what exactly is the status of data used, perhaps in a cursory way, by the data analytics team, in data exploration, i.e. the data used in the “Data Science Laboratory.”

Can we discard that data if nothing useful is found in it?

Can we discard the data samples we used to discover that there were useful patterns in the data?

In fact, can we discard all the data we used in data exploration, as not falling within our definition of corporate data?

Clearly it is a grey area. There is an economic factor here. Such data volumes could be very large and thus retaining the data, even if only as an archive, may not be affordable. Applying all eth rules of data givernance to it will be even less affordable. It may simply be a matter of: “can’t pay, won’t pay.”

Nevertheless, ultimately, this a decision for the Director of the Data Laboratory. In theory he may want to keep the data as evidence for any result, positive or negative, that it led to. Only then will there be a complete audit trail of what was done and whether the conclusions made were sound. In theory it should be archived and accessible for review.

It is easy to argue logically for such an approach, but there are costs and they may be weighty. And if we insist on such a soundly governed appraoch to this data, maybe the compliance police will start asking us about the data audit trails for all the other system that we run.

About Robin Bloor

Founder of Bloor Research and co-founder of The Bloor Group, is a well-known and widely published author on all aspects of data management, from SOA to cloud integration to his innovative Information Oriented Architecture. Co-author of a number of Wiley books on data management and cloud computing, Bloor authored a column in DMBS Magazine (in the US) for 4 years and also provided regular content for Computer Weekly in the UK. He has also regularly written individual articles for many magazines in the UK and Europe over a decade. In addition, he’s authored numerous white papers, product reviews and IT research papers.