Friday, 29 August 2014

Big Data and the Enterprise

Businesses that have defined a data strategy know that data is an integral part of the enterprise. There are a slew of enterprise standards for all data to adhere to, irrespective of whether the data is small or big, structured or unstructured, comes from sensors or websites or transactions, is housed in the holiest of data centres with the strictest of controls or is stored out in the wild, wild west of freshest-on-github software. A successful data strategy not only meets the business needs but also incorporates the enterprise standards in a holistic manner.

Big data being at the top of its hype nowadays (see Gartner's latest hype cycle for emerging technologies), there have been many companies that have eagerly jumped into its adoption without adequate considerations unfortunately. Of course, the prime motivating factors for considering big data - including those discussed in an earlier post in these pages Data as a Strategy - are usually in place and are not the subject of this post. It is the set of enterprise standards and requirements that typically are in the background but serve the crucial purpose of keeping the house in order that are being glossed over in the excitement over "new". In this post, let us look at the top requirements that fall under the headline of enterprise standards that apply equally to big data as it does to business-critical data. Mature organisations with a thriving data strategy will hopefully not find any of the below surprising. For the rest of this post, I'll assume that the enterprise has a data-focused implementation in place already - in the form of a data warehouse or an all-purpose database, for instance - and also has enterprise standards to follow.

First, the question "Why should an enterprise care about applying its corporate standards to new data that is not critical to the business?" In customer conversations, clickstream data arises frequently as an example of this new data. Suppose that clickstream data is collected and analysed in large scale in a big data technology in a R&D/labs environment. Incidentally, the reasons why a classic database would not be suited for such analysis is that the data is of unknown value to the business, has variable structure and popular big data technologies including Hadoop allow storing such data at a significantly lower cost. In the course of analysis, suppose there are insights of significant value about online customer buying behaviour that have been discovered and the clickstream data consequently becomes critical for the business. The repeated extraction of that value needs the clickstream data and the process of extraction be put into production. The new data and the associated processes need to be “operationalised” (to use a coined term), and therefore should have to adhere to the standards set by the enterprise. In other words, even if big data exploitation starts off as a project in an R&D/labs setting, when it starts adding value to the business, it would need to be turned into an operational/production platform in order to extend the data hygiene that the enterprise already has to encompass this new data. In a later post, we'll see how we can effect this transition from labs to production in an effective manner.

Let us get back to the enterprise standards and their requirements of big data.

Top most amongst the requirements is for governance around this new data. Irrespective of the nature of the data, data governance is a critical requirement in all mature enterprise since lack of (adequate amounts of) governance runs the risk of breaking businesses. Data that is not governed when analysed can produce dubious results leading to a questionable business decision that, in the worst case, can be catastrophic. Most companies that start down the road of big data without due consideration get this critical requirement awfully wrong. From our experience, one reason for this seems to be the misconception that governance processes requires time and effort and introduces latencies whereas that time is better spent doing the more “cool” activity of data science. Unless the big data project is intended to always remain in the labs environment, this is seriously faulty thinking. By the time a labs experiment is well and truly on its way, the governance cat (if I may) is already out of the bag. There is already some unknown amount of data duplication that has happened internally (and heavens forbid, externally too), some unrecorded numbers of unauthorised access (e.g. data scientist to outside-of-work friend “see this awesome analysis I did on average spends by HNIs”), and a pot pourri of ad-hoc scripts and execution logs that serve as the only documentation of how the data got in.

Security and related topics of data audit and access audit are equally critical requirements for big data. Like governance, security demands a clear plan in action even before the first access to data happens. Otherwise, the risks to the enterprise are too great, especially in the world of big data where there could be more to the data than meets the eye. Access audits demand the record of every interaction of every user with the data and the steps followed pre and post data access. In most industries and countries, access audits are mandatory for legal compliance. Data audits, on the other hand, are in some industries like finance required for compliance but in other industries, though not legally required, are strongly recommended in order to maintain good hygiene. Data audits pertain to the record of how a particular piece of data happened, by way of capturing all the steps in the data processing logic that happened before it, alongside the prior representations for that data tracing back all the way to the source. In other words, data audit requires that a lineage be available for each and every portion of the data.

The last, but by no means the least, of the critical requirements for big data is integration of new data and the big data technologies in use into the existing data ecosystem of the enterprise. Technology integration into the ecosystem involves making sure the existing interfaces are supported, the upstream and the downstream tools are tested for compatibility and seamless execution with the new big data technologies, applications on this new data can be implemented using existing tools, and management and administration happen in the same way for all technologies. Preferably, none of the tasks listed would require procurement of yet some more new technologies. Data integration into the existing ecosystem refers to rationalisation with existing metadata repositories, quality control, and creation of new metadata repositories as required. Note that this requirement coupled with the previous requirements imply that aspects like traceability need to be designed and implemented in a holistic manner that includes but is not limited to just the big data technologies.

The phrase "operational big data platform”, bringing together two apparently conflicting terms “operational” and “big data”, would probably have elicited a few smirks some time ago but that is no longer the case now. The enterprise should carefully plan and orchestrate big data projects with the same emphasis on standards relating to governance, security and ecosystem integration that they have in place for their mission-critical data, preparing for the eventuality that their new data becomes critical to the continued success of their business with the right use of big data and data science.