Tag: Datalake

In 2018 we are rapidly entering what I would like to call ‘Big Data 3.0’. This is the age of ‘Converged Big Data’ where its various complementary technologies – Data Science, DevOps, Business Automation begin to all come together to solve complex industry challenges in areas as diverse as Manufacturing, Insurance, IoT, Smart Cities and Banking.

(Image Credit – Simplilearn)

First, we had Big Data 1.0…

In the first pass of Big Data era, Hadoop was the low-cost storage solution. Companies saved tens of billions of dollars from costly and inflexible enterprise data warehouse (EDW) projects. Nearly every large organization has begun deploying Hadoop as an Enterprise Landing Zone (ELZ) to augment an EDW. The early corporate movers working with the leading vendors more or less figured out the kinks in the technology as applied to their business challenges.

Trend #1 Big Data 3.0 – where Data fuels Digital Transformation…

Fortune 5000 process large amounts of customer information daily. This is especially true in areas touched by IoT – power and utilities, manufacturing and connected car. However, they have been sorely lacking in their capacity to interpret this in a form that is meaningful to their customers and their business. In areas such as Banking & Insurance, this can greatly help arrive at a real-time understanding of not just the risks posed by a customer/partner relationship (from a credit risk/AML standpoint) but also an ability to increase the returns per client relationship. Digital Transformation can only be fueled by data assets. In 2018, more companies will tie these trends together moving projects from POC to production.

I have written extensively about efforts to infuse business processes with machine learning. Predictive analytics have typically resembled a line of business project or initiative. The benefits of the learning from localized application initiatives are largely lost to the larger organization if one doesn’t allow multiple applications and business initiatives to access the models built. In 2018, machine learning expands across more usecases from the mundane (fraud detection, customer churn prediction to customer journey) to the new age (virtual reality, conversational interfaces, chatbots, customer behavior analysis, video/facial recognition) etc. Demand for data scientists will increase.

In areas around Industrie 4.0, Oil & Energy, Utilities – billions of endpoints will send data over to edge nodes and backend AI services which will lead to better business planning, real-time decisions and a higher degree of automation & efficiency across a range of processes. The underpinning data capability around these will be a Data Lake.

This is an area both Big Data and AI have begun to influence in a huge way. 2018 will be the year in which every large and medium-sized company will have an AI strategy built on Big Data techniques. Companies will begin exposing their AI models over the cloud using APIs as shown above using a Models as a Service architecture.

Infrastructure vendors have been aiming to first augment and then replace EDW systems. As the ability of projects that perform SQL-on-Hadoop, data governance and audit matures, Hadoop will slowly begin replacing EDW footprint. The key capabilities that Data Lakes usually lack from an EDW standpoint – around OLAP, performance reporting will be augmented by niche technology partners. While this is a change that will easily take years, 2018 is when it begins. Expect migrations where clients have not really been using the full power of EDWs beyond simple relational schemas and log data etc to be the first candidates for this migration.

Trend #4 Cybersecurity pivots into Big Data…

Big Data is now the standard by which forward-looking companies will perform their Cybersecurity and threat modeling. Let us take an example to understand what this means from an industry standpoint. For instance, in Banking, in addition to general network level security, we can categorize business level security considerations into four specific buckets – general fraud, credit card fraud, AML compliance, and cybersecurity. The current best practice in the banking industry is to encourage a certain amount of convergence in the back-end data silos/infrastructure across all of the fraud types – literally in the tens. Forward-looking enterprises are now building cybersecurity data lakes to aggregate & consolidate all digital banking information, wire data, payment data, credit card swipes, other telemetry data (ATM & POS) etc in one place to do security analytics. This pivot to a Data Lake & Big Data can pay off in a big way.

The reason this convergence is helpful is that across all of these different fraud types, the common thread is that the fraud is increasingly digital (or internet based) and they fraudster rings are becoming more sophisticated every day. To detect these infinitesimally small patterns, an analytic approach beyond the existing rules-based approach is key to understand for instance – location-based patterns in terms of where transactions took place, Social Graph-based patterns and Patterns which can commingle real-time & historical data to derive insights. This capability is only possible via a Big Data-enabled stack.

Trend #5 Regulators Demand Big Data – PSD2,GPDR et al…

The common thread across virtually a range of business processes in verticals such as Banking, Insurance, and Retail is the fact that they are regulated by a national or supranational authority. In Banking, across the front, mid and back office, processes ranging from risk data aggregation/reporting, customer onboarding, loan approvals, financial crimes compliance (AML, KYC, CRS & FATCA), enterprise financial reporting& Cyber Security etc – all need to produce verifiable, high fidelity and auditable reports. Regulators have woken up to the fact that all of these areas can benefit from universal access to accurate, cleansed and well-governed cross-organization data from a range of Book Of Record systems.

Further, applying techniques for data processing such as in-memory processing, the process of scenario analysis, computing, & reporting on this data (reg reports/risk scorecards/dashboards etc) can be vastly enhanced. They can be made more real time in response to data about using market movements to understand granular risk concentrations. Finally, model management techniques can be clearly defined and standardized across a large organization. RegTechs or startups focused on the risk and compliance space are already leveraging these techniques across a host of areas identified above.

Trend #6 Data Monetization begins to take off…

The simplest and easiest way to monetize data is to begin collecting disparate data generated during the course of regular operations. An example in Retail Banking is to collect data on customer branch visits, online banking usage logs, clickstreams etc. Once collected, the newer data needs to be fused with existing Book of Record Transaction (BORT) data to then obtain added intelligence on branch utilization, branch design & optimization, customer service improvements etc. It is very important to ensure that the right business metrics are agreed upon and tracked across the monetization journey. Expect Data Monetization projects to take off in 2018 with verticals like Telecom, Banking, and Insurance to take the lead on these initiatives.

Most Cloud Native Architectures are designed in response to Digital Business initiatives – where it is important to personalize and to track minute customer interactions. The main components of a Cloud Native Platform are shown below and the vast majority of these leverage a microservices based design. Given all this, it is important to note that a Big Data stack based on Hadoop (Gen 2) is not just a data processing platform. It has multiple personas – a real-time, streaming data, interactive platform that can perform any kind of data processing (batch, analytical, in memory & graph based) while providing search, messaging & governance capabilities. Thus, Hadoop provides not just massive data storage capabilities but also provides multiple frameworks to process the data resulting in response times of milliseconds with the utmost reliability whether that be real-time data or historical processing of backend data. My bet on 2018 is that these capabilities will increasingly be harnessed as part of a DevOps process to develop a microservices based deployment.

Conclusion…

Big Data will continue to expand exponentially across global businesses in 2018. As with most disruptive innovation, it will also create layers of complexity and opportunity for Enterprise IT. Whatever be the kind of business model – tracking user behavior or location sensitive pricing or business process automation etc – the end goal of IT architecture should be to create enterprise business applications that are heavily data insight and analytics-driven.

This is the third in a series of blogs on Data Science that I am jointly authoring with Maleeha Qazi, (https://www.linkedin.com/in/maleehaqazi/). We have previously covered some of the inefficiencies that result from a siloed data science process @ http://www.vamsitalkstech.com/?p=5046 & the ideal way Data Scientists would like their models deployed for the maximal benefit and use – as a Service @ http://www.vamsitalkstech.com/?p=5321. As the name of this third blog post suggests, the success of a data science initiative depends on data. If the data going into the process is “bad” then the results cannot be relied upon. Our goal is to also suggest some practical steps that enterprises can take from a data quality & governance process standpoint.

“However, under the strong influence of the current AI hype, people try to plug in data that’s dirty & full of gaps, that spans years while changing in format and meaning, that’s not understood yet, that’s structured in ways that don’t make sense, and expect those tools to magically handle it. ” – Monica Rogati (Data Science Advisor and ex-VP Jawbone – 2017) [1]

Image Credit – The Daily Omnivore

Introduction

Different posts in this blog have discussed Data Science and other Analytical approaches to some degree of depth. What is apparent is that whatever the kind of analytics – descriptive, predictive, or prescriptive – the availability of a wide range of quality data sources is key. However, along with volume and variety of data, the veracity, or the truth, in the data is as important. This blog post discusses the main factors that determine the quality of data from a Data Scientist’s perspective.

The Top Issues of Data Quality

As highlighted in the above illustration, the top quality issues that data assets typically face are the following:

Incomplete Data: The data provided for analysis should span the entire cross-section of known data about how the organization views its customers and products. This would include data generated from various applications that belong to the business, and external data bought from various vendors to enriched the knowledge base. The completeness criteria measures if all of the information about entities under consideration is available and useable.

Inconsistent & Inaccurate Data: Consistency measures what data values give conflicting information and must be fixed. It also measures if all the data elements conform to specific and uniform formats and are stored in a consistent manner. Inaccurate data either has duplicate, missing or erroneous values. It also does not reflect an accurate picture of the state of the business at the point in time it was pulled.

Lack of Data Lineage & Auditability: The data framework needs to support audit-ability, i.e provide an audit trail of how the data values were derived from source to analysis point; the various transformations performed on it to arrive at the data set being considered for analysis.

Lack of Contextuality: Data needs to be accompanied by meaningful metadata – data that describes the concepts within the dataset.

Temporally Inconsistent: This measures if the data was temporally consistent and meaningful given the time it was recorded.

What Business Challenges does Poor Data Quality Cause…

Image Credit – DataMartist

Data Quality causes the following business challenges in enterprises:

Customer dissatisfaction: Across industries like Banking, Insurance, Telecom & Manufacturing, the ability to get a unified view of the customer & their journey is at the heart of the enterprise’s ability to promote relevant offerings & detect customer dissatisfaction. Currently, most industry players are woeful at putting together this comprehensive Single View of their Customers (SVC). Due to operational silos, each department possesses its own siloed & limited view of the customer across multiple channels. These views are typically inconsistent, lack synchronization with other departments, & miss a high amount of potential cross-sell and upsell opportunities. This is a data quality challenge at its core.

Lost revenue: The Customer Journey problem has been an age-old issue which has gotten exponentially more complicated over the last five years as the staggering rise of mobile technology and the Internet of Things (IoT) have vastly increased the number of enterprise touch points that customers are exposed to in terms of being able to discover and purchase new products/services. In an OmniChannel world, an increasing number of transactions are being conducted online. In verticals like the Retail industry and Banking & Insurance industries, the number of online transactions conducted approaches an average of 40%. Adding to the problem, more and more consumers are posting product reviews and feedback online. Companies thus need to react in real-time to piece together the source of consumer dissatisfaction.

Time and cost in data reconciliation: Every large enterprise nowadays runs expensive data re-engineering projects due to their data quality challenges. These are an inevitable first step in other digital projects which cause huge cost and time overheads.

Increased time to market for key projects: Poor data quality causes poor data agility, which increases the time to market for key projects.

Poor data means suboptimal analytics: Poor data quality causes the analytics done using it to be suboptimal – algorithms will end up giving wrong conclusions because the input provided to them is incorrect at best & inconsistent at worst.

Why is Data Quality a Challenge in Enterprises

Image Credit – DataMartist

The top reasons why data quality has been a huge challenge in the industry are:

Prioritization conflicts: For most enterprises, the focus of their business is the product(s)/service(s) being provided, book-keeping is a mandatory but secondary concern. And since keeping the business running is the most important priority, keeping the books accurate for financial matters is the only aspect that gets most of the technical attention it deserves. Other data aspects are usually ignored.

Organic growth of systems: Most enterprises have gone through a series of book-keeping methods and applications, most of which have no compatibility with one another. Warehousing data from various systems as they are deprecated, merging in data streams from new systems, and fixing data issues as these processes happen is not prioritized till something on the business end fundamentally breaks. Band-aids are usually cheaper and easier to apply than to try and think ahead to what the business will need in the future, build it, and back-fill it with all the previous systems’ data in an organized fashion.

Lack of time/energy/resources: Nobody has infinite time, energy, or resources. Doing the work of making all the systems an enterprise chooses to use at any point in time talk to one another, share information between applications, and keep a single consistent view of the business is a near-impossible task. Many well-trained resources, time & energy is required to make sure this can be setup and successfully orchestrated on a daily basis. But how much is a business willing to pay for this? Most do not see short-term ROI and hence lose sight of the long-term problems that could be caused by ignoring the quality of data collected.

What do you want to optimize?: There are only so many balls an enterprise can have up in the air to focus on without dropping one, and prioritizing those can be a challenge. Do you want to optimize the performance of the applications that need to use, gather and update the data, OR do you want to make sure data accuracy/consistency (one consistent view of the data for all applications in near real-time) is maintained regardless? One will have to suffer for the other.

How to Tackle Data Quality

Image Credit – DataMartist

With the advent of Big Data and the need to derive value from ever increasing volumes and a variety of data, data quality becomes an important strategic capability. While every enterprise is different, certain common themes emerge as we consider the quality of data:

The sheer number of transaction systems found in a large enterprise causes multiple challenges across the data quality dimensions. Organizations need to have valid frameworks and governance models to ensure the data’s quality.

Data quality has typically been thought of as just data cleansing and fixing missing fields. However, it is very important to address the originating business processes that cause this data to take multiple dimensions of truth. For example, centralize customer onboarding in one system across channels rather than having every system do its own onboarding.

It is clear from the above that data quality and its management is not a one time or siloed application exercise. As part of a structured governance process, it is very important to adopt data profiling and other capabilities to ensure high-quality data.

Conclusion

Enterprises need to define both quantitative and qualitative metrics to ensure that data quality goals are captured across the organization. Once this is done, an iterative process needs to be followed to ensure that a set of capabilities dealing with data governance, auditing, profiling, and cleansing is applied to continuously ensure that data is brought up to, and kept at, a high standard. Doing so can have salubrious effects on customer satisfaction, product growth, and regulatory compliance.

“Any enterprise CEO really ought to be able to ask a question that involves connecting data across the organization, be able to run a company effectively, and especially to be able to respond to unexpected events. Most organizations are missing this ability to connect all the data together.” Tim Berners Lee -(English computer scientist, best known as the inventor of the World Wide Web)

Image Credit – Device42

We have discussed vertical industry business challenges across sectors like Banking, Insurance, Retail and Manufacturing in some level of detail over the last two years. Though enterprise business models vary depending on the industry, there is a common Digital theme raging across all industries in 2017. Every industry is witnessing an upswing in the numbers of younger and digitally aware customers. Estimates of this influential population are as high as 40% in areas such as Banking and Telecommunications. They represent a tremendous source of revenue but can also defect just as easily if the services offered aren’t compelling or easy to use – as the below illustration re the Banking industry illustrates.

These customers are Digital Natives i.e they are highly comfortable with technology and use services such as Google, Facebook, Uber, Netflix, Amazon, Google etc almost hourly in their daily lives. As a consequence, they expect a similar seamless & contextual experience while engaging with Banks, Telcos, Retailers, Insurance companies over (primarily) digital channels. Enterprises then have a dual fold challenge – to store all this data as well as harness it for real time insights in a way that is connected with internal marketing & sales.

As many studies have shown, companies that constantly harness data about their customers and perform speedy advanced analytics outshine their competition. Does that seem a bombastic statement? Not when you consider that almost half of all online dollars spent in the United States in 2016 were spent on Amazon and almost all digital advertising revenue growth in 2016 was accounted by two biggies – Google and Facebook. [1]

According to The Economist, the world’s most valuable commodity is no longer Oil, but Data. The few large companies depicted in the picture are now virtual monopolies[2] (Image Credit – David Parkins)

Let us now return to the average Enterprise. The vast majority of industrial applications (numbering around an average of 1000+ applications at large enterprises according to research firm NetSkope) generally lag the innovation cycle. This is because they’re created using archaic technology platforms by teams that conform to rigid development practices. The Fab Four (Facebook Amazon Google Netflix) and others have shown that Enterprise Architecture is a business differentiator but the Fortune 500 have not gotten that message as yet. Hence they largely predicate their software development on vendor provided technology instead of open approaches. This anti-pattern is further exacerbated by legacy organizational structures which ultimately leads to these applications holding a very parochial view of customer data. These applications can typically be classified in one of the buckets – ERP, Billing Systems, Payment Processors, Core Banking Systems, Service Management Systems, General Ledger, Accounting Systems, CRM, Corporate Email, Salesforce, Customer On-boarding etc etc.

These enterprise applications are then typically managed by disparate IT groups scattered across the globe. They often serve different stakeholders who seem to have broad overlapping interests but have conflicting organizational priorities for various reasons. These applications then produce and data in silos – localized by geography, department, or, line of business, or, channels.

Organizational barriers only serve to impede data sharing for various reasons – ranging from competitive dynamics around who owns the customer relationship, regulatory reasons to internal politics etc. You get the idea, it is all a giant mishmash.

Before we get any further, we need to define that dreaded word – Silo.

What Is a Silo?

A mind-set present in some companies when certain departments or sectors do not wish to share information with others in the same company. This type of mentality will reduce the efficiency of the overall operation, reduce morale, and may contribute to the demise of a productive company culture. (Source- Business Dictionary -[2])

Data is the Core Asset in Every Industry Vertical but most of it is siloed in Departments, Lines of Business across Geographies..

Let us be clear, most Industries do not suffer from a shortage of data assets. Consider a few of the major industry verticals and a smattering of the kinds of data that players in these areas commonly possess –

DATA IN HEALTHCARE–

DATA IN MANUFACTURING–

Supply chain data

Demand data

Pricing data

Operational data from the shop floor

Sensor & telemetry data

Sales campaign data

The typical flow of data in an enterprise follows a familiar path –

Data is captured in large quantities as a result of business operations (customer orders, e commerce transactions, supply chain activities, Partner integration, Clinical notes et al). These feeds are captured using a combination of techniques – mostly ESB (Enterprise Service Bus) and Message Brokers.

The raw data streams then flow into respective application owned silos where over time a great amount of data movement (via copying, replication and transformation operations – the dreaded ETL) occurs using proprietary vendor developed systems. Vendors in this space have not only developed shrink wrapped products that make them tens of billions of dollars annually but also imposed massive human capital requirements of enterprises to program & maintain these data flows.

Once all of the relevant data has been normalized, transformed and then processed, it is then copied over into business reporting systems where it is used to perform a range of functions – typically for reporting for use cases such as Customer Analytics, Risk Reporting, Business Reporting, Operational improvements etc.

Rinse and repeat..

Due to this old school methodology of working with customer, operational data, most organizations have no real time data processing capabilities in place & they thus live in a largely reactive world. What that means is that their view of a given customers world is typically a week to 10 days old.

Another factor to consider is – the data sources described out above are what can be described as structured data or traditional data. However, organizations are now on-boarding large volumes of unstructured data as has been captured in the below blogpost. Oftentimes, it is easier for Business Analysts, Data Scientists and Data Architects to get access to external data faster than internal data.

Getting access to internal data typically means jumping over multiple hoops from which department is paying for the feeds, the format of the feeds, regulatory issues, cyber security policy approvals, SOX/PCI compliance et al. The list is long and impedes the ability of business to get things done quickly.

Data and Technical Debt…

Since Gene Kim coined the term ‘Technical Debt‘ , it has typically been used in an IT- DevOps- Containers – Data Center context. However, technology areas like DevOps, PaaS, Cloud Computing with IaaS, Application Middleware, Data centers etc in and of themselves add no direct economic value to customers unless they are able to intelligently process Data. Data is the most important technology asset compared to other IT infrastructure considerations. You do not have to take my word for that. It so happens that The Economist just published an article where they discuss the fact that the likes of Google, Facebook, Amazon et al are now virtual data monopolies and that global corporations are way way behind in the competitive race to own Data [1].

Thus, it is ironic that while the majority of traditional Fortune 500 companies are still stuck in silos, Silicon Valley companies are not just fast becoming the biggest owners of global data but are also monetizing them on the way to record profits. Alphabet (Google’s corporate parent), Amazon, Apple, Facebook and Microsoft are the five most valuable listed firms in the world. Case in point – their profits are around $25bn in the first quarter of 2017 and together they make up more than half the value of the NASDAQ composite index. [1]

The Five Business Challenges that Data Fragmentation causes (or) Death by Silo …

How intelligently a company harnesses it’s data assets determines it’s overall competitive position. This truth is being evidenced in sectors like Banking and Retail as we have seen in previous posts.

What is interesting, is that in some countries which are concerned about the pace of technological innovation, National regulatory authorities are creating legislation to force slow moving incumbent corporations to unlock their data assets. For example, in the European Union as a result of regulatory mandates – the PSD2 & Open Bank Standard – a range of agile players across the value chain (e.g FinTechs ) will soon be able to obtain seamless access to a variety of retail bank customer data by accessing using standard & secure APIs.

Once obtained the data can help these companies can reimagine it in manifold ways to offer new products & services that the banks themselves cannot. A simple use case can be that they can provide personal finance planning platforms (PFMs) that help consumers make better personal financial decisions at the expense of the Banks owning the data. Surely, FinTechs have generally been able to make more productive use of client data than have banks. They do this by providing clients with intuitive access to cross asset data, tailoring algorithms based on behavioral characteristics and by providing clients with a more engaging and unified experience.

Why cannot the slow moving established Banks do this? They suffer from a lack of data agility due to the silos that have been built up over years of operations and acquisitions. None of these are challenges for the FinTechs which can build off of a greenfield technology environment.

To recap, let us consider the five ways in which Data Fragmentation hurts enterprises –

#1 Data Silos Cause Missed Top line Sales Growth –

Data produced by disparate applications which use scattered silos to store them causes challenges in enabling a Single View of a customer across channels, products and lines of business. This then makes everything across the customer lifecycle a pain – ranging from smooth on-boarding, to customer service to marketing analytics. Thus, it impedes an ability to segment customers intelligently, perform cross sell & up sell. This sheer inability to understand customer journeys (across different target personas) also leads customer retention issues. When underlying data sources are fragmented, communication between business teams moves over to other internal mechanisms such as email, chat and phone calls etc. This is a recipe for delayed business decisions which are ultimately ineffective as they depend more on intuition than are backed by data.

#2 Data Silos are the Root Cause of Poor Customer Service –

Across industries like Banking, Insurance, Telecom & Manufacturing, the ability to get a unified view of the customer & their journey is at the heart of the the enterprises ability to understand their customers preferences & needs. This is also crucial in promoting relevant offerings and in detecting customer dissatisfaction. Currently most enterprises are woefully inadequate at putting together this comprehensive Single View of their Customers (SVC). Due to operational silos, each department possess a silo & limited view of the customer across other silos (or channels). These views are typically inconsistent in and of themselves as they lack synchronization with other departments. The net result is that the companies typically miss a high amount of potential cross-sell and up-sell opportunities.

#3 – Data Silos produce Inaccurate Analytics –

First off most Analysts need to wait long times to acquire the relevant data they need to test their hypotheses. Thus, since the data they work on is of poor quality as a result of fragmentation, so are the analytics operate on the data.

Let us take an example in Banking, Mortgage Lending, an already complex business process has been made even more so due to the data silos built around Core Banking, Loan Portfolio, Consumer Lending applications.Qualifying borrowers for Mortgages needs to be based on not just historical data that is used as part of the origination & underwriting process (credit reports, employment & income history etc) but also data that was not mined hitherto (social media data, financial purchasing patterns,). It is a well known fact there are huge segments of the population (especially the millennials) who are broadly eligible but under-banked as they do not satisfy some of the classical business rules needed to obtain approvals on mortgages. Each of the silos store partial customer data. Thus, Banks do not possess an accurate and holistic picture of a customer’s financial status and are thus unable to qualify the customer for a mortgage in quick time with the best available custom rate.

#4 – Data Silos hinder the creation of new Business Models –

The abundance of data created over the last decade is changing the nature of business. If it follows that enterprise businesses are being increasingly built around data assets, then it must naturally follow that data as a commodity can be traded or re-imagined to create revenue streams off it. As an example, pioneering payment providers now offer retailers analytical services to help them understand which products perform best and how to improve the micro-targeting of customers. Thus, data is the critical prong of any digital initiative. This has led to efforts to monetize on data by creating platforms that either support ecosystems of capabilities. To vastly oversimplify this discussion ,the ability to monetize data needs two prongs – to centralize it in the first place and then to perform strong predictive modeling at large scale where systems need to constantly learn and optimize their interactions, responsiveness & services based on client needs & preferences. Thus, Data Silos hurt this overall effort more than the typical enterprise can imagine.

It must naturally follow that as more and more information assets are stored across the organization, it is a manifold headache to deal with securing each and every silo from a range of bad actors – extremely well funded and sophisticated adversaries ranging from criminals to cyber thieves to hacktivists. On the business compliance front, sectors like Banking & Insurance need to maintain large AML and Risk Data Aggregation programs – silos are the bane of both. Every industry needs fraud detection capabilities as well, which need access to unified data.

Conclusion

My intention for this post is clearly to raise more questions than provide answers. There is no question Digital Platforms are a massive business differentiator but they need to have access to an underlying store of high quality, curated, and unified data to perform their magic. Industry leaders need to begin treating high quality Data as the most important business asset they have & to work across the organization to rid it of Silos.

The first post in this four part series on Data lakes will focus on the business reasons to create one. The second post will delve deeper into the technology considerations & choices around data ingest & processing in the lake to satisfy myriad business requirements. The third will tackle the critical topic of metadata management, data cleanliness & governance. The fourth & final post in the series will focus on the business justification to build out a Big Data Center of Excellence (COE).

“Business owners at the C level are saying, ‘Hey guys, look. It’s no longer inordinately expensive for us to store all of our data. I want all of you to make copies. OK, your systems are busy. Find the time, get an extract, and dump it in Hadoop.’”- Mike Lang, CEO of Revelytix

The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous online interactions with global consumers/customers/clients or patients. The goal is not just provide engaging visualization but also to personalize services clients care about across multiple modes of interaction. Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. For example Banking now requires an ability to engage consumers in a seamless experience across an average of four to five channels – Mobile, eBanking, Call Center, Kiosk etc. Healthcare is a close second where caregivers expect patient, medication & disease data at their fingertips with a few finger swipes on an iPad app.

Big Data has been the chief catalyst in this disruption. The Data Lake architectural & deployment pattern makes it possible to first store all this data & then enables the panoply of Hadoop ecosystem projects & technologies to operate on it to produce business results.

Let us consider a few of the major industry verticals and the sheer data variety that players in these areas commonly possess –

The Healthcare & Life Sciences industry possess some of the most diverse data across the spectrum ranging from –

Structured Clinical data e.g. Patient ADT information

Free hand notes

Patient Insurance information

Device Telemetry

Medication data

Patient Trial Data

Medical Images – e.g. CAT Scans, MRIs, CT images etc

The Manufacturing industry players are leveraging the below datasets and many others to derive new insights in a highly process oriented industry-

Supply chain data

Demand data

Pricing data

Operational data from the shop floor

Sensor & telemetry data

Sales campaign data

Data In Banking– Corporate IT organizations in the financial industry have been tackling data challenges due to strict silo based approaches that inhibit data agility for many years now.Consider some of the traditional sources of data in banking –

Industries have changed around us since the advent of relational databases & enterprise data warehouses. Relational Databases (RDBMS) & Enterprise Data Warehouses (EDW) were built with very different purposes in mind. RDBMS systems excel at online transaction processing (OLTP) use cases where massive volumes of structured data needs to be processed quickly. EDW’s on the other hand perform online analytical processing functions (OLAP) where data extracts are taken from OLTP systems, loaded & sliced in different ways to . Both these kinds of systems are not simply suited to handle not just immense volumes of data but also highly variable structures of data.

Let us consider the main reasons why legacy data storage & processing techniques are unsuited to new business realities of today.

Legacy data technology enforces a vertical scaling method that is sorely unsuited to handling massive volumes of data in a scale up/scale down manner

The structure of the data needs to be modeled in a paradigm called ’schema on write’ which sorely inhibits time to market for new business projects

Traditional data systems suffer bottlenecks when large amounts of high variety data are processed using them

Limits in the types of analytics that could be performed. In industries like Retail, Financial Services & Telecommunications, enterprise need to build detailed models of customers accounts to predict their overall service level satisfaction in realtime. These models are predictive in nature and use data science techniques as an integral component. The higher volumes of data along with attribute richness that can be provided to them (e.g. transaction data, social network data, transcribed customer call data) ensures that the models are highly accurate & can provide an enormous amount of value to the business. Legacy systems are not a great fit here.

Given all of the above data complexity and the need to adopt agile analytical methods – what is the first step that enterprises must adopt?

The answer is the adoption of the Data Lake as an overarching data architecture pattern. Lets define the term first. A data lake is two things – a small or massive data storage repository and a data processing engine. A data lake provides “massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs“.[1] Data Lake are created to ingest, transform, process, analyze & finally archive large amounts of any kind of data – structured, semistructured and unstructured data.

Illustration – The Data Lake Architecture Pattern

What Big Data brings to the equation beyond it’s strength in data ingest & processing is a unified architecture. For instance, MapReduce is the original framework for writing applications that process large amounts of structured and unstructured data stored in the Hadoop Distributed File System (HDFS). Apache Hadoop YARN opened Hadoop to other data processing engines (e.g. Apache Spark/Storm) that can now run alongside existing MapReduce jobs to process data in many different ways at the same time. The result is that ANY kind of application processing can be run inside a Hadoop runtime – batch, realtime, interactive or streaming.

Visualization – Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. For example Banking now requires an ability to engage consumers in a seamless experience across an average of four to five channels – Mobile, eBanking, Call Center, Kiosk etc. The average enterprise user is also familiar with BYOD in the age of self service. The Digital Mesh only exacerbates this gap in user experiences as information consumers navigate applications as they consume services across a mesh that is both multi-channel as well as provides Customer 360 across all these engagement points.While information management technology has grown at a blistering pace, the human ability to process and comprehend numerical data has not. Applications being developed in 2016 are beginning to adopt intelligent visualization approaches that are easy to use,highly interactive and enable the user to manipulate corporate & business data using their fingertips – much like an iPad app. Tools such as intelligent dashboards, scorecards, mashups etc are helping change a visualization paradigms that were based on histograms, pie charts and tons of numbers. Big Data improvements in data lineage, quality are greatly helping the visualization space.

The ability to store enormous amounts of data with a high degree of agility & low cost: The Schema On Read architecture makes it trivial to ingest any kind of raw data into Hadoop in a manner that preserves it’s structure. Business analysts can then explore this data and then defined a schema to suit the needs of their particular application.

The ability to run any kind of Analytics on the data: Hadoop supports multiple access methods (batch, real-time, streaming, in-memory, etc.) to a common data set. You are only restricted by your use case.

the ability to analyze, process & archive data while dramatically cutting cost : Since Hadoop was designed to work on low-cost commodity servers which have direct attached storage – it helps dramatically lower the overall cost of storage. Thus enterprises are able to retain source data for long periods, thus providing business applications with far greater historical context.

The ability to augment & optimize Data Warehouses: Data lakes & Hadoop technology are not a ‘rip & replace’ proposition. While they provide a much lower cost environment than data warehouses, they can also be used as the compute layer to augment these systems. Data can be stored, extracted and transformed in Hadoop. Then a subset of the data i.e the results are loaded into the data warehouse. This enables the EDW to leverage compute cycles and storage to perform truly high value analytics.

The next post of the series will dive deeper into the architectural choices one needs to make while creating a high fidelity & business centric enterprise data lake.