The Data Lake is Dead; Long Live the Data Lake!

You probably already know that leading analyst firms have been quoting data lake failure rates of 85% for some time now.

You may not be aware that one of those same leading analyst firms are now also forecasting that, by 2020, 30% of data lakes will be built on standard relational DBMS (database management system) technology “at equal or lower cost than Hadoop” because – and I quote - “application performance is superior” and “most data going into data lakes is relational.”

Put those two things together and you start to understand why MapR has recently gone to the wall. And why Cloudera is under so much financial stress.

With many organisations having invested tens and even hundreds of millions of dollars in data lakes that deliver little or no business value, it’s way past time for some brutal self-assessment in the technology industry.

Many data lakes have failed because they were IT-led vanity projects, with no clear linkage to business objectives and operational processes. If the strategy for your failing data lake is to lift-and-shift it lock-stock-and-barrel from Hadoop to an object store, then you are about to flush more millions down the pan - to say nothing of the opportunity cost associated with several more wasted years. Unfortunately, I know from personal experience that this is absolutely the plan in several large organisations that really ought to know better.

Failed data lakes often represent a toxic combination of both poor technology choices and an inadequate approach to data management and integration. If you think that data management begins and ends with ACID (Atomicity, Consistency, Isolation, Durability) compliance – as at least one of the cool kid vendors that e-mails me regularly seems to – then pick any technology platform you like, so long as you do it quickly. If you are going to fail anyway, you may as well fail fast.

Better yet, develop a data strategy that includes a layered data architecture, a minimum viable product approach to data integration (we call that “Light Integration”) - and an agile, incremental approach to the more robust integration of the data that matter most. That gives you a fighting chance of optimising end-to-end business processes and delivering real business value.

Much of the complex, multi-structured data that today sits unloved and unqueried in Hadoop-based data lakes will ultimately reside in object storage. At Teradata, we recognize this – hence our focus on enabling robust access to object stores. But much of your structured and semi-structured interaction data belongs in your existing data and analytics platform, where they can be seamlessly integrated with the transaction data you already manage there.Don’t just take my word for it, ask the analysts.
Not every data lake is a data swamp – and like all technologies, the Hadoop stack has a sweet spot. But the tide of history is now running against data silos masquerading as integrated data stores, just because they are co-located on the same hardware cluster. And that same tide is running against a distributed file system and lowest-common denominator SQL engine masquerading as a fully-fledged analytic DBMS.

If you are doubling-down your investment in Hadoop, you are swimming against that tide. And if you are betting on a fashionable-but-unproven technology to get you out of a data management hole, then you aren’t learning from recent history – you are condemning yourself to repeat it. But if you are ready to move on and look forward, talk to us about the industry’s leading integrated data and analytic platform, Teradata Vantage.

(Author):

Martin Willcox

Martin leads Teradata’s EMEA technology pre-sales function and organisation and is jointly responsible for driving sales and consumption of Teradata solutions and services throughout Europe, the Middle East and Africa. Prior to taking up his current appointment, Martin ran Teradata’s Global Data Foundation practice and led efforts to modernise Teradata’s delivery methodology and associated tool-sets. In this position, Martin also led Teradata’s International Practices organisation and was charged with supporting the delivery of the full suite of consulting engagements delivered by Teradata Consulting – from Data Integration and Management to Data Science, via Business Intelligence, Cognitive Design and Software Development.

Martin was formerly responsible for leading Teradata’s Big Data Centre of Excellence – a team of data scientists, technologists and architecture consultants charged with supporting Field teams in enabling Teradata customers to realise value from their Analytic data assets. In this role Martin was also responsible for articulating to prospective customers, analysts and media organisations outside of the Americas Teradata’s Big Data strategy. During his tenure in this position, Martin was listed in dataIQ’s “Big Data 100” as one of the most influential people in UK data- driven business in 2016. His Strata (UK) 2016 keynote can be found at: www.oreilly.com/ideas/the-internet-of-things-its-the-sensor-data-stupid; a selection of his Teradata Voice Forbes blogs can be found online here; and more recently, Martin co-authored a series of blogs on Data Science and Machine Learning – see, for example, Discovery, Truth and Utility: Defining ‘Data Science’.

Martin holds a BSc (Hons) in Physics & Astronomy from the University of Sheffield and a Postgraduate Certificate in Computing for Commerce and Industry from the Open University. He is married with three children and is a solo glider pilot, supporter of Sheffield Wednesday Football Club, very amateur photographer – and an even more amateur guitarist.

Related Posts

Excerpted & editorialized interview of Dr. Hani Mahmassani of Northwestern University and Stephen Brobst, CTO of Teradata, and their discussion of how companies are using real-time data for scenario crunching, such as supply chain risk assessment.