Intelligent Data Lakes — Reimagined

Thomas HazelJuly 11, 2017

Preface

Back in 2015, I began a series of articles around big data, its key trends, and likely future. These articles were published on the popular data management site DATAVERSITY. The general theme was there was a major change afoot in data analytics, initiated by large and inexpensive on-prem and cloud storage which fostered the deluge of data we see today. Due to this tsunami of data and the continued increase in computational power, analytics and machine learning began a unique partnership. In my inaugural column, I talked about how this explosion of data, along with the advent of cloud computing and progress in machine learning, would ignite innovation and reinvent business — changing how databases specifically, and big data generally, are used. Now, several years later, big data, machine learning, and business almost seem synonymous.

In my last DATAVERSITY column, I compared and contrasted traditional data warehouse solutions to up and coming data lake platforms such as Hadoop – asking the question of whether they would replace or augment traditional architectures. Since then data lakes, particularly Hadoop, have taken a bit of a hit due to the complexity in designing, building, and even hiring to such solutions. In other words, storing data in a “schema on read” philosophy is relatively easy and fast (the good part). However, post structuring purposely disjoined, disparate, and often schema-less data has turned the big data analytic dream into a chaotic nightmare. Today it is more common to hear how data lakes have turned into data swamps. To discover and organize what is in your data lakes, to be manually structured, and to ultimately be analyzed has proven darn near impossible (the bad part).

This is where I’d like to begin my next series of blogs, as well as introduce what I (and an amazing team of data nerds, computer scientists, and entrepreneurs) have been up to this past year. These blogs will continue the conversation of data management and its associated chaos, but more importantly, address current data lake failings and ultimately deliver on their powerful premise.

History

To reimagine data lakes, there might be a need for a quick analytic history. Before there were lakes, there were warehouses storing structured data, mostly geared toward optimizing business operations. However, with the big bang of big data, these architectures have proven terribly inadequate. Prior to the Googles and Facebooks of the world who store and process exabytes of data daily, the term “big” typically dealt with transactional data that was naturally structured and in the terabyte and maybe petabyte range. Such data sources and affiliated sizes were in the wheelhouse of warehousing solutions, but as data became more unstructured and the importance of storing anything and everything for advanced analytics turned vital, cutting edge technologies and procedures had to be invented and employed. Today, more than ever, these new data sources are as important as, and even maybe more important than the classic transactional data. And yet, new solutions up till now still have not solved how to simply, efficiently, and cost-effectively intersect all forms of data (structured, unstructured, semi-structured), all in one lake (think silo unification).

So where does this leave us? In principle, building data lakes using a Hadoop platform should be the answer: store anything and everything in cheap and elastic storage (Hadoop — HDFS) to be structured & queried programmatically (Hadoop — MapReduce). However, in practice this has been anything but true. First off, actually designing and building out a Hadoop cluster for either storage, processing, or searching is anything but simple. Probably the simplest aspect is setting up distributed storage, but even that can be a configuration nightmare. And if you thought HDFS was difficult, wait until you try to build a MapReduce project to answer the most basic analytic questions.

There has been work to simplify MapReduce. For instance, Pig, Hive, Hbase, and even Spark have all “achieved” to make the Hadoop platform simpler. Yet data lakes still turn swampy over time, eventually taking way too long to implement and most certainly cost way too much. Now, one might wonder why this continues to be the case. And the answer is there are many reasons. In other words, each new solution is a patch that is not the complete unified answer and more often than not, pushes the problem or problems elsewhere. And here is where I’d like to start on how we, at CHAOSSEARCH, reimagined data lakes.

Reimagine

This reimagination of data lakes touches several aspects of the architecture, each addressing limitations experienced in today’s solutions. All of which, when resolved as one service, allows data lakes to be called a “go to” solution for big (and don’t forget small) analytic needs.

Data Storage

Like any data platform, ours revolves around big data storage — specifically, object storage. Hadoop HDFS could have been used, but it is not simple, not a service, and not trending as well as object storage. In other words, object storage (such as Amazon S3) is looking to be, if not already, the de facto place to store anything and everything. The reasons are obvious. It’s even in its name for goodness sake: Simple Storage Service (S3). Simple because it is based on a RESTful API with three basic functions: PUT, GET, LIST. And here is the magic: “anyone” can use it. It is also wonderfully inexpensive and infinitely elastic, with zero configuration. Object storage, particular S3, is a clear storage winner.

Data Format

Another aspect, critical to the success of a data lake architecture, is data representation. In other words, data formats are where most of the problems begin and where most problems can be solved. Unlike Hadoop, where each data source has to be transformed into a specific format for any real analytic use case, CHAOSSEARCH uses a universal format we call Data Edge. Our format can universally represent any other format, removing critical aspects Hadoop deployments struggle with; and like any big data solution, data grows in three key areas: Volume, Velocity, and most notably Variety. The variety, along with ever-changing analytic demands, is the major reason data lakes become data swamps. The physical manipulation of data into different formats for different use cases is the proverbial Achilles’ heel, not only for data lakes but for all big data analytic implementations. Data Edging has several key aspects, though a significant aspect is its ability to transform, aggregate, and correlate other Data Edge sources instantly without the need for physical Extract, Transform, Load (ETL) data processing.

Data Processing

There are several ways to define what data processing is. In the case of both warehouses and lakes, data process is, in essence, the same. In other words, data “is” stored but any real data processing within the system is mostly an egress request. In the case of warehousing, ETL ingress (structuring the data) is not considered part of the warehouse. And in the case of lakes, ingress has no ETL (well in theory since any useful Hadoop has schema on write). The point is that there is no connecting of both ingress and egress data flows; there is always a write operation that is independent of a read. What is needed, which would greatly reduce the scaffolding typically built around such solutions, is the connecting of ingress with egress and have data transforming, aggregating, and correlating be seen as one data pipeline. For CHAOSSEARCH, that is what we did. The ability to create virtual data pipelines that can be triggered and queried is a major aspect of our solution, and a major stumbling block missing from traditional data lakes.

Data Scaling

Any good big data architecture has scale as part of its vernacular. The problems begin when scale requires manual reconfiguration or worst, an architectural redesign. Hadoop scale by definition requires both configuration and design as part of its basic constructs. CHAOSSEARCH in contrast, is just like S3, scales seamlessly without configuration or complexity. CHAOSSEARCH is a serverless, always on, elastic service. We achieved this by building from ground up a naturally distributed architecture, leveraging big data technologies such as Docker Swarm and Scala/Akka, and wrapped it around an Angular2 console. With the combination of data edge technology in an Akka framework, CHAOSSEARCH has built a service that has capabilities and performance like no other.

Data Analytics

The final aspect all data lakes need is the ability to perform useful queries on the data in the lake. Now there are many good solutions out there and the actual querying of data is not typically a problem one would think data lakes would have. In the case of Hadoop, manual coding “was” a problem, but now with higher level components introducing SQL such as Pig, Hive, and Hbase, this really is not a problem any more. However, SQL is not typically thought of as “easy” for the average programmer. In other words, simple is not an adjective for such a language. And here is where CHAOSSEARCH saw another opportunity to make querying a whole lot simpler. Our idea was to extend the object storage interface and add search functionality, both relational and text in keeping with the RESTful API services such as found in S3. We believe such an interface will promote adoption by anyone and everyone and ensure data lakes don’t become data swamps.

Intelligence

In conclusion, and the lead into the next several blogs is the role machine learning can play in big data analytics. In the case of CHAOSSEARCH, machine learning is the engine that continually wrestles the data chaos into a streamlined service. Everything from automatically discovering and analyzing sources with Data Edging, to automatically organizing Data Edge sources to be easily refined and queried. In my blogs, I will go into details on Data Edging, our distributed architecture, and how machine learning brings it all together for a simple and seamless data platform we call a “Smart Data Lake” built around “Smart Object Storage” all on top of S3.