The Down Low on Data Lakes

While data lakes can help democratize data, many experts feel they are not yet ready for prime time.

Data lakes are the latest incarnation of the democratize data for business theme. The idea is simple: Pool data from enterprise applications and other systems into a single repository. Voila! Business users have unfettered access to enterprise data.

Business leaders immediately grasp the benefits, said Andrew Oliver, president and founder of Open Software Integrators, a Big Data consulting firm.

"It's really part of this data liberalization, which is rather than having some IT gnome guarding the bridge in front of the castle and saying 'Business user, I think your use case is stupid, we're not going to do that,'" Oliver said in a recent interview. "You get the data in a place where it's accessible and bring it to the people who are actually using it."

Oliver has written numerous columns about the benefits of data lakes, making him one of the most outspoken advocates for this new concept. His North Carolina firm consults on Big Data and business intelligence projects and has built several data lakes for large, enterprise clients, so he's seen how data lakes can ultimately liberate data from enterprise applications and data warehouses.

Data warehouses require structure, which in turn means you need to know how you'll use the data. By using Hadoop, an unstructured file system, you eliminate this onerous step. Architecturally, a data lake acts as a buffer between the enterprise applications and your BI tools, so it should reduce system crashes and application crawl by freeing applications from queries. It also requires less integration work for future projects. All of this appeals to business leaders, according to data lake advocates.

With such clear benefits, you'd think a data lake would be an obvious win. Yet, among technologists, it is extremely controversial. Experts disagree over almost every aspect of data lakes, from defining the term to debating whether they are practically useful or possible.

The debate came to a head in July, when IT research firm Gartner struck out at what it deemed the "Data Lake Fallacy" and accused vendors of "exploiting the hype with no intent to resolve the lack of programming, analytical and data manipulation skills necessary to improve specific business outcomes."

"It's still so new there's more worst practices than there are best practices right now," Graham said. "There's just not enough repeatable implementations. In fact, the vision of the data lake is not exactly harmonious across the vendors and the customers."

Many of Teradata's customers are exploring Hadoop, and he estimates the top 50 have an implementation running. When it comes specifically to data lakes, however, he suspects there is more vision than reality. He predicts another five years before the concept is firm, complete with best practices and success cases.

"We're seeing some emerging consensus, but most of the consensus here are the land mines with the data lake; don't step there, do something else, and so the land mines are almost more valuable than the best practices," he said.

Data Lake Use Cases

Still, experts agree on some situations in which data lakes should do more good than harm. Data lakes give you the option of retaining data that would otherwise be deleted, Oliver points out. Once, web logs were only seen as a diagnostic tool and were not retained, but now those logs are used for statistical reports and analytics. Data lakes could be used to store that data for longer periods, giving your company's analytics capabilities time to catch up.

Data lakes could also be leveraged to solve problems that we haven't been able to resolve with legacy systems, he added. As an example, he said data lakes could be coupled with decision support systems to automate low-level decisions (re-ordering a supply, for instance) that currently require a person.

"Credit card companies have been doing this for years. You don't really need a person to do some of those things," Oliver said. "What you need is algorithms, rules engines."

Data lakes are worth considering if you're dealing with logs that come from machines such as devices, servers or sensors, according to Theo Vassilakis. He is a former principal engineer at Google, where he developed Dremel, a tool that allowed Google's employees to analyze data stored across thousands of machines at super fast speeds. For Google and Facebook, where systems were born and died in rapid succession, data lakes made sense, he said, because the data was always the same -- and the business was the data.

"I don't think any one system is 'the answer.' You need to think about how computation informs what you're actually trying to achieve in your organization," Vassilakis said.

Data Lakes and Enterprise Apps

When it comes to enterprise applications, data lakes may create more problems than they solve.

"Conceptually it looks easy, but logically it doesn't make sense," said BlueData CEO Kumar Sreekanti, a computer scientist who created iSCSI and VSan, and served as vice president of R&D at VMware.

Sreekanti compared it to historic efforts in Silicon Valley to build one computing architecture to handle every application. It's never worked because each system solves different problems. Single data repositories are similar in that the concept seems to make sense, but creates unintended consequences, he explained.

"We have a few customers who clearly do not want to create a single source repository because of various regulatory reasons or concerns about the origins of the data and all that, so they like to leave the data where it is and they still like to do an analysis on the data," Sreekanti said, adding it's also a major no-no in companies concerned about security.

Vassilakis and Sreekanti now run software startups that address very different problems, but share a common view that data can stay where it is and still be useful.

Vassilakis is CEO of Metanautix, a company he co-founded and launched with former Facebook engineer Toli Lerios. Metanautix offers a sort-of middleware solution called Quest, which Lerios and Vassilakis describe as a "data compute engine" that fits between the two edges of "move data, never move data." It allows companies to manipulate and analyze Big Data by supporting ETL and virtualized data.

Sreekanti's company, BlueData, launched Epic Enterprise 1.0 this fall. It creates an Amazon-like cloud experience with one premise: Rather than being constrained to only obtaining data from specific physical machines, the user can point to any storage that might exist within the enterprise.

"Your data can be anywhere, and we bring it in during runtime," Sreekanti said.