Data Lakes: Emerging Pros and Cons

When it comes to big data and cloud computing, it looks like more and more pipelines will funnel into so-called data lakes. Generally speaking, a data lake is a storage repository that holds lots of data in its native format until it's needed. But what problems do they solve -- and what new challenges might they introduce for data scientists and business analysts?

First, the good news. Data lakes hope to solve two problems -- one old and one new, according to Gartner:

Addressing an old problem -- information silos. "Rather than having dozens of independently managed collections of data, you can combine these sources in the unmanaged data lake. The consolidation theoretically results in increased information use and sharing, while cutting costs through server and license reduction," Gartner states.

Addressing a new problem -- big data initiatives. "Big data projects require a large amount of varied information. The information is so varied that it's not clear what it is when it is received, and constraining it in something as structured as a data warehouse or relational database management system (RDBMS) constrains future analysis," Gartner adds.

Vertical Market Adoption

Data lake deployments have started across multiple industries. A few examples:

Healthcare: UC Irvine Medical Center leverages a Hadoop-based data lake to maintain millions of records for more than 1 million patients. The lake includes radiology images and other semi-structured reports, according to PricewaterhouseCoopers, which was involved in the build-out. Moreover, Boston's Partners Healthcare is working with EMC on a data lake to speed clinically relevant research.

Insurance: The insurance industry is wrapping its arms around the data lake concept -- where agents hope to perform a "single lookup" rather than making multiple individual queries across numerous storage systems.

Financial Services: Westpac, the first bank established in Australia, has been building a data lake that will make all data centrally available and accessible as a customer service hub, CIO Dave Curran has said.

New Data Lake Solutions

Meanwhile, big data, storage and cloud companies have introduced more and more data lake solutions.

The latest involves Microsoft Azure Data Lake, which emerged at the Microsoft Build conference this week. The Microsoft offering is compatible with Hadoop File System (HDFS). It has no fixed limits on file size or account size, and is nearing a public preview stage.

Data lakes will likely take center stage at EMC World (May 4-7), where David Dietrich, director of big data solutions, and other pundits are set to offer Information Management multiple updates. EMC announced a range of data lake offerings in March 2015, and will surely expand on those efforts at the conference.

Lingering Challenges

Still, Gartner has warned data scientists and analytics professionals about data lake hype and three potential risks: First up is a lack of data quality if customers don't closely manage the metadata. The second involves security and access control. And finally, Gartner warns that general-purpose infrastructure may not scale well vs. purpose-built infrastructure -- designed especially for data lakes.

In theory, purpose-built cloud systems can solve the security, access control and scalability challenges. But when it comes to data quality, that remains a challenge that businesses have been navigating since they turned on their first data gathering and management systems.