The Pros and Cons of Data Lakes

As organizations finally get a handle on big data and analytics, the concept of a data lake is gaining momentum. A data lake is a large storage infrastructure that is capable of containing enormous amounts of raw data. By definition, data lakes store data in its native format. These are object storage repositories capable of storing unstructured and semi-structured data within a “flat” architecture. The data can then be queried for relevant information, and analysis can be conducted on the smaller sets of data as retrieved via query.

In the ideal scenario, the data lake would replace the data warehouse, serving as a large and vastly scalable collection of unstructured data in its native format, ripe for all sorts of analysis and experimentation. That’s ideal. In reality, a data lake can go horribly wrong if the data lake isn’t designed properly and the data management executed well. Here are the pros and cons of data lakes and how to determine if that’s what your organization really needs.

The Benefits of a Data Lake: Storing Raw Data

The number one benefit of a data lake is that you don’t have to know what you need to use the data for before it is stored. Since it’s stored in raw format, you can essentially store any and all data for indefinite periods of time, pulling it as analytical needs arise. This is likely the reason that data lakes are probably the future of data management. It’s also the reason that data lakes can quickly become problematic and useless if not properly managed.

Data lakes can become even richer when the data is endowed with metadata on the history or “chain of possession” of the data, but this is proving technologically difficult. A few vendors offer solutions, but most are still in their infancy.

Data lakes are excellent for applications in which it is useful to combine new or streaming data with historical data. For instance, comparing Web transactions with historical inventory data, or analyzing POS data with historical sales data. With a well-designed, smartly-executed data lake, big data can become a watershed technology for the business — enabling them to put every shred of data from every potential source to a purpose.

The Drawbacks of a Data Lake: Rethinking Data Management

If you know much about databases, you’ve probably already identified the primary quandary of a data lake: the ability to store data in such a way that it’s retrievable again by query. That capability must be built in to the data lake through unique and rich metadata tags. Without these tags, the data lake quickly devolves into what industry insiders have dubbed the data swamp.

Another disadvantage of the data lake is that it won’t work with traditional data storage and analytical technologies. For example, you can’t build a real data lake in SQL. You must be able to handle data in a flat, non-hierarchical fashion.

Finally, data lakes are resource hogs. Not only do data lakes take enormous quantities of storage, and analyzing the data eats up processing power, but data lakes can also squander the valuable time and energy of the data scientist. Since data scientists are some of the most difficult to find and expensive to acquire talent on the market today, you don’t really want them spending the majority of their time on data preparation, before the analytics can even begin. It’s better to build a team of analytics specialists to handle these mundane issues, and reserve your highly-skilled IT and data professionals for things like designing the infrastructure and conducting the actual data analysis.