Subscribe

Follow Us

Look Before you Leap into the Data Lake

Look Before You Leap into the Data Lake

If the concept of a data lake is confusing to you, don't worry because you're not alone. A primary reason for this confusion is that the definition of a data lake seems to change depending on which constituency you ask. The big data community will define it as a central location for all your disparate data sources stored in its native format in Hadoop. Even within the big data community, it may be called something different, like enterprise data hub, depending on the vendor you're speaking with. In the Business Intelligence community, a data lake is defined as a staging area, or landing area, for your source system data. They make less of a distinction about where the data is stored.

The two questions I'm asked most often include:

1. If I build a data lake, does it need to be in Hadoop? 2. Is there any value in building a data lake?

1. If I build a data lake, does it need to be in Hadoop?

The first question can be generalized to whether data lakes are best build in a relational or non-relational platform. Like any architectural decision, it depends on a number of criteria such as total data volume, incremental data volume, volatility of the underlying systems, cost, ease of access, speed to delivery, etc. Since Hadoop provides schema-on-read, it is often faster to build the data lake since a schema doesn't need to be defined before data can be loaded. However, this cost isn't eliminated but rather delayed since the data needs to be structured upon read. While the choice of where to build a data lake can be an article on its own, the truth is that you can build a data lake in both a relational or non-relational platform. It's truly up to your current and future requirements.

2. Is there any value in building a data lake?

The second question regarding value depends on what data is being loaded into the data lake and whether it can be leveraged, either now or in the future, to either generate revenue or reduce costs in an organization. I often hear that companies should spend time and consume hardware resources loading all data in the enterprise into the data lake. I find that message usually comes from the big data 'experts', which in my experience, have more experience in software development than data management. There is also a question of data quality, context and governance. While there may be value in having the data, if it's not accurate, no-one can use it and isn't managed, then the value of the data lake diminishes greatly.

If you're considering building a data lake, it's best to do your research, understand your requirements and ensure you have a viable business case. The questions of what technology to build it with and whether there is value will answer themselves.