Performance Tuning and Optimization by Dmitry Tolpeko

Menu

What is Data Lake Concept

Data lake concept assumes that you build an enterprise-wide data warehouse by storing data coming from various sources in its original format. You do not perform data transformation, and do not try to build a consistent data store anymore.

Once data is loaded, it is immediately available for analysis. Users define which analysis they want to perform, which data they want to access and they are not limited by the data model.

Of course, data manipulation and analysis is more complex now as there is no metadata and data consistency. There are a lot of different data formats – structured and unstructured – and all this requires users to have advanced data processing skills.

But what drives data lake concept is agility and accessibility of data analysis.

As soon as data is loaded users can immediately start analysis, define any approaches and algorithms they need to apply. They do not need to wait until all ETL processes that transform data to a consistent format complete.

Anyway, data lakes involve significant risks. You cannot guarantee data quality, data governance, and there are issues with security and access control that are very important for enterprises.

For now data lake is a just concept, it will be unlikely implemented as is, but it can influence data modeling and data warehousing principles especially for data warehousing implemented on Hadoop.