My thoughts on Azure Data lake

I have been very curious about Azure Data Lake. So I started experimenting with it. In this post I share with you my thoughts about it.

What is a data lake?

A data lake is a concept that opposes the idea of a data mart. Where a data mart is a silo with structured and cleansed data, a data lake is a huge data collection that is unstructured and raw. You could also say that a data mart is a bottle of clean water whereas the data lake is the lake with (not so clean) water. 🙂

Now why would you want a data lake? Imagine you are generating huge logfiles, for example in airplanes. Machines that track air pressure, temperature etc. If something goes wrong, you definitely want to be alerted. That is event-driven: "if A and B happen, alert pilot, or do C" and there are tools for dealing with that kind of streaming data. But what if the plane landed safely? What do you do with all that data? You do not need it anymore right?

Well, some people would say: "Wrong". You might need that data later for reasons you do not know today. Google, Microsoft and Facebook are all hoarding data. Also data they are not sure they might need someday. This data could later prove to be valuable for AI, machine learning or for something else.

In the past you would only collect data if you had a business case:

find a business case

collect the right data

model the data

clean the data

analyze the data

Nowadays, data is coming in so fast, you want to store it cheaply and in its raw form, just to be sure you don't miss anything valuable:

collect all the data

a business case comes along

apply logic/cleansing later (schema on read)

Analyze the data

So a data lake is just for cheap storage?

That is not all of it. The cool thing about Azure Data Lake is that you can just dump your huge petabyte sized log files to the cloud and analyze it with a language that merges SQL with C# called U-SQL. That sounds pretty powerful! And indeed it is. Coming from SQL Server I quickly found my way in using the data lake. Upload any txt or csv file and you can start querying it. Since flat files don't have structure, you are going to have to apply a schema while querying that data. This is called "schema on read". This is great for ad hoc analysis. What if you want to find all occurrences in huge chat logs where a person said a certain word? Data lake is a great tool for analyzing those huge sets of data.

How does it work?

Unlike SQL management server, you cannot just query it on the fly and see your results directly. you have to output your result to another flat file if you want to see the results. This can be a bit cumbersome as you are writing code and executing it often to see if you get the desired result. As a developer I often to do this on SQL Server databases: write a query, try it out, doesn't work, adjust query, execute again, etc.

The result is another flat file. The idea is that with Azure Data Lake you're going to process really big log files (and lots of them) to aggregate a desired result. For example: what if Facebook wants to find it's top 10 users that typed the most words in all their chats combined? The input is the entire chat log of Facebook, the end result would be a csv with 10 rows containing those users. Azure Data Lake is ideal for these kinds of workloads.

Where does the Data lake fit into the cloud architecture?

Data lake as an extra layer

In this scenario, data lake is a layer just for analyzing and preparing unstructured raw, big data. The large log files get aggregated to smaller results that are fed to the data warehouse, which is a SQL Server in Azure (or a Azure SQL DW). For example, it extracts just the transactions that are suspicious and passes those on to the data warehouse. The structured data sources are fed directly to the SQL data warehouse (preferrably via staging tables or a staging database). The advantage is in that you use every tool to it's own strenght. Let Data lake handle the big files, let SQL Server handle structured data.

Data lake as the staging area for all

Another option is to load all the data into the data lake. Structured or not. Now you have one repository, where you can build your historical staging easily. All the data is in one place, available to the powerful query engine that can handle big workloads. The advantage here is that you can compute and analyze over all the data in the lowest granularity.