Integrating Azure data lake store with Apache spark

Spark is a general processing engine to process big data on top of Hadoop ecosystem. It processes the data in-memory with speed. Spark is 10x-100x times faster than Hadoop. Spark applications can be run integrating with Hadoop and can also run alone.

If spark applications are integrated with Hadoop, they can use their data store as HDFS and the cluster as YARN.

Now Spark can also be integrated with Microsoft Azure data lake. So, that your Spark application can use your Azure data lake store as the data store.

After reading this blog, you will have a clear understanding of how to integrate Apache Spark cluster with Azure data lake store and how to use your data lake store as the data store for your Spark application.

We hope you have followed our previous blogs on azure data lake series and integrated Hadoop with Azure data lake store and you have generated the below configurations that are needed which are mentioned in the following Article.

Application ID — Client ID

OAUTH 2.0 Token End – OAUTH 2.0 Refresh URL

Key value — OAUTH 2.0 Credential or Client secret

Now we will perform YouTube data analysis using spark in azure. We will put our YouTube data in azure data lake using Hadoop as show below.

You can see in the above screen shot that we have successfully copied the YouTube data from our local system to azure data lake store using Hadoop. Let us confirm the same using Azure web UI.

In the below screen shot, you can see that we have the youtubedata.txt file in our Azure data lake store account.

Let us now start analysing the data using spark. To integrate Spark with Azure data lake, you need to do the below things.

Download Azure dependency jars from here and copy them into $SPARK_HOME.

Let’s check for the top10_rated_videos file in the Azure data lake store

Yes! Here we go, we can see the files created successfully in the Azure data lake store, the same you can see in the below screen shot.

If you click on the part file, you can see the output as shown in the below screen shot

So, we have successfully loaded the data from Azure data lake store and processed the same using Spark and we have again stored the result of the output back into Azure data lake store. So, we have successfully integrated Azure data lake store with Spark and used the data lake store as Spark’s data store.

We hope this blog helped you in understanding how to integrate Spark with your Azure data lake store. Keep visiting our site www.acadgild.com for more updates on Big data and other technologies.