Microsoft Azure Stack is an extension of Azure—bringing the agility and innovation of cloud computing to your on-premises environment and enabling the only hybrid cloud that allows you to build and deploy hybrid applications anywhere. We bring together the best of the edge and cloud to deliver Azure services anywhere in your environment.

This week, Microsoft announced the public preview of a new and expanded Azure Data Lake making big data processing and analytics simpler and more accessible. The expanded Azure Data Lake includes Azure Data Lake Store, Azure Data Lake Analytics and Azure HDInsight.

The Azure Data Lake Store provides a single repository where you can easily capture data of any size, type and speed without forcing changes to your application as data scales. Azure Data Lake Analytics is a new service built on Apache YARN and includes U-SQL, a language that unifies the benefits of SQL with the expressive power of user code. The service dynamically scales and allows you to do analytics on any kind of data with enterprise-grade security through Azure Active Directory so you can focus on your business goals.

In the first week of October, we announced you will be able to create and operationalize big data pipelines (aka workflows) using Azure Data Lake and Azure Data Factory in addition to using our existing support for Azure HDInsight. Today, we are announcing the public preview of these newly added capabilities. The Azure Data Lake and Azure Data Factory integration allows you to do the following:

Easily move data to Azure Data Lake Store

As of today, Azure Data Factory supports moving data from the following sources to Azure Data Lake Store:

Azure Blob

Azure SQL Database

Azure Table

On-premises SQL Server Database

Azure DocumentDB

Azure SQL DW

On-premises File System

On-premises Oracle Database

On-premises MYSQL Database

On-premises DB2 Database

On-premises Teradata Database

On-premises Sybase Database

On-premises PostgreSQL Database

On-premises HDFS

Generic OData (Coming soon!)

Generic ODBC (Coming soon!)

You can also move data from Azure Data Lake Store to a number of sinks such as Azure Blob, Azure SQL Database, on-premises file system, etc. Follow the steps below to move data from Azure Blob Storage to Azure Data Lake Store.

Note: You need to have a valid Azure Data Lake Store account before following the steps below. Click here to create a new account if you don’t have one.

Create an Azure Data Factory

Login to Azure Portal and navigate to Azure Data Factory. Enter Name, select Subscription, Resource group name and Region name. Let’s name it AzureDataLakeStoreAnalyticsSample.

Once created, navigate to your data factory and click Author and deploy.

Create ADF Linked Services

Create Azure Storage Linked Service: This is the Azure Blob Storage (source) from where you want to move the data.

Note: You need to delete the rows saying Optional in the Json if you are not specifying the values for them before hitting Deploy.

Create ADF DataSets

Create Azure Blob Storage source dataset:

Click New Dataset –> Azure Blob storage.

This will bring in the template for the Azure Blob storage dataset where you can fill in any values. Check out the Azure Blob storage dataset below as an example. For simplicity, we are not using the partitioned by clause for time based partitions and using a static folder. The below dataset specifies that the data being copied (SearchLog.tsv) is in rawdatasample/data/ folder in azure storage.

This will bring in the template for the Azure Data Lake Store dataset where you can fill in any values. For an example, look at the below Azure Data Lake Store dataset. For simplicity, we are not using the partitioned by clause for time based partitions and using a static folder. The below dataset specifies that the data being copied to datalake/input/ folder in data lake.

Monitor ADF Pipelines

The ADF copy pipeline created above will start running as the datasets have a daily frequency and the start, end in the pipeline definition is set to 08/08/2015. So, the pipelines will only run for that day and do the copy operation once. Click here to learn more about scheduling ADF pipelines.

Navigate to ADF Diagram View to view the operational lineage of your data factory. You will be able to see the Azure Blob Storage and Azure Data Lake Store dataset along with the pipeline for moving the data from blob storage to azure data lake store.

Click on the DataLakeTable in your Diagram view to see the the corresponding activity executions and its status.

You can see that the copy activity in EgressBlobToDataLakePipeline in ADF (see screenshot above) has successfully executed and copied 3.08 KB data from Azure Blob Storage to Azure Data Lake Store. You can also login to Microsoft Azure portal and use the Azure Data Lake Data Explorer to visualize the data copied to Azure Data Lake Store.

Click here to learn more about Azure Data Factory data movement activities. You can find detailed documentation about using AzureDataLakeStore connector in ADF here.

Create E2E big data ADF pipelines that run U-SQL scripts as a processing step on Azure Data Lake Analytics service

A very common use case for multiple industry verticals (retail, finance, gaming) is Log Processing.

Note: You need to have a valid Azure Data Lake Analytics account before following the steps below. Click here to create a new account if you don’t have one.

In this scenario, you will create an ADF pipeline that consumes the logs copied to Azure Data Lake Store account in previous step and processes logs by running U-SQL script on Azure Data Lake Analytics as one of the processing step. The U-SQL script computes events by region that can be consumed by downstream processes.

We will reuse the data factory (AzureDataLakeStoreAnalyticsSample) created in the scenario above to copy data from Azure Blob Storage to Azure Data Lake Store.

Create ADF Linked Services

Create Azure Data Lake Analytics Linked Service. This is the Azure Data Lake Analytics account which will run the U-SQL scripts to do log processing.

Note: You need to delete the rows saying Optional in the JSON if you are not specifying the values for them before hitting Deploy.

Create ADF DataSets

Create Azure Data Lake Store source dataset:

Note: If you are doing this scenario in continuation to the Copy scenario above, then you would have created this dataset already.

Click New Dataset -> Azure Data Lake Store.

This will bring in the template for the Azure Data Lake Store dataset. You can fill in any values.

For example: Have a look at the below Azure Data Lake Store dataset. For simplicity, we are not using the partitioned by clause for time based partitions and using a static folder. The below dataset specifies that the data being copied to datalake/input/ folder in data lake.

For example: See the EventsByEnGbRegionTable dataset definition below. The data corresponding to this dataset will be produced after running the AzureDataLakeAnalytics U-SQL script to get all events for ‘en-gb’ locale and date < “2012/02/19”.

The values for @in and @out parameters in the above U-SQL script are passed dynamically by ADF using the Parameters section. See the Parameters section above in the pipeline definition.

You can specify other properties viz. degreeOfParallelism, priority etc. as well in your pipeline definition for the jobs that run on the Azure Data Lake Analytics service.

Monitor ADF Pipelines

The ADF copy pipeline above will start running as the datasets have a daily frequency and the start, end in the pipeline definition is set to 08/08/2015. So, the pipelines will only run for that day and run the U-SQL script once. Click here to learn more about scheduling ADF pipelines.

Navigate to ADF Diagram view to view the operational lineage of your data factory. You will see two pipelines and the corresponding datasets viz. EgressBlobToDataLakePipeline (copy data from Azure Blob Storage to Azure Data Lake Store) and ComputeEventsByEnGbRegionPipeline (get all events for ‘en-gb’ locale and date < “2012/02/19”).

Click on the EventsByEnGbRegionTable in your Diagram view to see the the corresponding activity executions and its status.

You can see that the U-SQL activity in ComputeEventsByEnGbRegionPipeline in ADF has run successfully and created a Results.tsv file (/datalake/output/Result.tsv) in your AzureDataLakeStore account. The Result.tsv contains all events for ‘en-gb’ locale and date < “2012/02/19”. You can login to Microsoft Azure portal and use the Azure Data Lake Data Explorer to visualize the Result.tsv file generated as part of the processing step above in Azure Data Lake Store.

You can find detailed documentation about AzureDataLakeAnalyticsU-SQL activity in Azure Data Factory here.

To summarize, by following the steps above, you were able to build E2E big data pipelines using Azure Data Factory that allowed you to move data to Azure Data Lake Store. In addition, you were able to run U-SQL script on Azure Data Lake Analytics as one of the processing step and dynamically scale according to your needs.

We will continue to invest in solutions allowing us to operationalize big data processing and analytics workflows. Click here to learn more about the Microsoft Azure Data Lake from the Microsoft Cloud Platform team. If you want to try out Azure Data Factory, visit us here and get started by building pipelines easily and quickly using data factory. If you have any feature requests or want to provide feedback for data factory, please visit the Azure Data Factory Forum.