Category: Azure Data Factory

I’m pleased to say that I am speaking at IT/Dev Connections 2017. This year the conference will be held in San Francisco October 23-26. I had a great experience speaking at IT/Dev Connections in 2015, so I am excited to return again this year.

This conference is special to me because of its focus on providing great content for developers and IT pros – the conference website describes it as the “anti-keynote” conference with no forced marketing content.

I also enjoy it because it is more than just SQL Server/Data Platform (they have tracks for Cloud & Data Center, Enterprise Collaboration, Development & Dev Ops, and Enterprise Mobility and Security), and it’s nice to get out of my comfort zone a bit.

If you have been wanting to get into Azure Data Factory (ADF) development, join me for this demo-filled overview. In this session, we’ll go over the basic anatomy of an ADF solution. You’ll learn what ADF is and isn’t as we walk through a solution to pull data from an on-premises SQL Server database to a blob storage and then populate and Azure SQL Data Warehouse. You’ll learn tips for creating ADF solutions in Visual Studio, and I’ll show you how to make ADF development less tedious with a free Visual Studio Add-in called Biml Express. You’ll leave with a basic understanding of ADF and a list of tools and skills you’ll want to acquire as you begin your ADF development.

Improve Data Warehouse ETL Delivery with a Patterns-Based Approach

What if I told you that 90% of your data integration development in SQL Server could be automated? In 5 years, you will be “old fashioned” if you are hand coding SSIS packages. Developers with different skill levels and design preferences create databases and SSIS packages however they see fit to get the job done. Documentation is frequently omitted. Maintenance and small enhancements consume too much development time while manual errors and inconsistencies slip through the testing and release process. You can use tools and frameworks to rearrange the development process and alleviate these common problems. The implementation and automation of design patterns leads to improved efficiency and communication. Join me in this session to learn how to use Business Intelligence Markup Language (Biml) and Excel to facilitate metadata-driven SSIS development. I’ll use database schema information plus Excel inputs to implement a small data mart from staging through the dimensional model.

On a recent project I used Azure Data Factory (ADF) to retrieve data from an on premises SQL Server 2014 instance and land them in Azure Data Lake Store (ADLS) as ORC files. This required the use of the Data Management Gateway (DMG). Setup was quick and easy in our development environment. We installed the DMG for development on a separate server in the client’s network, where we also installed SQL Server Management Studio (SSMS) for query development and data validation. We set up resource groups in Azure for development and production, and made sure the settings for development and production were the same. Then we set up a separate server for the production DMG.

Deployment and execution went well in the dev environment. Testing was completed, so we deployed to prod. Deployment went fine, but the pipelines failed execution and returned the following error on the output data sets.

Java Runtime Environment is not found.

The Java Runtime Environment (JRE) is not required for the DMG to run successfully, but it is needed for the creation of ORC files. The only problem with this error message was that we did indeed have the JRE installed on the server.

After reinstalling the JRE and the DMG and getting the same error, I consulted the troubleshooting guide. After finding nothing relevant there, I asked some colleagues for suggestions.

I double-checked that I had the same version of the DMG that worked in dev and that I had the most current version of the JRE.

I double-checked that the DMG and JRE matched bit-wise (32-bit vs. 64-bit). Both were 64-bit in my case.

I checked that JAVA_HOME was set correctly in the environment variables.

When none of those things worked, I logged a support ticket with Microsoft. They had me do the following:

Check the registry key – HKEY_LOCAL_MACHINE\Software\JavaSoft\Java Runtime Environment should have a Current Version entry that shows the current JRE version.

Check that the subkey in the folder labeled with the version has a JAVAHOME entry with the correct path (something like C:\Program Files\Java\jre1.8.0_74).

After review, we realized that we had installed SSMS on the dev DMG server but not on the prod DMG server. SSMS would have required the installation of the C++ redistributable package, which is why we didn’t encounter this error in dev.

I will confess that I don’t understand exactly why missing C++ libraries manifest themselves in an error claiming a missing Java Runtime Environment. If you have a good explanation, please leave it in the comments and I’ll update this and give you credit.

I hope that someone else who runs into this issue will find this blog post and avoid days of troubleshooting and confusion.

I’ve spent the last couple of months working on a project that includes Azure Data Factory and Azure Data Warehouse. ADF has some nice capabilities for file management that never made it into SSIS such as zip/unzip files and copy from/to SFTP. But it also has some gaps I had to work around. My project involved copying data from on-premises SQL Server to an ORC file in a data lake staging area for ingestion into an Azure SQL Data Warehouse through Polybase. Then I had planned to move that file to the a raw area of the data lake for archiving.

In other words, as sung below by a great lemur, I like to move it.

But at this time ADF doesn’t support that. You can copy a file with a copy activity, but you cannot actually move (i.e., copy and delete).

Luckily, we had a workaround for our situation. If you tell ADF to copy data to a file that already exists in the specified location in the data lake, it will overwrite the existing file. We made sure the file name is always the same for each table in the staging area so there is always only one file per table.

I landed the data in Raw first so that we would not have to pull from SQL again if we needed to re-run a slice. Data latency wasn’t a huge issue for this client – we had some pipelines that ran hourly and some that ran daily. The extra seconds it took to land the file in Raw was not a concern.

For now, if you do need to actually move or delete, you can use a custom C# activity to delete files. I chose not to do this because I didn’t want to add another technology for the client to learn/manage while adopting Azure. This may be the way to go for other projects.

If you think moving (copying and deleting) files should be a first class citizen in Azure Data Factory, please vote for the idea and spread the word for others to vote.

I support 'Move Activity' – An activity that copies and then deletes in Azure Data Factory. What do you think? https://t.co/fyBE7tyw8w

I showed in my previous post how we generated the datasets for our Azure Data Factory pipelines. In this post, I’ll show the BimlScript for our pipelines. Pipelines define the activities, identify the input and output datasets for those activities, and set an execution schedule. We were creating several pipelines with copy activities to copy data to Azure Data Lake Store.

We generated one pipeline per schedule and load type:

Hourly – Full

Hourly – Incremental

Daily – Full

Daily – Incremental

We also generated some one-time load pipelines for DR/new environment setup.

The first code file below is the template for the pipeline. You can see code nuggets for the data we receive from the generator file and for conditional logic we implemented. The result is one copy activity per source table within the appropriate pipeline.

In the second code file below, lines 104 to 119 are generating the pipelines. We read in the necessary data from the Excel file:

Schema name

Table name

Columns list

Incremental predicate

Sidenote: We wrote a quick T-SQL statement (not shown) to generate the columns list. This could have been done in our BimlScript, but it was something we changed after the fact to accommodate the limitations of Polybase (Dear Microsoft: Please fix). SQL was quicker and easier for us, but if I were to do this again I would add that into our BimlScript. We needed to replace new lines and double quotes in our data before we could read it in from the data lake. You can get around this issue by using .ORC files rather than text delimited files. But the ORC files aren’t human readable, and we felt that was important for adoption of the data lake with the client on this project. They were already jumping in with several new technologies and we didn’t want to add anything else to the stack. So our select statements list out fields and replace the unwanted characters in the string fields.

Our Excel file looks like this.

Columns B, C, L, and M are populated by Excel formulas. This is the file that is read in by the BimlScript in the code below.

In our generator file (which is the same file that was used to generate the datasets), we use the CallBimlScript function to call the pipeline template file and pass along the required properties (table, schema, frequency, scope, columns list, predicate).

The great thing about Biml is that I can use it as much or as little as I feel is helpful. That T-SQL statement to get column lists could have been Biml, but it didn’t have to be. The client can maintain and enhance these pipelines with or without Biml as they see fit. There is no vendor lock-in here. Just as with Biml-generated SSIS projects, there is no difference between a hand-written ADF solution and a Biml-generated ADF solution, other than the Biml-generated solution is probably more consistent.

And have I mentioned the time savings? There is a reason why Varigence gives out shirts that say “It’s Monday and I’m done for the week.”

Apologies for the overly acronym-laden title as I was trying to keep it concise but descriptive. And we all know that adding technologies to your repertoire means adding more acronyms.

My coworker Levi and I are working on a project where we copy data from an on-premises SQL Server 2014 database and land it in Azure Data Lake Store. Then we use Polybase to get the data into Azure SQL Data Warehouse and build a dimensional model. I’ve done a couple of small projects before with Azure Data Factory, but nothing as large as this one. We had 173 tables that we needed to copy to ADLS. Then we needed to set up incremental loads for 95 of those tables going forward.

My Azure Data Factory is made up of the following components:

Gateway – Allows ADF to retrieve data from an on premises data source

Linked Services – define the connection string and other connection properties for each source and destination

Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data

Pipelines – combine the data sets and activities and define an execution schedule

Each of these objects is defined in a JSON file. Defining data sets and copy activities in JSON gets very tedious, especially when you need to do this for 100+ tables. Tedium usually indicates a repeatable pattern. If there is a repeatable pattern you can probably automate it. The gateway and linked services are one-time setup activities that weren’t worth automating for this project, but the datasets and pipelines definitely were.

In order to automate the generation of datasets and pipelines, we need a little help with some metadata. We had the client help us fill out an Excel spreadsheet that listed each table in our source database and the following characteristics relevant to the load to Azure:

Frequency (daily or hourly)

Changes Only (incremental or full load)

Changed Time Column (datetime column used for incremental loads)

That list plus the metadata we retrieved from SQL server for each table (column names and data types) were all we needed to automate the creation of the ADF datasets and pipelines with BimlScript.

This post will show how we built the data sets. The following post will show the pipelines with the copy activities.

First we need to generate the input datasets coming from SQL Server. We added some properties at the top and embedded some code nuggets to handle the values that are specific to each table.

Next we need the output datasets for Azure Data Lake Store. We use the same three properties in generating each dataset- schema, table, frequency- and we add one more for scope.

Now we just need another BimlScript file that calls these two files. We broke our pipelines up into daily versus hourly and incremental versus full loads.

We used a helper code file and a separate environments file, which I’m glossing over so we can focus on the Biml for the ADF assets. You can see that we read in the inputs from Excel and write some counts to a log file, just to make sure everything is working as intended. Starting on line 41 is where we generate the datasets. On lines 54 and 55, we use the CallBimlScript function to call the two files above. We end up generating datasets for the tables that are a full load each day and their counterpart datasets for the files we create in ADLS. The datasets for daily incremental loads are generated on lines 69 and 70. Then we do the hourly full loads and hourly incremental loads. I’ll discuss lines 100 – 119 in my next post.

The Results

We were able to write the BimlScript and generate the datasets and pipelines in about 35 hours. A previous ADF project without automation took about 3 hours per source table. If we had gone that route, we could have been looking at 350 – 500 hours to complete this part of the project. Visual Studio with Biml Express took about 5 minutes to generate everything. Deploying to Azure took about an hour. We are now looking into ARM templates for future deployments.

Stay tuned for part 2 where I show how we generated the ADF pipelines.