Introduction

„Azure Automation delivers a cloud-based automation and configuration service that provides consistent management across your Azure and non-Azure environments.It consists of process automation, update management, and configuration features.Azure Automation provides complete control during deployment, operations, and decommissioning of workloads and resources.„

Apart from this gibberish, I will point out some important issues…

Know your Automation

It has something that is called „a feature” – Fair Share – which basically prevent you from running scripts longer than 3 hours.

Well, at least it will pause your script after 3 hours. And if you didn’t implement it as a workflow with some checkpoints – it will RESTART your script from the beginning.

And if you implement checkpoints, it will resume your script from last known checkpoint. BUT it will do this only 3 times! So you are not able to implement logic that takes more than 9 hours to process…

The workaround is to connect your own machine (server or laptop) as a hybrid worker.

We will use quick create, so select „Create a new runbook„, then name it and select type as „PowerShell„.

Use the script below in „Edit” mode, then save it and publish.

PowerShell script

Parameters

It has two parameters:

PipelineName – the name of the pipeline to run

CheckLoopTime – a number of seconds between checking status of a trigerred pipeline run

Invoke-AzureRmDataFactoryV2Pipeline is a cmdlet which I use to trigger a pipeline. Unfortunately, it is an asynchronous operation, so after triggering, we have to periodically check for running state and status.

This script will do it in a simple loop and there will be some wait logic before every iteration. You can parametrize the number of seconds. Every loop also prints out last known pipeline status and timestamp of that check.

This post explains things that are difficult to find even in English. That’s why I will break my rule and will not write it in my native language! Po wersję polską zapraszam do google translate :>

Introduction

Loading data using Azure Data Factory v2 is really simple. Just drop Copy activity to your pipeline, choose a source and sink table, configure some properties and that’s it – done with just a few clicks!

But what if you have dozens or hundreds of tables to copy? Are you gonna do it for every object?

Fortunately, you do not have to do this! All you need is dynamic parameters and a few simple tricks

Also, this will give you the option of creating incremental feeds, so that – at next run – it will transfer only newly added data.

Mappings

Before we start diving into details, let’s demystify some basic ADFv2 mapping principles.

Copy activity doesn’t need to have defined column mappings at all,

it can dynamically map them using its own mechanism which retrieves source and destination (sink) metadata,

if you use polybase, it will do it using column order (1st column from source to 1st column at destination etc.),

if you do not use polybase, it will map them using their namesbut watch out – it’s case sensitive matching!

So all you have to do is to just keep the same structure and data types on the destination tables (sink), as they are in a source database.

Bear in mind, that if your columns are different between source and destination, you will have to provide custom mappings. This tutorial doesn’t show how to do it, but it is possible to pass them using „Get metadata” activity to retrieve column specification from the source, then you have to parse it and pass as JSON structure into the mapping dynamic input. you can read about mappings in official documentation: https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping

String interpolation – the key to success

My entire solution is based on one cool feature, that is called string interpolation. It is a part of built-in expression engine, that simply allows you to just inject any value from JSON object or an expression directly into string input, without any concatenate functions or operators. It’s fast and easy. Just wrap your expression between

@{ ... }

. It will always return it as a string.

Below is a screen from official documentation, that clarifies how this feature works:

So what we are going to do? :>

In my example, I will show you how to transfer data incrementally from Oracle and PostgreSQL tables into Azure SQL Database.

All of this using configuration stored in a table, which in short, keeps information about Copy activity settings needed to achieve our goal

Adding new definitions into config will also automatically enable transfer for them, without any need to modify Azure Data Factory pipelines.

So you can transfer as many tables as you want, in one pipeline, at once. Triggering with one click

Every process needs diagram :>

Basically, we will do:

Get configuration from our config table inside Azure SQL Database using Lookup activity, then pass it to Filter activity to split configs for Oracle and PostgreSQL.

In Foreach activity created for every type of database, we will create simple logic that retrieves maximum update date from every table.

Then we will prepare dynamically expressions for SOURCE and SINK properties in Copy activity. MAX UPDATEDATE, retrieved above, and previous WATERMARK DATE, retrieved from config, will set our boundaries in WHERE clause. Every detail like table name or table columns we will pass as a query using string interpolation, directly from JSON expression. Sink destination will be also parametrized.

Now Azure Data Factory can execute queries evaluated dynamically from JSON expressions, it will run them in parallel just to speed up data transfer.

Every successfully transferred portion of incremental data for a given table has to be marked as done. We can do this saving MAX UPDATEDATE in configuration, so that next incremental load will know what to take and what to skip. We will use here: Stored procedure activity.

This example simplifies the process as much as it is possible. Remember, in your solution you have to implement logic for every unsuccessful operation. You can achieve that using On Failure control flow with some activities (chosen depending on your needs) and timeout/retry options set individually for every activity in your pipeline.

As Andy rightly noted in the comment below this post, it is possible to use „Cols” also to implement SQL logic, like functions, aliases etc. The value from this column is rewritten directly to the query (more precisely – concatenated between SELECT and FROM clause). So you can use it according to your needs.

Creating datasets

BUT they can be parametrized, so you can just create ONE dataset and use it passing different parameters to get data from multiple tables within same source database

Source datasets

Source datasets don’t need any parameters. We will later use built-in query parametrization to pass object names.

Go to and click + and choose

Choose your datataset type, for example

Rename it just as you like. We will use name: „ORA”

Set proper Linked service option, just like this for oracle database:

And that’s it! No need to set anything else. Just repeat these steps for every source database, that you have.

In my example, I’ve created two source datasets, ORA and PG

As you can see, we need to create also the third dataset. It will work as a source too, BUT also as a parametrizable sink (destination). So creating it is little different than others.

Sink dataset

Sinking data needs one more extra parameter, which will store destination table name.

Create dataset just like in the previous example, choose your destination type. In my case, it will be Azure SQL Database.

Go to , declare one String parameter called „TableName”. Set the value to anything you like. It’s just dummy value, ADF just doesn’t like empty parameters, so we have to set a default value.

Now, go to , set Table as dynamic content. This will be tricky :). Just click „Select…”, don’t choose any value, just click somewhere in empty space. The magic option „Add dynamic content” now appears! You have to click it or hit alt+p.

Parametrizable PIPELINE with dynamic data loading.

Ok, our connections are defined. Now it’s time to copy data :>

Creating pipeline

Go to you ADF and click PLUS symbol near search box on the left and choose „Pipeline„:

Reanme it. I will use „LOAD DELTA„.

Go to Parameters, create new String parameter called ConfigTable. Set value to our configuration table name: load.cfg . This will simply parametrize you configuration source. So that in the future it would be possible to load a completely different set of sources by changing only one parameter :>

In case you missed it, SAVE your work by clicking „Save All” if you’re using GIT or „Publish All” if not ;]

Creating Lookup – GET CFG

First, we have to get configuration. We will use Lookup activity to retrieve it from the database.

Bear in mind, that lookup activity has some limits. Currently, the maximum number of rows, that can be returned by Lookup activity is 5000, and up to 2MB in size. Also max duration for Lookup activity before timeout is one hour. Go to documentation for latest info and updates.

Drag and drop into your pipline

Rename it. This is important, we will use this name later in our solution. I will use value „GET CFG„.

In „Settings” choose

Now, don’t bother TableName set to dummy :> Just in „Use Query” set to „Query„, click „Add dynamic content” and type:

Now go to „Stored Procedure”, select our procedure name and click „Import parameter”.

Now w have to pass values for procedure parametrs. And we will also parametrize them. Id should be

@{item().id}

and NewWatermatk has to be:

@{activity('GET MAX ORA').output.firstRow.MAXD}

.

And basically, that’s all! This logic should copy rows from all Oracle tables defined in the configuration.

We can now test it. This can be done with „Debug” or just by triggering pipeline run.

If everything is working fine, we can just copy/paste all content from „FOR EACH ORA” into „FOR EACH PG„.

Just remember to properly rename all activities to reflect new source/destination names (PG). Also, all parameters and SELECT queries have to be redefined. Luckily PostgreSQL support ISO dates out of the box.

Source code

Here are all components in JSON. You can use them to copy/paste logic directly inside ADF V2 code editor or save as files in GIT repository.

Below is source code for pipeline only. All other things can be downloaded in zip file in „Download all” at the bottom of this article.