Years ago in a previous job, I worked at a company that had no DBAs. I am/was a BI developer, so I know my way around a database, but I wasn’t dedicated to keeping all databases in good health. There were several application developers at this company (mostly focused on .NET and Javascript) who built applications with SQL Server databases as the back end. And there was a guy who acted as a system admin among his many other duties. The application developers had built a web app that was to be used by users around the world. The application had been launched and things were fine for several weeks. I wasn’t involved with the project, but I was aware of it.

One day, a manager asked me if I could help on an urgent matter: the application suddenly could no longer execute transactions on the production database and the database connection was intermittently failing. The system admin was busy with other duties, so I was the closest thing they had to a DBA. All they could tell me was the production database had crashed and they got an error message about insufficient disk space.

I logged on to the server that housed the database to see what was going on. The server itself had been set up appropriately and seemed to have sufficient memory and CPU to support the load of this application. I saw 3 volumes on the server: a C volume for application and system files, a large F volume for data, and a large G volume for logs.

I connected to the database with Management Studio to do some more digging. The first thing I noticed is that the dev, test, and prod databases for this application were all on the same SQL Server instance. The dev and test databases weren’t very large, so while that wasn’t what I would have recommended, that didn’t seem to be the main problem. As I looked at the prod database, I noticed that the MDF and LDF files were sitting on the C volume rather than the spacious F volume that was made for them! The person who configured the server hadn’t made the C volume very large since user databases weren’t supposed to be there.

Then I looked at the size of the log file. It was huge! A bit more digging revealed that they had left all the defaults on the database for full recovery and autogrowth of the log file, but they had never done a transaction log backup. (Sidenote: You can check the Log_Reuse_Wait_Desc column in sys.databases to verify the database is waiting on a transaction log backup.) The developers had worked long and hard to get the application up and running and hadn’t quite finished up the maintenance and disaster recovery tasks.

Once I knew what I was dealing with, I was able to fix the problem. A full backup and a log backup later we were back in business. I went ahead and shrunk the log file back to a reasonable size (please remember this is reserved for special occasions). I took the database offline (which was acceptable since the application was currently unusable anyway), moved the MDF and LDF files to their rightful home, and brought it back online. A lesson on recovery models and setting up SQL Agent jobs that scheduled such backups ensured this didn’t happen again anytime soon.

This should be a good reminder to have a healthy respect and understanding for your database settings and to make sure you have (and test) your backups (both full and transaction logs) for your production databases.

I showed in my previous post how we generated the datasets for our Azure Data Factory pipelines. In this post, I’ll show the BimlScript for our pipelines. Pipelines define the activities, identify the input and output datasets for those activities, and set an execution schedule. We were creating several pipelines with copy activities to copy data to Azure Data Lake Store.

We generated one pipeline per schedule and load type:

Hourly – Full

Hourly – Incremental

Daily – Full

Daily – Incremental

We also generated some one-time load pipelines for DR/new environment setup.

The first code file below is the template for the pipeline. You can see code nuggets for the data we receive from the generator file and for conditional logic we implemented. The result is one copy activity per source table within the appropriate pipeline.

In the second code file below, lines 104 to 119 are generating the pipelines. We read in the necessary data from the Excel file:

Schema name

Table name

Columns list

Incremental predicate

Sidenote: We wrote a quick T-SQL statement (not shown) to generate the columns list. This could have been done in our BimlScript, but it was something we changed after the fact to accommodate the limitations of Polybase (Dear Microsoft: Please fix). SQL was quicker and easier for us, but if I were to do this again I would add that into our BimlScript. We needed to replace new lines and double quotes in our data before we could read it in from the data lake. You can get around this issue by using .ORC files rather than text delimited files. But the ORC files aren’t human readable, and we felt that was important for adoption of the data lake with the client on this project. They were already jumping in with several new technologies and we didn’t want to add anything else to the stack. So our select statements list out fields and replace the unwanted characters in the string fields.

Our Excel file looks like this.

Columns B, C, L, and M are populated by Excel formulas. This is the file that is read in by the BimlScript in the code below.

In our generator file (which is the same file that was used to generate the datasets), we use the CallBimlScript function to call the pipeline template file and pass along the required properties (table, schema, frequency, scope, columns list, predicate).

The great thing about Biml is that I can use it as much or as little as I feel is helpful. That T-SQL statement to get column lists could have been Biml, but it didn’t have to be. The client can maintain and enhance these pipelines with or without Biml as they see fit. There is no vendor lock-in here. Just as with Biml-generated SSIS projects, there is no difference between a hand-written ADF solution and a Biml-generated ADF solution, other than the Biml-generated solution is probably more consistent.

And have I mentioned the time savings? There is a reason why Varigence gives out shirts that say “It’s Monday and I’m done for the week.”

Apologies for the overly acronym-laden title as I was trying to keep it concise but descriptive. And we all know that adding technologies to your repertoire means adding more acronyms.

My coworker Levi and I are working on a project where we copy data from an on-premises SQL Server 2014 database and land it in Azure Data Lake Store. Then we use Polybase to get the data into Azure SQL Data Warehouse and build a dimensional model. I’ve done a couple of small projects before with Azure Data Factory, but nothing as large as this one. We had 173 tables that we needed to copy to ADLS. Then we needed to set up incremental loads for 95 of those tables going forward.

My Azure Data Factory is made up of the following components:

Gateway – Allows ADF to retrieve data from an on premises data source

Linked Services – define the connection string and other connection properties for each source and destination

Datasets – Define a pointer to the data you want to process, sometimes defining the schema of the input and output data

Pipelines – combine the data sets and activities and define an execution schedule

Each of these objects is defined in a JSON file. Defining data sets and copy activities in JSON gets very tedious, especially when you need to do this for 100+ tables. Tedium usually indicates a repeatable pattern. If there is a repeatable pattern you can probably automate it. The gateway and linked services are one-time setup activities that weren’t worth automating for this project, but the datasets and pipelines definitely were.

In order to automate the generation of datasets and pipelines, we need a little help with some metadata. We had the client help us fill out an Excel spreadsheet that listed each table in our source database and the following characteristics relevant to the load to Azure:

Frequency (daily or hourly)

Changes Only (incremental or full load)

Changed Time Column (datetime column used for incremental loads)

That list plus the metadata we retrieved from SQL server for each table (column names and data types) were all we needed to automate the creation of the ADF datasets and pipelines with BimlScript.

This post will show how we built the data sets. The following post will show the pipelines with the copy activities.

First we need to generate the input datasets coming from SQL Server. We added some properties at the top and embedded some code nuggets to handle the values that are specific to each table.

Next we need the output datasets for Azure Data Lake Store. We use the same three properties in generating each dataset- schema, table, frequency- and we add one more for scope.

Now we just need another BimlScript file that calls these two files. We broke our pipelines up into daily versus hourly and incremental versus full loads.

We used a helper code file and a separate environments file, which I’m glossing over so we can focus on the Biml for the ADF assets. You can see that we read in the inputs from Excel and write some counts to a log file, just to make sure everything is working as intended. Starting on line 41 is where we generate the datasets. On lines 54 and 55, we use the CallBimlScript function to call the two files above. We end up generating datasets for the tables that are a full load each day and their counterpart datasets for the files we create in ADLS. The datasets for daily incremental loads are generated on lines 69 and 70. Then we do the hourly full loads and hourly incremental loads. I’ll discuss lines 100 – 119 in my next post.

The Results

We were able to write the BimlScript and generate the datasets and pipelines in about 35 hours. A previous ADF project without automation took about 3 hours per source table. If we had gone that route, we could have been looking at 350 – 500 hours to complete this part of the project. Visual Studio with Biml Express took about 5 minutes to generate everything. Deploying to Azure took about an hour. We are now looking into ARM templates for future deployments.

Stay tuned for part 2 where I show how we generated the ADF pipelines.