Years ago in a previous job, I worked at a company that had no DBAs. I am/was a BI developer, so I know my way around a database, but I wasn’t dedicated to keeping all databases in good health. There were several application developers at this company (mostly focused on .NET and Javascript) who built applications with SQL Server databases as the back end. And there was a guy who acted as a system admin among his many other duties. The application developers had built a web app that was to be used by users around the world. The application had been launched and things were fine for several weeks. I wasn’t involved with the project, but I was aware of it.

One day, a manager asked me if I could help on an urgent matter: the application suddenly could no longer execute transactions on the production database and the database connection was intermittently failing. The system admin was busy with other duties, so I was the closest thing they had to a DBA. All they could tell me was the production database had crashed and they got an error message about insufficient disk space.

I logged on to the server that housed the database to see what was going on. The server itself had been set up appropriately and seemed to have sufficient memory and CPU to support the load of this application. I saw 3 volumes on the server: a C volume for application and system files, a large F volume for data, and a large G volume for logs.

I connected to the database with Management Studio to do some more digging. The first thing I noticed is that the dev, test, and prod databases for this application were all on the same SQL Server instance. The dev and test databases weren’t very large, so while that wasn’t what I would have recommended, that didn’t seem to be the main problem. As I looked at the prod database, I noticed that the MDF and LDF files were sitting on the C volume rather than the spacious F volume that was made for them! The person who configured the server hadn’t made the C volume very large since user databases weren’t supposed to be there.

Then I looked at the size of the log file. It was huge! A bit more digging revealed that they had left all the defaults on the database for full recovery and autogrowth of the log file, but they had never done a transaction log backup. (Sidenote: You can check the Log_Reuse_Wait_Desc column in sys.databases to verify the database is waiting on a transaction log backup.) The developers had worked long and hard to get the application up and running and hadn’t quite finished up the maintenance and disaster recovery tasks.

Once I knew what I was dealing with, I was able to fix the problem. A full backup and a log backup later we were back in business. I went ahead and shrunk the log file back to a reasonable size (please remember this is reserved for special occasions). I took the database offline (which was acceptable since the application was currently unusable anyway), moved the MDF and LDF files to their rightful home, and brought it back online. A lesson on recovery models and setting up SQL Agent jobs that scheduled such backups ensured this didn’t happen again anytime soon.

This should be a good reminder to have a healthy respect and understanding for your database settings and to make sure you have (and test) your backups (both full and transaction logs) for your production databases.

Update: As Gerhard points out in the comments, switching to ORC files solves this issue nicely. It’s not human readable, but it is much less error-prone when reading in data.

I’ve spent the last few weeks working on a project that used PolyBase to load data from Azure Blob Storage into Azure SQL Data Warehouse. While it’s been a great experience, I must note that PolyBase is a picky eater.

But just because you have successfully created the external tables does not mean you are finished. That is when the “fun” begins. If you would like more information on why “fun” is in quotes, read Grant Fritchey’s blog post on Loading Data into Azure SQL Data Warehouse.

You should test after populating any table in SQL Server, but I think this is especially true with external tables. More than likely you will find that you must resolve several issues with source file contents and external table definitions.

First let me say that PolyBase is cool. I can query data in text files and join to tables in my database. Next let me say PolyBase is a fairly young technology and has some limitations that I imagine will be improved in later versions.

One of those limitations (as of July 30, 2016) is that while you can declare your field delimiter and a string delimiter in external file formats, the row delimiter is not user configurable and there is no way to escape or ignore the row delimiter characters (\r, \n, or \r\n) inside of a string. So if you have a string that contains the row delimiter, PolyBase will interpret it as the end of the row even if it is placed inside of the string delimiters.

To elaborate further, I had several fields that originally came from a SQL Server table and were of type text. Some of the values in these fields contained newlines (\n) as users had typed paragraphs and addresses into the fields in the source application. The data from the source tables was exported from SQL Server to Azure Blob Storage using Azure Data Factory with a simple copy pipeline with no modifications to the data. The problem is that Hive, PolyBase, and several other tools have issues reading strings with newlines the value. They immediately interpret it as the end of the row. There is no escape character or setting you can use to allow newlines in the values.

If you find yourself in a similar situation, trying to load data from delimited files into Azure SQL DW and realizing you have newlines inside of string fields, there are two things you can do.

Fix the data in the flat files so it doesn’t contain new lines in string fields.

Switch to a different tool to load data to Azure DW. Azure Data Factory can take the data from blob storage and import it into a normal table in Azure DW.

In most circumstances, I would go for option #1. Option #2 only fixes things in Azure DW, leaving other tools in the environment to deal with the issue separately, and it requires storing a copy of the data in the DW.

In my project, I changed the ADF pipelines to replace newlines with an acceptable character/set of characters that doesn’t often appear in my data set and doesn’t obscure the values. We chose to replace them with 4 spaces. It’s important to understand that this means that your data in your blob storage will no longer exactly match its source. This is something you will want to document because it will surely pop up somewhere in the future.

Updating the ADF pipelines is not much effort. If my table definition is

If you are using time slices in ADF and you have your query inside of the Text.Format() function, you will find that ADF doesn’t allow the single quotes around the four spaces in your JSON. You can instead use CHAR(32) instead of a space. If you have a better way of accomplishing this, please leave me a note in the comments.

In addition to updating the ADF pipelines, I also had to replace the newlines in my existing files in blob storage. Since there weren’t many of them, I just opened them up in Notepad++ and did a find & replace. If there had been more files, I would have looked into a more automated solution.

If the ability to allow field/row terminators within string fields is something you would like to see in the PolyBase, please voice your opinion by casting a vote on the feedback site.

In my previous post, I provided the design pattern and Biml for a pure Type 2 Slowly Changing Dimension (SCD). When I say “pure Type 2 SCD”, I mean an ETL process that adds a new row for a change in any field in the dimension and never updates a dimension attribute without creating a new row. In practice, I tend to create more hybrid Type 2 SCDs where updates to some attributes require a new row and others update the value on the existing rows. A similar pattern that I find I implement more often than a pure Type 2 is a Type 6 SCD. A Type 6 SCD builds on the Type 2 technique by adding current attributes alongside the historical attributes so related measures can be grouped by the historical or current dimension attribute values. The only difference between what I call a hybrid Type 2 and a Type 6 is that in the Type 6, there are no Type 1 attributes in the dimension that do not also have a Type 2 version in the dimension to capture the historical values.

Design Pattern

Control Flow

If you are comfortable with my design pattern for a pure Type 2 SCD in which a change of value in any column causes a new row, this pattern is quite similar. And the control flow is exactly the same. This pattern, as with my pure Type 2, assumes that rows are not deleted from the source system. You could easily modify this to check for deleted rows if needed.

Control Flow for a Hybrid Type 2 or Type 6 Dimension

The steps in the control flow are:

Log package start.

Truncate the update table.

Execute the data flow.

Execute the update statements to update columns and insert new rows.

Log package end.

The update statements are different in this pattern, and I’ll explain those in detail below.

Data Flow

The data flow looks like a pure Type 2 SCD, with the exception of an added derived column transformation and minor changes to the lookup and conditional split. Again, I use views to integrate the data, apply business logic, and add hashkeys for change detection. Then I use SSIS to perform the mechanics of loading the data.

The steps in this data flow task are:

Retrieve data from my source view.

Count the rows for package logging purposes.

Perform a lookup to see if the entity already exists in the dimension table.

If the entity doesn’t exist at all in the dimension table, it goes into the left path where I count the number of rows, add a derived column that sets the row start date to “01/01/1900 00:00:00”, and then insert the row into the dimension table.

If the entity does exist in the table, I check it for changes.

If there are changes to the entity, I count the number of rows, us a derived column to flag the type(s) of changes to make, and then insert the row into an update table.

Entities with no changes are simply counted for audit purposes.

The Source View

This SSIS pattern requires 3 hashed values for for change detection:

HistoricalHashKey: the unique identifier of the entity, the natural key that ties the historical rows together

ChangeHashKey: the columns on the dimension that cause a new row to be created and the current row to be expired

UpdateHashKey: the columns on the dimension that should be updated in place

In my example view below, the Route ID and Warehouse identify a unique route. The supervisor, route description and route type are all important attributes of the route. The route area identifies the metro area in which a route is located. If this should change, there is no need to see previous values; we just want to see the current value on every row.

The RowEndDate value in this view will be used for routes that require a current row to be expired since my pattern is the leave the end date of the current row null.

The Change Detection Lookup

The lookup in my DFT compares the HistoricalHashKeyASCII column from the source view with the varchar version of the HistoricalHashKey from the dimension table and adds two lookup columns: lkp_ChangeHashKeyASCII and lkp_UpdateHashKeyASCII to the data set.

Rows that do not match are new rows; i.e., that route has never been in the dimension table before. Rows that do match have a row in the dimension table and will then be evaluated to see if there are any changes in the values for that route.

Derived Column for New Rows

The no match output of the lookup are new rows for routes that are not in the dimension table. Since this is the first row in the table for that route, we want this row to be effective from the beginning of time until the end of time. The beginning of time in this data mart is 01/01/1900. Since the data is loaded multiple times per day, I use a date/time field rather than a date. If you only need the precision of a day, you can cut your row end date/time back to just a date. In my pattern, the current row has a null row end date, but you could easily add a derived column to set the end date to 12/31/9999 if you prefer.

Conditional Split for Change Detection

This time, I have to check to see if both the ChangeHashKeyASCII and the UpdateHashKeyASCII match in my conditional split.

If both hashed values from the source view match the hashed values from the dimension table, then no changes are required and the row is simply counted for audit purposes.

If either hashed value doesn’t match, there is an update to be made.

Derived Column to Identify Change Types

We again compare the UpdateHashKeyASCII value from the source view with that of the dimension. If they don’t match, we set the UpdateInPlace flag to true. If the ChangeHashKeyASCII values don’t match, we set the UpdateNewRow flag to true. If a row has both types of changes, both types of updates will be made.

My update table contains the UpdateInPlace and UpdateNewRow columns, so I can reference these flags in my update statements.

The Update Statements

The update statements in the control flow take the changes from the update table and apply them to the dimension table. Three statements are executed in the Execute SQL Statement labeled SQL Update DimRoute.

The first statement updates the values for the columns designated to be updated in place by joining the update table to the dimension table based on the HistoricalHashKey column. This is the same as performing updates in a Type 1 SCD.

The second statement expires all the rows for which a new row will be added. The third statement inserts the new rows with the RowIsCurrent value set to 1 and the RowEndDate set to null.

The Biml

If you are using Biml, you know that you can create a design pattern for this type of dimension load and reuse it across multiple projects. This speeds up development and ensures that your Type 2 Hybrid or Type 6 dimensions are implemented consistently.

As usual, I have 3 Biml files that are used to create the SSIS package:

ProjectConnections.biml – contains all the project-level connections for the project

Dim2Hybrid.biml – contains the SSIS design pattern with code nuggets that parameterize it to make it reusable

CreateDim2HybridPackages.biml – calls Dim2Hybrid.biml and passes along the values to be used for each package

Once I have my source view, dimension table, and update table in the database, the 3 Biml files added to my project, and BIDSHelper installed, all I have to do is right click on the CreateDim2Hybrid.Biml file and choose Generate SSIS packages to create my package.

Microsoft acquired Datazen back in April 2015, and I explored it and wrote about it a couple months later. To date, Microsoft has mostly left the product as is, although a new version containing bug fixes and a few enhancements was released in September.

While I was at PASS Summit I learned that there is a bright future for Datazen as a significant part of the MSBI reporting offerings. Microsoft has provided a reporting roadmap that looks very promising. They are working to align cloud and on-premises solutions and to harmonize report types. In the MSBI reporting world, there will be 4 report types:

Paginated reports (SSRS)

Interactive reports (Power BI)

Mobile reports (Datazen)

Analytical reports (Excel)

For on-premises solutions, you will have one unified SSRS Report manager that supports mobile reports and interactive reports as well as paginated reports. And it’s much prettier than the old SSRS report manager!

There will also be a unified mobile app so there is no need to switch between apps to get Datazen reports and Power BI reports.

This new report manager means the current administration of Datazen (hubs and dashboard groups, etc.) will likely change to conform to more of the SSRS or Power BI paradigms. As far as I know, there aren’t yet solid details on exactly how this will work. In the Microsoft-led sessions at PASS Summit, they also mentioned that there will be migration tools for those already using Datazen that want to take advantage of this new platform.

So if you are considering Datazen but were worried about its future, there is no need to worry. If you are implementing Datazen now, you can feel better about enhancement and upgrade opportunities with the promised migration tool.

If you’d like to know more about Datazen, my BlueGranite colleagues and I wrote an ebook about it to help you understand:

Datazen’s place in the Microsoft BI stack

Datazen elements and architecture

Tips for planning your Datazen implementation

A workflow for creating a dashboard in Datazen

Answers to frequently asked questions from Datazen users

You can get the ebook here for free! We plan on updating it as more information comes out on Datazen and SQL Server 2016.

I’ve been working on a project that includes geographical data representing stops on a delivery route. I’ve just completed loading this data into a data mart. The source data contains longitude and latitude in millionths of a degree with 9 digits of data. We haven’t decided what tool we will use to visualize this data yet, but we know Power View and Power Map both accept latitude and longitude values. I decided to store my longitude and latitude data in decimal (9,6) fields. There is a good possibility that we may be computing distances between points in the future, so I thought it would be good to store the data as a spatial data type as well. I thought I would share a few things that I learned along the way.

There are two spatial data types in SQL Server: geometry and geography. Geometry represents the flat-earth system where units are all equally spaced apart. Geography represents the round-earth system measured in latitude and longitude. Since I had longitude and latitude in my data, I used the geography data type. The geography spatial data type is implemented as a .NET common language runtime (CLR) data type in SQL Server.

I populated my table using a query of which I’ve included a snippet below. You can see the use of the Point function to create my geography values.

The Point function accepts a a latitude, longitude, and SRID, and returns a geography value. An SRID is a unique identifier associated with a coordinate system, tolerance, and resolution. SRIDs are not specific to SQL Server. They are maintained by the International Association of Oil & Gas Producers (OGP) Surveying & Positioning Committee. Here’s a blog post that I think does a good job explaining many of the terms associated with spatial data in SQL Server.

Tip #1: You can see a list of SRIDs available in SQL Server by running the following query. SQL Server uses the default SRID of 4326, which is the WGS 84 spatial reference system.

SELECT * FROM sys.spatial_reference_systems

My source database has planned delivery stops and times and actual delivery stops and times stored in separate columns in a very wide table. I decided to pivot that data and create a table with a scenario key that refers to either plan or actual data. To do this, I wrote 2 queries and attempted to union them together to produce my final data set. That’s when I learned:

Tip #2: When SQL Server performs a UNION it must compare values to remove duplicate rows. CLR user-defined type columns like geography are not comparable. As long as there is no risk of duplicate data between the two sets, you can use UNION ALL.

As I finalized my table design I considered using a computed column to store my geography data. But I encountered an issue when I went to add a spatial index. Spatial indexes are built on top of B+ trees. They decompose space into 4 levels of grids. I think spatial indexes are interesting, but they have some restrictions of which you should be aware. They require the table to have a clustered primary key. They cannot be specified on indexed views. And…

Tip #3: You can create a computed column to store the geography point based upon the latitude and longitude. But you cannot create a spatial index on a computed column.

If you try to create a spatial index on a computed column you will get SQL Server error message 6342.

You don’t have to use spatial data types just because you have spatial data. Many data viz tools have built-in geocoding that will accept longitude and latitude or an address. But spatial data types can be useful when calculating distances between two points and planning and measuring routes.