Ben Jarvis' Blog

One of the many exciting announcements made at MSBuild recently was the release of the new Cosmos DB Bulk Executor library that offers massive performance improvements when loading large amounts of data to Cosmos DB (see https://docs.microsoft.com/en-us/azure/cosmos-db/bulk-executor-overview for details on how it works). A project I’ve worked on involved copying large amounts of data to Cosmos DB using ADF and we observed that the current Cosmos DB connector doesn’t always make full use of the provisioned RU/s so I am keen to see what the new library can offer and look to see if our clients can take advantage of these improvements.

In this post I will be doing a comparison between the performance of the Cosmos DB connector in ADF V1, ADF V2 and an app written in C# using the Bulk Executor library. As mentioned in the Microsoft announcement, the new library has already been integrated into a new version of the Cosmos DB connector for ADF V2 so the tests using ADF V2 are also using the Bulk Executor library.

All tests involved copying 1 million rows of data from an Azure SQL DB to a single Cosmos DB, the scaling settings used for the resources involved are:

Cosmos DB Collection – partitioned by id, 50000 RU/s

Azure SQL DB – S2 (50 DTUs)

Each test was repeated 3 times to enable an average time taken to be calculated.

The document inserted into each the database looked like the below:

Test 1 – ADF V1

To set up the ADF V1 test I used the Copy Data Wizard to generate a pipeline that would copy data from my Azure SQL DB to Cosmos DB. I then executed the pipeline 3 times, recreating the Cosmos DB collection each time.

Each test behaved in a similar way, the process slowly ramped up to 100,000 requests per minute and sustained that throughput until completion. The tests only consumed around 3,000 RU/s out of the 10,000 RU/s provisioned to each partition in the collection (each collection was created with 5 partitions).

The results of the test were:

Run 1 – 677 seconds

Run 2 – 641 seconds

Run 3 – 513 seconds

Average – 610 seconds

The performance increased after each run with run 3 taking 513 seconds, nearly 3 minutes quicker than the first test. I can’t explain the differences in time taken however, it seems that ADF progressively ramped up the throughput quicker after each run so it may be down to scaling within the ADF service itself.

Test 2 – ADF V2

To set up the ADF V2 test I again used the Copy Data Wizard to generate the pipeline. I then executed the pipeline 3 times, recreating the Cosmos DB collection each time.

ADF V2 represents a massive improvement over V1 with a 75% increase in performance. The connector used more than the provisioned throughput meaning some throttling occurred however, this means that the collection could have been scaled even further to obtain higher performance.

Interestingly the ADF V2 requests didn’t appear on the number of requests metric shown in the Azure Portal so I’m unable to look at how many requests ADF V2 was able to sustain. I’m unsure of the reason for this however, it could be something like ADF using the direct connection mode to Cosmos DB rather than connecting through the gateway meaning the requests aren’t counted.

The results of the test were:

Run 1 – 163 seconds

Run 2 – 158 seconds

Run 3 – 156 seconds

Average – 159 seconds

The performance of ADF V2 was more consistent that V1 and remained reasonably steady across all tests.

Test 3 – C# w/ Bulk Executor Library

To set up the C# test I wrote a quick C# console application that uses the Bulk Executor library, the application was running on my local machine rather than within Azure so there will obviously be a performance hit from the additional network latency. The source code for the application can be found at https://github.com/benjarvis18/CosmosDBBulkExecutorTest.

The results of the test were:

Run 1 – 240 seconds

Run 2 – 356 seconds

Run 3 – 352 seconds

Average – 316 seconds

The performance of the C# application is less consistent however, this is probably due to the network from my local machine to Azure. My application is also not very scalable as it is loading the whole dataset into memory rather than streaming it, as would be required with a larger dataset. The actual code itself is probably also not as optimised as it could be.

Overall however, the performance of my C# application was still 50% better than ADF V1.

Comparison

The final results of the tests are below:

As above, ADF V2 is the clear winner in the test offering a 75% increase in performance when compared to ADF V1. This represents a huge performance gain and could provide some significant costs savings for users that are loading large amounts of data into Cosmos DB. My C# application offered 50% better performance than ADF V1, running outside of Azure without any optimisation so the performance benefits of the new library are significant.

The new library is an exciting development for Cosmos DB and allows us to fully utilise the capabilities it has to offer when dealing with large amounts of data. I look forward to making use of these benefits in projects, especially the significant improvements in ADF V2!

On a current client project we are taking files from an on-prem file server and uploading them to Azure Blob Storage using ADF V2. The files are compressed on-prem using GZip compression and need to be decompressed before they are placed in blob storage where some other processes will pick them up.

In our particular case the file names and extensions of the source files are all uppercase and when ADF uploads them it doesn’t alter the file extension e.g. if I upload 1234_567.CSV.GZ I get 1234_567.CSV.GZ in blob storage rather than 1234_567.CSV.

If I upload 1234_567.csv.gz the functionality works correctly and I get 1234_567.csv in blob storage.This means that the file extension replace is case sensitive when it should be case insensitive.

This bug isn’t a major issue for us as the file is decompressed and we can change the extension when we process the file further however, it’s something that stumped me for a while.

Introduction

Recently we had a client requirement whereby we needed to upload some files from an on-prem file server to Azure Blob Storage so they could be processed further. The file system connector in ADF V2 with a self-hosted integration runtime was the perfect solution for this so this blog post will discuss the process for getting a basic set up working to test the functionality.

Test Data

The first task is to generate some data to upload. I did this by executing the below PowerShell script to create 100 files, with each being 10 MB in size.

ADF Configuration

Self-Hosted Integration Runtime

In ADF V2 the integration runtime is responsible for providing the compute infrastructure that carries out data movement between data stores. A self-hosted integration runtime is an on-premise version of the integration runtime that is able to perform copy activities to and from on-premise data stores.

When we configure a self-hosted integration runtime the data factory service, that sits in Azure, will orchestrate the nodes that make up the integration runtime through the use of Azure Service Bus meaning our nodes that are hosted on-prem are performing all of our data movement and connecting to our on-premises data sources while being triggered by our data factory pipelines that are hosted in the cloud. A self-hosted integration runtime can have multiple nodes associated with it, which not only caters for high availability but also gives an additional performance benefit as ADF will use all of the available nodes to perform processing.

To create our self-hosted integration runtime we need to use the following PowerShell script:

Once we run the downloaded executable the installer will run and install the required application, the application will then open on the following screen asking for our authentication key:

We can enter the authentication key retrieved in the previous step into the box and click register to register our integration runtime with the data factory service. Please note that if your organisation has a proxy you will need to enter the details to ensure the integration runtime has the connectivity required.

After clicking register we receive the following prompt, click finish on this and allow the process to complete.

ADF Pipeline

Now we’ve got our self-hosted integration runtime set up we can begin configuring our linked services, datasets and pipelines in ADF. We’ll do this using the new ADF visual tools.

Linked Services

The first step is to create our linked services. To do this we open up the visual tools, go to the author tab and select connections; we can then create a new linked service to connect to Azure Blob Storage:

Next we need to create a linked service for our on-prem file share.

First create a new linked service and select the file system connector, we can then fill in the relevant details.

There are a couple of things to note with the above configuration, the first is we have selected our self-hosted integration runtime for the “Connect via integration runtime” option. We have also specified an account that has access to the folder we would like to pick files up from, in my case I have configured a local account however, if you were accessing data from a file share within your domain you would supply some domain credentials.

Datasets

For our pipeline we will need to create two datasets, one for our file system that will be our source and another for blob storage that will be our sink.

Firstly, we will create our file system dataset. To do this we will create a new dataset and select the file system connector, we will then configure our connection tab, like below:

As above, we have selected our linked service we created earlier, added a filter to select all files and opted to perform a binary copy of the files in the directory.

Next, we need to create a dataset for blob storage. To do this we will create a new dataset using the Azure Blob Storage connector, we will then configure our connection tab, like below:

This time we have entered the container we’d like to copy our files to as the directory name in our file path, we have also opted for a binary copy again.

Pipeline

The final step is to create our pipeline that will tie everything together and perform our copy.

First, create a new pipeline with a single copy activity. The source settings for the copy activity are as follows:

The settings for the sink are:

In this situation we have set the copy behaviour to “Preserve Hierarchy”, which will preserve the folder structure from the source folder when copying the files to blob storage.

Now we’ve got everything set up we can select “Publish All” to publish our changes to the ADF service.

Testing

We’ve completed all of the necessary configuration so we can now trigger our pipeline to test that it works. In the pipeline editor click the trigger button to trigger off the copy. We can then go to the monitor tab to monitor the pipeline as it runs:

The pipeline will take a couple of minutes to complete and on completion we can see that the files we selected to upload are now in our blob container:

Conclusion

To conclude, ADF V2 presents a viable solution when there is a need to copy files from an on-prem file system to Azure. The solution is relatively easy to set up and gives a good level of functionality and performance out of the box.

When looking to deploy the solution to production there are some more considerations such as high availability of the self-hosted integration runtime nodes however, the documentation given on MSDN helps give a good understanding of how to set this up by adding multiple nodes to the integration runtime.

This year at SQLBits Adatis decided to create two technical demos to fit in with the magic theme, these were the Adatis Sorting Hat and the Adatis Photo Booth.

My blog post will look at the Adatis Photo Booth which was designed to allow us to print a wizarding student ID for participants; I’ll give an overview of the technology stack that was used and then do some analysis of the data we collected using Power BI.

The Requirement

The first place to start when designing a technical solution are the business requirements. In our case the requirements were:

The user should have an app running on a tablet that will ask them for some basic details such as name, email address and twitter account before allowing them to take a selfie to be printed on their card.

The image of the user should be uploaded to the Azure Cognitive Services Face API to capture their emotions.

The final data to be placed on the card should be added to a database, in the required format, for our card printing software to pick up.

The solution should require the smallest amount of coding possible.

The solution should be fully automated and require the least amount of interaction from the Adati on the stand.

The Technology

To satisfy our requirements we chose to use Power Apps for the front-end and Logic Apps for the backend making our solution completely serverless.

A brief overview of the steps in the process and the tech used at each stage is below:

At a high level the technology being used is:

Power Apps – allows us to build a cross-platform, front-end application without writing any code.

Azure SQL Database – allows us to store the data captured from the application. Power Apps natively integrates with Azure SQL DB.

Azure Cognitive Services – a set of APIs that allowed us to very easily add some intelligence to our application. The face API was easy to integrate and we were using machine learning to map the emotions of the face within minutes.

The Technical Challenges

While building the solution we came across a couple of limitations that are useful to raise, these are:

The cognitive services connector for Logic Apps doesn’t currently return the emotion data returned by the API. To get around this we made a web request within the logic app and parsed the JSON returned to extract the attributes we needed.

Apart from the API to upload the images no coding was required to get this solution to function making Power Apps and Logic Apps a very good choice when you need to create a solution that needs to be delivered quickly.

The Data

Now we’ve looked at the technical solution we can use Power BI to look at some of the data we collected at SQL Bits.

To analyse the data I connected my Power BI desktop to the Azure SQL Database we were using, I then created some measures and visuals against the data to give the below result:

The visuals we have are:

Top Left - tree map that look at the count of students by their primary emotion (the emotion that scored highest)

Top Middle – % of happy (where primary emotion is happiness) and % of unhappy (where primary emotion is anger, sadness or disgust) over each 3 hour period.

Top Right – total number of students we produced cards for, overall % of happy and % unhappy students.

Bottom – primary emotion logged over time

As you can see, we did quite well with over 46% of our students having a primary emotion of happy, with only 13% being unhappy. It seems that over Thursday and Friday we had some unhappy students but on Saturday everybody was happy!

Conclusion

To conclude, we created our tech demos for SQLBits so everybody can have some fun on the stands however, as you’ve seen, the demos are backed by the cloud technology that we use all day, everyday. Power Apps and Logic Apps are an excellent set of tools and allow you to build applications quickly and easily while connecting various services together; all while being part of the Azure PaaS and SaaS offering without the headache of provisioning servers. The Cognitive Services APIs also present a great offering and allow you to quickly and easily add some intelligence to your application without needing to roll your own solution.

As always, if you have any questions or comments then please let me know.

Introduction

Recently, while researching Cosmos DB, I came across the in-built capabilities for managing spatial data.

Cosmos DB is Microsoft’s globally distributed, multi-model database. It has the capability to store various types of data, such as document, graph and key-value and can elastically scale to cope with varying needs. The piece of Cosmos DB that this post will be discussing is the spatial capabilities of the document model.

The problem I have chosen to solve using the spatial functionality is working out which airfields are within the range of an aircraft when given its true airspeed and fuel endurance in hours; with the range being calculated by multiplying the airspeed by the endurance.

The Data

The first step was to find a data set containing a list of all of the world’s airfields, this was found on GitHub at https://github.com/mwgg/Airports. The data set contains the details we need, which are:

ICAO code – this is the unique identifier for an airport

Airport Name

Latitude and Longitude of the Airport

The next step was to create a Cosmos DB account in the Azure Portal and write a C# app to do some transformations on the data and load the documents into our Cosmos DB collection.

I first created a C# object that matched the structure of the JSON data:

I then created a C# object that matched the structure of the document I wanted to insert into Cosmos DB. The “Point” data type is used to serialize the latitude and longitude into the GeoJSON structure that Cosmos DB supports:

One thing to note is the above code enables spatial indexing when creating the collection, without this enabled performance is extremely poor when performing spatial queries.

The beauty of Cosmos DB is that it is able to elastically scale to the performance level specified by the user through the number of RUs (request units) that are allocated to the collection. While loading the data into Cosmos DB I wanted to scale up my database to take advantage of the multithreading in my C# app and speed up my processing so I just went in to the Azure Portal and adjusted the number of RUs allocated to the collection, the change was almost instant and my process instantly sped up. Once I had finished importing the data I was able to scale my database back down so I’m no longer paying for any unused capacity.

Querying

Now the data is in Cosmos DB we can begin to play around with some spatial queries.

To query airfields within a certain distance of a specified point I can run the following query which returns all of the airfields within 25km of Blackbushe airport. As you can see, the SQL syntax for querying Cosmos DB is very similar to T-SQL meaning it’s very easy to re-use your SQL Server skills:

The App

The next step is to create an application that uses the functionality to find the airfields within the range of an aircraft. To do this I created a basic ASP.NET MVC application that has a form with the following fields:

The above code connects to Cosmos DB and retrieves the details for the base airfield that was specified, it then calculates the range of the aircraft in meters by multiplying the endurance (in hours) by the true airspeed in knots (nautical miles per hour) and then multiplying that my 1852 (number of meters in a nautical mile). A Linq query is then run against Cosmos DB using the built-in spatial functions to find airfields within the specified distance. The result is then converted into a JSON array that can be understood by the Google Maps API that is being used on the client side.

The client side uses the Google Maps API to plot the airfields on a map, giving us a view like the one below when given a base airfield of Blackbushe (EGLK), a true airspeed of 100kts and an endurance of 4.5 hours:

The current functionality of the app is extremely basic but could easily be expanded to make the range calculation more accurate by looking at wind and other factors that can affect range. This could be done by creating a polygon representing our range and then using the ST_WITHIN function to find all airfields within that polygon. The functionality could also be enhanced to take into account other attributes of the airfield by deciding if an airfield is suitable based on runway length and other facilities.

Conclusion

As you can see, using Cosmos DB it is extremely easy to analyse spatial data and the elastic scaling capability allows you to easily increase capacity of your application to cope with increased usage and large amounts of data. The spatial capabilities of Cosmos DB can make it a viable alternative to other databases such as SQL Server that aren’t always as easy to scale on demand and don’t have the flexibility that a document model can give.