Tag Archives: Azure

Finally got my Azure Databricks preview enabled. This is just a quick overview of how it all hooks together.

Technical Architecture Overview

Basically Databricks is the PaaS and Azure is the IaaS. Which is a smart play by Databricks. Why try and compete with the scale and resilience that Microsoft and AWS hosting IaaS. Leveraging partner IaaS to host their PaaS service allows them to do what they do well which is to focus on pushing the Databricks spark platform above and beyond what other data platforms are capable of.

Creating the Service

When you create the Databricks service it creates what appears to be a PaaS service that sits in a blade in the resource group you select when creating the service. So in my case I created a service called sibytes-databricks in my DataLakePatterns resource group.

At the top:

Managed Resource Group – this is a resource group that houses the managed IaaS – see below

URL – the URL to the Databricks PaaS service. This is also where the “Launch Workspace” button goes to

Resource Group – Just where this pass through blade to the Databricks service sit in your subscription

When you click the Launch Workspace button it takes you through a single sign-on to Databricks which is where the PaaS services operates. The first time you go through you’re asked to delegate authorisation to your azure account from the Databricks service so it can manage the IaaS as part of the service. So you’ll see this

NOTE: this has to be an azure domain account not a Microsoft account so you’ll need to set one up before hand. Also it’s probably wise to create a domain account specifically for this purpose rather than using a specific user account so you can lock down it’s permissions entirely to what it should have – particularly with regards data access.

Once in you’ll see the usual and in my opinion brilliantly usable PaaS service interface.

Remember that it’s a single sign-on delegated permission to Azure so when you manage your account in Databricks it will take straight back into Azure.

Managed IaaS

When you create your Databricks service you’ll also notice that it will also create an additional Resource Group to house the infrastructure services that Databricks creates to run the enhanced spark data platform.

Initially it will look like the following i.e. with no clusters created and running.

It has:

A storage account – so it’s not entirely IaaS. The storage is read-only and you cannot access it. Not 100% what this is for yet but if it’s for data content storage then we might be limited to 1 account and that also means data going in and out can only be through Databricks. Or maybe it’s just for a specific types of Databricks meta data or processing storage type… Or maybe it’s for internal config, logs and security… Need to look into this further when I get into data processing and explore to see what if any other Azure storage accounts can be attached. Will blog on it…

Virtual Network – This is basically a secure vnet for the IaaS clusters.

Network Security Group – This is all the network security inbound and outbound settings that allow Azure and Databaricks to speak to each other to spin up IaaS VM’s and access data.

Creating A Cluster

Things get a little more interesting when we create a cluster. I created a 2 node cluster with a 1 driver using Standard DS3’s. In the drop downs for choosing machines you’re limited to a specific collection and currently they are all listed as beta.

Looking back into the Azure IaaS resource group we can see it’s spun up 3 linux machines on the VM’s we have chosen in Databricks:

When we stop the cluster in Databricks I was expecting the VM’s to be stopped. However that doesn’t happen. Currently what happens is that Databricks retains the cluster as a terminated cluster and Azure delete’s all of the VM’s so all you’re left with is the VNet, SGN and Storage Account with our data on it – which is what we started with plus our data.

What this means is that you’ll pay less in storage when you cluster is stopped however it can take quite a while for clusters to come and down particularly if they’re pretty large. It would be good to have an option to pause the service that keeps the VM’s persisted.

Thoughts & More to Follow

This is just a quick look and intro – more to follow particularly on data processing, storage and hooking into the PaaS service through their data provider.

How the storage accounts hook in is a big part of how good this will be because if interaction and flexibility around storage is limited then it may significantly reduce it’s appeal as an architectural option. If it is significantly locked down then we’ll be forced to use and pay for services that Microsoft and Databricks have agreed to integrate that you would otherwise may not need in order to build a viable and compelling data platform e.g. Azure Data Warehouse…

In this blog I’m detailing out how flatten complex json in Azure Stream Analytics. It’s not massively hard but took a bit of fiddling since the online docs stop a little short. This kind of stuff is mostly a reference for myself for when I need to do it again sometime later and will have probably forgotten.

There is this blog which gives some insight but stops a little short on dealing with arrays. Basically I’m just going to cover how GetArrayElements works in a bit more detail.

Consider the following json event payload we might receive through a stream… No guesses for how I waste most of my time… and… it’s close but that’s not my real gamer tag 🙂

Let us imagine our game sends a payload every minute that contains all the game stats that occurred in that minute. This is completely fabricated. Am sure that’s not how such a thing would be done but its a fun example and provides an array.

So our json above is relatively semi-structured, we have:

Attributes in the root

Child entities with attributes

A child entity with an array

Our end game is that we want to flatten this into a denormalized data sets to insert into a SQL table for example or aggregate our stats which is a more likely use case of stream analytics.

We want a data set that looks like this (click image to see larger pic):

If we were aggregating we might thin out a few columns not needed in the grouping but just for an example it’s fine.

Solution

So here’s the solution straight off the bat. There’s some discussion that follows if you’re interested.

The best way I found to fiddle around with this stuff is to write out to json file on blob storage. Since you get to see the json of how close you’re getting and go from there. There is a Stream Analytics addin for visual studio now that offers local debugging but I’ve had some issues with it, namely the addin breaking my data lake tools addin.

GetArrayElements

The GetArrayElements function pivots out the array items but has to be applied using a cross apply so that it runs across each item in the array. Any where clause should follow the cross apply. I’ve used a sub query since it allows me to:

Have intuitively readable code

Break out the columns from the ArrayValue into their own alias’s above; it might be possible to do that in 1 hit but I like the subquery (see 1)

If you take the sub query alone it creates the following json so it’s not hard to see how to get to the final result by looking at the intermediate result.

This has blown out the array across the higher level json data, from here we can just pick out what we need in our rows using the outer query. When you dump this out to a file it won’t create an array as a single json doc it basically writes a json document per file line, so 4 json documents in this case separated by a line feed.

Incidentally if you are using the Stream Analytics Visual Studio addin with local debug and a sample json file you have to encase your json document events into an array or it errors.

Hope this makes sense / helps. No doubt I’ll have forgotten how to do this in a couple of days.

So folks might have noticed the Feb 1st update to stream analytics. There’s a fair bit of stuff in this release.. What I am going to focus on though is how stream analytics integrates into Power BI now.

Power BI has had the addition of Stream Data sets sometime after Azure Stream Analytics (ASA) integration. Folks who have worked with it might be aware that when you hook up ASA to Power BI it just creates a dataset and doesn’t create a stream dataset. When I first did this it jumped out as something that probably needed to be consolidated… and now it has.

So what does the implementation experience look like now… Well from the stream analytics nothing much appears to have changed. I have a test query that’s not doing much it just fires through some data. I’m actually firing json in through the event hub but will focus on the ASA PowerBI bit. The query is as follows:

When creating the output everything seems as it was before. My assumption was that something would’ve changed here and that I would have to create my streaming dataset end point definition first in Power BI and choose my data set here. But that’s not the case…

The subtle impact here is that it’s not like an API where I can just fire json at the secure end point. I still have to authorize with an account! This means you need to have a PowerBI account for your stream analytics service which in a production environment you probably won’t want that to be a named user account. Also, if the workspace needs to be shared then it will need to be a Pro account.

So far we’ve hooked up ASA to Power BI but we haven’t done anything with Power BI. Well it turns out we don’t have to manually define the end point like we do with API integration. When data starts streaming through the data set appears automatically in our Streaming Datasets within Power BI. So when I turn on my data stream using a C# app I get the following in PowerBI.

When I click the pencil to edit the data set I can see that it’s created all the attributes, typed them and defaulted history to on.

Now it seems we have some control of how these data sets are shaped in Power BI and we can flip the history on or off. We also have an end point that we can use with an api, cURL or PowerShell…

This means we might be able to stream in data to the data set using another api source not just stream analytics. Also it might seem I can do the following:

add an attribute that’s not in my ASA stream

remove an attribute that is in my ASA stream

Lets see…

Add an Attribute

I created a streaming tile just totting up transactions realtime.

When I add a new column the streaming tile still displays without issue. However when I go to create a new report or edit the existing report I used to pin to the dashboard from my dataset I start hitting issues. To see the new column in the dataset on the report editor I have to hit refresh. After doing this my data set and report is broken.

Editing the data set again, removing the column and hitting refresh doesn’t fix the issue. It seems I have to trash and recreate the data set.

How about ASA? Well ASA also had a problem. It didn’t stop the job but failed to send the transactions meaning the seemingly unaffected tile won’t update with any new data. Note the lack of column name in both the Power BI and ASA exception.

Remove an Attribute

What happens when we remove an attribute?

Well… the same thing as we might expect from trying to add a column.

Conclusion

This is a good change and consolidates the features of streaming data sets into Power BI for API’s and Stream Analytics. We gain some further control over datasets but it seems ASA is very tightly bound to the data set definition and in reality we can’t do much without breaking it.

It might mean that we can stream data in from another source not just ASA since we now have an endpoint we can use… I haven’t tried this yet.

All in all I’m still left a bit wanting in the mashing features for realtime data in Power BI. I’m sure it will get there… Microsoft are doing a great job… We are a tough bunch to please these days 🙂

Azure SSAS has hit preview and if you’re familiar with SSAS you’ll know it only works with Active Directory Integrated authentication. If you’re not an Active Directory spod this might throw you a little with it being an Azure PaaS service. Effectively it works in a similar way and requires you to create the service declaring an Azure Active Directory administration account. If you’ve created your own subscription what you use to login is not an Azure AD account and choosing that account won’t be an option.

So… A little on Azure AD. Azure AD wasn’t really intended to be a stand alone Active Directory, well not yet anyway. Essentially for cloud to work there had to be an authentication method that works well over the internet and Microsoft went for OAuth 2 which is basically what Azure AD uses to create secure tokens. Azure AD is effectively a cloud extension to traditional AD that we are familiar with at least on some level and is the organization account we use every day to logon onto our work machines.

Good news is you get basic Azure Active Directory for free. See the screen shot below.

In the screen shot there are 2 accounts:

Microsoft Account

Azure Active Directory Account

The Microsoft Account will be there already. That’s what was possibly used to create and use the subscription up till now. This is the personal account sign in when you go to sign into office 365 or the Azure portal. The personal account username will be whatever email address you used… e.g.

joe.blog@hotmail.com

When you look at the profile of the account in the next blade on the above image you’ll see the source of this account is declared Microsoft Account.

The other account is an Azure Active Directory account. This can be created in the blade above using “Add User”. It will have the following username convention.

username@joebloghotmail.onmicrosoft.com

If you want to assign your own domain to get rid of the one given to you by Azure you can but that isn’t free. I’ll leave that beyond the scope of this blog. There are brilliant courses on Pluralsight for Azure AD… Thoroughly recommend pluralsight… it’s awesome.

When we look at the profile of this account it’s source is declared as Azure ActiveDirectory. So to create a SSAS service we need an Azure Active Directory account to be an administrator so go ahead and create one. This account is what we know and sign in as an organizational account and you can indeed give it further privileges and use it to sign into the portal to provision and use other services.

Now we have an Azure AD account we’re ready to go…

When creating the SSAS service you’ll be asked for this type of account for an administrator. It won’t allow to select a Microsoft Account (personal account). Below we see the Microsoft Account is grayed out I can only select an Azure AD Account (organizational account).

Once the service is provisioned we can deploy to this services using SQL Server Data Tools (SSDT) and connect to it using SQL Server Management Studio (SSMS). These docs go into detail:

When connecting using SSMS normally with SSAS we just go straight in using our on premise AD account with integrated authentication. Because we’re signing in with our Azure AD Account we need to provide the account details that we have created in Azure. In order to connect we need a new option Active Directory Password Authentication that requires a registry entry to enable it (Connect using SSMS). If you’re building up advanced security models in SSAS then we can obviously create a number of users and use different sign-in’s to test the results.

Unless of course your on premise AD and Azure AD are hooked up via federation then you can go straight in with Active Directory Integrated Authentication without having to sign in.

In a working environment we obviously wouldn’t want to assign individual users. It would be better to create a groups e.g. SSAS Administrators, assign the groups and manage user access by assigning users to groups in Azure Active Directory.

I hope this helps. Not all BI developers have a handle on Active Directory especially the in’s and out’s with Azure. I only know what I know from app programming many moons ago and some hours spent on Pluralsight learning enough to get by and design good security models

I blogged recently on loading files using parameterised wildcards. Michael Rys kindly posted some feedback in my comments regarding parameters which was helpful and uncovered some important behavior to consider. Effectively when considering big data we want the optimizer to be as smart as possible when deciding what files to read since we pay for it (literally not just in performance!), at the same I want flexible and elegant code implementation frameworks.

So it turns out we can use dateparts and filter in the same job. We can post dateparts into extraction paths and follow it with a where a clause in separate statement. When this script compiles the execution ideally will only read extraction files that are covered by the where clause in the USQL statement that follows the extract. In fact when using a USQL predicate with a DateTime variable that’s exactly what it does. Essentially it doesn’t execute sequentially, we should expect the compiler to choose the best parallel query plan and only read the files it needs to.

So to provide a pattern that can load everything, year, month or date it might seem reasonable to do something like the following with C# types in order to provide a generic procedure.

Note: below I’ve configured the setup to load February, I have 1 file in Feb and 7 files in Jan:

When the job executes it loads all the files for January and February after which it does the filter which is not what I expected. 8 streams read all the files but the output only contains February. We effectively read 7 files in Jan that we didn’t need to.

Output only contains February:

So in conclusion we’ve read far more data than we intended. The following code however avoids this problem. Note the differences in the following code. I’ve effectively passed the day parts into different columns day, month and year. This means I can use a native datetime predicate in the where clause. The analytics engine doesn’t have to load all the data to determine the value of year, month and day using higher level .net execution; well that’s my guess. I need to read more on the internals underneath if published anywhere. Effectively there is a read price to pay running custom .net in this scenario.

This time we can see that the job only reads my single February file which is exactly what I want. I want the efficiency of selective reads and flexibility to load any part of the temporal hierarchy of my files in my blob storage.

In the code the filter variables day, month and year are still typed as datetime… however consider how 2017-02-18 gets split out into these variables:

Day 18-> defaults month to 01 and year to 0001 -> 0001–01–18T00:00:00.000000

Month 2 -> defaults day to 01 and year to 0001 -> 0001–02–01T00:00:00.000000

Year 2017-> defaults day and month to 01 -> 2017–01–01T00:00:00.000000

Note I’ve left the date part datetimes in the output so you can see how they come through:

The result with this pattern is that we effectively read only what we needed for the output and perhaps the moral of the story – avoid custom .net in the where clause when filtering down files that you want to read. Part of that solution is making sure your file has attributes it needs to perform specific predicate selections.

This blog is about the finer points of ingesting temporally stamped files from directory structures.

Consider the following directory structure. This kind of structure is typically created by Data Factory when ingesting data into blob storage in azure. The temporal schedule of extract is reflected in the document structure in the form yyyy->mm->dd. In this example we stop at day but it is possible to go down to minute:

Each leaf directory contains the corresponding data for that day. This data reflects a daily change data capture (CDC) extraction from some source. This means that the characteristics of the data are as follows:

a file in a leaf directory only contains data that has changed since the last extract

this approach stores historical changes across time therefore data at month or year level may contain duplicate members but some part of the data will be different

data at leaf level will not contain duplicates, duplicates can only occur across different days

the current position of the data will be distributed throughout all the leaf directories since the directory that holds the latest version for a given member will depend on when it was last changed

Each leaf directory contains the following files and data about actors. Note the file names also contain the temporal stamp with the format yyyy-MM-dd, this is deliberate:

With this in mind it would be possible to create a procedure that loads a consolidated table that could either contain history or just the current version. This procedure could work in the following ways:

Load all the data from the directories to provide a historical and current view of the information up to the current day – has to process all the data

Incrementally load the current day merging the delta into an existing table of data – just has to process days not already processed

Load a defined proportion of all the data up-to a provided point in time to create a consolidated view at a specific point in time – has to process a variable subset of data

I have patterns to do all 3 that I will blog about at some point. It’s just worth noting for now that 3 has the least value. It could be of some value in some specific use cases around creating snapshots for comparison; normally a full reload, incremental of a current or full historical consolidated view of data is adequate since the full historical view can be used to view any point in time. Also when thinking about this problem consider that this is file based nosql distributed batch processing system… row level updates and inserts aren’t available to you! Now think about how that will affect the incremental load pattern.

I’ve used the local debug execution environment to build and test my patterns. For this to work with the structure above I’ve changed my local data lake directory path to be my sourcecode directory. To do this in visual studio choose the menu [Data Lake] -> [Options and Setting…] and change the path:

So getting to the topic of discussion. I want to be able to:

Optionally load all of my data from the entire directory structure

Optionally load just 1 daily incremental leaf directory

Pass the incremental snapshot date in the folder and/or file name into the data so I can determine which is the latest version of a row or where in the historical view that version should be inserted

This is obviously the most simple thing you can do. In this snippet we specifically named the year, month and day in the path. This will specifically load the files in the targeted directory. Obviously you would want to parameterise year, month and day so the proc can be re-used for any leaf period.

The following specifies a wildcard to load all the files at the directory path. The filename is then passed into the dataset into a field called filename

{filename:*}

This gives us an output file that looks like the following. It meets our objective of just loading 1 incremental partition. Although we pushed the filename into the output data but we haven’t extracted the date.

The output data set includes all the data and again we see that we’ve pushed the filename into the output data set but haven’t extracted the date. Note the duplicates appearing because this is a complete historical concatenation. Also note how easy it is to concatenate the load of multiple files with a single extract statement, which on Data Lake in the cloud will utilize multiple nodes to load the data in parallel resulting in random interlaced ordering data in the output.

Date Parts

Review the following code. We can pass in date part variables to any part of the path. We can then pull the date into our result set in the same way as the filename. The date gets passed through as a strongly typed DateTime. This meets our objective of loading all the data and passing in the date from the directory path.

The output data set includes all the data. Note that using the date parts has achieved the goal of passing the date into the result data set from the directory path. However we can’t use this method for an incremental load because date parts effectively wildcard the directory path and will always load all of the data.

Filename Regex

Finally we get to my preferred method which is to use a regex to pull the date from the filename. I prefer this because I can pass wildcards or a specific period to do an incremental, partial or full load of files and still pull the date into the result set. It also breaks the dependency of the directory structure on date data lineage. Regardless of directory I always want the date partition in the filename for all sorts of maintenance and administration purposes. In fact having it broken out into a hierarchy of directory structures can be a bit a pain with the traditional directory and data brwosing tools currently available.

This example below shows the output with full load with the date passed through from the filename using a C# regex. We can easily do an incremental by adjusting the commented out block of variable initialisation of @year, @month and @day. This variables would be parameters in a productionised version of a proc allowing the same code to used for either a full or incremental load.

If you’re in the business of processing raw data files then the issue of choosing a file format should be pretty familiar territory. How do you know the file format isn’t a feature of the data itself? At which point the schema read comes crashing down like a ton of bricks. Ultimately you may have limited control over the protection of your chosen file format; how do you know what you have chosen now won’t be compromised in the future?

The Text() extractor and outputter provides an additional parameter to allow control over the file format delimiter. Below is a small experimental snippet showing the text extractor:

Column Delimiter

The escape character gets us out of a hole with column delimiters. Something to know is that if you use the escapeCharacter parameter then it won’t be used to escape the quote character (which is true by default); this is done using another quote e.g. “Some “”data”” with a quotes”

My go-to approach is just to always make use of the Text extractor, provide an escape character and avoid quotes for the following reasons:

Data factory supports the addition of an escape character that will automatically get injected when the file is written out, this provides an easy way to parse an escape character into the data

Data factory may also parse in escapes for quotes (I don’t know haven’t tried), it just seems like an overly verbose way of doing the same thing as escape characters except I can’t choose my character and the one I’m given is very common

I can change the file markup in a central place without editing all my loading procedures

I’d rather have an escape character and not need it than waste tedious amounts of time sorting out these issues later.

The escapeCharacter escapes itself so you don’t have to worry too much about what you choose, though you’ll probably do yourself some favours by choosing something obscure

With large amounts of data such issues aren’t going to be found until you start churning through lots of data at which point you may have a lot of code to change and pressing deadlines because you simply chose the wrong delimiter.

You could argue that choosing something obscure for your column delimiter is just as effective. However can you ever be 100% sure; also someone will have to look at this data to resolve issues. Choosing common supported delimiters ensures the data is easy to look at and troubleshoot.

Row Delimiter

With row delimiter we’re not so lucky. Unfortunately because it’s a big data platform and the processing of data is split across computation nodes it’s not possible to escape row delimiters. As a best practice approach I always now delete row delimiters from my data on the selection queries since it’s a relatively cheap operation.

The only other approach is to clean in data lake. This involves loading ragged files, looking for the ragged ends and cleansing out the delimiters for broken lines. Not impossible but a bit more coding involved albeit re-usable; after all nosql platforms shine with semi/non-structured data. Probably depends on the type of source since this kind of processing will be charged on Azure subscription; if it’s cheap and easy at source outside azure then why pay?

Interestingly the row delimiter can only be 1 character. /r/n is one of 3 defaults the others being /r and /n. So custom formats are limited to one character, since the following causes failure for being 2 characters.

Hierarchical Formats

There are of course hierarchical formats available such as json, xml and paraquet that are supported by data factory and other tools. Nosql platforms after all really shine with semi-structured data. However at the moment they’re not yet fully supported by data lake analytics.

U-SQL is highly extensible and custom extractors are available for hierarchical file formats. If traditional ETL was accused of one thing it would be that it’s too sensitive to schema changes. Hierarchical file formats compound this issue even further because it creates dependency between data entities at an asset (file) level.

If the source data by it’s original nature comes in as a hierarchical format then fair enough. If it’s coming from a RDBMS database then taking it to hierarchical structural format to get it back to flat format is taking 3 steps back and 1 step forward.

Centralising File Format

The code snippet above shows pushing in file format parameters using variables. We obviously have to maintain these values in every script or procedure. You could push file markup values in using parameters and then of course centralisation depends on what you’re using for orchestration. One way to centralise this is to pull the literal file markups values from shared a assembly:

Shared assemblies can be written in C# (or any .Net language) and registered at database level. This assembly could read a json or xml file stored in the data lake itself. Then it’s just a matter of tweaking your global file formats organised by source when you hit a format issue.