Ust Oldfield's Blog

A couple of weeks ago I wrote a blog post detailing how to process a tabular model from a stored procedure. The challenge there was to have a way of processing a tabular model on demand from a web application. The challenge still exists, but how do you achieve the same in Azure using Platform as a Service (PaaS) objects which do not have the benefit of a full server and items, such as msdb and SQL Agent, to aid in the process?

In this post, I’m going to show you how to process a tabular model only using Azure PaaS offerings. Not only am I going to show you how to do process a tabular model on-demand, but also how to process a tabular model on a schedule. This post has taken inspiration and part of the code base from the a Microsoft blog: Automating Azure Analysis Services processing with Azure Functions.

Azure Functions

Before we begin properly, it’s worth spending some time introducing Azure Functions. According to Microsoft, Azure Functions are:

…a solution for easily running small pieces of code, or "functions," in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. Functions can make development even more productive, and you can use your development language of choice, such as C#, F#, Node.js, Java, or PHP. Pay only for the time your code runs and trust Azure to scale as needed. Azure Functions lets you develop server less applications on Microsoft Azure.

They are super useful for extending the capabilities of any solution and not just limited to what we’re going to cover here.

On-Demand Refresh

The use of Azure Functions creates a trigger for the on-demand refresh of a tabular model from the web application or web hook, this is to make sure that selected elements of data in a tabular model, for example hot partitions, are always up to date. The following describes the process that Azure Functions will be involved in this scenario:

The steps that are needed to create an Azure Function for On-Demand Refresh are as follow:

1) Create an Azure Function App

Navigate to the Azure Portal and create a Function App (the name changes quite a bit, so don’t be concerned if it’s not exactly displayed as it is below)

2) Create a new Function

After you’ve created the Function App, we need to add a new Webhook + API function, which we’ll use as the basis for our on-demand refresh. Click on the + button next to Functions, select Webhook + API, choose C# as your language and click Create this function.

3) Configure the Function

Download the latest client libraries for Analysis Services. This needs to be done to your local machine so you can then copy these files to your Azure Function App. After you’ve downloaded the client libraries, the DLLs can be found in C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies. The two files you need are:

The documentation references the 130 assemblies, which is not correct and will not work. You need the assemblies in 140.

In order to add these assemblies to your function, you’ll need to add a folder called “bin”. To do this, select your Function App, click Platform features, and under Development Tools, click Advanced tools (Kudu).

In Kudu, click Debug console and select Cmd. Navigate to the site\wwwroot\OnDemandHttpTrigger folder and add the “bin” folder here by clicking the + button.

Once you’ve added the “bin” folder, go back over to the Azure portal and select your function OnDemandHttpTrigger.

On the right under View files, navigate to the bin folder. Then click the Upload button to add the two previously mentioned DLLs to the bin folder.

You should see the two DLLs in your bin folder now.

4) Add Azure Analysis Services Connection String to the Function App

This step is going to add the connection string to the Azure Analysis Services (AAS) service to the entire Function App, not just individual functions.

Click the name of your Function App, then select Platform features. Select Application settings under General Settings.

Now we need to add our connection string under the Connection strings section. You’ll need your AAS server name and a user ID and password that has access to the AAS database.

You can find your AAS server name by navigating to your AAS database in the Azure portal and copying the value constructed after clicking Show server connection strings:

Back in the screen for the Function App, fill in the Name textbox with a name for your connection string and paste your connection string in the Value text box:

Click Save at the top to save these settings for your Functions.

5) Time for Code

Our Function App has been configured, now we need to add code to the function. The function comes with a working function, for which to test out the functionality, but we don’t need everything that is on offer.

:req.CreateResponse(HttpStatusCode.BadRequest, "Please pass a status on the query string or in the request body");

}

Click the Save button at the top.

6) Test, Test, Test

Click the Run button at the top to test the function

The function can also be tested in a web browser, and be called by a Web App using the POST HTTP method.

Now we have a fully functioning method of refreshing an Azure Analysis Services tabular model on-demand.

Scheduled Refresh

The use of Azure Functions creates a trigger for the scheduled refresh of a tabular model, this is to make sure that the entire tabular model has the latest data and is always up to date. The following describes the process that Azure Functions will be involved in this scenario:

The steps that are needed to create an Azure Function for Scheduled Refresh are as follow:

1) Create a Function

We’ve created our Function App, and now we need to add a new Timer Trigger function, which we’ll use as the basis for our scheduled refresh. Click on the + button next to Functions, select Timer, choose C# as your language and click Create this function.

2) Configure the Timer

What use is a timer without a schedule? To give the timer a schedule, click Integrate, set the schedule and click Save.

The Schedule text box expects a CRON expression to define the days and times that the function should execute. Click the little Documentation button on the screen above to read about CRON expressions. The schedule I’ve set is to run once everyday at 09:30AM.

3) Configure the Function

See step 3 of the On-Demand Function for detailed steps. You’ll need to create the bin folder and upload the DLLs to the bin folder.

4) Time for Code

We’ve configured our function, so now it’s time to add the code. The code base is much simpler than the On-Demand code base, mainly because it’s doing fewer tasks. But the AMO section is exactly the same. Paste in the following code:

//m.Tables["Date"].RequestRefresh(RefreshType.Full); // Mark only one table for refresh

db.Model.SaveChanges(); //commit which will execute the refresh

asSrv.Disconnect();

}

catch (Exception e)

{

log.Info($"C# Timer trigger function exception: {e.ToString()}");

}

log.Info($"C# Timer trigger function finished at: {DateTime.Now}");

}

Click the save button at the top.

5) Test, Test, Test

Click the Run button at the top to test the function

Now we have a fully functioning method of refreshing an Azure Analysis Services tabular model on-demand.

Conclusion

I have shown you how simple it is to invoke two methods of refreshing a tabular model using Azure Functions: an On-Demand refresh and a refresh by Schedule. I hope that you take inspiration and use these methods in your use of both Azure Analysis Services and Azure Functions.

I ran into a problem setting up on-premises data gateways on a client recently, whereby they had two domains but with a one-way trust. The result was that when authenticating within the Power BI Service to retrieve data from on-premises data sources in the untrusted domain it would throw an error.

At this point it is worth spending some time explaining the architecture.

The Architecture

The architecture might be familiar to many who use Power BI and the on-premises data gateway, with a little caveat. Domain 1 is the main domain. Domain 2 is the secondary domain and trusts Domain 1. Domain 1, on the other hand, doesn’t trust Domain 2.

A user in Domain 1 can access data sources in both Domain 1 and Domain 2. They can create their Power BI reports with a live connection or direct query and publish them to the Power BI Service. In order to use the reports in the service, on-premises data gateways need to be established to provide a messaging service between on-premises and the cloud. In this example, each domain has a domain controller, a tabular server and an on-premises data gateway for each tabular server.

The Problem

When a user logged-on to the Power BI Service tries to access data from Domain 2, their credentials are passed down to the on-premises data gateway, checked against the domain controller in Domain 2 and returns an error to the Power BI Service.

What I think happens is that the user (User.One@Domain1.com) will have their credentials passed down through the on-premises data gateway to the domain controller in Domain 2. Either the domain controller will not be able to find the user, it is the untrusted domain, and will not be able to pass the short name (DOMAIN1\USERONE) to the tabular server, or it tries to check with the domain controller in Domain 1 and encounters the dreaded Kerberos and cannot perform a double hop to return the short name. Either way, the result is the same in that the short name cannot be passed to the tabular server.

The Solution

As you can imagine, there are a few solutions to the problem.

If it is a Kerberos related issue, then Kerberos will have to be configured separately

Make Domain 2 a trusted domain

User mapping in Power BI Service

This latter approach is the one I opted for because it was guaranteed to work and would not change the current domain and network configuration.

In the gateways settings in the Power BI Service, I went to the Users tab under my data source and clicked on Map user names. In there I mapped users in Domain 1 to users in Domain 2.

If you have a large number of users, individual mapping might not be preferable or feasible, which is why you can replace the Domain names in part of the user string, as in example 3. This, however, does rely upon users in Domain 1 having an equivalent account in Domain 2. This is not always the case, for which the wildcard to service account would work, as shown in example 4.

The Azure Data Lake has just gone into general availability and the management of Azure Data Lake Store, in particular, can seem daunting especially when dealing with big data. In this blog, I will take you through the risks and challenges of working with data lakes and big data. Then I will take you through a framework we’ve created to help best manage these risks and challenges.

Risks and Challenges of Big Data and Data Lake

The challenges posed by big data are as follow:

Volume – is the sheer amount of data becoming unmanageable?

Variety – Structured tables? Semi-structured JSON? Completely unstructured text dumps? We can normally manage with systems that contain just one of these, but if we’re dealing with a huge mix, it gets very tricky

Velocity – How fast is the data coming in? And how fast do we need to get it to the people who need it?

Veracity - How do we maintain accuracy, veracity, when the data is of varying volumes, the sources and structures are different and the speed in which they arrive in the Lake are of differing velocities?

Managing all four simultaneously is where the challenges begin.

It is very easy to treat a data lake as a dumping ground for anything and everything. Microsoft’s sale pitch says exactly this – “Storage is cheap, Store everything!!”. We tend to agree – but if the data is completely malformed, inaccurate, out of date or completely unintelligible, then it’s no use at all and will confuse anyone trying to make sense of the data. This will essentially create a data swamp, which no one will want to go into. Bad data & poorly managed files erode trust in the lake as a source of information. Dumping is bad.

There is also data drowning – as the volume of the data tends towards the massive and the velocity only increases over time we are going to see more and more information available via the lake. When it gets to that point, if the lake is not well managed, then users are going to struggle to find what they’re after. The data may all be entirely relevant and accurate, but if users cannot find what they need then there is no value in the lake itself. Essentially, data drowning is when the amount of data is so vast you lose the ability to find what’s in there.

If you ignore these challenges, treat the lake like a dumping ground, you will have contaminated your lake and it will no longer be fit for purpose.

If no one uses the Data Lake, it’s a pointless endeavour and not worth maintaining.

Everyone needs to be working together to ensure the lake stays clean, managed and good for a data dive!

Those are the risks and challenges we face with Azure Data Lake. But how do we manage it?

The Framework

We’ve carved the lake up into different sections. The key point is that the lake contains all sorts of different data – some that’s sanitised and ready to consume by the business user, some that’s indecipherable raw data that needs careful analysis before it is of use. By ensuring data are carefully managed you can instantly understand the level of preparation that data has undergone.

Data flows from left to right – the further left areas represent where data has been input directly from source systems. The horizontal sections describe the level of preparation – Manual, Stream and Batch.

Stream – The data here flows in semi-real time, coming from event hubs and being landed after processing through stream-specific tools such as Streaming Analytics. Once landed, there is no further data processing – the lake is essentially a batch processing tool.

Batch – This is more traditional data processing, the kind of “ETL” seen by many BI developers. We have a landing area for our raw data, a transitional area where data is cleaned, validated, enriched and augmented with additional sources and calculation, before finally being placed in a curated area where it is ready for consumption by the business.

We’re taking the blank-canvas of the Data Lake Store and applying a folder structure, a file management process and a curation process over the top.

The folder structure itself can be as detailed as you like, we follow a specific structure ourselves:

The Raw data area, the landing place for any files entering the lake, has sub-folders for each source of data. This allows for the easy browsing of the data sources within the Lake and ensures we are not receiving the same data twice, even if we use it within different systems.

The Enriched and Curated layers however, have a specific purpose in mind. We don’t take data and enrich/clean/process it without a business driver, it’s not something we do for fun. We can therefore assign a project or system name to it, at this point it is organised into these end-systems. This means we can view the same structure within Enriched as within Curated.

Essentially Raw data is categorised by Source whilst Enriched and Curated data is categorised by Destination.

There’s nothing complicated about the Framework we’ve created or the processes we’ve ascribed to it, but it’s incredibly important that everyone is educated on the intent of it and the general purpose of the data lake. If one user doesn’t follow process when adding data, or an ETL developer doesn’t clean up test files, the system starts to fall apart and we succumb to the challenges we discussed at the start.

Data
Lakes are the new hot topic in the big data and BI communities. Data Lakes have
been around for a few years now, but have only gained popular notice within the
last year. In this blog I will take you through the concept of a Data
Lake, so that you can begin your own voyage on the lakes.

What
is a Data Lake?

Before
we can answer this question, it's worth reflecting on a concept which most of
us know and love - Data Warehouses. A Data Warehouse is a form of
data architecture. The core principal of a Data Warehouse isn't the
database, it's the data architecture which the database and tools implement.
Conceptually, the condensed and isolated features of a Data Warehouse are
around:

1.Data acquisition

2.Data management

3.Data delivery /
access

A Data Lake is similar to a Data Warehouse in these
regards. It is an architecture. The technology which underpins a Data Lake
enables the architecture of the lake to flow and develop. Conceptually, the
architecture of a Data Lake wants to acquire data, it needs careful, yet agile
management, and the results of any exploration of the data should be made
accessible. The two architectures can be used together, but conceptually the
similarities end here.

Fundamentally, a Data Lake is just not a
repository. It is a series of containers which capture, manage and explore
any form of raw data at scale, enabled by low cost technologies, from which
multiple downstream applications can access valuable insight which was previously inaccessible.

How Do Data Lakes Work?

Conceptually, a Data Lake is similar to a real lake
- water flows in, fills up the reservoir and flows out again. The incoming flow
represents multiple raw data formats, ranging from emails, sensor data,
spreadsheets, relational data, social media content, etc. The reservoir
represents the store of the raw data, where analytics can be run on all or some
of the data. The outflow is the analysed data, which is made accessible to
users.

To break it down, most Data Lake architectures come
as two parts. Firstly, there is a large distributed storage engine with very
few rules/limitations. This provides a repository for data of any size and
shape. It can hold a mixture of relational data structures, semi-structured
flat files and completely unstructured data dumps. The fundamental point is
that it can store any type of data you may need to analyse. The data is spread
across a distributed array of cheap storage that can be accessed independently.

There is then a scalable compute layer, designed to
take a traditional SQL-style query and break it into small parts that can then
be run massively in parallel because of the distributed nature of the disks.

In essence – we are overcoming the limitations of
traditional querying by:

·Separating
compute so it can scale independently

·Parallelizing
storage to reduce impact of I/O bottlenecks

There
are various technologies and design patterns which form the basis of Data
Lakes. In terms of technologies these include:

·Azure Data Lake

·Cassandra

·Hadoop

·S3

·Teradata

With regards to design patterns, these will be
explored in due course. However, before we get there, there are some challenges
which you must be made aware of. These challenges are:

1.Data dumping -
It's very easy to treat a data lake as a dumping ground for anything and
everything. This will essentially create a data swamp, which no one will want
to go into.

2.Data drowning -
the volume of the data could be massive and the velocity very fast. There is a
real risk of drowning by not fully knowing what data you have in your lake.

These challenges require good design and
governance, which will be covered off in the near future.

Hopefully this has given you a brief, yet
comprehensive high-level overview of what data lakes are. We will be focusing
on Azure Data Lake, which is a management implementation of the Hadoop
architectures. Further reading on Azure Data Lake can be found below.

Further Reading

In order to know more about Data Lakes the
following resources are invaluable.