Tag: Azure Data Factory

Microsoft
Graph data connect (GDC) is a connector technology that allows an
organization to extract data in bulk from the Microsoft Graph. Using Azure Data
Factory, extraction jobs can be scheduled that can securely extract Graph
data while respecting an organization’s data control policies. On a scheduled
basis, GDC stages the data behind the scenes, and stores it in an Azure storage
account. The storage can either be Azure Blob storage, Azure Data lake Gen 1,
or Azure Data Lake Gen 2. This article describes a procedure to process the
output from GDC and store it in a Power BI dataflow.

Details on how to configure GDC can be found here,
and an excellent video tutorial here.

Azure Data Lake Gen 2 Storage

Azure
Data Lake Gen 2 (ADLG2) brings a hierarchical namespace to Azure Blob
storage. This storage system is designed for big data analytics and is highly
cost effective. It is one of the three data sink (destination) options for GDC,
and it is the required storage system for the “bring your own”, or external storage
option of Power BI Dataflows. Given that n ADLG2 account is required for the
Power BI Dataflows, it is logical to use the same account as the GDC data sink,
but it is not required.

In order to use an ADLG2 account for external storage with
Power BI dataflows, it must be in the same data center as the Power BI tenant.
The data center for a tenant can be determined by navigating to the Power BI
web application, selectin the “?” icon in the upper right, and then selecting
“About Power BI”.

In order to be able to use an external storage account for
Power BI dataflows, it MUST be created in the data center listed in “Your data
is stored in”.

Connecting Power BI to ADLGen2 Storage

When a dataflow is created in Power BI, it is stored in an
ADLG2 storage system managed by Microsoft. If Power BI is the only platform
that will access the data, this is perfectly adequate, but an organization may
wish to use the data with other tools. If this is the case, a Power BI tenant
can be connected to an ADLG2 account that is accessible to other tools. A
workspace administrator can then decide to have all the dataflows in that
workspace store their data in the custom storage account. These are known as
“external dataflows”. Dataflows are all stored in Common Data Model (CDM)
folders which are described in detail here.

Detailed instructions on configuring external dataflow storage for Power BI can be found here . The process consists of several steps. It should be noted that as of this writing, external dataflows are in preview, and these steps could change.

If one does not already exist, create an ADLG2
account in the same tenant as Power BI

In Azure, Grant the Reader role to the Power BI
service identity for the account in #1

Create a file system for Power BI. The file
system MUST be named “powerbi”

Using Azure Storage Explorer, grant file system
access to three Power BI service principals, Power BI Premium, Power BI
Service, and Power Query Online (see the above link for details)

As of this writing, step #5 above is irreversible. Care
should be taken with its name.

Once configured, a workspace administrator can assign their
workspace to their external storage. This setting is a property of the
workspace, and can be accessed via its settings with the “Storage” tab.

Once this setting has been enabled, all dataflows will be
stored in external storage. A folder is created within the file system created
in step #3 above with the name of the workspace. Each dataflow in the workspace
will be added within that folder, and each entity of the dataflow as a folder
of its own. The dataflow folder will contain a file named model.json which
describes the entities, and the entity folders contain multiple csv files which
house the data itself. Within Azure Data Explorer, the structure appears as
below.

Azure Data Lake Gen 2 account (connected to the Power BI tenant)

File system created for Power BI dataflows (always named powerbi)

Workspace folder

Dataflow folder

JSON file describing the dataflow

Entity folder containing entity data

Once configured, Power Query Online (part of the process of
creating a dataflow) can be used to acquire and transform data. The data will
be stored in these folders according to the Common Data Model specification and
can be accessed by other applications. However, the reverse of this is also
true. Any CDM folder that is stored in the Power BI connected file system can
be connected to Power BI as an external dataflow. The process for doing this is
described here.
The order of operations is important. The user that will make the connection
needs to be granted access to the CDM folder before it is populated with data.

An external dataflow is read only with respect to Power BI
(Power BI only sees the data; it does not transform it). The goal is therefore
to transform the data created by Graph data connect into the CDM format. Azure
Databricks provides support for doing so.

Azure Databricks

Azure Databrick is a suite of serverless big data
technologies that encompass Hadoop, Apache Spark, SQL, Python and Scala
technologies. Databricks clusters can be created and used when needed and
discarded or suspended when not as needed. A discussion of how to create and
use Databricks is beyond the scope of this post, but there is a great deal of
documentation on it here. In
addition, Microsoft provides a free 14-day trial of Azure Databricks.

Databricks is particularly useful in this scenario, as it
has libraries that support Azure Data Lake Gen 2, and libraries that support
the Common Data Model. Databricks notebooks can be called from Azure Data
Factory, so that when a GDC extraction job is completed, the resulting files
can be processed with Databricks to populate the CDM folders.

An excellent tutorial on using Databricks with dataflows and
CDM folders can be found on GitHub here.
The scenario in the tutorial involves using dataflows to produce data instead
of consuming it, but it does cover off several important concepts. The tutorial
is part of the project that includes the CDM
library for Databricks which is used to transform GDC data into CDM
folders.

As of this writing, the CDM library requires a Databricks 4.3.x-scala2.11
cluster. This is an older configuration that is not available to the standard user
interface when creating a Databricks cluster. Subsequent versions of the CDM
library will most likely support newer clusters, but at present, it is
necessary to take a few additional steps during cluster creation.

From the cluster creation UI, specify window.prefs.set(“enableCustomSparkVersions”,
true) in the browser debug console, and then navigate to the cluster page, and
specify the image tag below. Refresh the browser and then
4.3.x-scala2.11 will be listed as a custom version.

Once a cluster has been created, and the CDM library loaded
into it, a notebook can be created to process the GDC data. Processing consists
of four main steps. Connecting Databricks to ADLG2, Reading the JSON files from
GDC, extracting the desired data into dataframes, and writing the data out to
CDM folders.

Connecting Databricks to ADLG2

The recommended way to connect Databricks to ADLG2 storage
is through a Service Principal. The same principal that GDC itself uses can be
used, and if the same ADLG2 account is being used, no further configuration is
necessary.

Databricks will need to read from the file system that
houses the GDC data. Several lines of code (Python) in a Databricks notebook
will establish the required connections:

Once connected, files in the GDC folder can be listed using
the built in dbutils library:

dbutils.fs.ls(filesystem + “/GDCFolderName”)

While the above and below examples shows account names and
keys being explicitly defined in the notebook, this is not recommended beyond
any testing or demonstration environments. Instead, it is recommended to store
such secure strings in Azure Key Vault and retrieve them at runtime. For
instructions on how this is done, see the document Secret
Scopes.

Reading JSON Files

Databricks can read all JSON files in a folder (as well as other text-based formats) into a dataframe. A dataframe is an in-memory table that can be hierarchical and queried via standard SQL commands. The schema of the dataframe will be implied through the structure of the JSON files contained within. To load all of the GDC JSON files from a particular folder into a dataframe, the following line of Python can be used:

contactbasedf = spark.read.json(filesystem + “/Contacts Folder”)

The read is recursive, which means that subfolders are
interrogated as well. GDC folders typically contains a metadata folder with
files of differing schemas than the data files themselves. For this reason, it
is a good idea to move the data files to a dedicated folder before reading them
into a dataframe. This can be done with the dbutils.fs.mv command.

Extracting the desired data

Once the files have been read into a dataframe, the dataframe can be saved to a temporary table. This table can be queried through standard SQL commands. For example, the query creates a temporary table from the initial dataframe (contactbasedf) that was created by reading JSON files created by GDC for organizational contacts. The relevant details are then queried and saved into another dataframe, named df1 in this case.

Writing to CDM folders

Once the CDM libraries are loaded into a Databrick cluster, writing data to them is a relatively simple method call from a dataframe. The call itself requires several parameters, and those parameters are:

cdmModelName – The name of the Model (dataflow) that houses all entities

entity – the name of the entity within the dataflow (a dataflow can contain multiple entities or tables)

cdmFolder – The folder in ADLG2 to save the model.

appId – The service principal ID of an application with Blob Contributor access to the ADLG2 account

appKey – The secret key for the appId specified above

tenantId – The tenant ID for the ADLG2 account

Using the dataframe defined above, the contents of the dataframe can be written out to the CDM folder with the following (Python) code:

The above code will output the contents of the dataframe to an entity named “Contacts” in a model named “AllContacts” stored in a folder named “AllContacts” within the workspace folder specified in the “Workspace” variable.

Creating an external Dataflow

Once the GDC data has been written to a CDM folder, it can
be connected to Power BI as an external dataflow. In order to do so, as
mentioned above, the user making the connection must have explicit access to
the model folders.

From a Power BI V2 workspace (V1 workspaces are not
supported), go to the dataflows tab, and select Create – Dataflow from the
toolbar. If Power BI has been connected to the ADLG2 storage, and the workspace
has been configured for external storage, the “Attach a Common Data Model
folder” option should appear.

Selecting “Create and attach” brings up the Attach Common
Data Model folder dialog box, where two items must be entered.

The Name of the dataflow is the name with respect to Power
BI. It can be completely different than the name of the model folder, or the
internal name of the model created above, but it’s likely a good idea to keep
it consistent. The CDM folder path is actually the absolute path to the
model.json file that describes the model, and it’s vital that model.json be
included at the end of the path. Failing to do so will result in an error.

Finishing Up

Once completed, Power
BI Desktop can be used to connect to the external dataflow, just like any other
dataflow. The only difference is that external dataflows are not refreshed in
the Power BI service, but will be updated by Databricks. The same Azure Data
Factory jobs that extract data from Graph data connect can be used to call into
the Databricks notebooks when the data has been extracted.

If you are interested in a product that leverages the data
produced by Graph data connect, I would be remiss if I did not suggest our
tyGraph for Exchange, which is currently in preview. It combines all of the
technology listed above with a rich set of reports in concert with other Office
365 workloads. If you are interested, please contact me directly, or email sales@tygraph.com .

Like this:

I received a notice from my main Power BI tenant last night that a new version of the Data Management Gateway was available. The previous (1.2) version contained some very significant changes so I was understandably eager to have a look. I installed it, observed a relatively attractive setup interface, then opened the release notes to find out what else was new. Only four items were listed (from the release notes).

I had already observed number two, the new setup experience. Bug fixes, while absolutely necessary, aren’t necessarily something to write about, but I think that the other two items are. While they may not have immediate impact, my bet is that they will in very short order.

The key point here is that the gateway now supports the Azure Data Factory. There are many, many things that the data factory enables (Hadoop anyone), but the one that I feel is most relevant to Power BI today is the ability to connect directly to on premises data sources. That’s not quite how it’s been done until now.

Power BI for Office 365

In the context of Power BI as we’ve come to know it today, on-prem data refreshes are handled by the Data Management Gateway. On a periodic basis (daily at most) the service contacts the gateway, which in turn reruns all relevant queries. The resultant data is then uploaded to the service.

The service in turn packages the data and updates the host Excel workbook, and the model is transferred into a back end analysis server. Every transaction goes through the host Excel workbook.

Power BI Dashboards

If you’ve had a chance to see the preview of Power BI Dashboards, you may have noticed that it is not dependent on Office 365 or Excel at all. When you add a data source, you take the date and add it to a cloud based data model directly (presumably backed by SQL Server Analysis Services). All visualization work against these models, with one very important exception. If you connect to a SQL Server Analysis Services Data source you are actually connecting directly to a model hosted on an on-prem SSAS server in real time.

How is this done? The connection is made through the “Analysis Services Connector”, which is a separate bit of software installed on prem to facilitate connection between the Power BI Dashboards service and the On-Prem SSAS server. It’s available directly from the dashboards portal.

After installing it, a process that establishes for dashboard and SSAS credentials, it can be reconfigured by running the “Power BI Analysis Services Connector” tool.

However, installation also adds another piece of software to the host machine. The Microsoft Data Management Gateway. This version of the DMG establishes the connection between the SSAS server and the Power BI service in real time. Up until now, the DMG didn’t work this way, so which version is it?

Until now, the most recent version of the DMG was 1.2, so this Dashboards preview contained a glimpse into the next generation Data Management Gateway that provided some intriguing new capabilities.

Coming Together

Checking into the latest version of the Data Management Gateway from Office 365, we see:

This version is newer that that included in the Dashboards Preview, and presumably includes everything from it. The key phrase in the release notes to me is therefore “Unified Binary”. One gateway to rule them all, if you will. Does this mean that we’ll be able to connect to on-prem data in real time from Office 365 as well as from the Power BI preview? I don’t know how, but I bet that the building blocks are now there.

The latest version may not include support for any new data sources, or any new bells and whistles, but it’s likely worth setting up for new capabilities that will hopefully show up sooner rather than later.