Category: Cloud Computing

Microsoft data connect is a new technology for extracting data in bulkj from the Microsoft Graph. This article outlines how this data can be transformed with Databricks, and loaded into Power BI dataflows.

Advertisements

Microsoft
Graph data connect (GDC) is a connector technology that allows an
organization to extract data in bulk from the Microsoft Graph. Using Azure Data
Factory, extraction jobs can be scheduled that can securely extract Graph
data while respecting an organization’s data control policies. On a scheduled
basis, GDC stages the data behind the scenes, and stores it in an Azure storage
account. The storage can either be Azure Blob storage, Azure Data lake Gen 1,
or Azure Data Lake Gen 2. This article describes a procedure to process the
output from GDC and store it in a Power BI dataflow.

Details on how to configure GDC can be found here,
and an excellent video tutorial here.

Azure Data Lake Gen 2 Storage

Azure
Data Lake Gen 2 (ADLG2) brings a hierarchical namespace to Azure Blob
storage. This storage system is designed for big data analytics and is highly
cost effective. It is one of the three data sink (destination) options for GDC,
and it is the required storage system for the “bring your own”, or external storage
option of Power BI Dataflows. Given that n ADLG2 account is required for the
Power BI Dataflows, it is logical to use the same account as the GDC data sink,
but it is not required.

In order to use an ADLG2 account for external storage with
Power BI dataflows, it must be in the same data center as the Power BI tenant.
The data center for a tenant can be determined by navigating to the Power BI
web application, selectin the “?” icon in the upper right, and then selecting
“About Power BI”.

In order to be able to use an external storage account for
Power BI dataflows, it MUST be created in the data center listed in “Your data
is stored in”.

Connecting Power BI to ADLGen2 Storage

When a dataflow is created in Power BI, it is stored in an
ADLG2 storage system managed by Microsoft. If Power BI is the only platform
that will access the data, this is perfectly adequate, but an organization may
wish to use the data with other tools. If this is the case, a Power BI tenant
can be connected to an ADLG2 account that is accessible to other tools. A
workspace administrator can then decide to have all the dataflows in that
workspace store their data in the custom storage account. These are known as
“external dataflows”. Dataflows are all stored in Common Data Model (CDM)
folders which are described in detail here.

Detailed instructions on configuring external dataflow storage for Power BI can be found here . The process consists of several steps. It should be noted that as of this writing, external dataflows are in preview, and these steps could change.

If one does not already exist, create an ADLG2
account in the same tenant as Power BI

In Azure, Grant the Reader role to the Power BI
service identity for the account in #1

Create a file system for Power BI. The file
system MUST be named “powerbi”

Using Azure Storage Explorer, grant file system
access to three Power BI service principals, Power BI Premium, Power BI
Service, and Power Query Online (see the above link for details)

As of this writing, step #5 above is irreversible. Care
should be taken with its name.

Once configured, a workspace administrator can assign their
workspace to their external storage. This setting is a property of the
workspace, and can be accessed via its settings with the “Storage” tab.

Once this setting has been enabled, all dataflows will be
stored in external storage. A folder is created within the file system created
in step #3 above with the name of the workspace. Each dataflow in the workspace
will be added within that folder, and each entity of the dataflow as a folder
of its own. The dataflow folder will contain a file named model.json which
describes the entities, and the entity folders contain multiple csv files which
house the data itself. Within Azure Data Explorer, the structure appears as
below.

Azure Data Lake Gen 2 account (connected to the Power BI tenant)

File system created for Power BI dataflows (always named powerbi)

Workspace folder

Dataflow folder

JSON file describing the dataflow

Entity folder containing entity data

Once configured, Power Query Online (part of the process of
creating a dataflow) can be used to acquire and transform data. The data will
be stored in these folders according to the Common Data Model specification and
can be accessed by other applications. However, the reverse of this is also
true. Any CDM folder that is stored in the Power BI connected file system can
be connected to Power BI as an external dataflow. The process for doing this is
described here.
The order of operations is important. The user that will make the connection
needs to be granted access to the CDM folder before it is populated with data.

An external dataflow is read only with respect to Power BI
(Power BI only sees the data; it does not transform it). The goal is therefore
to transform the data created by Graph data connect into the CDM format. Azure
Databricks provides support for doing so.

Azure Databricks

Azure Databrick is a suite of serverless big data
technologies that encompass Hadoop, Apache Spark, SQL, Python and Scala
technologies. Databricks clusters can be created and used when needed and
discarded or suspended when not as needed. A discussion of how to create and
use Databricks is beyond the scope of this post, but there is a great deal of
documentation on it here. In
addition, Microsoft provides a free 14-day trial of Azure Databricks.

Databricks is particularly useful in this scenario, as it
has libraries that support Azure Data Lake Gen 2, and libraries that support
the Common Data Model. Databricks notebooks can be called from Azure Data
Factory, so that when a GDC extraction job is completed, the resulting files
can be processed with Databricks to populate the CDM folders.

An excellent tutorial on using Databricks with dataflows and
CDM folders can be found on GitHub here.
The scenario in the tutorial involves using dataflows to produce data instead
of consuming it, but it does cover off several important concepts. The tutorial
is part of the project that includes the CDM
library for Databricks which is used to transform GDC data into CDM
folders.

As of this writing, the CDM library requires a Databricks 4.3.x-scala2.11
cluster. This is an older configuration that is not available to the standard user
interface when creating a Databricks cluster. Subsequent versions of the CDM
library will most likely support newer clusters, but at present, it is
necessary to take a few additional steps during cluster creation.

From the cluster creation UI, specify window.prefs.set(“enableCustomSparkVersions”,
true) in the browser debug console, and then navigate to the cluster page, and
specify the image tag below. Refresh the browser and then
4.3.x-scala2.11 will be listed as a custom version.

Once a cluster has been created, and the CDM library loaded
into it, a notebook can be created to process the GDC data. Processing consists
of four main steps. Connecting Databricks to ADLG2, Reading the JSON files from
GDC, extracting the desired data into dataframes, and writing the data out to
CDM folders.

Connecting Databricks to ADLG2

The recommended way to connect Databricks to ADLG2 storage
is through a Service Principal. The same principal that GDC itself uses can be
used, and if the same ADLG2 account is being used, no further configuration is
necessary.

Databricks will need to read from the file system that
houses the GDC data. Several lines of code (Python) in a Databricks notebook
will establish the required connections:

Once connected, files in the GDC folder can be listed using
the built in dbutils library:

dbutils.fs.ls(filesystem + “/GDCFolderName”)

While the above and below examples shows account names and
keys being explicitly defined in the notebook, this is not recommended beyond
any testing or demonstration environments. Instead, it is recommended to store
such secure strings in Azure Key Vault and retrieve them at runtime. For
instructions on how this is done, see the document Secret
Scopes.

Reading JSON Files

Databricks can read all JSON files in a folder (as well as other text-based formats) into a dataframe. A dataframe is an in-memory table that can be hierarchical and queried via standard SQL commands. The schema of the dataframe will be implied through the structure of the JSON files contained within. To load all of the GDC JSON files from a particular folder into a dataframe, the following line of Python can be used:

contactbasedf = spark.read.json(filesystem + “/Contacts Folder”)

The read is recursive, which means that subfolders are
interrogated as well. GDC folders typically contains a metadata folder with
files of differing schemas than the data files themselves. For this reason, it
is a good idea to move the data files to a dedicated folder before reading them
into a dataframe. This can be done with the dbutils.fs.mv command.

Extracting the desired data

Once the files have been read into a dataframe, the dataframe can be saved to a temporary table. This table can be queried through standard SQL commands. For example, the query creates a temporary table from the initial dataframe (contactbasedf) that was created by reading JSON files created by GDC for organizational contacts. The relevant details are then queried and saved into another dataframe, named df1 in this case.

Writing to CDM folders

Once the CDM libraries are loaded into a Databrick cluster, writing data to them is a relatively simple method call from a dataframe. The call itself requires several parameters, and those parameters are:

cdmModelName – The name of the Model (dataflow) that houses all entities

entity – the name of the entity within the dataflow (a dataflow can contain multiple entities or tables)

cdmFolder – The folder in ADLG2 to save the model.

appId – The service principal ID of an application with Blob Contributor access to the ADLG2 account

appKey – The secret key for the appId specified above

tenantId – The tenant ID for the ADLG2 account

Using the dataframe defined above, the contents of the dataframe can be written out to the CDM folder with the following (Python) code:

The above code will output the contents of the dataframe to an entity named “Contacts” in a model named “AllContacts” stored in a folder named “AllContacts” within the workspace folder specified in the “Workspace” variable.

Creating an external Dataflow

Once the GDC data has been written to a CDM folder, it can
be connected to Power BI as an external dataflow. In order to do so, as
mentioned above, the user making the connection must have explicit access to
the model folders.

From a Power BI V2 workspace (V1 workspaces are not
supported), go to the dataflows tab, and select Create – Dataflow from the
toolbar. If Power BI has been connected to the ADLG2 storage, and the workspace
has been configured for external storage, the “Attach a Common Data Model
folder” option should appear.

Selecting “Create and attach” brings up the Attach Common
Data Model folder dialog box, where two items must be entered.

The Name of the dataflow is the name with respect to Power
BI. It can be completely different than the name of the model folder, or the
internal name of the model created above, but it’s likely a good idea to keep
it consistent. The CDM folder path is actually the absolute path to the
model.json file that describes the model, and it’s vital that model.json be
included at the end of the path. Failing to do so will result in an error.

Finishing Up

Once completed, Power
BI Desktop can be used to connect to the external dataflow, just like any other
dataflow. The only difference is that external dataflows are not refreshed in
the Power BI service, but will be updated by Databricks. The same Azure Data
Factory jobs that extract data from Graph data connect can be used to call into
the Databricks notebooks when the data has been extracted.

If you are interested in a product that leverages the data
produced by Graph data connect, I would be remiss if I did not suggest our
tyGraph for Exchange, which is currently in preview. It combines all of the
technology listed above with a rich set of reports in concert with other Office
365 workloads. If you are interested, please contact me directly, or email sales@tygraph.com .

Power BI has been able to work with Excel files since it was first introduced. Indeed, it was born from the analytic capabilities in Excel. Users can connect directly to Excel files by using the Power BI service and nothing but a browser. However, depending on the content of the Excel file, and the method of connecting, the resulting products can be very different. In this post I will attempt to clarify this behavior. A subsequent post will detail the options available when working with Excel files in Power BI Desktop.

File Structure

Excel is a multi-purpose tool. It contains all the building blocks of Power BI, and as such, it is an excellent Business Intelligence client. Excel files are also often used (much to my chagrin) as a data storage container, or as a data transport medium. Understanding how the file is structured, and what you want to do with it is key to making the right choice when combining it with Power BI.

Originally Excel files (workbooks) were collections of worksheets. Analysts could import data into those worksheets and then analyze them with the tools that Excel provided. Although Excel was never intended to be a database, it’s ease of use and familiarity led many people to begin using it that was, and “spreadmarts” (spreadsheet data marts) quickly became a problem. The problems arose because the instant data was extracted from a source it became stale, and the fact that it was being stored in worksheets meant that it could be edited (changing history) and became subject to the data size limitations of a worksheet.

To take advantage of Excel’s analytic capabilities without being subject to the issues involved in worksheet data storage, the data model was introduced, initially through PowerPivot. The data model is a “miniaturized” version of the SQL Server Analysis Services tabular engine that runs in Excel. This data model is read only, refreshable, and highly compressed which importantly means that its only data limitation is the amount of available memory available on the machine running it. Importantly, this engine is the same engine that is used by Power BI – the advantages of which we’ll explore shortly.

Excel of course still needs to be able to use worksheets and be Excel, so we can’t just remove the worksheet capability (which incidentally is effectively what Power BI Desktop is – Excel without worksheets). Therefore, today from a data perspective, Excel files can have data in the data model, worksheets or both. From the Power BI service perspective, the important thing is whether the file contains a data model, as it treats the two cases differently.

Getting Excel Data

From the Power BI service, you click the Get Data button, and then the Get button in the Files tile. You are then presented with one of two dialogs depending on whether you are using a personal workspace, or an app workspace.

Personal workspace

Connecting file-based data to a personal workspace

When importing into a personal workspace, there are 4 possible data sources.

A local file is one that is stored on a file system local to the machine being used. Selecting this option will allow you to work with the Excel file stored in that location, but if the file is being used as a data source (data is in the worksheets), then a Data Gateway will be required for any data refreshes. Power BI will also connect to a file stored in OneDrive, either Personal or Business (through office 365). Finally, the service can work with files stored in any accessible SharePoint site (not simply Team sites as the name would indicate).

App workspace

Connecting file-based data to an App workspace

When importing into an App workspace, there are 3 possible data sources. The Local File and SharePoint – Team Sites options are precisely the same as when importing into a personal workspace. The difference is the OneDrive – Workspace name option replaces the two other OneDrive options. Choosing this option allows you to work with files stored in the “Group OneDrive”. Since every App workspace is backed by an Office 365 or “Modern” group, it also has access to the SharePoint site for that group. The “Group OneDrive” is the Documents library within that SharePoint site. Therefore, choosing SharePoint – TeamSites and navigating to the Documents library will render the same results in a few more mouse clicks, but also give access to all other document libraries within that site.

Connect vs Import

Once you navigate to the Excel file that you want to work with, you select it, and click connect. You will then be presented with two options for the file, Import or Connect.

This choice dictates how the file is brought into the Power BI service. The structure of the file determines exactly what is brought in to the service in both cases.

Connect

Clicking the Connect button allows Power BI to connect to and work with the Excel file in place. The workbook is displayed as an Excel workbook in full fidelity in the Power BI interface using Excel Online. The file itself is shown in the Workbooks section in the Power BI interface, and it stands alone from other Power BI elements (except that regions of it can be pinned to a dashboard). Connecting to an Excel report will not create a Power BI Dataset, Report, or Dashboard. All operations, including refresh (see below) are controlled through the workbook.

At no point is the file moved, or “brought in” to the Power BI service. If the file is being stored in SharePoint, or OneDrive, anything done to the file in the Power BI service will be visible to anyone with access to the file itself, whether they are a Power BI user or not. This includes refresh, which will be discussed further below, but the important part to remember here is that if the data in the connected file is refreshed through the Power BI service, and it is being stored in SharePoint (or OneDrive), all users will be able to see updated data the next time that they open the file.

Connecting to an Excel file behaves the same way whether the file contains a data model or not, but the file must contain a data model in order to be refreshed by the Power BI service.

Connected Excel file within Power BI

Import

Importing an Excel file behaves totally differently from connecting to it. When an Excel file is imported, it is treated as a data source to Power BI, and the assets within that file are brought into the Power BI service. Subsequent changes to the source file are not immediately reflected within the Power BI service, but are retrieved through the refresh process.

The way that the assets are brought into the service depends very much on the structure of the file, specifically whether it contains a data model or not. If the file does not contain a data model, then Power BI will use the data contained in the Excel worksheets to construct a new one. This is similar to what happens when a CSV file is imported into the service. If the file does contain a data model, then the worksheet data is imported, and that data model is brought into the service as-is. One important exception to this is if worksheet data uses the same query as an existing model, the worksheet data is ignored, and the data model is brought in as-is. This is important because Excel’s Power Pivot editor can be used to edit the model, creating calculated columns, calculated measures and relationships prior to import. The model that is automatically created when the file does not contain a model has no editing capabilities.

When an Excel file with a data model is imported, the data model (imported or created) is added to datasets, and a link to the dataset is added to the default dashboard for the workspace. If no default dashboard exists, one will be created. A report can then be authored in the service. If the workbook contains any PowerView reports, these will be converted to native Power BI reports and added to the service as well. Any embedded 3D maps are not brought in.

Imported Excel File showing calculated measures

Refresh

Data refresh options, and behavior depend on both the Get Data choice (connect or import) and the structure of the Excel file.

Connected Workbooks

If the workbook is connected to the service, and it does not contain a data model, it cannot be refreshed. This is true even if the worksheets in the workbook contain data from Power Query queries. This is the only scenario that does not support refresh in any way.

If the workbook contains a data model refresh is supported. The interesting part is that refresh will be triggered not only for the data model itself, but for any worksheets that have Power Queries as a data source. Therefore, a workaround to the lack of refresh support for a worksheet with no data model is to add a blank data model.

For refresh to work, the data source must be available to the Power BI service. This means that the source must be available in the cloud or registered on an available On-Premises Data Gateway.

The important thing to note about connected workbooks is that the refresh options that are performed on them are permanent – refreshed data is stored with the workbook. This means that if the connected workbook is stored in SharePoint, or shared through OneDrive, updated data is available to all users with access regardless of whether they are Power BI users.

Imported Workbooks

Refresh options for imported workbooks are slightly more complicated. As mentioned above data is either imported from the worksheets, a data model imported into the service or both.

If data was imported from worksheets, then the Excel file is the data source from the standpoint of Power BI. If the file is stored in SharePoint or OneDrive, it will automatically be refreshed every hour by default. This means that changes to the underlying Excel file will be reflected back in the Power BI service within an hour. This feature can be disabled, but it is not possible to change to hourly schedule, nor precisely when it will occur.

Refresh options for workbooks in OneDrive/SharePoint

If the file is stored on a file system, it can be scheduled more granularly, but you will need to connect to it through an On-Premises Data Gateway.

If the file contained a data model that was imported into the service, then the original source of data for that data model (the query) is what the Power BI service will refresh from (NOT the Excel file itself). In this case, the refresh options are the same as with most other Power BI data sources – Excel is taken out of the picture completely, and any changes to the source Excel file will not be reflected into the service. The exception to this is if the file had both a data model, and worksheet data that was imported.

In the case of an Excel with both a data model and worksheet data, both cases above will apply. The workbook is used as a data source for the table that was created by Power BI on import, and the original data model’s source is updated independently. This means that changes to the worksheet data are reflected in the Power BI service when refreshed, but any model changes to the original Excel file are not. Both the OneDrive and regular refresh schedules are used for imported files of this type.

Refresh options for a combined data source

The following table summarizes the refresh options available for file structure and connection type.

File Structure

Get Data option

Connect

Import

Worksheet data

None

Refresh from worksheet

Data model only

Refresh from model source

Refresh from model source

Data model plus worksheet data

Refresh from model source and worksheet source

Refresh from model source and worksheet

Summary

Both Excel and Power BI are powerful tools in their own rights, and the decision to use one does not preclude using the other and in fact there are many good reasons for doing so. Bringing refreshability to Excel files stored in SharePoint is just one of them. It is however important to understand how it all works in order to get the maximum impact.

If you are in information worker of any sort, and have been at it for any more than a couple of years, you’ve experienced it – the X: drive. Or the S: drive, P, R, whatever the letter. It’s the drive letter that is mapped to a network based file share that contains most of the company’s documents.

My first experience with IT was in 1989, setting up and managing a Novell Netware 3 network for a University department. Logging into the network (through a DOS prompt) would automatically add a drive to your machine with all of the resources that you needed, the storage you could ever want (measured in MB). It was “magical”.

This basic model exists to this day. We’ve tried to move away from it, we’ve tried very hard. We’ve had large monolithic document management systems imposed from above like FileNet and Documentum. These solutions gained success in specific areas, as they were mandated from above. SharePoint itself came along and democratized document management to a much broader degree, but the pesky X: drive still persists. Why?

One word. Usability.

End users want to be able to open up File Explorer, navigate to their drive, browse their folder structure and work with their documents. The drive mapping metaphor has succeeded so well because it fits this scenario perfectly, and its familiarity. Ever since personal computing began, we’ve accessed file storage using drive letters.

Users use formal document management systems reluctantly. This is often due to overzealous metadata requirements (just fill out this 20 field form to store your document), burdensome procedures, or performance issues. However given the choice, they retreat to the familiarity of their file systems, and their X: drives more often than not.

Consultants and vendors preaching a new way of doing things are in the end shouting against thunder. We can’t expect users to adapt to new systems quickly – what we need, at least transitionally is for the systems to adapt to the users. This is where OneDrive comes in.

One of the most compelling features of SharePoint 2013 in my opinion was OneDrive for Business. The reason that I felt that was that for the first time, SharePoint document management was tightly integrated into File Explorer. There had been previous attempts at synchronizing (SharePoint Workspace), but that required separate client software and required a lot of manual dragging and dropping.

The implementation of OneDrive for Business was initially hobbled by restrictions and limits, and was rather confusing, limiting adoption. However, through the combination of the current OneDrive sync client, with the Files On Demand feature available in the Windows Fall Creators update, OneDrive is truly ready for widespread adoption. Over the past few years, OneDrive has become both reliable and fast, and Files on Demand combined with in-context sharing put it over the top.

OneDrive Files on Demand

Screenshot taken while flying on airplane mode from a laptop containing a single 256 GB drive.

Files on Demand allows you to easily control what files are synchronized to your local device, while still being able to see all your file assets, directly from File Explorer.

In the figure above, the cloud icon indicates that that folder is not currently stored locally. That’s a good thing, because opening up the folder properties reveals that it is over 1 TB in size, and the drive on the laptop (that I’m currently writing this with) is only 256 GB. Moreover, that screen shot was taken while flying, and totally disconnected. I could still see things that were not on my local device.

If connected, cloud files can be interacted with the same way that local files can, by any application. Opening them just requires a little more time as the file is downloaded in the background. If you will be offline, bringing a file local is a simple matter of right clicking on that file, or folder, and selecting “Always keep on this device”.

Files on Demand is currently available to Windows Insiders, and will be generally available with the Windows Fall Creators Update on October 17, 2017.

In Context Sharing

The new OneDrive in context sharing experience

Up until very recently, sharing a drive from Explorer was a rather frustrating experience. You could identify your file, right click on it, click on share, and then a browser window would pop open, and if you were lucky, you would be presented with a view of the OneDrive web user interface. This experience was jarring, and required multiple steps.

Over the past summer, OneDrive rolled out updates that change this behaviour significantly. Now clicking on share brings up the dialog above and sharing is done completely from there. No context switching, and no authentication failures.

Time to move

These two new features, combined with the performance and reliability improvements over the last few years puts OneDrive over the top. Finally, all of the usability issues have been addressed. End users can live completely in File Explorer should they wish to do so, and be oblivious to the workings of OneDrive and/or SharePoint. However, at the same time, they will gain significant benefits compared to the shared file system.

However, OneDrive provides much more than simply an alternate storage location for your files. Once content is stored in OneDrive, a whole host of options are opened. The organization benefits because all this content is immediately made discoverable through technologies like Delve and Search. File access can be tracked, helping users understand what content others find valuable. There are, however, many immediate benefits that occur directly to the users themselves. I wanted to call out three of them, but there are many more.

OneDrive File Viewing

OneDrive comes with a long list of file viewers. These viewers allow the contents of a file to be viewed without opening the underlying application, which tends to be significantly faster than opening the host application. In fact, the application itself does not need to be installed. This is valuable in itself, but when combined with Files On Demand a file can be viewed whether or not it is even stored locally. A large Adobe Illustrator file can be viewed locally without it even being present on disk. Files On Demand is also available on Mac, and in the OneDrive client, and therefore, this very same file can be viewed on iPhone, Android, iPad, anywhere the OneDrive application is available. This, to my mind, is a game changer.

Sharing

Sharing with OneDrive is not new, but sharing directly from the explorer window is. That sharing experience is also now being consolidated across devices, and embedded into Microsoft Office applications. In addition, OneDrive files can be shared externally with an anonymous link, or securely with others that have a Microsoft account (personal or organizational), but what will be shortly available is the ability to securely share files externally with people that have any type of account.

Files Restore

Announced at Ignite 2017, FilesRestore provides end users with the ability to easily track all versions of their files for the past 30 days, and to instantly restore them to a point of time anywhere in that window. Administrators have long had this capability through a set of operations, but FilesRestore puts this capability into the hands of the end user with a simple to use user interface. Users can rest assured that their files are safe not only from disaster, but from their own mistakes, malware, ransomware or anything. Files stored in OneDrive are safe.

These are but three compelling benefits that users can enjoy by moving content to OneDrive. There are many more. Foe a good overview, and to hear all of the OneDrive announcements from Ignite 2017, be sure to check out “OneDrive – Past, Present and Future“.

I was recently asked by Christian Buckley what my top 2016 blog posts were. No problem I thought, I just went back to my output for the past year, and pulled out the posts that I knew have had a lot of discussion or impact, and forwarded them on. At that point he asked how many views that each of those pages have had. Being a data guy, I suddenly felt like the shoemaker noticing that his children had been going barefoot.

I monitor my blog traffic with the built-in WordPress JetPack tools, StatCounter, and Google Analytics. They all work slightly differently, with StatCounter and JetPack being the most alike. I tend to rely on StatCounter for immediate stats (how many hits today, what’s popular today) and the Google stats for a longer time frame. StatCounter doesn’t persist my stats beyond a day, as I don’t have the pro version, and the JetPack stats don’t seem very extensible. Google Analytics seemed like the best place to begin, particularly because there is a pre-existing content pack for Power BI.

The Google Analytics Content Pack

I have used the Google Analytics (GA) content pack casually and for demonstration purposes since it was introduced with the Power BI launch in July 2015. It hasn’t changed much. Actually, as far as I can tell, it hasn’t changed at all. To use the content pack, you simply log into the Power BI service, select “Get Data”, select the “Services” tile, and select Google Analytics.

After you enter in your credentials by selecting oAuth2, Power BI will import your GA data into a data model, and populate a pre-configured report. The report consists of several pages, mostly focused on visitors to the site.

There are some interesting visuals out of the box, and there are more metrics available in the data model if you want to customize the out of the box reports. At the moment, any customizations that are made in this way are not portable, and with the content pack, data is only retained for 180 days, which means that year over year comparisons are not possible. The visuals don’t appear to have been updated since initial release, which means that many of the new Power BI UI enhancements are not there, but they too can be added through customization.

Generally, if you’re going to do a lot of customization, the best tool to use is Power BI Desktop. Reports can then be reused easily and are highly portable. Luckily, in addition to the content pack, Google Analytics also exists as a data source for Power BI Desktop.

Using the Google Analytics Data Source in Power BI Desktop

When Power BI Desktop imports data from GA, it imports all the data that GA has. There seems to be no agreement on how long Google will retain this data, but in practice, GA seems to retain all data since it was originally configured. In my case, that’s a little over two years now, which is fine for my analysis. The first step is to connect to and import the correct data. Start Power BI Desktop, select “Get Data”, choose the Online Services tab and choose “Google Analytics”.

Once you authenticate, you’ll be presented with all of the sites that are monitored by Google Analytics. You’ll want to drill down and open “All Web Site Data”. GA captures an awful lot of information, and the trick is to know what to grab. Grabbing everything won’t work as it only allows for 8 dimensions and measures in a single import. In my case, I am interested in PageViews and Unique PageViews measures, and the Page, Page Title and Landing Page dimensions (under the “Page Tracking” section) measures. In addition, I want the Date, Hour, and Minute dimensions from the Time section.

Once selected, w select OK, and edit the query, giving it a good name like “GA Data”. Finally, we can select “Close and Apply” and the data will be added. This procedure can take a little while depending on the quantity of data.

Once loaded, we need to do a little bit of work in the data model. We imported the dates from GA, but we’ll want to do year/month/day drilldowns, as well as use textual values for month names, day names etc. For that, the tried an true method has been to build a Date table. Power BI itself will actually do some of this automatically for you behind the scenes, but a custom table gives us the ultimate in flexibility. DAX (the Power BI modelling language) makes this very easy. We create a new table by first selecting the “Modeling” tab, and then the New Table button. This allows us to create a calculated table in the formula bar. First change the name from “Table” to something meaningful like “View Dates”, and then add the following formula:

Adjust the beginning and end dates to suit the data in question, click the check mark, and voila, instant date table. There will be a record for every date between the beginning and end dates. It’s a good idea to adjust the properties of some of the resultant columns for display, we want to sort the Month Name Long and Month Name Short columns by Month Number, and the Day of Week column by the Day of Week Number column. Any additional customizations can be made as necessary.

The next step is to establish the relationship between the Date column in the GA table, and the Date field in the new calculated date table. Simply click on the relationship builder icon, the drag and drop the Date column from one table to the corresponding column on the other.

At this point, we can create a visual that shows traffic over time. We create a column chart, and add Pageviews as the Value, then we add Year Month Short (which should be sorted by Year Month Number) to the axis, and we should see site all site traffic over time. Adding Date to the axis and stripping out all the dimensions except Day allows us to drill down on days for a selected month.

Although we can see our site traffic by month, we still can’t answer Christian’s original question, which was “what were the most frequently viewed posts written in 2016“. Google Analytics has no clue when the pages were created. It’s possible to try to imply it from the earliest viewed date for a given page, but the created date is available directly in WordPress. We just need to get the WordPress data into the data model. Thankfully, that is possible through the WordPress REST Add on.

Using the WordPress REST Add-On

REST support is available for WordPress as an add-on. The “WP REST API” is available in the add-on catalog, and on Github here. Once installed, all WordPress content (including posts) is available through a simple http GET request. This is something that’s fully supported by Power BI, and therefore all the relevant post data can be loaded into Power BI through this add-on.

From the Power BI Home tab, select Get Data, then “web” and then use the URL required to retrieve posts. For the blog that you’re reading, it’s https://whitepages.unlimitedviz.com/wp-json/wp/v2/posts. The query will return a list of records. However, there will only be as many records as WordPress shows by default. We need all of them. The add on-allows you to specify the number of posts per page, by adding the “per_page” parameter. Therefore, in our case, it’s https://whitepages.unlimitedviz.com/wp-json/wp/v2/posts?per_page=50 where 50 is the desired number of items per page.

The per_page parameter is all that you need if the number of posts to analyze is fewer than 100, but the limit of this parameter is 100. There is another parameter that can be added to the query, page= that will specify the page number. With this, and the posts per page parameter, it’s possible to get all the posts. There are a couple of ways to implement this in Power BI.

The ideal way is to an “M” function. With a function, you build up a query normally, and then you wrap it in another parameterized query using the advanced editor, passing in the page number as a parameter, and that parameter being used in the subsequent query. The function can then be called from each record of another table, thereby returning all the posts, which is exactly what we need. This approach works perfectly well in Power BI Desktop. Unfortunately, once the model and report are deployed to the Power BI service, it stops working. The Power BI service currently cannot refresh any query that uses replaceable parameters as part of the query.

The other way that this can be handles is to generate multiple queries that explicitly use the page= parameter. The number of queries necessary will be equal to the number of posts divided by 100, then rounded up to the next whole number. In my case, I have 230 posts, and therefor need 3 queries. Once created, all 3 queries can be merged into a single table. This approach is messy, and will require occasional maintenance, but it’s the only one that works for now. Let’s walk through the process.

We’ll start with the first query. As above, we use Get Data, select the Web source and enter the URL for page 1 and 100 posts per page. For this blog the URL is https://whitepages.unlimitedviz.com/wp-json/wp/v2/posts?page=1&per_page=100. The query should show a list of 100 records. Next, we need to turn the list into a table so that it can be expanded. Click the “To Table” button in the ribbon.

Click OK to accept the defaults, and then click the small expand button in the column header (Column1). Be sure to deselect the “Use original column name as prefix” before clicking OK.

At this point, all the post metadata from WordPress should be available. You can choose to keep all or only some of the columns, but the ones that we want to be sure to keep are date, slug, and title. Title needs to be expanded, so we should go ahead and do that – the procedure is the same as the step above, but only the title field is returned as “rendered”. It’s a good idea to rename it to Title. Also, it’s a good idea to set the data type of the Date field to Date/Time here.

Once the query is the way we want it, we’ll want to name it something like “Posts1-100”, and then we need to set its data load properties to not load into the report. We don’t want the data to load into this query because it will only be one merge source of three, and we don’t want to store the data redundantly. To do that, we right click on the query, select properties, and deselect “Enable load to report”. Then click OK.

We now need to duplicate this query for page 2. The easiest way to do this is by copying all the M script generated by the query builder into a new blank query, and then editing it. From the Home tab, we click on “Advanced Editor”, then select and copy all the text in the dialog box. We then close the dialog box, then select New Source – Blank Query. Once opened, we again select “Advanced Editor”, remove the default content and paste the copied text into the box. Finally, “page=1” in the URL is replaced with “page=2”.

We then save the query, name it and set the properties not to load as with the first query. We then repeat all these steps for page 3. At this point we are ready to merge the queries into our “master” query.

To merge the three queries into one, we select the “Append Queries” dropdown from the ribbon, and select “Append Queries as New”. We then select “Three or more tables” and add the three tables and select OK. Finally, we give this new query a name “Posts” but we do not prevent the data from loading. This is our master table. At this point, we are ready to Close and Apply, and return to the main design surface.

This Posts table has a Date column, but it’s actually a Date/Time column. To use a date table, we need to create a new calculated column with just the date portion. With the Posts table selector selected, we select the Modeling tab, and then “New Column”. We then give the column a name (PostDate) and the following formula based on the Date column:

We also want a calculated measure to indicate the number of posts. The process is like that for a new column. We select “New Measure”, and add the following formula to the formula bar:

Posts = CountA(Posts[id])

We will be relating records in the Posts table to records in the GA table, so we need another date table to keep the relationships clean. We could calculate another table as we did above, but it’s even easier to calculate the new one based on the one already created. We simply select “New Table” and use the following formula:

PostDates = ViewDates

Next, we create the relationship between the Posts table and the PostDates table the same way that we did it for the GA table above. Now that both tables are date sliceable, we need to relate them together. In the Posts table, the Link column uniquely identifies the page but the GA table uses the relative address of the page in the Landing Page column. In our case the solution is simple, we need to prepend the main part of the site address in question (in our case https://whitepages.unlimitedviz.com) to the Landing Page. We do that by creating a new column, URL, with the following formula:

Finally, we relate the URL column in the GA table to the Link column in the Posts table.

At this point the model is ready for use in reports.

Building a Report

How to build a report is not the focus of this article, so I’ll just explain the steps taken here. To prepare our data model, we first need to flag the Link column in the Posts table as a URL field. To do that, select it in the UI, then select the model tab. Use the Data Category Drop down control and select “Web URL”.

Next, add a new table to the reports, and in in the Format section, select Values, and set the “URL icon” setting to “On”.

This has the effect of displaying any column that has been flagged with the Web URL attribute as a link icon with a live hyperlink, instead of the entire, often long URL itself.

Next, we add the Title and Link fields from the Pages table, and the Pageviews field from the GA table, and then sort the table by Pageviews. Next, we add two slicer controls to the report – one bound to the Year column of the PostDates table, and the other bound to the Year column of the ViewDates table. Now by selecting 2016 from the ViewDate slicer, and 2016 from the PostDate slicer, we can see, in order with precise numbers, which posts authored in 2016 were most frequently viewed in 2016. With this, I was now able to give Christian an answer.

An answer today is one thing, but an answer next year is another altogether. This report was worth sharing, so it was worth sprucing up a bit. By taking advantage of some of the new table formatting capabilities in Power BI, and importing the chiclet slicer custom control, we are able to make a more visually appealing report. I will also occasionally use a column chart in a report and use it like a slicer when appropriate. With a little bit of formatting work, we wind up with a report that looks something like the following:

Publishing and Sharing

We’re now ready to publish this report. The easiest approach is to simply select the “Publish” button from Power BI desktop. Select the destination, most likely your personal workspace. When publishing is complete, we can select “View in Power BI” to see the report in the service.

Having the report is one thing, but we want this report to be kept up to date. To do this, we go to the datasets section and select our dataset. In the data source credentials, section, we need to set the credentials for both Google analytics, and our WordPress connection (which will display as “Web”). Even though the Web source is anonymous, we have to configure it that way in the Power BI service. Once the connections are configured they should appear in the Data source credentials section with no notices.

When we configured the WordPress data import above, we used 3 queries. That’s good for 300 posts, and my blog is currently at 238, which should be fine for a while. However, once I hit 300, I’m going to need another query. What I’m really hoping for is that by that time the Power BI service will support parameterized data sources for refresh, but either way I’ll need to modify the data source. I’m likely to forget this need about a week after I publish this post, so a reminder is a good idea. Luckily, Power BI supports data driven alerts, which is exactly what we need here.

Alerts are set on dashboard tiles for card date. Our report has a data card showing the number of total posts. Once that card has been pinned to the dashboard, an alert can be set on it for when it reaches a specific threshold. Simply hover over the dashboard card and click on the ellipsis, then the small bell icon.

In our case, we want to be notified when the number of posts are approaching 300, so we set the condition to be above 297. Once blog post 298 is published, I will receive an email and can then act on it.

Finally, I want to share this report with Christian so that the next time he has questions about my blog, he can just look it up for himself. When I tell him this, I’ll say that it’s so he can keep me honest, but really, I just want him to stop bugging me…

We don’t work at the same company and we use different Azure AD tenant. I could share the dashboard externally with him, but it’s even easier to share it anonymously, and anonymous sharing of this data is fine with me. Anonymous sharing of data is relatively straightforward. From the report interface, select File – Publish to web. A dialog will open asking for confirmation, and once opened will provide a URL that can be shared publicly. In the case of this blog’s report, you can simply click here to get the full report in a dedicated window. I can just email that report to Christian, and he’ll have the answers that he’s looking for. The beauty of anonymous sharing is that you are also given an embed code that can be added to any web page. As an example, the fully interactive report for this blog can be seen below.

Note – Since originally publishing this post, I have been made aware of some new management tools that will allow the ability to disable group creation by default. As opposed to modifying this post, which contains other observations, I have published a new one dealing with these new tools here.

As I’ve discussed before, Office 365 Groups are a very important feature in Office 365, and one that all organizations using Office 365 should fully understand as soon as possible. Groups are either required or they provide important capabilities for every product in the Office 365 stack. However, every organization has a different tolerance for change, and some have no tolerance for it at all. In addition, there are many aspects of Groups that are still a work in progress (navigation for example). A frequently asked question is “how do we turn off Groups?”. There’s nothing in the Office 365 Administration interface in either the Groups, or the Services & Add-ins sections.

What these approaches do is to adjust the Outlook Web Access policy that controls the creation of Office 365 Groups. At its core, an Office 365 Groups is just a type of Azure Active Directory Group, one with multiple services attached to it. When Groups were first introduced, the only way of creating them was through the Azure Active Directory interface, PowerShell and through Outlook Web Access (OWA). The first two methods require an administrative level of access, so enabling and disabling this feature in OWA effectively disabled it for end users. An end user can still see the Group creation controls, but any attempt to create a new group is met with a dialog informing them that this feature is disabled.

Since Groups were first introduced, there have been several significant changes as more Office 365 services embraced the Groups structure, and others have been introduced that rely on it.

When the “new” Power BI was introduced in mid 2015, its Sharing story relied heavily on Office 365 Groups. Each Group receives a Power BI workspace, and conversely each new Power BI workspace is a Group. Given that end users can create and to some extent manage the workspace directly in the Power BI user interface, it represents an alternate Groups management tool focused on the end user.

Creating a new Group in Power BI

Microsoft Planner, launched in mid-2016 is another product that relies on the availability of Groups. For the most part Planner stands on its own, with minimal ties to the rest of the Office 365 stack. Each Plan contains multiple tasks, but under the covers, each Plan is backed by an Office 365 Group, with all the rest of the available services. Creating a new plan in Planner creates a new Group, and everything that goes with it, even though the interface doesn’t make it very clear. You’re getting far more than just a plan.

Creating an Office 365 Groups (aka Plan) in Planner

With the release of Modern Team Sites in SharePoint, SharePoint is also very tightly bound to Office 365 Groups. Before this release, creating a new team site through the SharePoint interface or through the SharePoint administration interface created a classic SharePoint site collection. Doing so now also creates a group to go along with it (again, with everything that goes along with that) and all the access to the new Team Site (a site collection) is controlled through membership to that Group. The SharePoint interface for this makes it very clear as to what is happening – “Lets create a new team site and group”.

Creating an Office 365 Group from SharePoint

It is still possible to create a SharePoint site collection that is not bound to a group through the SharePoint administration interface. Modern team sites (the site collections created through the SharePoint user interface) don’t appear in the SharePoint administration interface at all.

The Outlook 2016 rich client also has a comprehensive set of group management features. A group can be created by right clicking on the “Groups” node in the Outlook mailbox, and once created can be fully managed by the “Home” tab in the ribbon.

Creating a new group in Outlook 2016

Managing a group in Outlook 2016

There are now 5 different way for end users to create and in some cases, manage their Office 365 groups. The original Outlook Web Access interface, and now Outlook 2016, Planner, SharePoint and Power BI. The processes outlined above for disabling group creation prevent group creation from Outlook Web Access, but what effect do they have on these new interfaces? The answer is, no effect whatsoever. Whether the “GroupCreationEnabled” OWA policy has been set to false or not, these other interfaces will still be able to create and work with Office 365 Groups. This may not be surprising as Power BI, Planner, and now even features of SharePoint are dependent on the Groups infrastructure.

I have not called out Microsoft Teams above. It is true that Teams is also dependent on the Groups infrastructure, and that creating a new team will create a new group. Where Teams differs from the other dependent services is that the creation of a new Group in one of the other interfaces does not automatically create a new Team. In addition, Teams itself must be enabled by an administrator, meaning that for this additional service, Groups creation can be controlled centrally.

In the very near future, Yammer will also become Groups dependent. Creation of a new group in Yammer will spin up a corresponding Office 365 group, which will be used to store the files and notes available in Yammer. These groups will be flagged as “Yammer managed” meaning that they will not appear in the Outlook interfaces, but they will be available to all the other services that utilize groups.

The bottom line of all this is that even if you use Office 365, and you think that you have disabled Groups in your tenant, the chances are that you could be in for a surprise. If any of these dependent services are in use, the chances are that you already have several created.

Groups are the bedrock of all new features in Office 365 moving forward – it is therefore a good idea that your organization understand them as soon as possible. Their inevitability is also another strong argument for paying close attention to them. If you are currently discussing whether or not they should be used, I would strongly encourage you to shifting that discussion to how they should best be used.