Dataflows in Power BI: Overview Part 7 – External CDM Folders

One key aspect of Power BI dataflows is that they store their data in CDM Folders in Azure Data Lake Storage gen2.[1] When a dataflow is refreshed, the queries that define the dataflow entities are executed, and their results are stored in the underlying CDM Folders in the data lake.

By default the Power BI service hides the details of the underlying storage. Only the Power BI service can write to the CDM folders, and only the Power BI service can read from them.

NARRATOR:

But Matthew knew that there are other options beyond the default…

Because the CDM folder format is an open standard, any service or application can create them. A CDM folder can be produced by Azure Data Factory, Azure Databricks, or any other service that can output text and JSON files. Once the CDM folder exists, we just need to let Power BI know that it’s there.

Like this.

When creating a new dataflow, select the “Attach an external CDM folder” option. If you don’t see the “Attach an external CDM folder” and “Link entities from other dataflows” options, the most likely reason is that you’re not using a new “v2” workspace. These capabilities are available only in the new workspaces, which are currently also in preview.

You’ll then be prompted to provide the same metadata you would enter when saving a standard Power BI dataflow (required name and optional description) and also to enter the path to the CDM folder in Azure Data Lake Storage gen2.

Just as you need permissions to access your data sources when building a dataflow in Power BI, you also need permission on the CDM folder in Azure Data Lake in order to attach the CDM folder as an external dataflow.

And that’s it!

The other steps that would normally be required to build a new dataflow are not required when attaching an external CDM folder. You aren’t building queries to define the entities, because a service other than Power BI will be writing the data in the CDM folder.

Once this is done, users can work with this external CDM folder as if it were a standard Power BI dataflow. An analyst working with this data in Power BI Desktop will likely never know (or care) that the data came from somewhere outside of Power BI. All that they will notice is that the data source is easy to discover and use, because it is a dataflow.

One potential complication[2] is that Power BI Desktop users must be granted permissions both in Power BI and in Azure Data Lake in order to successfully consume the data. In Power BI, the user must be a member of the workspace that contains the dataflow. If this is not the case, the user will not see the workspace in the list of workspaces when connecting to Power BI dataflows in Power BI Desktop. In Azure Data Lake, the user must be granted read permissions on the CDM folder and the files it contains. If this is not the case, the user will receive an error when attempting to connect to the dataflow.

One additional consideration to keep in mind is that linked entities are not supported when referencing dataflows created from external CDM folders. This shouldn’t be a surprise given how linked entities work, but it’s important to mention nonetheless.

Now that we’ve seen how to set up external folders, let’s look at why we should care. What scenarios does this feature enable? The biggest scenario for me is the ability to seamlessly bridge the worlds of self-service and centralized data, at the asset level.

Enabling business users to work with IT-created data obviously isn’t a new thing – this is the heart of many “managed self-service” approaches to BI. But typically this involves a major development effort, and it involves the sharing of complete models. IT builds data warehouses and cubes, and then educates business users on how to find the data and connect to it. But with external CDM folders, any data set created by a data professional in Azure can be exposed in Power BI without any additional IT effort. The fact that the data is in CDM folder format is enough. Once the CDM folder is attached in Power BI, any authorized user can easily discover and consume the data from directly within Power BI Desktop. And rather than sharing complete models, this approach enables the sharing of more granular reusable building blocks that can be used in multiple models and scenarios.

There doesn’t even need to be a multi-team or multi-persona data sharing aspect to the scenario. If a data engineer or data scientist is creating CDM folders in Azure, she may need to visualize that data, and Power BI is an obvious choice. Although data science tools typically have their own visualization capabilities, their options for distributing insights based on those visuals tend to fall short of what Power BI delivers. For data that is in CDM folders in Azure Data Lake Store gen2, any data producer in Azure can easily have a seamless way to have their data easily exposed and shared with Power BI.

And of course, there are certainly many possibilities that I haven’t even thought of. I can’t wait to hear what you imagine!

[1] If you click through to read the CDM folders post you’ll see that I used almost exactly the same opening sentence, even though I hadn’t read that post since I wrote it over a month ago. That’s just weird.

[2] At least during the preview. I plan on going into greater depth on dataflows security in a future post, and you should expect to see things get simpler and easier while this feature is in preview.