Why do this payload and the one before it have a standard metadata package, even though the payloads are from different sources? What is the scope of the standard? Under what authority is the standard defined, and enforced?

What are these?

How do you know?

How much have they been processed since they were produced?

Without metadata, how do you evaluate the contents?

Without metadata, would you bother to evaluate the contents, or would you pass them by and instead look for payloads with complete metadata?

What are these?

How much have they been processed since they were produced?

Can you infer important details from the payload container format,even though the primary metadata is missing? Is this enough metadata for you to evaluate the payload for use?

How does the complexity of a payload relate to the complexity of the metadata? How does it relate to your requirements for considering the metadata to be complete enough?

Do you need more metadata to understand a payload that has been highly processed?

Is it easier or harder to use a simpler payload? How does the complexity of the desired application factor into your answer?

Your values and priorities determine what you do, and what you accept being done.

Now apply that to your work as a technical professional.

Think for a moment about the team you work with. What is the team breakdown by gender? By nationality? By education? By skin color? By language? By ethnic background? By religion? By cognitive ability?

Now think about the decisions your team makes when deciding what to build, how to build it, and how to determine when it’s ready to ship.

Microsoft stated that it was able to “significantly reduce accuracy differences across the demographics” by expanding facial recognition training data sets, initiating new data collection around the variables of skin tone, gender and age and improving its gender classification system by “focusing specifically on getting better results for all skin tones.”

“The higher error rates on females with darker skin highlights an industrywide challenge: Artificial intelligence technologies are only as good as the data used to train them. If a facial recognition system is to perform well across all people, the training dataset needs to represent a diversity of skin tones as well factors such as hairstyle, jewelry and eyewear.”

I’d like to emphasize that I’m not on the Face API team and don’t have any insight into the team beyond this story, but I think it’s probably safe to say that if the team had more darker-skinned men and women as team members, the decision to ship an API with high failure rates for darker-skinned men and women may not have been made.[1] Imagine a developer saying “but it works on my face” in the same tone you’ve heard one say “but it works on my machine” in the past. If it doesn’t work on your machine, that’s when even the most obstinate developer will admit that the code has a problem.

This example of the impact diversity of background can make is significant, but it’s also pretty common. If you follow tech news sites, you’ve heard this story and others like it before. So let’s look at another one.

In systems that you’ve built, how easy is it to change an email address or username? This might be a transactional system, where the email address or username is business key. Or it may be an analytics system, where these fields may not be handled as slowly changing dimension attributes. Think about your Microsoft Account[2] as an example – just how easy is it to change the email address you use for your cloud identity across dozens of Microsoft services?

As it turns out, it’s pretty darned easy today, and I have to wonder if there are transgender team members who are responsible for this fact.

For most cisgender people, the only time you’d think about changing your name is when you get married, and then it’s only your last/family name. Changing your first/given name may feel like a weird corner case, but it definitely won’t feel this way if you or someone you love is transgender. In that case, you understand and appreciate the impact of deadnaming, and you may well have experienced the struggle of making name and email changes in system after system after system.

…

Rather than going on with more examples, I’ll get to the point: If you have a more diverse team, you have a better chance of building a product that is better for more customers more quickly, and to ship the right thing sooner rather than later.

To me this is an obvious truth because I have seen it play out again and again, for good and for ill. Not everyone agrees. There are still people who use the term “diversity hire” as a pejorative. This summer the amazing intern working with my team was told by one of her fellow interns that the only reason she got the position was because she was female, and there was a quota[3]. Although some people[4] may be threatened by the recognition of the value of diversity, that doesn’t reduce the value in any way.

Join a diverse team. Form a diverse team. Support a diverse team. And build something that’s what the world needs, even if the world doesn’t look just like you.

[2] The cloud identity that used to be called Live ID, after it was called Passport. It’s the email address you use to sign into everything from Windows to Outlook to OneDrive.

[3] Hint: It wasn’t, and there wasn’t. She was awesome, and that’s why she got the internship. I sure hope she sticks with it, because if I was half as good when I was 21 as she is, I would be ruling the world today.

[4] Typically the people who directly benefit from a lack of diversity. Yes, typically white heterosexual cisgender males. Typically people who look like me.

This post presents a variation on the data profiling pattern that doesn’t require Premium capacity. Let’s jump right in.

This is the approach that I took last time: I created a single dataflow to contain the data profiles for entities in other dataflows. As you can see, my workspace is no longer backed by Premium capacity, so this approach isn’t going to work.

Instead of having a dedicated “Data Profiles” dataflow, we’re going to have data profile entities in the same dataflows that contain the entities being profiled. Dataflows like this one.

As you can see, this dataflow contains two entities. We want to profile each of them. The most intuitive approach would be to create new queries that reference the queries for these entities, and to put the profile in the dependent query…

…but if you do this, Power BI thinks you’re trying to create a computed entity, which requires Premium.

Please allow me to rephrase that last sentence. If you reference a query that is loaded into a dataflow entity, you are creating a computed entity, which requires Premium.

So let’s not do that.

Specifically, let’s use the same pattern we used in the “reuse without premium” post to address this specific scenario.

Let’s begin by disabling the data load for the two “starter” entities that reference the external data source.

Once this is done, the Premium warning goes away, because we’re no longer trying to create computed entities.

Let’s rename the queries, and look at the M code behind the new queries we’ve created.

As you can see, the new queries don’t contain any real logic – all of the data acquisition and transformation takes place in the “source” queries. The new ones just reference them, and get loaded into the CDM Folder that’s backing the dataflow.

At this point we’re functionally right back where we started – we just have a more complex set of queries to achieve the same results. But we’re also now positioned to add in queries to profile these entities, without needing Premium.

To do this, we’ll simply add new queries that reference the “source” queries, and add a step that calls Table.Profile().[1]

And that’s that.

When I save my dataflow and refresh it, the entities for the data, and the entities for the data profile, will load, and will be saved for reuse. When I connect to this dataflow from Power BI Desktop, I have available all four entities.

At this point you may be wondering about what the difference is between this approach and the approach that uses computed entities. To help answer this question, let’s look at the refresh details in the CSV file that can be downloaded from the refresh history.[2]

If you look at the start and end time for each of the four entities, you’ll see that each of them took roughly the same time to complete. This is because for each entity, the query extracted data from the data source and transformed it before loading into the CDM Folder. Even though the extract logic was defined in the shared “source” queries, when the dataflow is refreshed each entity is loaded by executing its query against the data source.

By comparison, in the data profiling pattern that relies on computed entities, the data source is not used to generate the profile. The computed entity uses the CDM Folder managed by Power BI as its source, and generates from profile from there. This means that the data source is placed under lighter load[3], and the profile generation itself should take less time.

For meaningfully large data sources, this different may be significant. For the trivial data sources used in this example, the difference is measured in seconds, not minutes or hours. You’ll probably want to explore these patterns and others – I’m eager to hear what you discover, and what you think…

A: Dataflows are a capability in Power BI for self-service ETL and data preparation that enable analysts and business users to define and share reusable data entities. Each dataflow is created in a Power BI workspace and can contain one or more entities. Each entity is defined by a Power Query “M” query. When the dataflow is refreshed, the queries are executed, and the entities are populated with data.

Q: Where is the data stored?

A: Data is stored in Azure Data Lake Storage gen2 (ADLSg2) in the CDM folder format. Each dataflow is saved in a folder in the data lake. The folder contains one or more files per entity. If an entity does not use incremental refresh, there will be one file for the entity’s data. For entities that do use incremental refresh, there will be multiple files based on the refresh settings. The folder also contains a model.json file that has all of the metadata for the dataflow and the entities.

Q: Do I need to pay for Power BI dataflows?

A: Yes, but you don’t need to pay extra for them. Dataflows are available to Power BI Pro and Premium users.

Q: Do I need Power BI Premium to use dataflows?

A:No. Although some specific features (incremental refresh of dataflow entities and linked/computed entities) do require Premium, dataflows are not a Premium-only capability.

Q: How much data storage do I get?

A: Storage for dataflow entities counts against the existing limits for Power BI. Each user with a Power BI Pro license has a limit of 10GB, and each Premium capacity node has a limit of 100TB.

Q: Do dataflows support incremental refresh?

A: Yes. Incremental refresh can be configured on a per-entity basis. Incremental refresh is supported only in Power BI Premium.

Q: Can I use on-premises data sources with dataflows?

A: Yes. Dataflows use the same gateways used by Power BI datasets to access on-premises data sources.

Q: How do I do X with dataflows? I can do it in a query in Power BI Desktop, but I don’t see it in the dataflows query editor UI!

A: Most Power Query functionality is available in dataflows, even if it isn’t exposed through the query editor in the browser. If you have a query that works in Power BI Desktop, copy the “M” script into a “blank” query to create a new dataflow entity. In most cases it will work.

Q: Do I still need a data warehouse if I use dataflows?

A: If you needed a data warehouse before Power BI dataflows, you probably still need a data warehouse. Although dataflows serve a similar logical function as a data warehouse or data mart, modern data warehouse platforms provide capabilities that dataflows do not.

Q: Do I need dataflows if I already have a data warehouse?

A: Dataflows fill a gap in data warehousing and BI tools by allowing business users and analysts to prepare and share data without needing help from IT. With dataflows, users can build a “self service data mart” in Power BI that can be used in their solutions. Because each dataflow entity is defined by a Power Query “M” query, handing off the definitions to an IT team for operationalization/industrialization is more straightforward.

Q: Do dataflows replace Azure Data Factory?

A:No. Azure Data Factory (ADF) is a hybrid data integration platform designed to support enterprise-scale ETL and data integration needs. ADF is designed for use by professional data engineers. Power BI dataflows are designed for use by analysts and business users – people familiar with the Power Query experience from Power BI Desktop and Excel – to load data into ADLSg2.

A: Probably. Power BI dataflows are a self-service data preparation tool that enable analysts and other business users who may not be comfortable using SSIS or ADF to solve data prep problems without IT involvement. This is still relevant even after ADF now includes Power Query with Wrangling Data Flows.

Q: Can I use dataflows for realtime / streaming data?

A: No. Dataflows are for batch data, not streaming data.

Q: Do dataflows replace Power BI datasets?

A: No. Power BI datasets are tabular analytic models that contain data from various sources. Power BI dataflows can be some or all of the sources used by a dataset. You cannot build a Power BI report directly against a dataflow – you need to build reports against datasets.

Q: How can I use the data in a dataflow?

A: No. Oh, wait, that doesn’t make sense – this wasn’t even a yes or no question, but I was on a roll… Anyway, you the use data in a dataflow by connecting to the Power BI dataflows connector in Power Query. This will give you a list of all workspaces, dataflows, and entities that you have permission to access, and you can use them like any other data source.

Q: Can I connect to dataflows via Direct Query?

A: No. Dataflows are an import-only data source.

Q: Can I use the data in dataflows in one workspace from other workspaces?

A: Yes! You can import entities from any combination of workspaces and dataflows in your PBIX file and publish it to any workspace where you have the necessary permissions.

Q: Can I use the data in a dataflow from tools other than Power BI?

A: Yes. You can configure your Power BI workspace to store dataflow data in an Azure Data Lake Storage gen2 resource that is part of your Azure subscription. Once this is done, refreshing the dataflow will create files in the CDM Folder in the location you specify. The files can then be consumed by other Azure services and applications.

Q: Why do dataflows use CSV files? Why not a cooler file format like Parquet or Avro?

A: The dataflows whitepaper answers this one, but it’s still a frequently asked question. From the whitepaper: “CSV format is the most ubiquitously supported format in Azure Data Lake and data lake tools in general, and CSV is generally the fastest and simplest to write for data producers.” You should probably read the whole thing, and not just this excerpt, because later on it says that Avro and Parquet will also be supported.

Q: Is this an official blog or official FAQ?

A: No, no. Absolutely not. Oh my goodness no. This is my personal blog, and I always suspect the dataflows team cringes when they read it.

For me, it often comes down to things that I’m not good at – useful skills and knowledge that I lack, or where my team members are far more advanced. When you consider the people I work with, this can be pretty intimidating[1]. It’s great to know that I can reach out at any time to some of the best experts in the world, but it sometimes makes me wonder what I have to offer to the team when I see them kicking ass and optionally taking names.

As it turns out, I have a lot to offer. Specifically, I’ve been able to do some pretty significant things[2] that have changed the way my team works, and changed the impact that the team has on Power BI as a product. And if I believe what people have told me[3], I’ve implemented changes that probably could not have been made without me.

This brings me back to this question: What makes a high-performing team?

Define and Create Interdependencies.There is a need to define and structure team members’ roles. Think of sports teams, everyone has their position to play, and success happens when all of the players are playing their roles effectively. In baseball, a double-play is a beautiful example of team interdependency.

Based on my experiences, I’ll paraphrase this a little. A high-performing team requires team members with diverse and complimentary abilities. Everyone should be good at something – but not the same thing – and everyone should be allowed and encouraged to contribute according to his or her personal strengths and interests[5]. Every team member should have significant abilities related to the priorities of the team – but not the same abilities. And equally importantly, each team member’s contributions need to be valued, appreciated, recognized, and rewarded by the team culture.

I’ve worked on more than a few teams where these criteria were not met. At the extreme, there were teams with a “bro” culture, where only a specific set of brash technical abilities were valued[6], so everyone on the team had the same skills and the same attitudes, and the same blinders about what was and what was not important. And of course the products they built suffered because of this myopic monoculture. Although this is an obvious extreme, I’ve seen plenty of other teams that needed improvement.

One example that stands out in my memory was the first major product team I worked on at Microsoft. There was one senior developer on the team who loved sustained engineering work. He loved fixing bugs in, and making updates to, old and complex code bases. He was good at it – really good – and his efforts were key to the product’s success. But the team culture didn’t value his work. The team leaders only recognized the developers who were building the flashy and exciting new features that customers would see in marketing presentations. The boring but necessary work that went into making the product stable and viable for enterprise customers simply wasn’t recognized. Eventually that team member found another team and another company.

I’m very fortunate today to work on a team with incredible diversity. Although most team members[7] are highly skilled at Power BI, everyone has his own personal areas of expertise, and an eagerness to use that expertise to assist his teammates. And just as importantly, we have a team leader who recognizes and rewards each team member’s strengths, and finds ways to structure team efforts to get the best work out of each contributor. Of course there are challenges and difficulties, but all in all it’s a thing of beauty.

Let’s wrap this up. If you’ve been reading the footnotes[8], you’ve noticed that I’ve mentioned imposter syndrome a few times. I first heard this term around 8 years ago when Scott Hanselman blogged about it. I’d felt it for much longer than that, but until I read his post, I’d never realized that this was a common experience. In the years since then, once I knew what to look for, I’ve seen it all around me. I’ve seen amazing professionals with skills I respect and admire downplay and undervalue their own abilities and contributions. And of course I see it in myself, almost every day.

You may find yourself feeling the same way. I wish I could give advice on how to get over it, but that’s beyond me at the moment. But what I can say is this: you’re better than the people you work with[9]. I don’t know what you’re better at, but I’m highly confident that you’re better at something – something important! – than the rest of your team. But if your team culture doesn’t value that thing, you probably don’t value it either – you may not even recognize it.

If you’re in this situation, consider looking for a different team. Consider seeking out a team that needs the thing that you have to give, and which will appreciate and value and reward that thing you’re awesome at, and which gives you joy. It’s not you – it’s them.

Not everyone is in a position to make this sort of change, but everyone can step back to consider their team’s diversity of ability, and where they contribute. If you’ve never looked at your role in this way before, you may be surprised at what you discover.

[1] Epic understatement alert. I work with these guys, and more like them. Imagine my imposter syndrome every damned day.

[2] I will not describe these things in any meaningful way in this post.

[3] Which is far from certain. See the comment on imposter syndrome, above.

[4] Imposter syndrome, remember? Are you noticing a theme yet?

[5] I explicitly include interests here because ability isn’t enough to deliver excellence. If you’re skilled at something but don’t truly care about it, you may be good, but you’re probably never be great.

[6] Bro, short for brogrammer, with all the pejorative use of this term implies. If you’ve been on one of these teams, you know what I mean, and I hope you’re in a better place now.

[7] Present company excluded, of course.

[8] Yes, these footnotes.

[9] Did this work? I was a little worried about choosing the opening sentence I did, but I wanted to set up this theme later on. Did I actually pull it off, or is this just a cheap gimmick? I’d love to know what you think…

When Microsoft first announced dataflows[1] were coming to Power BI earlier this year, I started hearing a surprising question[2]:

Are dataflows for Master Data Management in the cloud?

The first few times I heard the question, it felt like an anomaly, a non sequitur. The answer[3] seemed so obvious to me that I wasn’t sure how respond.[4]

But after I’d heard this more frequently, I started asking questions in return, trying to understand what was motivating the question. A common theme emerged: people seemed to be confusing the Common Data Service for Apps used by PowerApps, Microsoft Flow, and Dynamics 365, with dataflows – which were initially called the Common Data Service for Analytics.

The Common Data Service for Apps (CDS) is a cloud-based data service that provides secure data storage and management capabilities for business data entities. Perhaps most specifically for the context of this article, CDS provides a common storage location, which “enables you to build apps using PowerApps and the Common Data Service for Apps directly against your core business data already used within Dynamics 365 without the need for integration.”[5] CDS provides a common location for storing data that can be used by multiple applications and processes, and also defines and applies business logic and rules that are applied to any application or user manipulating data stored in CDS entities.[6]

And that is starting to sound more like master data management.

When I think about Master Data Management (MDM) systems, I think of systems that:

Serve as a central repository for critical organizational data, to provide a single source of truth for transactional and analytical purposes.

Provide mechanisms to define and enforce data validation rules to ensure that the master data is consistent, complete, and compliant with the needs of the business.

Provide capabilities for matching and de-duplication, as well as cleansing and standardization for the master data they contain.

Include interfaces and tools to integrate in with related systems in multiple ways, to help ensure that the master data is used (and used appropriately) throughout the enterprise.

While CDS has many of these characteristics, dataflows fit in here primarily in the context of integration. Dataflows can consume data from CDS and other data sources to make them available for analysis, but their design does not provide any capabilities for the curation of source data, or for transaction processing in general.

Hopefully it is now obvious that Power BI dataflows are notan MDM tool. Dataflows doprovide complementary capabilities for self-service data preparation and reuse, and this can include data that comes from MDM systems. But are dataflows themselves for MDM? No, they are not.

[1] At the time, they weren’t called dataflows. Originally they were called the Common Data Service for Analytics, which may well have been part of the problem.

[2] There were many variations on how the question was phrased – this is perhaps the simplest and most common version.

[6] Please understand that the Common Data Service for Apps is much more than just this. I’m keeping the scope deliberately narrow because this post isn’t actually about CDS.

[7] MDM is a pretty complex topic, and it’s not my intent to go into too much depth. If you’re really interested, you probably want to seek out a more focused source of information. MDM Geek may be a good place to start.