Lego Bricks and the Spectrum of Data Enrichment and Reuse

I love them because they were a favorite toy from my childhood. I love them because my children love them. I love them because they foster and encourage creativity.

I love them because they serve as an excellent metaphor for the enrichment and reuse of data in the enterprise.[1]

Consider this starting point.

A single brick has almost unlimited potential. This brick could become the floor of a building, the chassis of a fire truck, the wing of an airplane or spaceship, or part of something that no Lego engineer had ever imagined. This potential comes with a cost – this single brick must be enriched with many other bricks before it can achieve any of these lofty goals.

Similarly, a single file, entity, or table has massive potential. The data it maintains could become part of many different applications in many different contexts. But as with the lone brick, it would need to first be enriched, to be combined with other data for that potential to be realized.

As each additional brick is combined with the first brick, its complexity increases, but so does its value. These more complex components are closer to the ultimate end state, so less work will be required to achieve that goal.

But at the same time, each additional brick reduces the potential for reuse. After only a few additions we’ve already ruled out creating a floor or an airplane wing. We might still create a car or truck, or some types of spaceships, but the enrichment we’ve performed is already constraining our potential.

Similarly, each time we enrich a data source to move closer to our goal, we also limit the potential scenarios for which the data can be used. If we filter the data horizontally or vertically to eliminate records or fields we don’t currently need, we are eliminating data that may be needed for some other purpose. If we merge our starter data set with additional data, we may also be adding records or fields that aren’t needed for future purposes, while increasing complexity and adversely affecting performance.

As we continue building, we see this pattern continue. We also see the pattern repeated on multiple levels, while contributing to the overall goal. At multiple points we will combine a small number of individual bricks to build a component, and then add that component to the main model to make it more complex, and more specialized. Each time – both for the component and for the whole – the process of enrichment adds value and complexity, and reduces the potential scope of reuse. When the final model is finished we have the final model we needed[2]. The value is very high, but the opportunities for reuse are very small.

The parallel is clear: when we finish building a BI application, the value is very high, but the potential for reuse is very low. The dashboards and reports, the data model with its measures, dimensions, and KPIs, the data warehouse and data lake, and all of the upstream logic and components that make up a BI solution need to be combined in specific ways to achieve specific goals. The application as a whole is specialized for a given purpose…

…but what about the components that make up the application? Where can we find the sweet spot, the perfect balance between achieved value and potential value?

Like these Lego components:

When you’re building your solution using Power BI, this is where dataflows come in.

The answer will of course differ for each context, but when designing an application, it’s important to take larger and longer-term requirements into account.

Consider this diagram[3]:

In this simple architecture, each dataflow (represented by the lighter squares) represents a stage in the overall data processing and enrichment flow. Each one adds value toward the application, and each serves as a building block that can be further reused both in this application and in other applications with overlapping data requirements.

The nature of Power BI dataflows lends itself well to this problem – each dataflow is a collection of reusable data entities managed by the Power BI service, easily discoverable and usable by technical and business users in BI applications. The computed entities feature in particular makes this type of scenario easy to set up and manage.

At the same time, the nature of Power BI dataflows introduces challenges for this Lego-style reuse. Dataflows in Power BI are optimized to enable non-IT users, business users and analysts who are typically focused on solving their immediate data challenges without relying on support from IT. These users are less likely to be focused on bigger-picture requirements like broad reuse of the entities they create.

This is where a little process and collaboration can come in, aided by the fact that dataflows are managed by the Power BI service. Power BI administrators can monitor the service to understand what dataflows are being used most frequently and most widely, and in what contexts. With this as a starting point, they can then operationalize[4] dataflows and entities created by business users, so that they are managed and maintained by IT. Since each dataflow entity is defined by the Power Query “M” code in the entity definition, this operationalization process is likely to be simpler and easier than similar processes with other technologies.

This approach also fits in well with how many larger enterprises implement Power BI. It is common[5] for larger organizations to use both shared capacity and dedicated Premium capacity for different purposes, and often those applications deployed to Premium capacity are those that are managed by a central IT/BI team. Since computed entities are only available when using Power BI Premium[6], this approach could lend itself well to the hand-off from business users to IT.

In any event, the next time you’re building dataflow entities, pause for a moment to think about Lego bricks, and what types of bricks or components your entities and dataflows represent. And then maybe take a break to go play with your kids.

[1] All images and examples in this post are taken from the building instructions for the Lego City Starter Set. I used to have a similar Lego set in my office that I would use to tell this story in person, but I gave it away during an office move shortly before I started this blog. The moral of the story: never get rid of your Lego sets.

[2] A fire truck!!

[3] This diagram is my reformatted version of a diagram included in the Power BI dataflows whitepaper. If you haven’t read this document, you really should.

[4] Or industrialize – which term do you use?

[5] Please note that this is not a blanket recommendation. I have the advantage of talking to many top Power BI customers around the world, so I can see this approach emerging as a common pattern, but the “right” approach in a given context will always depend on the context and the goals of the organization. I personally believe it is too early to start talking about best practices for Power BI dataflows (as I write this, dataflows have been in public preview for barely three weeks) but this is one of the areas where I am most excited to see best practices start to emerge.

Share this:

Like this:

Related

Post navigation

18 thoughts on “Lego Bricks and the Spectrum of Data Enrichment and Reuse”

Darryll Petrancuri

Hello Matthew:

Perhaps you can help me wrap my head around something.

I would like to take advantage of features in Power BI Premium such as Data Flows but may have a need to connect Tableau, etc. to datasets. As I understand it, Tableau needs to connect to an Analysis Services (AS) instance.

Now, if I’m not mistaken Power BI Premium actually includes a dedicated Azure AS instance under the covers. Do I have the ability to interact with / connect to that Azure AS instance externally, through tools such as Tableau?

If not, do I have the ability to bring an existing Azure AS instance to Power BI Premium and configure Power BI Premium to use the existing Azure AS instance in lieu of the dedicated Azure AS instance created by / managed by Power BI Premium.

Also, as it relates to Data Flows, I know from presentations that Data Flows can be consumed by Azure AS.

I hope you’re able to follow the bouncing ball.

I would appreciate any guidance, thoughts, clarification that can be offered.

I’d recommend watching Christian Wade’s session from Microsoft Ignite. The Power BI team is planning to open the XMLA endpoint on datasets (which are tabular models managed by the Power BI service) so that client tools such as SSMS or Tableau can connect to them. This capability is not yet available, and I believe the Ignite presentation is the best public source of information today.

The dataflows connection feels a little tenuous to me, but I hope this is helpful nonetheless. 😉

Great metaphor for the role of dataflows in the new self service world! Thank you for your detailed explanation!

It is a fact that Dataflows have immense potential to collabrate busines with IT but Traditional BI teams are increasingly becoming uncomfortable in managing self service ETL solutions.

What would the role of Traditional BI Teams in the context of these new capabilities. Should they adopt these new capabilities for the ETL workloads? Will the dataflows bex exposed to other client tools as well in short future. It would be really good if Microsoft can publish some scenario based use cases for dataflows.

There is *not* a mechanism in place today to let Power BI dataflows be used as a data source in AAS. The diagram (which as mentioned elsewhere is one I stole from Amir Netz’s dataflows whitepaper and reformatted for PowerPoint) describes a future scenario.

With dataflows in preview (one month as of today!) there is still a lot of functionality that is in development but not yet available for general use…