Like this:

While I was playing around with the new release (August 2018) of Power BI Desktop I noticed there was an undocumented change: similar to the OData improvements I blogged about here, there is a new option in the AnalysisServices.Database() and AnalysisServices.Databases() M functions that turns on a newer version of the MDX generation layer used by the Power Query engine. Like the OData improvements it is an option called Implementation=”2.0”, used like this:

…and also, as with the OData improvements, you will need to manually edit any existing M queries to take advantage of this.

In fact first heard about this option in a comment on this blog post back in January, but as I was told by the dev team that it hadn’t been tested properly I didn’t blog about it. However as it is now set by default in the M code generated by the Power Query Editor, I guess it’s ready for production use. I’m told it should improve the performance of M queries that import data from Analysis Services – and I would be very interested to hear from anyone who tests this about how much of an improvement they see.

I’ve done a little bit of testing myself and I can see there are indeed some differences in the MDX generated. For example, I created an M query that connected to the Adventure Works DW database and returned all combinations of customer and day name where the Internet Sales Amount measure is greater than 100. In the old version of the MDX generation layer (ie without Implementation=”2.0” set) the following MDX query is generated:

The difference between the two is that the first query uses a subselect to do the filtering whereas the second does not; subselects in MDX are not necessarily bad, but I generally try to avoid using them unless I need to. There may be other differences in the way the MDX is generated in the new version but I haven’t had a chance to do any detailed testing.

Like this:

My posts from two weeks ago (see here and here) on using Process Monitor to troubleshoot the performance of Power Query queries made me wonder about another question: how does the performance of reading data from CSV files compare to the performance of reading data from Excel files? I think most experienced Power Query users in either Power BI or Excel know that Excel data sources perform pretty badly but I had never done any proper tests. I’m not going to pretend that this post is a definitive answer to this question (and once again, I would be interested to hear your experiences) but hopefully it will be thought-provoking.

To start off, I took the 153.6MB CSV file used in my last few posts and built a simple query that applied a filter on one text column, then removed all but three columns. The query ran in 9 seconds and using the technique from my last post I was able to draw the following graph from a Process Monitor log file showing how Power Query reads data from the file:

Nothing very surprising there. Then I opened the same CSV file in Excel and saved the data as an xlsx file with one worksheet and no tables or named ranges; the resulting file was only 80.6MB. Finally I created a duplicate of the first query and pointed it to the Excel file instead. The resulting query ran in 59 seconds – around 6 times slower! Here’s a comparison with the performance of this query with the first query:

The black line in the graph above is the amount of data read (actually the offset values showing where in the file the data is read from, which is the same thing as a running total when Power Query is reading all the data) from the Excel file; the green line is the amount of data read from the CSV file (the same data shown in the first graph above). A few things to mention:

Running Process Monitor while this second query was refreshing had a noticeable impact on its performance – in fact it was almost 20 seconds slower

The initial values of 80 million bytes seem to be where data is read from the end of the Excel file. Maybe this is Power Query reading some file metadata? Anyway, it seems as though it takes 5 seconds before it starts to read the data needed by the query.

There’s a plateau between the 10 and 20 second mark where not much is happening; this didn’t happen consistently and may have been connected to the fact that Process Monitor was running

In any case you don’t need to study this too hard to understand that the performance of reading data from an xlsx format Excel file is terrible compared to the performance of reading data from a CSV. So, if you have a choice between these two formats, for example when downloading data, it seems fair to say that you should always choose CSV over xlsx.

Share this:

Like this:

It’s my pleasure to announce that version 2 of the white paper that I co-wrote with Melissa Coates, “Planning A Power BI Enterprise Deployment”, is now live and can be downloaded from here.

The title should give you a pretty good idea of the subject matter: the aim is to help anyone who intends to roll out Power BI within their organisation understand the different components of Power BI, the ways it can be deployed, how the licensing works, how it can connect to different types of data source, how it should be monitored and administered, how security works and so on. In fact, at 188 pages, it’s more of a book than a white paper and it is substantially expanded, updated and improved compared to the previous version. One of the aims that Melissa and I had when we started on this new version was to clarify certain topics that we know everyone struggles with, such as what the differences between the various Premium tiers are, what embedding is and what licenses are necessary to do it, and how sharing and collaboration should be managed. There’s also a certain amount of new information about how Power BI will be changing in the next few months (you should also read the very new October 18 release notes for Power BI and data integration if you haven’t done so already). I certainly learned a lot myself while writing it.

I would like to say a particular thank you to my co-author Melissa who was, once again, a pleasure to work with. She is extremely knowledgeable (especially about things like the new Azure data-related services that I know very little about), a great writer, very good at designing tables and charts (which I’m awful at), very reliable and above all very patient. I am also incredibly grateful to Adam Wilson and Lukasz Pawlowski from Microsoft who spent many long hours on calls with us explaining how things work in Power BI and what’s coming in the pipeline, and also to Meagan Longoria and the large number of other Microsoft employees who reviewed it.

I hope you enjoy reading it!

Share this:

Like this:

This post is really just a quick follow-on from my post earlier this week on using Process Monitor to troubleshoot Power Query performance issues with file-based data sources, which I suggest you read before carrying on. I realised, after playing around with Process Monitor some more, that the ReadFile operation actually tells you how much data is being read from a file when a Power Query query is running. For example, here’s a sample of some of the ReadFile operations captured while running the unoptimised version of the query I talked about in my last post:

Since Process Monitor can export captured events to a CSV file, it’s pretty easy to load the events into Power BI, filter the events down to only the ReadFile operations, parse the Detail column to extract the Offset values (which I’m sure you can work out how to do if you’re reading a post like this), and then draw a graph showing how much data gets read from a file when a query is run. Here’s what the graph looks like for the unoptimised version of the query from my previous blog post, with relative time on the x axis and the amount of data read in bytes on the y axis:

In that post I noted that there were six reads of the file – and while that’s clear from the graph above, it’s also possible to see that the first read does not read the whole contents of the file while the next five do (the file is 149MB). So maybe I was right that there is one complete read of the file for each row in the output query? What is that first, partial read for, I wonder?

Like this:

Troubleshooting Power Query performance issues in Power BI and Excel can be difficult because it’s a bit of a black box: there’s nothing in the UI to tell you what’s going on inside the Power Query engine and the diagnostic logs are very difficult to interpret. With relational data source like SQL Server you can use tools like SQL Server Profiler to see the queries that are being run by Power Query, and I blogged recently about using Fiddler to troubleshoot OData performance issues; but what about file-based data sources, which often present the most challenges regarding performance?

Process Monitor, a free tool from Microsoft, allows you to monitor file system activity in real-time and even having spent a limited amount of time using it I can already tell that it can provide a lot of information to help identify performance issues with file-based data sources. Take, for example, the scenario I described in my recent post on improving the performance of merge operations. In that post (which I suggest you read before you carry on) I mentioned that it looked as though the Power Query engine was reading data from one of the source files multiple times and Process Monitor confirms that this indeed the case.

Disclaimer: I am not going to pretend to be an expert on Process Monitor, the Windows file system or the internals of Power Query. The rest of this post contains the results of my experiments and the conclusions I have drawn from them, which may or may not be correct!

There are a lot of resources online describing Process Monitor (I also took a look at this book, which is very helpful, as is the help file) and it’s also fairly intuitive, so I’m not going to attempt to explain the basics of how to use it, but with a bit of trial-and-error I set up a simple filter consisting of:

For the Power Query engine it seems as though the Process Name is Microsoft.MashupContainer.NetFX40.exe.

There are obvious pairs of CreateFile/CloseFile operations

Some of these pairs, where the detail for the CreateFile operation is “Desired Access: Read Attributes”, seem to be for reading file metadata and are very fast

The pairs where the CreateFile operation has a detail of “Desired Access: Generic Read”, seem to be where the contents of the file are read (the first of these pairs is highlighted in the screenshot). The Relative Time column shows the time elapsed since Process Monitor started capturing events, and the difference between the Relative Time values for the CreateFile and CloseFile operations in these pairs is around 5-10 seconds.

It looks like the Power Query engine reads some or all of the data from the csv file six times in total. This tallies roughly with the values shown in the Power BI UI for the amount of data read from disk, although it does not tally with my guess that it was reading the data from the file once for each of the five rows in the output query.

The difference in the Relative Time values of the first and last events is around 45 seconds; the query itself takes around 55 seconds to run.

With the optimisation described in the blog post (ie doing a “Remove Duplicates” on one of the columns) in place, Process Monitor shows the following instead:

As you can see, it looks like Power Query now only reads from the file twice and the difference in time between the first and last event is now only around 8 seconds.

This is clearly useful: we can now see when Power Query opens a file, how often it opens a file, and how long it holds the file open for each time. It also raises a number of other questions which I can’t answer properly yet:

Why does Power Query need to open a file multiple times? Actually this isn’t a surprise to anyone who has spent a lot of time with Power Query and has a vague understanding of how it works, but it seems fair to say that the unoptimised query in this example was slow because Power Query had to open and read from the file multiple times.

What is Power Query doing exactly when it has the file open? Is it reading some or all of the data in the file? What is it doing for the rest of the time?

Are some file formats inherently slower than others as far as Power Query is concerned? It seems so: for example Excel files seem to be a lot slower than csv files.

So yet more research is required… but still very interesting, don’t you think? If I’ve made any mistakes here, or if you use Process Monitor to see what happens when you run your queries and you find something else useful, please leave a comment below.

Share this:

Like this:

It’s very common that you need to combine data from multiple worksheets in the same Excel workbook when you’re using Power BI or Power Query/Get&Transform in Excel. Indeed a lot of people have blogged about how to solve this problem, but none of the solutions I’ve found on the internet work in more complex scenarios when the data on each sheet needs some kind of transformation before it can be combined. I was asked to explain how to do this recently while teaching a Power BI class, so in this blog post I’m going to walk through a worked example and point out a few issues that might trip up even experienced Power BI users.

First of all, the source data. Let’s say you have an Excel workbook with four worksheets: Q1, Q2, Q3 and Q4. On each worksheet is some sales data for the three months in each quarter; for example the Q1 worksheet looks like this:

…the Q2 worksheet looks like this:

…and so on. The required output for Power BI should be a table that looks like this:

Now most of the blog posts that describe this problem, such as Ken Puls’s post here, assume each worksheet has a table with the same column names on it. If each sheet has the same columns, this means you can just connect to the Excel workbook and get a table containing the contents (Miguel Escobar has a great post describing how to do this here) and then click the Expand/Aggregate button:

However in this particular case it doesn’t solve the problem, because we get this:

Aha, you may say, we have to transform the data before we can combine it and so we need to create a function and call it for every worksheet – the technique I’ve already blogged about here. And yes, that is basically what needs to happen, but the devil’s in the detail.

Here’s the solution, step-by-step:

Step 1: Get a table with all the worksheets listed

In Power BI connect to your Excel file as normal, then in the Navigator pane right-click on the name of the Excel workbook and select Edit rather than selecting any of the individual worksheets:

The result will be a table that looks something like this:

If you need to, filter out any rows that do not contain “Sheet” in the Kind column and also filter out any worksheets that you don’t want to combine data from.

Step 2: Create your template query

Duplicate the query above and call the new query Template.

Now, in the Template query, select one of the worksheets to use to build the query whose logic will be applied to all the other worksheets, and filter the table above so it only contains the row for that worksheet. In this case I’m using the worksheet called Q1:

Then – and this is important – remove all the other columns in the table except the Data column:

Doing this changes the M code generated for the next thing you’ll do; removing all these columns changes the way the row is referenced (see the section on “The effect of primary keys” in this post) and makes sure the name of the worksheet won’t be hard-coded anywhere.

After that click the Table link inside the cell, and you’ll see the contents of the worksheet:

There will probably be a Changed Type step in the query that sets the data types for each of the columns, and you will need to delete it:

You can now perform any other transformations you need on this query, but you will need to avoid any transformations that generate M code referring to any columns on the original worksheet that aren’t present on other worksheets. Remember, these transformations will need to be applied to the other worksheets and they will fail if they refer to columns that aren’t present – this is why you had to delete the Changed Type step earlier, because it sets the types on the January, February and March columns, and you’ll probably need to delete any other Changed Type steps that are created elsewhere in the query. Open up the Advanced Editor and check the M code for the whole query just to be sure.

In this case all I need to do is unpivot the month columns by selecting the Product column and using the Unpivot Other Columns button on the Transform tab, and then renaming the columns appropriately:

Step 3: Create a function

Next you need to create a new parameter by clicking the Manage Parameters/New Parameter button, call the parameter Worksheet, set the data type to text and have it return the name of the worksheet you chose in the previous step:

Now, go back to the Template query, find the step called Filtered Rows towards the beginning where you filtered down to a single worksheet, and click the gear icon next to the step to edit it:

Then, edit the step so it uses the value returned by the parameter to filter by instead of the hard-coded value you entered earlier. To do this, click on the button shown below, select Parameter and then select the Worksheet parameter in the next dropdown box along:

Finally, go to the Queries pane on the left-hand side of the screen and right-click on the Template query and select Create Function..

You’ll be prompted to give the new function a name; call it GetData:

Step 4: Invoke the function and combine the data

Finally, go back to the duplicate copy of the original query created at the beginning of step 2. Then go to the Add Column tab on the ribbon and click the Invoke Custom Function button and invoke the GetData function, passing in the contents of the Name column to the function’s only parameter:

Last of all, click the Expand/Aggregate button on the new column and expand the nested tables:

After removing any unnecessary columns, you’ll see the data from all the worksheets combined into a single table as desired:

Don’t forget to set the data types on each column.

You can download the Excel workbook used in this post here and the sample Power BI Desktop file here.

Post navigation

Follow Blog via Email

Social

Need some help?

As well as being a blogger, I'm an independent consultant specialising in Analysis Services, MDX, DAX, Power BI, Power Query and Power Pivot. I work with customers from all round the world solving design problems, performance tuning queries and delivering training courses, and I am happy to work on short-term engagements. For more details see http://www.crossjoin.co.uk