Category Archives: M

This post is really just a quick follow-on from my post earlier this week on using Process Monitor to troubleshoot Power Query performance issues with file-based data sources, which I suggest you read before carrying on. I realised, after playing around with Process Monitor some more, that the ReadFile operation actually tells you how much data is being read from a file when a Power Query query is running. For example, here’s a sample of some of the ReadFile operations captured while running the unoptimised version of the query I talked about in my last post:

Since Process Monitor can export captured events to a CSV file, it’s pretty easy to load the events into Power BI, filter the events down to only the ReadFile operations, parse the Detail column to extract the Offset values (which I’m sure you can work out how to do if you’re reading a post like this), and then draw a graph showing how much data gets read from a file when a query is run. Here’s what the graph looks like for the unoptimised version of the query from my previous blog post, with relative time on the x axis and the amount of data read in bytes on the y axis:

In that post I noted that there were six reads of the file – and while that’s clear from the graph above, it’s also possible to see that the first read does not read the whole contents of the file while the next five do (the file is 149MB). So maybe I was right that there is one complete read of the file for each row in the output query? What is that first, partial read for, I wonder?

Share this:

Like this:

About two years ago I wrote a blog post describing how the #table M function can be used to generate tables, but in that post I only covered the functionality I used regularly – namely using #table with a list of column names or a table type in the first parameter. However there two other variations on #table that I have used recently that I thought were worth pointing out.

For example, if you need to generate a table with a set number of columns but you don’t care what the columns are called, you can use an integer in the first parameter to get a table with that number of columns. The following expression returns a table with four columns of data type Any called Column1, Column2, Column3 and Column3, and no rows:

#table(4,{})

Also, if you have a list of lists with an unknown number of items in and you want to use each nested list for the row values in a table, you can use a null value in the first parameter of #table. The following expression returns a table with four columns like the one above, but with two rows of integer values:

Like this:

A long time ago someone – probably from the Power Query dev team – told me that adding primary keys to M tables could improve the performance of certain transformations in Power BI and Excel Power Query/Get&Transform. Indeed, I mentioned this in chapter 5 of my book “Power Query for Power BI and Excel”, but I it wasn’t until this week that I found a scenario where this was actually the case. In this post I’ll describe the scenario and try to draw some conclusions about when adding primary keys to tables might make a difference to performance.

Imagine you have two queries that return data from two different files. First of all there’s a query called #”Price Paid”, which reads data from a 140MB csv file containing some of my favourite demo data: all 836933 property transactions in the year 2017 from the UK Land Registry Price Paid dataset. The query does nothing special, just reads the all the data, renames some columns, removes a few others, and sets data types on columns. Here’s what the output looks like:

Second, you have a query called #”Property Types” that reads data from a very small Excel file and returns the following table:

As you can see, the Property Type column from the #”Price Paid” query contains single letter codes describing the type of property sold in each transaction; the Property Type column from #“Property Types” contains a distinct list of the same codes and acts as a dimension table. Again there’s nothing interesting going on in this query.

The problems start when you try to join data from these two queries using a Merge and then, for each row in #”Property Types”, show the sum of the Price Paid column from #”Price Paid”. The steps to do this are:

1. Click on the Merge Queries/Merge Queries as New button on the Home tab in the Power Query Editor:

2. In the Merge dialog, do a left outer join between #”Property Types” and #”Price Paid” on the Property Type column on each table

3. Back in the Power Query Editor window, click on the Expand/Aggregate button in the top right-hand corner of the column created by the previous step.

It’s a very common thing to do, and in this case the query is extremely slow to run: on my laptop, when you refresh the preview in the Power Query Editor window in Power BI Desktop it takes 55 seconds. What’s more, in the bottom left-hand corner of the screen where it displays how much data is being read from the source file, it shows a value that is a lot more than the size of the source file. In fact it seems like the source file for #”Price Paid” is read once for each row in the #”Property Types” query: 140MB * 5 rows = 700MB.

Not good. But what if you specify that the Property Type column from the #”Property Types” query is the primary key of the table? After all it does contain unique values. Although it isn’t widely known, and it’s not shown in the UI, tables in M can have primary and foreign keys defined on them whatever data source you use (remember that in this case the data sources are csv and Excel). One way to do this is using Table.AddKey as follows:

And guess what? After making this change, the query only takes 12 seconds to run instead of 55 seconds! What’s more the amount of data read from disk shown in the UI suggests that the source file for #”Price Paid” is only read once. But making this change involves writing M code, and not everyone is comfortable making changes in the Advanced Editor. The good news is that it’s possible to get this performance benefit in another way without writing any M code.

As I mentioned in this blog post, using the Remove Duplicates transformation on a column has the side-effect of marking that column as a primary key. Therefore if you right-click on the Property Type column and select Remove Duplicates in between steps 2 and 3 above, before you click on the Expand/Aggregate button:

…then you also get the performance benefit. Here’s the M code generated by the UI for this query:

It’s #”Removed Duplicates” that is the new step here, and it uses the Table.Distinct function.

All this suggests that if you are doing merge operations on queries that get data from large csv or Excel files and then aggregating data, if you can set a primary key on one of the columns used in the join it can make a massive difference to performance. However this is just one example, and I would be really interested to hear if any of you can reproduce these findings with your own data and queries – if you can, please leave a comment. I suspect this could be a very important discovery for M query performance tuning!

Share this:

Like this:

CSV files are one of the most commonly used data sources in Power BI and Power Query/Get&Transform, and yet the documentation for the Csv.Document M function is very limited and in some cases incorrect. In this rather long post I’ll show you as many of the capabilities of this useful function as I’ve been able to discover.

The Source parameter

The Csv.Document function returns a table, and the first (and only non-optional) parameter of this function is the source data in CSV format. Normally this is a binary value returned by the File.Contents function. For example take the this simple CSV file with no column headers and one row of data:

The following M code uses File.Contents to read the contents of the file, and then passes the contents to Csv.Document to be interpreted as a CSV file:

In both of these examples I’m relying on the default behaviour of the Csv.Document function with regard to delimiters and other properties, which I’ll explain in more detail below.

Using a record in the second parameter

The second parameter of Csv.Document can be used in several different ways. In code generated by the Query Editor UI it usually takes the form of a record, and the different fields in the record specify how the function behaves in different scenarios. For example, if you connect to the CSV file shown above by selecting the Text/CSV source in the Query Editor UI, you’ll see the following window appear showing a preview of the data and three options:

The query above shows the Csv.Document function with a record in its second parameter containing four fields: Delimiter, Columns, Encoding and QuoteStyle. There is also a fifth field that can be added to the record, CsvStyle, but this cannot be set anywhere in the UI.

The Data Type Detection option shown in the screenshot gives you three options for detecting the data types in each column of your file: by default it looks at the first 200 rows in the dataset, but you can also ask it to look at the entire dataset (which may be slower) or not to detect data types at all, in which case it will treat all columns as text. In this case data types are not set in the Csv.Document function but in the #”Changed Type” step with the Table.TransformColumnTypes function, but as we will see later it is possible to set column names and data types in a single step with Csv.Document instead.

The Encoding field

The File Origin dropdown menu shown above corresponds to the Encoding field in the Csv.Document function. This integer value specifies the code page used to encode the contents of the file:

In the M query in the previous section the 1252 code page is set explicitly. The following M query sets the (incorrect) 1200 code page for the CSV file shown above:

The Delimiter dropdown allows you to specify the delimiter used to separate the columns in each row of data. There are a number of options available through the UI, including commas and tabs, and the Custom option allows you to enter your own delimiter:

If you specify a single character delimiter at this point then the Delimiter field of the record in the second parameter of Csv.Document is set; the Custom and Fixed Width options shown here use a different form of the Csv.Document function described below. If the Delimiter record is not set then a comma is used as the delimiter. If you want to use a special character like a tab then you need to use an M escape sequence; for example to use a tab character as delimiter you need to use the text “#(tab)” which returns a text value containing just a single tab character.

The Columns field is not explicitly set by the user when you first connect to a CSV file via the UI, but the UI infers its value from the number of columns it finds in the CSV file. This can be a problem if the number of columns in the source data changes in the future because, as shown above, when the Columns field is set the table returned always has a fixed number of columns. As a result if the number of columns in the data source increases in the future you will find columns on the right-hand side of the table are not returned; similarly if the number of columns decreases you’ll see unwanted empty columns. Several people such as Prathy Kamasani have blogged about this problem and it may be better to delete the Columns field from the record, or not set the Columns field in the first place, in order to avoid it. If you do not set the Columns field then Csv.Document returns a table with the number of columns that are present in the first row of your source data.

The QuoteStyle field

The QuoteStyle field can take two possible values of type QuoteStyle: QuoteStyle.None and QuoteStyle.Csv. Here’s what the built-in documentation has to say about the QuoteStyle type:

While the value for QuoteStyle is set automatically when you connect to a file, if you edit a step in the Query Editor that uses Csv.Document you can change this value in the UI in the Line Breaks dropdown shown here:

As the screenshot above suggests this field controls whether line breaks inside text values are respected. For both QuoteStyle.None and QuoteStyle.Csv, if you wrap a text value inside double quotes those double quotes are used to indicate the start and the end of the text value and are not shown in the output; if you want a double quote to appear, you have to double it up. However if QuoteStyle.None is set then line breaks are always respected, even if they appear inside double quotes; if QuoteStyle.Csv is set, then line breaks inside double quotes are ignored. Take the following CSV file for example:

Returns the text “four” in double quotes on the second line, because the space before the comma on the second line changes how the double quotes are treated:

Using a list or a table type in the second parameter

If the first line of your CSV file contains column headers and you connect to the file using the Query Editor user interface, in most cases this will be detected and an extra step will be added to your query that uses Table.PromoteHeaders to use these values as the column headers. However if you don’t have column headers inside your CSV file, instead of a record it is also possible to supply a list of column names or even better a table type to define the columns present in your CSV file. When you do this Csv.Document has three other parameters that can be used to do some of the same things that are possible if you use a record in the second parameter – Delimiter, ExtraValues and Encoding – and they are described below.

For example, in the following CSV file there are three columns: Month, Product and Sales.

Using a list of text values containing these column names in the second parameter of Csv.Document, as in the following M query:

This has set the column names correctly but the data types of the three columns are set to text. What if I know that only the Month and Product columns contain text and the Sales column should be a number? Instead of a list of column names, using a table type allows you to set names and data types for each column:

If you have used a list of column names or a table type in the second parameter of Csv.Document, you can use the third parameter to control how each row of data is split up into columns. There are two ways you can do this.

First of all, you can pass any piece of text to the third parameter to specify a delimiter. Unlike the delimiter field of the second parameter described above, this can be a single character or multiple characters. For example, the following M query:

Instead of text, the Delimiter parameter can also take a list of integer values to allow you to handle fixed-width files. This functionality is available from the UI when you choose the Fixed Width option from the Delimiter dropdown box when you connect to a CSV file for the first time:

Each integer in the list represents the number of characters from the start of the row that marks the start of each column; as a result, each integer in the list has to be a larger than the preceding integer. The values are 0-based so 0 marks the start of a row. For example, the M query:

The ExtraValues parameter allows you to handle scenarios where there are extra columns on the end of lines. This isn’t quite as useful as it sounds though: most of the time when the number of columns varies in a CSV file it’s because there are unquoted line breaks in text columns, in which case you should make sure your source data always wraps text in double quotes and use the QuoteStyle option described above, or if you can’t fix your data source, see this post.

The ExtraValues parameter can take one of three values of type ExtraValues: ExtraValues.List, ExtraValues.Ignore and ExtraValues.Error.

Consider the following CSV file with two extra columns on the second row:

As you can see from the screenshot below, because we have specified that there are three columns in the table, the error “There were more columns in the result than expected” is returned for each cell on the second line:

The same thing happens when ExtraValues.Error is explicitly specified in the fourth parameter, like so:

Setting ExtraValues.List allows you to capture any extra column values in a list; however, if you want to do this you will need to add an extra column to your table to hold these values. For example, notice in this query that four columns rather than three have been defined:

On the first and third rows the Extra Columns column contains an empty list. On the second row, however, the Extra Columns column contains a list containing two values – the two values from the two extra columns on that line.

The Encoding parameter

The Encoding parameter corresponds directly to the Encoding field used when you pass a record to the second parameter, as described above. The one difference is that it can take an integer or a value of type TextEncoding, although the TextEncoding data type only contains values for some of the more common code pages so the only reason to use it would be for readability:

If you specify a list of column names or a table type in the second parameter of Csv.Document there’s no way to set CsvStyle or QuoteStyle – these options are only available when you use a record in the second parameter. The behaviour you get is the same as CsvStyle.QuoteAlways and QuoteStyle.Csv, so with the following source data:

Share this:

Like this:

When you’re working with M code in Power BI or Power Query/Get&Transform you may want to include special characters such as carriage returns, line feeds or tabs inside a text value. To do this you’ll need to use character escape sequences inside your text. Information on this topic is buried in the Power Query Formula Language Specification document that can be downloaded here, but since that’s not an easy read I thought I’d write a quick post repeating the information in there with a few examples.

The escape sequence for a tab character is #(tab), the escape sequence for a carriage return is #(cr) and the escape sequence for a line feed is #(lf). You can combine multiple special characters inside the same escape sequence too, so #(cr,lf) is equivalent to #(cr)#(lf). So, for example, following M code:

"A tab:#(tab), a carriage return and a line feed #(cr,lf) and stop"

…used as a query and entered as follows in the Advanced Editor:

…returns this output:

[Interestingly, in the formula bar, the value #(cr,lf) is displayed as #(cr)#(lf) although that’s not what I wrote in the Advanced Editor]

To use the two character sequence #( in your text you’ll need to escape those values; if you don’t, you’ll get an error. For example the following M code:

"This causes an error #("

Gives an “Invalid literal” error:

Whereas this works:

"This does not error #(#)("

Finally, this can be used to return any Unicode character in your text by putting either the four-digit or eight-digit hex code of the Unicode character in between the brackets. For example this expression returns the Unicode character with the four-digit hex code 2605, which is a star:

Like this:

Writing M in the Advanced Editor in Excel or Power BI can be a frustrating experience unless you’re the kind of masochist who loves writing code in Notepad. There are some options for writing M code outside Excel and Power BI, for example Lars Schreiber’s M extension for Notepad++ (see here for details) or the M extension for Visual Studio Code (available from the Visual Studio Marketplace here; more details on Brett Powell’s blog here), but the trouble with them is that you have to copy the code back into Excel or Power BI to run it. What many people don’t realise, however, is that it is possible to write M code and have IntelliSense, formatting, keyword highlighting and also the ability to execute your own M queries, using the Power Query SDK in Visual Studio.

The Power Query SDK (which you can download here) supports Visual Studio 2015 and 2017 and is intended for people who are writing custom Data Connectors for Power BI. To let you test your Data Connector you can create a .pq file containing M code, and this in fact allows you to run any M query you want whether you’re building a Data Connector or not.

Here’s how. First, install the Power Query SDK and then open Visual Studio and create a new project. Find the Power Query template, select the PQ file option and give your file a name:

Then, in the .pq file that is created, you can enter an M query and then either press the Start button on the toolbar or hit F5 to run the query. The output of the query is shown on the Output tab in the M Query Output window:

Right-clicking on your project in the Solution Explorer pane and selecting Properties brings up a Properties dialog with various properties that control how your queries behave:

Many of the properties are self-explanatory, at least for anyone used to writing M in Power BI or Excel. FastCombine turns off data privacy checks. Allow Native Query lets you use M queries that contain ‘native’ queries (for example your own SQL queries if you’re using a SQL Server data source), as Cédric Charlier shows here. A few of them, such as Legacy Redirects, I have no idea about yet (I should really ask someone…). Turning on Show Engine Traces displays engine trace information in the Log tab of the M Query Output pane; turning on Show User Traces displays trace information generated by the use of the Diagnostics.Trace() function in your code in the Log tab. You can save the contents of the Log tab to a text file.

Error messages are displayed on the Errors tab of the M Query Output pane:

When you have a query that connects to an external data source, the first time you try to run your query you will be prompted to set the credentials used to connect to that data source (as you would in Power BI Desktop), and the data privacy level for the data source, on the Errors pane:

The query won’t actually run this first time though; you’ll need to hit Start/F5 again to see the results. If you close the project and then reopen it you will need to enter credentials again; alternatively, on the Credentials tab you can save the credentials used for a data source to a .crd file which can then be reloaded when you reopen your project. You can also edit and delete credentials on the Credentials tab.

If I’m honest it’s all very basic but it does the job. The main thing that I miss from writing M code in Power BI is the Query Editor UI – when I write M code there I only write about 50% of it manually, the rest I generate by clicking buttons in the UI because it’s faster. Give me the Query Editor (or the ribbonless version of it that comes with SSDT, because Visual Studio doesn’t support ribbons apparently) and I’ll be happy. Even better, give me the improved code editing functionality in the Advanced Editor in Power BI Desktop and Excel that we’ve been promised!

It may not be immediately obvious, but you cannot set your own connection string properties when connecting to SQL Server using the built-in SQL Server connector from either Power BI or a modern data source in Azure SSAS/SSAS Tabular 2017:

All you can do is configure the options that are available in the UI, which in the current version of SSDT looks like this:

It turns out that the restriction on using your own connection string properties in the built-in SQL Server connector is a deliberate design decision on the part of the Power Query team because, behind the scenes, they use different providers in different circumstances to optimise performance, and because allowing arbitrary connection string properties might make maintaining backwards compatibility difficult in the future.

While your average Power BI user is unlikely to even notice this, for SSAS Tabular developers it could be a big problem: complete control over the connection string is often necessary in enterprise BI scenarios. What are the alternatives then? Well you can use the OLE DB and ODBC connectors instead:

Both of these connectors do allow you to set your own connection string properties. For example here’s the UI for a new ODBC connection in SSDT:

However, apart from possible performance differences between the two (which you should test yourself – Henk van der Valk wrote a good post on this for SSAS MD and most of what he said is relevant for Tabular) there’s one less-than-obvious difference between these two options: the OLE DB connector does not appear to support query folding right now whereas the ODBC connector does. Of course this isn’t an issue if you’re writing your own SQL queries to import data, but if you do want to use M functions for partitioning (as I show here) you’re likely to get very poor performance with the OLE DB connector.

Post navigation

Follow Blog via Email

Social

Need some help?

As well as being a blogger, I'm an independent consultant specialising in Analysis Services, MDX, DAX, Power BI, Power Query and Power Pivot. I work with customers from all round the world solving design problems, performance tuning queries and delivering training courses, and I am happy to work on short-term engagements. For more details see http://www.crossjoin.co.uk