Using Power Query to combine files from Hadoop into a single table (part 2)

As in my previous post I’d like to import a bunch of csv-files from a Hadoop cluster into a single Power Pivot table using Power Query. In the last post I used the basic built-in functionality for this task. In this post, I will create a custom function to perform the load. Using a custom function allows you to have more control over the import process. For example, you could easily import different file formats using this way.

The steps below involve a little bit of work on the code generated by Power Query. The Power Query coding is done using the Power Query language (formerly known as ‘M’). So this example is more targeting the power user who is getting deeper into the functions of Power Query.

Also, in order to be able to work with the Power Query functions directly, you need to turn on the “Enable Advanced Query Editing” option in the options dialog of Power Query (click “Options” on the Power Query ribbon).

The basic idea follows these steps:

Use Power Query to load a single file from hdfs

Wrap the code from step 1 into a function

Get a list of all files in the hdfs directory and invoke the function for each of the files

The basic idea for joining the tables and the inspiration comes from a great presentation by Christoph Seck which I had the chance to see recently.

1. Use Power Query to load a single file from hdfs

We start exactly as in the post before. First we need to connect to the Hadoop file system and get a list of all the csv files:

In the last post we took the Binary column to retrieve the file contents directly. However, for this post I will use a different approach now, that allows me to control the loading of each file in detail.

For my example, let’s start with a single file load. We could use any file here, so I took the first one (20130916.csv). Loading the file contents can be done by clicking the link “Binary” in the desired row. This results in Power Query fetching only the contents of this single file:

Note that Power Pivot did automatically detect the column types. This can be changed if needed, but for my purpose, I leave it that way. The only thing I do is to rename the columns by right clicking the column header and choose rename. The result is shown in the next step.

2 Wrap the code from step 1 into a function

In order to read several files, we need to write a function that basically takes the filepath and the filename and returns a table with the file content. The code generated in step 1 is a good starting point.

What we will do here is to view the generated M code by clicking on the code icon (marked red in the screenshot):

I’m not getting into too much details about the Power Query language here. Basically, you see each transformation step as a named assignment in the section that is preceded by “let”. The output (final result) is defined by the in clause at the end. I marked those parts of the query in red that need to be replaced by variables. Don’t be confused by the generated variable names (like #"http://…"). You may replace them directly in the code editor by a comprehensive name.

In order to turn this code into a function we wrap it in a let/in construct as shown below:

Click on Ok to submit the formula. The outer let/in pair is used to send the variables to the inner let/in pair which does the work. By clicking ok, the code is parsed and the result should look like this:

At this point, it’s a good idea the test the new function by clicking “Invoke”. For my example I’m entering exactly the values I had in the formula before replacing them by a variable:

The result shows exactly the values from above but this time being pulled by a function:

Next thing is the rename the query (right click on “Query1”) to “Hdfs_FetchSingleFile” (this is the name we will use to call the function later on) and to delete the invoke-step of my test (otherwise a table is returned, not a function). Do not forget this step because otherwise the function will not work:

3. Get a list of all files in the hdfs directory and invoke the function for each of the files

In order to get a list of all files in my csv-folder, I follow exactly the procedure from step 1. We start with the list of all the csv files in the given directory. I’m only interested in the columns Name and Folder Name, so I delete all the other columns here (right click on column header->remove, multiple columns may be selected). This is how the result looks like:

Please note that we still could apply filters on this file list (for example by create date, file name, file type or other attributes of the file).

I now right click on one of the column headers and choose Insert Column/Custom.

This brings up a query editor for my new column. I’m going to use the function created in step 2 here. Since I named the function “Hdfs_FetchSingleFile” it may be invoked as:

Hdfs_FetchSingleFile([Folder Path],[Name])

The columns may be inserted by double clicking on the column name in the “Available Columns” box. You may wonder how you can show the results of a full file (table) for each row. Here’s how Power Query handles it:

As you can see, we created a new column (“Custom”) which just shows “Table” as entries. In order to retrieve the table data, click on the arrows on the columns header:

This allows us to select the columns we like to retrieve from each table. Note that the correct column headers are shown in the selection.

I use all columns here and now, by clicking Ok, all files/tables are expanded properly:

How cool is that!? Being able to work with table structures within a table and expanding them to details makes processing very easy for us here.

The remaining steps are to delete the first two columns (we’re not interested in them), rename the columns accordingly and set the columns data type correctly (date for Date, numeric for product and sales):

Power Query generates the script for us while we do all those changes. We may now rename the script and return the result to Excel by clicking Ok.

And here is the final result in Excel.

Please note that you can refresh the table at any time by simply clicking Refresh in the context menu. This causes the full script from above to be executed again, so if new files are stored on the hdfs file system, the query brings them all back here.

Last step shown here is to load the data in my Power Pivot data model by clicking the “Load to data model” button/link in the query settings pane:

This takes the data in the Power Pivot data model allowing us to create calculations or to blend this with other data sources:

Querying big data sources like Hadoop might not be a common use case. As I mentioned before in general it would be better to combine/filter/transform the files on the Hadoop cluster before loading to a Power Pivot model. But there may still some cases where the approach above might be useful. And of course we could filter down the file list (for example to read only files starting with a given date) or the file content (with the reader function from above).

So, Power Query and the embedded formula language offer very powerful tools for retrieving and transforming (reshaping) data. For most cases none or only little coding is required and the step by step engine allows to get back to a specific point in the transformation at any time. The use cases for Power Query are literally endless and this tool is definitely a big step ahead for bringing Business Intelligence Power to the end users.