Menu

Unpivot Excel data with Python

This post is is based on another demo from my SQLSaturday session on Python integration in Power BI. Its a handy real world scenario that could easily be re-applied with minor changes to the script. The primary pattern is unpivot functionality which is achieved through the melt method in pandas. A few additional useful dataframe techniques are illustrated, including pivot_table.

I am using the Tourist Accommodation data from the Statistics South Africa website. The file (tourism.xlsx) is also available on the SQLSaturday website. Once you open the file observe the following:

the first 8 columns (starting with the letter H) are the header columns

the remaining columns are a series of months i.e. MO012007 is month 1 (Jan), 2007

Here we have the best case scenario because all month columns start with “MO”. If no pattern exists and the columns are sequenced, you could split on position i.e. headers = columns[:8], months = columns[8:] will produce the same result. Worst scenario is when the order of the columns are random. In this case you would manually specify the list elements.

Step 3: Melt (unpivot) the month columns

Unpivoting will transpose all month columns to rows. For each month column a new row is created using the same header columns.

All month columns will be dropped and a new column ‘Date’ will contain the original month column header. The values will bed added to a new column called ‘Val’.

Now reconstruct and convert the Date column to an actual date. We get the year and month parts from the column and just assume the day is always 1. This is explicitly converted to a date using the to_datetime method.

We noted at the start that the values were a mixture of numeric scales which is indicated in column H17. To determine the possible value scales we use df2.H17.unique() which returns the following list of unique values: [‘Thousand’, ‘Percentage’, ‘Rands’, ‘R million’]. We create a function to apply the scale to the Val column.

Note how rows are filtered. We set the dataframe equal to the existing dataframe where the Category column is not equal to ‘Total Industry’

At this point the script could be used as is, but in Power BI we would need to understand how the measure(s) would be created. For example, we could have a generic measure [Value] = SUM(‘Tourism'[FullValue]). We would then use the Metric column as a filter/slicer. Most likely you would rather want a measure such as [Units Available] = CALCULATE ( SUM ( ‘Tourism'[FullValue] ), ‘Tourism'[Metric]=”Stay units available” ). In this case it is cumbersome to create the measures, so we rather reshape the table to make things easier.

Step 6: Pivot the Metrics (optional)

If the metrics provided are fixed (consistent), then pivoting the metrics is a good idea. This creates a column for each metric. This is possible with the Pandas pivot_table method.

The side effect of the pivot_table method is that the dataframe is re-indexed based on the index parameter passed to the method. This is not really a problem, but for Power BI it is. You may have noted from earlier posts that the Python Script in Power BI does not provide access to the dataframe index. So we would in effect lose the first three columns from our result above.

The fix is as simple as re-indexing the dataframe. We can also drop the ‘Total income’ column as we can let Power BI handle the aggregations.