Menu

This is part 4 and the final post of this JSON series. We are going to play with Smartsheet data because its adds to our examples of varying JSON structure. Note that Power BI already has a connector for Smartsheet. There is also a Python SDK for Smartsheet, but we are going to use the Python requests library so that we can start from scratch and learn how to navigate the JSON structure.

We will follow the same approach as the previous post i.e. focus on understanding the structure, draw from our existing patterns and add one or two new techniques to our arsenal.

Below is the screenshot of the actual Smartsheet we will be working with.

Accessing the Smartsheet API

API access is available during the free trial period. If you want to play, have a look at how to generate the token so that you can authenticate your API access. If you have been following my posts, you would know by now that I store my secure information (such as API tokens) in the user environment variables on by development computer.

To access the API we need to create the endpoint from the Smartsheet API URL and the sheet ID for the sheet you want to access. We also need the token which we pass to the request header. The code to do this is as follows:

In the above script, the last line converts the request text into a Python dictionary structure.

Understanding the data structures

This is the critical step. Understand the problem … and in this case it means read the Smartsheet documentation and compare it to the physical structure. This time round we are going to use the pretty print (pprint) library for the latter.

import pprint
pprint.pprint(data)

The documentation tells us that there is a sheet which is the main object. The sheet contains the columns (metadata) object, a rows object and cells object. It also tells us that a collection of cells form a row. When we explore the physical structure we see that this is indeed the case.

So the column metadata and rows are referenced directly from the sheet object, and the cells are referenced from the row object. We could thus access the columns and rows dictionaries directly.

pprint.pprint(data['columns'])
pprint.pprint(data['rows'])

Building the dataframe

The process we are going to follow is to create an empty dataframe from the column metadata (using the ‘title’ key: value pair). We are then going to iterate over the rows and cells and append the values to the new dataframe.

We are firstly going to create a list of columns. This has a dual purpose. We will use it to create the empty dataframe, then we will re-use it to create row dictionaries that we will append to the dataframe.

cols = []
for col in data['columns']:
cols.append(col['title'])

Creating the empty dataframe from this is straightforward.

import pandas
df = pandas.DataFrame(columns=cols)

To append the rows, we iterate over the rows with a loop and we use a nested loop to iterate over the cells. The cells in each row are added to a list, but first we trap empty values in the VAT field.

for row in data['rows']:
values = [] #re-initialise the values list for each row
for cell in row['cells']:
if cell.get('value'): #handle the empty values
values.append( cell['value'])
else:
values.append('')
print(values)

Now we need a new trick to added these lists to the dataframe. We are going to use a function called zip. Zip must be used in conjunction with an iterator such as a list or dictionary.

We are going to use this to combine our columns and rows into a dictionary that we append to the dataframe. The complete script is below.

While JSON provides a great data interchange standard, implementation of the file format will vary. This means we will always need to make sure we understand the structure and select the required techniques to create an analysis ready format.

This post concludes the JSON series, but will revisit the topic when new scenarios pop up.

This is part 3 of the introduction to handling JSON with Python. This post covers a real world example that accesses data from public API from the New York City Open Data. For fun we are going to look at the most popular baby names by sex and ethnicity.

The key idea is that JSON structures vary, and the first step in dealing with JSON is understanding the structure. Once we understand the structure, we select the appropriate techniques to handle the JSON data. Most of these techniques were covered in part 1 and part 2 of this series.

Browsing the data

You can browse the data from the API URL endpoint. The displayed format is not really going to help us understand the data. Copy the text over to an online JSON formatter such as jsonformatter. Here we see the pretty JSON format on the left:

Click on the JSON Viewer button to see a tree view of the JSON data on the right:

The very first thing you notice right at the top of the tree view is that there are 2 objects at the top level and the JSON consists of nested dictionaries. By collapsing the tree structure to only the top level, we see the following:

Here we see the meta dictionary has 1 sub-dictionary and the data dictionary is a list of 11345 items. Further exploration shows that the data dictionary items are also lists i.e. the data dictionary is a list of 11345 lists (rows). Each of these lists are a list of 14 items. Each item represents a value (column).

The data we have only shows a list index and a list value. We don’t know what those values represent … yet.

Loading the data

It is reasonably easy to get to the data by referencing the data dictionary directly as shown in the script below:

The results show that the only problem is the column names are the index values.

Loading the column meta data

We revisit our JSON tree view and find that the meta dictionary contains a view sub-dictionary, which in turn contains a list of dictionaries called columns. Each item in the columns list is a dictionary with 8 column properties.

To access the column names is not that tricky. We will use the same techniques from the previous posts. Perhaps the only “trick” is to reference the nested columns dictionary. The rest of the code accesses the key: value pairs nested inside the list, and adds the column name value to our columns list. The final step replaces the column headers with the newly acquired column headers.

You can clean up by removing the columns you don’t need (see previous blog posts using the drop method for dataframes).

When the problem is understood …

… the solution is (hopefully) obvious. While this post barely introduced anything new from a Python perspective (except referencing sub-dictionaries), the focus was on understanding the problem and cementing techniques for handling JSON.

In the next blog post I want to conclude this JSON series with another real world example.

The previous post might have left you feeling that its a bit too complicated and a bit too abstract. Dont worry, the concepts will be clearer after a bit of a visual recap and then exploring how we would navigate the JSON/dictionary structures and finally how we reconstruct a “flattened” dictionary from the original dictionary.

Visual recap

We discussed that the JSON format is the same as a Python dictionary i.e. it consists of key: value pairs separated by a comma and enclosed inside braces. The keys are just names and the values may be of type string, number, boolean, list ordictionary (object).

The image below is the JSON file we created in the previous post. It is a list of 3 dictionaries. Each dictionary has the same keys, but the values may differ. We access the values by referencing the keys.

Navigating the JSON file

We can iterate through the JSON file using the for loop structure. Lets start by loading the JSON file and then looping through the outer list of dictionary items. Our image above shows that there are 3 dictionary items. We will observe the same from the code below.

import json
with open('d:\\data\\json\\data.json') as json_file:
mydict = json.load(json_file)
#looping through each object in the list
for obj in mydict:
print(obj)

In order to traverse the key: value pairs inside each dictionary, the for loop structure is slightly different. We will nest the dictionary for loop inside the list for loop so that we can access the key: value pairs for each of the 3 dictionaries.

for obj in mydict:
for key, value in obj.items():
print(mydict.index(obj), key, value)

For good measure, we have added in the index of the list item so that we know which dictionary the key: value pairs are associated with.

We know from the file structure that we still have list and dictionary values to deal with. We will choose our for loop structure according to the value type (list or dictionary). In order to get the values out of the “versions” list, we will use the list for loop structure nested inside our existing nested loop structure.

for obj in mydict:
for key, value in obj.items():
if type(value) == list:
for item in value:
print(mydict.index(obj), obj['product'], key, item)

Note how we test if the value type is a list. We are excluding all other key values.

To keep the data meaningful we have retained the dictionary index and product value that are associated with the list items.

We can use the same technique with the nested dictionary, by using the dictionary for loop structure.

for obj in mydict:
for key, value in obj.items():
if type(value) == dict:
for key2, value2 in value.items():
print(mydict.index(obj), obj['product'], key2, value2)

At this stage we have everything we need in terms of navigating the dictionary. We know the difference in for loop structure for lists vs dictionaries. We also know how to test the data type of the dictionary value and we know how to reference the keys, values and indexes.

Reconstructing the dictionary

To create a flattened dictionary (no list or dictionary values), we will use a combination of navigation techniques just shown. The only additional techniques required are a bit of list and dictionary comprehension i.e. how to create and update a dictionary, and appending items to a list.

We create empty lists and dictionaries in the same way, but we append to these data structures in a different way.

We will now use the above in conjunction with our nested looping structures to create a flattened dictionary.

The principle of the technique is to isolate every key: value pair, including unpacking list or dictionary values. Each unpacked key: value pair is then added to a new dictionary that no longer has list of dictionary values. Finally we append each dictionary to create a new list of dictionaries, and the resulting dictionary list is added to a dataframe.

import json, pandas
with open('d:\\data\\json\\data.json') as json_file:
mydict = json.load(json_file)
dict_lst = [] #initialise and empty list to hold the final result
#outer loop through each dictionary
for obj in mydict:
flat_dict = {} #initilaise a target dictionary
#loop through each key: value pair in dictionary
for key, value in obj.items():
#test the data type of each value
#handle list and dictionary types in the respective ways
if type(value) == dict:
for key2, value2 in value.items():
#use both the original key and value as values in new dict
flat_dict.update({'cheapest':key2, 'price':value2})
elif type(value) == list:
for item in value:
#create a key of values version0, version1, version2, etc
flat_dict.update({'version' + str(value.index(item)): item})
else:
flat_dict.update({key: value})
#append the flattened dictionary to our new dictionary list
dict_lst.append(flat_dict)
#add the dictionary list to a dataframe
df = pandas.DataFrame.from_dict(dict_lst)

This technique has greater flexibility in terms of allowing other ways of manipulating the dictionary. Ironically it also appears easier than the technique shown in the previous post where we used dataframes. Best of all, this technique forces a level of understanding of dictionary data structures.

In part 3 of the JSON series I want to explore a real world example that include JSON with a metadata section and retrieving data from an API.

In modern application environments we will encounter JSON (JavaScript Object Notation) as the preferred file format for data-interchange. Many APIs and files are delivered in JSON format. Working with JSON format could potentially be challenging, especially if you are new to it. It is easy to get thrown by values that are variable length lists or values that are dictionaries (nested objects). So the three key things I want to cover in this post are:

The basics of the JSON and how it compares to a Python dictionary

Dealing with values that are lists (including variable length lists)

Dealing with nested objects (including where keys differ)

There will potentially be many opportunities to improve the code. Please let me know or comment so that I can learn in the process.

The JSON format is best understood by comparing it to a Python dictionary. Be aware though that the JSON format (rules for data representation in text format) only appears similar to the Python dictionary (a data type that supports add, delete, sort, etc).

Python dictionaries

Dictionaries consist of key: value pairs separated by a comma and enclosed inside braces. The value may be of type string, number, boolean, list or object (dictionary).

We can reference a specific array item by its index, then the dictionary key as follows:

Here the list index precedes the dictionary key. Remember that the list index offset is 0.

Creating a JSON file from a dictionary

First we convert the dictionary to a JSON string using the built in json package. The json.dumps function converts a dictionary into JSON encoded format. Add the following to the code above:

import json
print(dict_array)
json.dumps(dict_array)

Compare the output of the dictionary and the JSON encoded string. The appearance is almost identical. The dictionary displays the key/value pairs in single quotes, while the JSON string displays the key/value pairs in double quotes and the entire string in single quotes.

Now we can write the JSON string to a file. Here we use json.dump (without the s) which allows us to write the file rather than a unicode string.

In our example, our keys dont match across dictionaries, so the result is not great.

from pandas.io.json import json_normalize
json_normalize(dict_lst)

Because the keys differ they are split across different columns. Also the list column is not handled. Lets use a different technique to handle the list column. Pandas apply allows us to apply operations across dataframes (rows/columns). Here we use Pandas Series to create a column for each list item.

The resulting dataframe can be concatenated with the existing one as follows:

df3 = pandas.concat([df[:], df2[:]], axis=1)

So we still have to deal with the dictionary column. We will use dataframe methods (melt, merge, drop) that were covered in previous posts, along with some new methods. I’ve taken a different approach here by providing all the steps with comments and then a visual that shows the output at each step.

So the result is not the simplest of code, however it is not that complex either. We solved the entire problem using Pandas. What I have not tested is how we would solve this without Python in Power BI (visual Query Editor or M). It would also be interesting to compare solution complexity. In my case, I already had these Python patterns, so it was a matter of script re-use.

In the next post I want to explore methods of traversing and reshaping the data without Pandas and dataframes. It will provide better understanding of the JSON format and expand on the possibilities with Python script.

This is another post based on a demo from my SQLSaturday session about Python integration in Power BI. The data file is also available from this link or download your own file from the Reserve Bank link below.

It is a pattern I use fairly often because flat files lend itself to the artistic license of the data provider i.e. they can do just about anything they want with the flat file.

Lets take a look at a real world example. The South African Reserve Bank provides daily exchange rate data (ZAR per USD) through an embedded Reporting Services environment. When you download the data as Comma Separated Value (CSV), this is what you get:

The data represents 3 days of data. We only require the section indicated as “useful data”. If you look at the file structure carefully, you will notice that it could still be treated as a CSV file. When you open the file with the Query Editor, you will see that Power BI already treats the file like a CSV (or an Excel file that does not have all cells populated).

I am going to illustrate two methods of dealing with this in Python. You may encounter scenarios where either one could be useful.

Method 1: Treat the file as CSV

This method will essentially mimic what the Power BI Query Editor does with the file. We will use the Pandas read_csv method, but we will use some parameters to handle our requirements. Without indicating that there are three columns, read_csv will throw an exception when it encounters a third field further down the file. To indicate three columns we simply specify the columns.

In this example we will read the data line by line. Each line will get add to a list i.e. we will end up with a list of lines. We will then use some list comprehension techniques filter the data before adding it to a dataframe.

with open('D:\\data\\dirty\\ExchangeRateDetail.csv') as f:
lines = [line for line in f]

The new line characters form part of the line as shown below in our list where each item is a line in the file.

There are also special characters at the start of the file, but we dont need to worry about this as we will not use this item in the list.

To strip off the “\n” we will use the rstrip method which removes characters from the right hand side of a string. By not specifying a character, whitespace characters are removed. This includes “\n”.

with open('D:\\data\\dirty\\ExchangeRateDetail.csv') as f:
lines = [line.rstrip() for line in f]

With the newline removed, we can now start filtering the list of items(lines). We start by filtering out the empty items (blank lines).

lines = list(filter(None, lines))

Next we remove the first 3 items, but keep the rest of the list.

lines = lines[3:]

Our final filter condition is to remove all items that start with “The”

lines = [t for t in lines if not t.startswith('The')]

Now our list is ready to convert to a dataframe.

import pandas
df = pandas.DataFrame(lines)

Note that we end up with a single column (0) dataframe. Our final steps to get the dataframe into a ready to use state are:

split the column on comma separator

drop the original column

Explicitly cast the data types

Lets put a complete final script together that include the final steps.

Hopefully that provides some ideas on how to deal with flat files in general. As mentioned at the start of this post, it is often challenging to deal with flat files as it is up to the data provider to produce a well structured file.

This post is is based on another demo from my SQLSaturday session on Python integration in Power BI. Its a handy real world scenario that could easily be re-applied with minor changes to the script. The primary pattern is unpivot functionality which is achieved through the melt method in pandas. A few additional useful dataframe techniques are illustrated, including pivot_table.

I am using the Tourist Accommodation data from the Statistics South Africa website. The file (tourism.xlsx) is also available on the SQLSaturday website. Once you open the file observe the following:

the first 8 columns (starting with the letter H) are the header columns

the remaining columns are a series of months i.e. MO012007 is month 1 (Jan), 2007

Here we have the best case scenario because all month columns start with “MO”. If no pattern exists and the columns are sequenced, you could split on position i.e. headers = columns[:8], months = columns[8:] will produce the same result. Worst scenario is when the order of the columns are random. In this case you would manually specify the list elements.

Step 3: Melt (unpivot) the month columns

Unpivoting will transpose all month columns to rows. For each month column a new row is created using the same header columns.

All month columns will be dropped and a new column ‘Date’ will contain the original month column header. The values will bed added to a new column called ‘Val’.

Now reconstruct and convert the Date column to an actual date. We get the year and month parts from the column and just assume the day is always 1. This is explicitly converted to a date using the to_datetime method.

We noted at the start that the values were a mixture of numeric scales which is indicated in column H17. To determine the possible value scales we use df2.H17.unique() which returns the following list of unique values: [‘Thousand’, ‘Percentage’, ‘Rands’, ‘R million’]. We create a function to apply the scale to the Val column.

Note how rows are filtered. We set the dataframe equal to the existing dataframe where the Category column is not equal to ‘Total Industry’

At this point the script could be used as is, but in Power BI we would need to understand how the measure(s) would be created. For example, we could have a generic measure [Value] = SUM(‘Tourism'[FullValue]). We would then use the Metric column as a filter/slicer. Most likely you would rather want a measure such as [Units Available] = CALCULATE ( SUM ( ‘Tourism'[FullValue] ), ‘Tourism'[Metric]=”Stay units available” ). In this case it is cumbersome to create the measures, so we rather reshape the table to make things easier.

Step 6: Pivot the Metrics (optional)

If the metrics provided are fixed (consistent), then pivoting the metrics is a good idea. This creates a column for each metric. This is possible with the Pandas pivot_table method.

The side effect of the pivot_table method is that the dataframe is re-indexed based on the index parameter passed to the method. This is not really a problem, but for Power BI it is. You may have noted from earlier posts that the Python Script in Power BI does not provide access to the dataframe index. So we would in effect lose the first three columns from our result above.

The fix is as simple as re-indexing the dataframe. We can also drop the ‘Total income’ column as we can let Power BI handle the aggregations.

I spoke about Python integration in Power BI at SQLSaturday #793 in Cape Town this past weekend. You can download the deck (which includes the notes) and the Python scripts (which include sample data) from the SQLSaturday site. One of the demos included the creation of a date table to support time intelligence DAX functions in Power BI. I thought the example would be interesting for a number of reasons:

Time intelligence is one of my favorite features of Power BI (Power Pivot & SSAS)

A few examples of M script to create date tables already exist (it would be interesting to compare)

Before we start, lets quickly recap the basic criteria for a date table in Power BI:

A separate date table is required (and joined to the table to calculate against)

The date table must be at a day level (1 row equals 1 day)

The days must be contiguous (no gaps even if your data is not daily)

The days must be available for full years (1 Jan to 31 Dec)

A column of data type ‘Date’ must exist on the date table

Step 1: Create a list of sequential dates

For this we use the Pandas date_range method.

The output shows the first 10 and last 10 elements of the list, the data type and the number of elements (length).

Step 2: Create a dataframe from the list

A list can easily be converted into a dataframe. The result is a dataframe with one column.

The column is assigned a default name of 0. The second line renames the column.

Step 3: Derive additional columns

For this step we make use of the string format time (strftime) method to derive required columns from the Date column. Where a numeric data type column is required, we explicitly convert the column using the dataframe apply function.

At this step you already have a basic date table. In many cases financial calendars do not end in December. In this example the financial year ends in March.

Step 4: Create a financial calendar lookup dataframe

I took a different approach with the financial calendar columns because I could easily do this inside the script with Python. Instead of using conditional evaluation (if, case, switch, etc), I decided to create a lookup “table” that a could join to. The lookup table would either provide the column I need, or help me create the column I need.

In the creation of this fy lookup dataframe, each column is provided as a list input i.e. each row in the code adds a column to the dataframe. All the lookup dataframe does is remap the financial periods to start in April (calendar month 4). The FinYearDif column will help us calculate the financial year in the next step.

Step 5: Add the financial year periods

We do this by doing a left outer join to the lookup dataframe. This is achieved through the merge method. We specify MonthNumber as the join column and outer as the join type. Finally we derive the financial year and then drop the helper column (axis=1 indicates the column axis).

Add to Power BI with the Python Script …

Bonus step: Persist result to a SQL Server database

I hinted previously that its easy to persist the data to a SQL Server database using the dataframe.to_sql method. Add this after your code:

I added the to_sql method and used this in my Python script for Power BI. If this will be supported in the Power BI Service, I may consider writing to SQL Server in the Python script, then reading the result from SQL Server. I could even run stored procedures that only load deltas to SQL Server. Should something like this be allowed in the Power BI Service, it could mean that a Power BI refresh also runs my (ETL) process to persist data to SQL Server.

Download the script

As mentioned at the start of this post, this was part of a SQL Saturday demo. The complete script (and more) is available from the SQL Saturday website for download.