Menu

Reshaping ugly CSV with Python

This is another post based on a demo from my SQLSaturday session about Python integration in Power BI. The data file is also available from this link or download your own file from the Reserve Bank link below.

It is a pattern I use fairly often because flat files lend itself to the artistic license of the data provider i.e. they can do just about anything they want with the flat file.

Lets take a look at a real world example. The South African Reserve Bank provides daily exchange rate data (ZAR per USD) through an embedded Reporting Services environment. When you download the data as Comma Separated Value (CSV), this is what you get:

The data represents 3 days of data. We only require the section indicated as “useful data”. If you look at the file structure carefully, you will notice that it could still be treated as a CSV file. When you open the file with the Query Editor, you will see that Power BI already treats the file like a CSV (or an Excel file that does not have all cells populated).

I am going to illustrate two methods of dealing with this in Python. You may encounter scenarios where either one could be useful.

Method 1: Treat the file as CSV

This method will essentially mimic what the Power BI Query Editor does with the file. We will use the Pandas read_csv method, but we will use some parameters to handle our requirements. Without indicating that there are three columns, read_csv will throw an exception when it encounters a third field further down the file. To indicate three columns we simply specify the columns.

In this example we will read the data line by line. Each line will get add to a list i.e. we will end up with a list of lines. We will then use some list comprehension techniques filter the data before adding it to a dataframe.

with open('D:\\data\\dirty\\ExchangeRateDetail.csv') as f:
lines = [line for line in f]

The new line characters form part of the line as shown below in our list where each item is a line in the file.

There are also special characters at the start of the file, but we dont need to worry about this as we will not use this item in the list.

To strip off the “\n” we will use the rstrip method which removes characters from the right hand side of a string. By not specifying a character, whitespace characters are removed. This includes “\n”.

with open('D:\\data\\dirty\\ExchangeRateDetail.csv') as f:
lines = [line.rstrip() for line in f]

With the newline removed, we can now start filtering the list of items(lines). We start by filtering out the empty items (blank lines).

lines = list(filter(None, lines))

Next we remove the first 3 items, but keep the rest of the list.

lines = lines[3:]

Our final filter condition is to remove all items that start with “The”

lines = [t for t in lines if not t.startswith('The')]

Now our list is ready to convert to a dataframe.

import pandas
df = pandas.DataFrame(lines)

Note that we end up with a single column (0) dataframe. Our final steps to get the dataframe into a ready to use state are:

split the column on comma separator

drop the original column

Explicitly cast the data types

Lets put a complete final script together that include the final steps.

Hopefully that provides some ideas on how to deal with flat files in general. As mentioned at the start of this post, it is often challenging to deal with flat files as it is up to the data provider to produce a well structured file.