Most of the Data Science / Machine Learning projects follow the Pareto principle where we spend almost 80% of the time in data preparation and remaining 20% in choosing and training the appropriate ML model. Mostly, the datasets we get to create Machine Learning models are messy datasets and cannot be fitted into the model directly. We need to perform some data cleaning steps in order to get a dataset which then can be fitted into the model. We need to make sure that the data we are inputting into the model is a tidy data. Indeed, it is the first step in a Machine Learning / Data Science project. We may need to repeat the data cleaning process many times as we face new challenges and problems while cleaning the data. Data cleaning is one of the most important and time taking process a Data Scientist performs before performing any EDA (Exploratory Data Analysis) or statistical analysis on the data.

What is Tidy Data

Tidy data is the data which is obtained as as an outcome of processes applied to clean the data in such a way so that the data can be visualized and used in a model in an easy way. Tidy data norms helps us in data analysis process by defining some guidelines which we need to follow while performing the data cleaning operations. It provides a standard way to organize the data values within a dataset and formalizes the way we define the shape of the dataset. Also, the tidy data makes the data manipulation process easier. As in database normalization forms(1NF, 2NF, 3NF, & BCNF), tidy data principles have some predefined rules. Let’s have a look at the tidy data principles.

Tidy Data Principles

The tidy data principles are:

Each column represents a separate variable in the dataset

Each row represents an individual observation in the dataset

Each type of observational unit forms a table

These rules are equivalent to the database third normal form (3NF).

Let’s have a look at some real life dataset examples to understand the tidy data principles in more detail.

1. Each column represents a separate variable in the dataset

This data looks better for reporting purpose, however, this format is not good for analysis purpose. In this data, each day (Mon to Sun) represents an individual column as a separate variable. However, as per the first rule of the Tidy Data principles (Each column represents a separate variable in the dataset), we need to transpose the day names into a single column as they are representing a single variable – the week day name.

We can use pandas.melt method to convert the day name columns into a single row values which will help us to convert this messy dataset into the tidy dataset. We can use this python script.

#Line 3

In this line, we are using the pandas melt method to transpose the day name columns into a single row. We are passing EmployeeName column in id_vars params to make it identifier. Also, we are passing all the column names which need to be unpivoted as row values in value_vars param. Using var_name and value_name params, we are supplying meaningful names to the unpivoted column and the result column.

#Line 4

print(dfMelted)

In this line, we are printing the melted data frame which is a tidy dataset now.

The date values are on columns which need to be moved in a row as they are representing a single variable – date.

The min and max temperature values which are placed into rows, need to be transposed as individual columns as they are representing two different variables – minimum and maximum temepartures.

To get this done, we can use pandas melt method followed by a pivot or pivot_table method. To move the date values into rows, we can use melt method and then we can apply the pandas pivot or pivot_table method to move the min and max row values as individual columns in the dataset.

This is the python script we are using to convert the weather dataset into a tidy dataset.

In this line, we are using pandas melt method to transpose the date values into a single row. We are using CityName and VarType in id_vars params to make it identifier. Also, we are passing all the column names which need to be unpivoted as row values in value_vars param. Using var_name and value_name params, we are supplying meaningful names to the unpivoted column and value column.

#Line 4

In this line, we are pivoting the temperature column values to move the row values as individual columns in the data frame. We can use the python method chaining technique to perform melt and pivot in a single line of code also. However, here, we are using a new data frame to hold the pivoted output data for easy understanding.

As we don’t need to transpose the CityName and Date columns, we are passing these columns in a list in index param. The VarType column is passed to columns param which needs to be pivoted. Finally, we have used Temperature column as value column in values param.

#Line 5

dfPivoted.reset_index(inplace = True)

As we have used CityName and Date columns as index column, both these columns will be used as dataframe index column. Also, this will be a multilevel index column. If we want to move these index columns as regular columns, we can use pandas reset_index method (without any index values) to reset the indexes to the regular range index. We have set the inplace param value to True to transform the dataframe in-place rather than creating a new dataframe.

#Line 6

dfPivoted.columns.name = ''

Here, we are resetting the column name list to empty.

#Line 7

print(dfPivoted)

In this line, we are printing the transformed dataframe which satisfies the tidy data principles.

3. Each type of observational unit forms a table

Sometimes, a single type of observational unit spread over multiple files or tables. However, as per the tidy data principle – each type of observational unit forms a table, we need to merge these files into a single table.

Here, we can see that both of these files contains the details about a single observation. We can use pandas concat/merge/join methods to merge these two datasets into a single dataset to convert it into a tidy dataset.

#Line 4

Here, we are using pandas merge method to join these two dataframes into a single dataframe. We have defined the left and right dataframes using left and right params respectively. Also, we have defined the on clause to join these datasets based on Date column. By default, merge performs the inner join. However, we can use left, right or full join, if needed.

#Line 5

print(dfMain)

Here, we are printing the dataframe using print method.

Thanks for the reading. Please share your input in the comment section.

About Gopal Krishna Ranjan

Gopal has 8 years of industry experience in Software development. He has a head down experience in Data Science, Database, Data Warehouse, Big Data and cloud technologies and has implemented end to end solutions.
He has extensively worked on SQL Server, Python, Hadoop, Hive, Spark, Azure, Machine Learning, and MSBI (SSAS, SSIS, and SSRS). He also has good experience in windows and web application development using ASP.Net and C#.