Simulating data and file-based ETL

Posted on 24 Sep 2018
|
19 minutes (3854 words)

Introduction

Data Scientists spend a lot of time importing, cleaning, tidying and transforming data before any decent analysis can start. Like many, the industry that I work in typically email files to communicate data and report. I follow a consistent approach to ETL and subsequent data concentration to better manage the accumulation of multiple, disparate files from a variety of sources and different formats.

This tutorial demonstrates a simplified version of this process. It is split into four parts, starting with manufacturing data and culminating in file-based IO1.

Part 2 extends the output from Part 1 with simulated data in the context of a gym training scenario.

Part 3 exports multiple grouped simulated data to a variety of file types.

Part 4, the final tutorial, imports, tidies and transforms the exported files from Part 3 in an automated and standardised way.

The Scenario

As an honorary English person, I decided to simulate a gym training log for the current English football squad. The result is a daily record of selected exercise repetitions for each player, logged during a gym training session. It covers a two month period, with the first month during 2017 and the latter in 2018, a year difference.

The base list of players is copied from Wikipedia, extended further down this tutorial with simulated data, and used to demonstrate procedures and functions to do with IO.

Part 1: Import, Tidy Data and Create Features

Library

I always declare libraries at the top of the document in a typical workflow. It makes it a lot easier to know upfront which libraries are used when sharing the document with others.

Functions

Keeping general functions towards the top of a document makes it easier to reuse throughout. In larger projects, I tend to create function files that are sourced by other files, sourcing it upon setting up a document. It promotes consistency, continuity, standardisation and simplification.

General Parameters

Grouping generic parameters in one section quickly allows new users to adopt a document customised to their environment and for its intended use. Here I am setting up the relative path to the source data.

directory_path <- "../../resources/england_football_simulation/"

Importing

The data type of each column is guessed when the base source data is imported. I always increase the guess_max parameter as a rule unless I know the data structure in advance.

I have previously imported data where the first few hundred values in a column are empty, having populated values further down the table. The import function guesses the data-type as logical when it isn’t, resulting in subsequent populated values incorrectly coerced to NA.

A hangover from working with databases, I tend to tidy column names upon import, using the func_rename_tidy function as it simplifies coding after import.

Let’s take a look at the first few rows of data and simultaneously create an initial table for later comparison.

kable(
df_initial <-
df %>%
head()
)

No

Player

Date_of_birth_age

1

Jordan Pickford

7 March 1994 (age 24)

2

Trent Alexander-Arnold

7 October 1998 (age 19)

3

Danny Rose

2 July 1990 (age 28)

4

Kyle Walker

28 May 1990 (age 28)

5

James Tarkowski

19 November 1992 (age 25)

6

Harry Maguire

5 March 1993 (age 25)

Tidying

The Date_of_birth_age field contains multiple variables, depicting the date-of-birth, and age calculated at the time the article was published. This is messy data, which I have previously written about.

The DOB is extracted from the field and converted to a date datatype, with the age part discarded. A new feature Age is created, which is the calculated interval between DOB field and the lubridate::today() function.

This confirms that the DOB field has been converted into a date field, and used to calculate the Age based on today(), the current date when the notebook was run.

Part 2: Simulation

Simulation Scenario

The number of repetitions for each one of the standard exercises are logged during a training session. Most players are unable to train every day, so there are gaps in the timeline for each player during the recorded period, with some players missing more sessions compared with others.

The exercises included:

Horizontal Seated Leg Press

Lat Pull-Down

Cable Biceps Bar

Cable Triceps Bar

Chest Press

Hanging Leg Raise

Simulation Process

The graph shows the simulation data process flow cycle.

Simulation Process Cycle

Between 15 and 22 players train daily, represented by the Sampled Players in the above graph. Days during the two months where no exercises are simulated are coded those as Day Off.

Some of the exercises repetitions, represented by Repetitions in the graph have a zero count, the case where a player skips an exercise during a training session.

Let’s start this section by creating some parameters and reference values.

Simulation Function

The following code section is the simulation function that maps together the parameters with random combinations and sequences through the simulation process, as represented by the Simulation Process Cycle graph shown above.

func_create_data <-
# As with the process graph, each day from the param_days tibble is used as a
# input to start a new simulated cycle
function(param_day) {
# Use the integer value of the day to set a seed value, which means that
# this part of the simulation can be recreated as is
set.seed(param_day %>% as.integer())
# create the sample size from the 22 players, sampling a value between 15
# and 22 which represent the total number of players that exercise in a
# single day
training_group_size <- sample(15:param_group_size, 1)
# Generate between training_group_size-value samples, instanced from a population
# between 1 to 22 without replacement
sample(x = 1:param_group_size,
replace = FALSE,
size = training_group_size) %>%
# Map each sampled player in the distribution to all 6 exercises
map(function(param_player_number) {
param_exercises[1:6] %>%
# for each one of the exercises, create a random sample between 0 and
# 20 repetitions
map(function(param_excercise) {
# Populate a tibble with all parameters used within the daily
# cycle simulation. This part is truly random and will be
# different between all iterations
tibble(
No = param_player_number,
Exercise = param_excercise,
Volume = sample(x = 0:20,
replace = FALSE,
size = 1)
) %>%
return()
})
}) %>%
# Return and nest the result in the daily tibble, corresponding its
# calling date
return()
}

Executing the Simulation

The code section below calls the tmp_exercise_data function by feeding it the previously created parameters.

tmp_exercise_data <-
param_days %>%
# Map the list of days to the func_create_data function to simulate the data
mutate_at("Date", funs(data = map), func_create_data) %>%
# The unnest() function returns all iteratively nested cycled data to a top
# param_days level
unnest() %>%
unnest() %>%
unnest() %>%
# Complete the data for all combinations of the player no, date and exercise,
# creating explicit entries for the missing combinations
complete(No, Date, Exercise) %>%
# Create a new variable by extracting the year value from the date
mutate_at("Date", funs(Year = year)) %>%
# Group by Player Number and nest the simulated data in the `training_data`
# column
group_by(No) %>%
nest(.key = training_data)

Explore the Simulated Data

Let’s take a look at the first few rows of player data joined with the simulated data.

Part 3: Export Simulated Data

Export Process

The aim of this part is to output data into a variety of files to be consumed in Part 4, which is to Import Simulated Data.

The following image shows the process by which data is transformed, grouped and outputted into a variety of file types.

Data Export Process

The training log data is grouped by year, data nested and exported to two file-types, including Excel and CSV.

2017 data is exported to Excel, mapping each player and associated data to a tab. The 2018 data is exported to CSV file-type, with each exercise written to its own file.

To Excel

I use the openxlsx library to create and manipulate Excel files. The aim is to create one Excel file for the 2017 data, creating tabs for each of the 22 players, and populating each sheet with relevant data.

Confirm that the files are created by checking the directory_path folder, as specified in the General Parameters section.

Part 4: Import Simulated Data

Import Process

The aim is to dynamically import data into a common dataframe. With Excel files, the function lists and unnest all the tabs, and import each table back into the calling files tibble as is the case with CSV files.

Summary

This tutorial is simplified and without real-world nuances and overhead, like error handling for example. However, the power and simplicity of the approach is adequately demonstrated and can be extended for use in similar situations.