Descriptive Analytics-Part 4 : Data Manipulation

Descriptive Analytics is the examination of data or content, usually manually performed, to answer the question “What happened?”.

In order to be able to solve this set of exercises you should have solved the part 0, part 1, part 2 ,and part 3 of this series but also you should run this script which contain some more data cleaning. In case you haven’t, run this script in your machine which contains the lines of code we used to modify our data set. This is the fifth set of exercise of a series of exercises that aims to provide a descriptive analytics solution to the ‘2008’ data set from here. This data set which contains the arrival and departure information for all domestic flights in the US from 2008 has become the “iris” data set for Big Data. Descriptive analytics is all about answering questions, the goal of this set of exercises is to ‘answer’ questions with very few lines of code using the dplyr package. The dplyr is a great package for data manipulation ( if you are familiar with sql , it will be a piece of cake for you). Before proceeding, it might be helpful to look over the help pages for the select, contains, filter,summarise, mutate, group_by, arrange.

For this set of exercises you will need to install and load the package rapportools, outliers.

Since we use the dplyr package, we will also make the our data frame a local data frame.flights
The reason we do that is because it has some cool properties that can be useful. First of all, if we type ( accidentally) ‘flights’ as a local data frame it will print only the first 10 rows , while as a data frame it will print as many as your screen can fit, which can be both disturbing or have RAM issues may occur down the road. Another reason is that when we type the name of the data frame , it provides us with some information regarding the number of rows and columns and the type of variable that each column is.