Exercise

Tidy Data and Messy Data

What exactly marks the difference between tidy data and messy data? It is not only how organized and intuitive the datasets look to our human eyes, but also how easily and efficiently they can be processed by computers. In his seminal paper Tidy Data, Hadley Wickham proposed three standards for tidy data:

Each variable forms a column

Each observation forms a row

Each type of observation forms a unit.

In this course, we'll focus on the first two rules and show you how we can use the Python package pandas to deal with datasets violating them. To get started, execute messy in the IPython shell. This dataset, which appears in Wickham's paper, shows the number of people who choose either of two treatments in a hospital. Observe its structure in comparison with Wickham's rules. This dataset is messy because it violates rule #2: it combines Treatment A and Treatment B, two distinct observations, in a single row.

Now let's look at two more datasets. Execute df1 and df2 in your IPython shell to check out two other preloaded datasets, both featured in DataCamp's Cleaning Data in R course. The former shows the type and number of pets owned by three co-workers, and the latter shows the average BMI in three countries over several years. Which one of these datasets is messy, and why?