Setup

Overview of the Dataset

dplyr provides a function glimpse to view the columns in a data frame:

dplyr::glimpse(Cars93)

It is very important to check that the data type of each column matches what you expect. Problems can arise if, for example, you have accidentaly used stringsAsFactors=TRUE when calling read.table or read.csv.

We can visualise this information using the R package visdat

vis_dat(Cars93)

Note that there are several NA values in the columns Luggage.room and Rear.seat.room

sum(is.na(Luggage.room))

## [1] 11

sum(is.na(Rear.seat.room))

## [1] 2

In this case, rear seat room is missing for sports cars, which do not have a back seat:

There is an important distinction between data that is missing at random and data that is unavailable due to some underlying reason. The column Rear.set.room is truly not applicable for these makes of car, so these rows can be safely excluded from any statistical model.

Exercise: can you figure out if there is a reason why Luggage.room is missing, or whether it is completely at random?

Pairwise Correlation

We can use a pairs plot¹ to explore relationships between pairs of columns in our data frame. For example:

A pairs plot is an example of small multiples²: we look at selected subgroups of columns, rather than plotting all 351 possible combinations at once. Otherwise, it is too difficult to glean any useful information from this style of visualisation.

Principal Components Analysis

Principal components analysis (PCA) is a method for dimension reduction. We can use it to explore covariance relationships between all of the columns simultaneously. PCA can be computed using the functions stats::prcomp or stats::princomp, but instead we will be using the R package FactoMineR³. This is mainly because it provides plotting functions using ggplot instead of base graphics, via the R package factoextra.

For now, we will exclude any rows with missing variables. We will also exclude any columns containing categorical data (factors). It is possible to handle these types of data using generalised PCA, but that is beyond the scope of this tutorial.

There are 18 principal components, since we have 18 continuous variables in our dataset. A scree plot shows how much of the variance in the data is explained by each component:

fviz_screeplot(cars_pca, ncp=18)

We can see that 63.7% of the variance is explained by the first principal component, which an additional 13.1% explained by the second component. We can plot all of our observations according to their 2D coordinates.

fviz_pca_ind(cars_pca, axes = c(1, 2), habillage=3)

Instead of plotting the rows (observations) according to their principal components, we can also plot the columns (variables).