Getting to know multivariate data

Core Ideas:

multivariate modeling is challenging

pair plots make it easy to get a quick understanding of each variable and the relationships between them

Multivariate analysis and modeling can be really challenging. Getting the job done well requires you to know your data really well. People often use the metaphor the you know something well if you “know it like the back of your hand”. However we look at our hands everyday but probably do not recall the details of where each freckle or wrinkle is. You want to know your data in a much more detailed way.

One very valuable first step when working with a new multivariate data set is to look at the relationships between each pair of variables. There are a number of ways to do this in R and I often prefer to use two different scatter plot matrix methods to get a feel for the relationships between the variables.

Here is an example using the mtcars dataset in R.

df<-mtcars[,c(1,2,3,4,5,6,7)]

Scenario(s):

getting to know your numerical data

predictive modeling (feature selection, technique choice,…)

psych::pairs.panels

why use it?

you can see points with an ellipse superimposed in the lower region

you can see the data distribution on the diagonal for each variable

you can see the correlation values in the upper region

works with categorical data

library(psych) pairs.panels(df)

corrgram::corrgram

why use it?

pie chart in the lower region gives a quick visual view of correlations

Based on these plots it is easy to see some important high-level relationships between the variables.

mpg is strongly inversely proportional to:

cyl : number of cylinders

disp: engine displacement

hp: horsepower

wt: vehicle weight

mpg is negatively proportional to:

drat: rear axel ratio

qsec: time to get drive 1/4 mile

rear axel ratio and weight do not have a strong relationship with the 1/4-mile time. This means that if you want to predict 1/4-mile time, you would not want to use these as unconditional predictor. In fact it might cause you to start looking for interactions between the variables so you can do conditional modeling.

rear axel ratio is inversely proportional to wt, hp, disp and cyl. I know nothing about cars, but now I know that heavier, more powerful cars tend to have a smaller rear axel ratio.

There is also a lot of great basic summary info here:

A distribution plot for each variable

The min and max of each variable

This still only provides a very superficial understanding of the data, but this is a good start. There are lots of different options and ways to use both packages, so you can adapt how you use these functions for your own style and preferences.

Advertisements

Rate this:

Like this:

LikeLoading...

Related

Posts navigation

4 comments on “Getting to know multivariate data”

Hi there,
I have found your post very informative. However, I’m experiencing difficulties reproducing your second example under R 2.13.1. It gives me the following error message:
object ‘panel.conf’ not found.
I checked the help and I couldn’t find that option.
May I know what is what I’m doing wrong?
Many thanks in advance,
Ruben