Pair Plots

Pair Plots

Why do we need Pair Plots?

A Simple 2D scatter plot is used to understand the relationship or pattern between two variables or features in our dataset. A 3D plot will be used for three variables or features. However, what would we do if we have more than 3 dimensions or features in our dataset as we humans do not have the capability to visualize more than 3 dimensions? One solution to this problem is pair plots. It is one of the most effective starting tools in our arsenal to do EDA.

They are used to plot features when we have more than three dimensions. As the name suggests we actually do pairs of features and plot them all.

For example, let’s say we have four features Name, Place, Animal and Thing in our dataset. In that case, we will have 4C2 plots i.e. 6 unique plots. The pairs, in this case, will be i. (Name, Place); ii. (Name, Animal); iii. (Name, Thing); iv. (Place, Animal); v. (Place, Thing) and vi. (Animal, Thing).

So, here instead of trying to visualize four dimensions which is not possible. We will look into 6 2D plots and try to understand the 4-dimensional data in the form of a matrix.

Some basic points to remember whenever we are looking at pair plots:

We will receive a matrix of plots where we will have an equal number of plots on both sides of the diagonal which would be the mirror image of each other. So, we should either look at the plots below the diagonal or above the diagonal, unless we use pairgrids.

We should always try to understand the legends

It is simply amazing to see just one simple line of code to gives us the entire plot.

Instead of writing codes for plotting 2D scatter plot individually. We can just write one line of code and we have our pair plots.

#Creating a simple pair plot
sns.pairplot(knowledge)
plt.show()

A Simple Pair Plot

We can make these charts more interesting by assigning them class labels

With the help of these customizations, we can have some elementary visualization of data which helps us find patterns and relationship in our data. This helps us to have a great start in understanding our project. If you would like to look at some more customization you can check my github profile for this dataset.

The advantage of having pair plot would be to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make a linear separation in our dataset.

However, pair plots have a major disadvantage:

Let’s say we have 10 features or 100 features in our dataset instead of 4. In such a case we would have 100C2 or 1000C2 plots which would be very difficult to go through as the sheer number of plots would be too high to make sense of them.

So, pair plots are easy to understand when the number of features or dimensions are low say 4, 5 or even 6 as we can quickly go through all the plots and see any trends. But when the number of dimensions is very high say 10 or 100 or 1000 then pair plots are not of much help unless and until we are very sure as to which 5-6 features we can to visualize. In such cases, we use dimensionality reduction techniques like Principal Component Analysis (PCA) or T-SNE which will help us visualize the data. We will try to understand those concepts too over the period of time.