Follow me on my journey to becoming a Data Scientist

Exploratory Data Analysis: Covariation of a Categorical and a Continuous Variable

Which Wine Varieties are the Most Affordable?

In my recent post about Variation, I used Kaggle’s Wine Reviews data set to explore the variation within wine variety, specifically to find the most common wines, with the most common being Chardonnay. I then looked at the variation of Chardonnay prices and ultimately found that the outliers in the data set may have been entered by error. A great example of why exploratory data analysis is needed for any project.

Now, instead of looking at the variation in one variable, in this post I want to use the data to see the covariation between all wine varieties and price. Specifically, which of the common wine varieties are most affordable? Since these are two different types of variables, a categorical and continuous variable, I will need to visualize the data in a different way.

The best way to visualize the data is in a box and whisker plot. This kind of plot is ideal for comparing distributions because the median, spread and range are immediately obvious.

What does a box and whisker plot tell you?

The ends of the box are the upper and lower quartiles, also called the interquartile range, and contains 50% of the data.

The median of the distribution is marked by a line inside the box.

The whiskers, the two lines outside of the box, extend to the highest and lowest data observations.

The data points beyond the end of the whiskers are outliers.

Now, let’s take a look at the box and whisker plot comparing the prices across the most common wine varieties. Click here for reference code so you can replicate this plot yourself.

From this plot, we can see that Sauvignon Blanc is the most affordable as 75% of the prices recorded for this variety is below $25. How do I know this? 75% of a distribution’s data is captured in between the minimum data point and the 3rd quartile, indicated by the top of the box. Looking at the plot, we can clearly see that the top of the box for Sauvignon Blanc is below $25.

Other affordable wines include Merlot, Riesling and Chardonnay. Of course each of these wines have outliers at much higher prices, but it is more likely that you will find them at lower prices based on each of their price distributions.

Now you know how to plot a categorical and a continuous variable to see their covariation how and read a box and whisker plot. And if you’re trying to save some money, you now know which wines to buy too!