Boxplots– boxplot consists of a rectangular box bounded above and below by “hinges” that represent the quartiles Q3 and Q1 respectively, and with a horizontal “median” line through it. You can also see the upper and lower “whiskers”, and a point marking a potential “outlier”.

IQR (interquartile range) = Q3 — Q1, (the box in the plot)

whiskers = ±1.58IQR/√ n ∗ IQR, where n is the number of samples. (datapoints)

boxplot(airquality$Wind~airquality$Month,col=”purple”)

Wind Speed by Month

Histograms- The most basic graph is the histogram, which is a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values. Typically the bars run vertically with the count (or proportion) axis running vertically. To manually construct a histogram, define the range of data for each bar (called a bin), count how many cases fall in each bin, and draw the bars high enough to indicate the count.

hist(airquality$Wind,col=”gold”)rug(airquality$Wind)#(Optional)plots the point below in a histogram

Barplot- A bar chart is made up of columns or rows plotted on a graph. Here is how to read a bar chart made up of columns.

The columns are positioned over a label that represents a categorical variable .

The height of the column indicates the size of the group defined by the column label.

A bar chart is used for when you have categories of data: Types of movies, music genres, or dog breeds.Hence, a bar chart is used (and not histogram) when we are dealing with categorical variables.

For two quantitative variables, the basic graphical EDA technique is the scatterplot which has one variable on the x-axis, one on the y-axis and a point for each case in your dataset. If one variable is explanatory and the other is outcome, it is a very, very strong convention to put the outcome on the y (vertical) axis.

One or two additional categorical variables can be accommodated on the scatterplot by encoding the additional information in the symbol type and/or color.

We will use the Males.csv dataset (present in the project on Datazar, to check whether being a part of an union impacts the salaries of young american males.

Scatter plot to represent age vs experience (the color represent whether the employee is a part of an union)

We can also use multiple scatter plots to understand better, whether being part of an union impacts an employees salary.

We can see that, most employees are not part of an union and they tend to earn more than employees who are a part of an union.Correlation doesn’t always mean causation, as it might be the case, the high paying industries do not allow their employees to form unions.

In a nutshell: You should always perform appropriate EDA before further analysis of your data

Lastly, I wish you all a merry Christmas and a very happy new year. I will come back with the next edition of EDA in New Year. Till then, happy modeling!