In our first blog of the series, we discussed about generating summary data using spark.This summary data included mean, standard deviation and quantiles. Quantiles gives pretty good idea about spread of data and are one of the robust measurements compared to mean.

In this third blog of the series, we will be discussing about how to use quantiles to identify the outliers in our data. You can find all other blogs in the series here.

Outlier

For a given variable in data, outlier is a value distant from other values. Normally outlier is introduced in data due to issue with measurements or some error. Outlier effects our inference of the data as they may skew the results.

So in statistics its important to identify the outliers in the data, before we use it for analysis.

Outlier detection using Box-and-Whisker Plot

There are many methods to identify outlier in statistics. In this blog, we are going to discuss about one of the method which uses quantiles. The logic of the algorithm as follows

Let’s say we have Q1 as first quantile(25%) and Q3 as third quantile(75%) , the inter quantile range or IQR will be given as

IQR = Q3 - Q1

IQR gives the width of distribution of data between 25% and 75% of data. Using IQR we can identify the outliers. This method is known as Box and Whisker method.

In this method, any value smaller than Q1- 1.5 * IQR or any value greater than Q3+1.5 * IQR will be categorised as the outlier.

In above example, we have taken a list of values as sample data. If you observe the data, most of the values are around 14.1-14.7. From that we can assume mostly values 10.2, 16.4 are outliers. There is chance that 15.1 and 15.9 are also outliers but we are not fully sure.