Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field. It only takes a minute to sign up.

For the single variate data sets, we can use some straightforward methods, such as box plot or [5%, 95%] quantile to identify outliers. For multivariate data sets, are there any statistics that can be used to identify outliers?

1 Answer
1

Multivariate outlier detection can be quite tricky and even 2D data can be difficult to visually decipher at times. You are spot-on in looking for robust statistical treatments analogous to 95% quantiles.

Where as normally distributed data naturally aligns with the chi square distribution, the gold standard for robust statistics in n dimensions would be to use Mahalanobis distances and then eliminate data beyond 95% or 99% quantiles in Mahalanobis space.

Additionally there is a very sophisticated treatments called PCOUT for outlier detection that instead rely on principal component decomposition. There is a corresponding R package, but the theoretical treatment is behind a paywall: