$\begingroup$What makes you think that there are any outliers (in the sense of erroneous extreme values)?$\endgroup$
– HenrySep 6 '17 at 14:57

2

$\begingroup$My study deals with the selling price of products. For example, I found cakes for 8 people under 13 cents or baby underwear 2977,68 euros.$\endgroup$
– Clarisse DussaugeSep 6 '17 at 15:08

$\begingroup$Such information should really be included in the original post! It seems maybe you need data cleaning based on domain knowledge, not outlier rejection per se.$\endgroup$
– kjetil b halvorsenFeb 9 at 13:53

$\begingroup$@Henry being an outlier doesn't actually require that the data are erroneous. The definition isn't well defined. But I think that most believe it's high leverage and high influence observations$\endgroup$
– AdamOJun 17 at 15:16

2 Answers
2

A statistical test has to do with replications of the experiment and a null hypothesis that is not "discovered" by the incidental finding of a outlying data point. For that reason, it doesn't make any sense to use a statistical test for data points, but you can use critical values or other criteria to flag observations as possible outliers, and then proceed accordingly to verify the data's accuracy.

Because of Chebyshev's inequality, you can always probabilistically quantify the distance of an observation from the mean in terms of a Z-score. The famous rule of Tukey identifies outliers based on a lower bound of normal of Q1 - 1.5 IQR and Q3 + 1.5 IQR. To give you a sense, in a normal distribution the upper bound comes out to a value of 2.70, which in a sample of 6,000 would flag about 21 observations irrespective of their actually being outliers.

Along those lines, it is fair to consider any rule that suits the problem to rank and classify outlying observations. Some ad hoc examples below:

Use Tukey's test to flag outliers. With 6,000 you may set a FDR by simulation or something similar to scale the IQR by an even larger value as needed.

Log transform the data if the data are concentrations or counts (due to biologic interest).

Use a Box Cox transform to generate the optimally normal exponential change-of-variable and then apply normal tests.

Use the Z-score to rank and flag outliers and choose a stringent alpha level critical value to flag outliers anyway.

Use a known distribution suspected to form a data generating process, fit a QQ-plot to those data, and rank outliers in terms of mean-squared error from the calibration line.

Use single-observation deletion and perform maximum likelihood to find which observation's deletion leads to the greatest improvement in likelihood.

where price would be a vector of prices, lowerprice would be the lower bound and upperprice would be the upper bound. NOTE that I would apply this for each separate group of like products. What would be an outlier for clothes would not be an outlier for cars!