Outliers

Real data are dirty. Errors creep in at almost every step of data collection, recording, and organization. Even correct data describe a world that contains individuals so extraordinary that it may be impossible to understand the ordinary course of events without treating them specially.

Any case that is far from the body of the data is an outlier. Some outliers are clearly extreme. Others are more subtle, being extraordinary in some combination of variables but not extreme in any one variable. The medical patient who is 6’3” and weighs 110 pounds is an example. Neither his height nor his weight alone is extraordinary, but the combination of the two show him to be extraordinarily thin — possibly thin enough to warrant excluding him from a medical study

Extraordinary cases are sometimes divided into blunders and rogues. Blunders are clerical or measurement errors in the data that are not informative about the real world. They should be identified and either corrected or omitted.

Rogues are correctly measured and recorded, but are inherently unusual. Even though a rogue case is correct, you may do well to omit it from your main analysis and treat it specially. A data analysis that describes most of the data well and deals specially with a few exceptions is almost always more useful than an analysis that provides a mediocre fit to all the data but makes no exceptions. In no event should rogues be summarily discarded. A rogue can be worth more to your understanding of the data than all the ordinary cases because it makes a particular aspect of the data clear.

Often the first thing to do when you discover an outlier is to learn more about it. The Web Search Query command is an excellent place to start.