Outlier Values – An Enemy Of The Index December 13, 2007

Outlier values are basically values that sit way way outside the standard range of a column’s normal value range.

Data can be a funny thing and sometimes there are values that are naturally “exceptional”. However, very commonly, outlier values are used by applications to represent bizarre default values, to avoid confusion with legitimate values. For example, I look after an application that uses the American Date Of Independence as it’s “default” date.

Usually, these weird outlier values are used to avoid nulls values, as nulls can be problematic and can not be indexed (well actually you can index a null column but we’ll leave that for another blog entry).

However, outlier values while (maybe) solving one problem, can introduce some very significant problems in return.

Firstly, the CBO “hates” outlier values as it potentially totally screws up the CBO’s selectivity calculations. The selectivity of a range scan is basically calculated by the CBO to be the number of values in the range of interest divided by the full range of possible values (IE. the max value minus the min value). Therefore if this calculation is invalidated by a massive and disprotionate “hole” in the full range of possible values, the CBO can get things horribly wrong.

Additionally, indexes “hate” outlier values as it prevents Oracle using the 90-10 block split to keep indexes nice and compact and is forced to use 50-50 block splits instead. Basically a 90-10 block split is considered if and only if the index entry to be inserted is equal or greater than the current maximum value. An outlier value that is also the maximum value, usually means monotonically increasing values (such as sequences, dates, etc.) don’t actually insert the maximum value. Therefore, not only do indexes perform 50-50 splits but this 50% of free space is never used, as all new values are all almost, but not quite, maximum values.

Jonathan Lewis also touches on this subject in Chapter 6 of his CBO book in the section “Discrete Dangers”. Very good to highlight this for people developing applications and tempted to use these extreme values.

Brian, the data is sometimes what we set it to be. As I mentioned, sometimes we use really bad values as defaults. For example, in the second demo, a default of 0 instead of the massive number would possibly have been a far better choice.

Minor observation but maybe worth mentioning – the “hole” problem can be thought as a manifestation of skewness, which has two faces: skewness in the multiplicity of values (some values occur much more often than others) and skewness in the relative “distance” of values (the “hole” problem). The former can be a problem (especially) for equality predicates and equijoins, the latter for range predicates. Histograms can help for both – but of course, collecting an histogram is an option that has to be carefully weighted against their gotchas, so it’s way much better to avoid “holes” whenever possible (which is of course what you’re saying here).

Hi Alberto, absolutely worth mentioning. Outlier values effectively causes a skew effect in the data distribution which could be addressed with histograms. However, histograms come with a cost and prevention (if possible) is the best cure. Also, it doesn’t of course address the index space problem.

PS. I also changed the wording in the Blog entry as per your (very good) suggestion. Thanks :)