Index Only Values Of Interest: (Little Wonder) January 28, 2008

Thought I might expand a little on the discussion and comments on how NULLs can be indexed and address point #6 on my list of those things you may not have known about indexes

“It’s possible and potentially very useful to just index some column values and not all column values within a table”.

as well as touching on point #4 that “B-Tree Indexes can be extremely useful and beneficial even if the column contains very few distinct values (as low as 1)”.

As previously discussed, index entries which are fully NULL are not indexed by Oracle. We can however use this fact to our advantage.

There are many scenarios whereby we may only search for a rowset based on a subset of the possible values in a column or group of columns. The classic scenario is where we may have a flag or status field denoting “current”, “live”, “not yet processed”, etc. rows and our main transactional queries are only interested in these relatively few rows.

Most rows are “historical”, “processed”, etc. rows and are not generally of interest and when they are of interest represent such a large proportion of the overall table that an index would be inappropriate for these batch jobs or long running reports to access them anyways. Often, (but not always) we might need a histogram to let the CBO know that those column values of interest actually represents a small, non-uniform proportion of the overall rowset.

Because we need to efficiently access those few rows of interest, we generally index the column but in the process also index all the other column values that aren’t of interest as well. It’s all or nothing, right ?

Not necessarily. A possible solution is to use an appropriate function-based index in combination with our understanding that fully null index entries are not actually indexed. For example, let assume we have a very large table that has a STATUS code column. The only column value of interest are those with a status value of ‘BOWIE’, all other values are simply not of direct interest with our OLTP queries. By creating an index such as:

the decode function only returns a non-null value for the specific status of “BOWIE”. All other values are converted to nulls and so are not indexed.

We now have an index that consists of nothing but “BOWIE” values. As a result, the index is tiny because the vast majority of column values are simply not indexed. But because the percentage of rows that actually have a status of “BOWIE” is very small, the CBO looks at this index very favourably. By now writing our queries in a manner such as this:

Like this:

Related

Are there any known advantages of using DECODE over CASE or vice versa other than the number of characters to type? Somehow I end up using CASE most of the time, probably because I am sure that it is SQL standard compliant where I am not so sure about DECODE.

quote:
As previously mentioned, if the “BOWIE” value is not representative of the general distribution of other STATUS values, we may need a histogram on the STATUS column for the CBO to cost the index access favourably.

Yes I did mean the computed virtual column. I’ve decided to actually remove the second reference altogether as it’s not applicable in the example where I only have one status of interest and it would be a pretty a rare scenario where the “selected” status values were unevenly distributed to make a difference else you wouldn’t have included them in the first place.

It kinda defeats the purpose of the example and so only confuses the issue, so the reference is gone.

We’ve used this function-based index trick to great effect in our automated email application. Whenever the system needs to send an email it inserts a row into a table, which is then queried by a regular job. Each email gets a DATE_TO_SEND, allowing emails to be “scheduled” for a future time, and a DATE_SENT, which is updated to sysdate when the email is sent.

The job needs to run very quickly, we don’t want it doing full table scans to find the emails that haven’t been sent yet and where the DATE_TO_SEND is now or in the past, and we want to keep a fairly long history in the table; so we have used an index based on (CASE WHEN DATE_SENT IS NULL THEN DATE_TO_SEND ELSE NULL END).

The query is something like:
SELECT * FROM emails
WHERE (CASE WHEN DATE_SENT IS NULL THEN DATE_TO_SEND ELSE NULL END) <= SYSDATE
ORDER BY (CASE WHEN DATE_SENT IS NULL THEN DATE_TO_SEND ELSE NULL END);

The nice thing about this index is that not only are the unsent emails returned very quickly, they are returned in priority order without the database performing an additional sort operation.

what would be the advantages/disadvantages between to index only values of interest (only ‘S’ value) versus to index the whole column + histogram?

I Can see:

index only values of interest:
————————
– index is smaller
– the change would only affect queries I include the “where decode(col,’S’,’S’,null) = ‘S'”. So I know the total impact of my changes.

index the whole column + histogram:
——————————-
– you can still use the index in querys like select count(*) where col = ‘N’
– the change should increase performance of all querys with predicates on this column (when none of the querys use bind on this column)

I need your help if you can see more advantages for each strategy, in order to decide which to use.

You have it pretty well sorted out. The only other issue I would add is that the function-based index can only be used by those queries that explicitly use the decode function whereas the second index could be used by any query.