I last discussed how high cardinality columns shouldn’t necessarily be in the leading column of a concatenated index as they don’t provide the performance benefit as sometimes claimed.

If all things are equal and the columns in the concatenated index are all likely to be referenced, a simple consideration that is often forgotten when deciding which column to have as the leading index column is the Clustering Factor of the corresponding columns.

As previously discussed, the Clustering Factor determines how well aligned or ordered the index entries are in relation to the rows in the parent table. So if the rows are ordered within the table on a particular column or columns (such as a sequential ID column, a monotonically increasing date or time-stamp, etc), then an index on these columns is likely to have a very good Clustering Factor. Consequently less IOs will be required to retrieve all the required rows via the index as all the required rows will be housed in relatively few, well clustered data blocks.

It therefore makes sense to at least consider the Clustering Factor of the various columns in a concatenated index. Why ? Because if the leading column has a very good Clustering Factor, the concatenated index by definition must also have a very good Clustering Factor as all indexes are primarily ordered based on the leading indexed column. A concatenated index with a good Clustering Factor is going to be more efficient in retrieving rows from the table and more importantly, will be considered more desirably by the CBO when costing access path options.

Of course, the opposite is also true. By having a leading column with a poor Clustering Factor will mean the concatenated index will have a poor Clustering Factor, making it less efficient and less likely to be considered by the CBO.

As such, the Clustering Factor of each corresponding column in a concatenated index is at least worthy of some consideration when making the decision on how best to order the indexed columns.

UPDATE: However as Tom Kyte has stated in the comments, in virtually all cases, the Clustering Factor is not really a factor (yes, pun fully intended) as subsequently columns are generally going to impact the CF anyways or the selectivity of the index is such that the improved CF is not relevant anyways.

More relevant considerations regarding the ordering of columns in an index coming I promise :)

It therefore makes sense to at least consider the Clustering Factor of the various columns in a concatenated index. Why ? Because if the leading column has a very good Clustering Factor, the concatenated index by definition must also have a very good Clustering Factor as all indexes are primarily ordered based on the leading indexed column.

But that wouldn’t be so “in general”. If the first column is “well sorted” (that would mean C1,ROWID is well sorted in the index with respect to the table), that does not imply that C1,C2,ROWID will be well sorted – C2 is sorted within C1 and would mess up the clustering factor just as easily as if C2 were first.

I always wondered who the other person was that visits the Blog regularly ;)

If the CF is not populated, it usually means the index hasn’t had statistics collected on it. Note my specifc demo was run on 10g and as such Oracle automatically collects the stats on any new or rebuilt index. However in 8i, this would need to be performed “manually”.

Seriously, thank you for your comments, they make perfect sense. You’re right my example isn’t particularly general and that if the leading column wasn’t so selectively, subsequent columns are likely to screw the CF.

Actually, I do have a case where CF in a Concatenated Index
did make a difference. The leading column was a Date column
and rows were being inserted into the table by Date.
So, this index actually helped reduce my physical read calls
(huge table, it can’t be cached) because the rows were co-located).
See http://hemantoracledba.blogspot.com/2008/01/impact-of-clustering-factor.html

well, that would be the case where the leading edge of the index was nearly unique – and as pointed out – that would lead to a good clustering factor as a result but….

so what?

The clustering factor is something that will lead the optimizer to choose an index or not based on how many table blocks it believes it will access.

Your blog entry gives no details. But I can say this with some confidence – when you wrote:

“The difference lay in the Clustering Factor.”

you jumped to the WRONG conclusion. The clustering factor determines whether an index will be used. The clustering factor has no influence over the runtime performance of a query (other than it will help the optimizer decide to use it or not).

Going by your description, it sounds more likely that by reordering the columns in your concatenated index – you had to inspect less rows, that is all.

I think that I will have to rewrite that posting to make it clearer.
Since the SQL statement and table and column names are from a
real case, I tried to mask the details. However, l’ll try to explain here :

0. Yes it is a range scan.

1. There are two indexes on the table. Both are multicolumn indexes. “index_a” is the bad index which the optimizer was choosing to use initially.

Thus, Oracle using this index in the Range Scan was doing 1.35million consistent gets.

3. I needed to disable usage of this index. I knew that there was a better index — index_b. I had 4 choices :
a) Drop index_a
b) Apply a function to the predicate
c) Manipulate column and index statistics
d) Put an explicit NO_HINT index
Actually, I could have tried the 5th option
e) Put an explicit HINT (table_1 index_b).
But I didn’t do that. I just wanted to see what alternate execution plan (not just index_b but the whole plan, nested loops whatever) Oracle chose withouth index_a.

4. The NO_INDEX Hint therefore, wasn’t really intended as the solution. My blog posting doesn’t assert it that way. I just said that I used a NO_INDEX Hint — the reason only was I wanted to disable usage of that index.

6. The question, then, was :
“In the first SQL RowSourceOperations there 3.354million rowids fetched from index_a and the total count of Buffer Gets for the whole query was 1.353million.
In the second SQL’s RowSourceOperations there 3.315million rowids fetched from index_b and yet the total count of Buffer Gets for the whole query was only 159,300 !
Why the difference ?”

7. The assertion “The difference lay in the Clustering Factor” was actually an answer to that exact question. I didn’t assert outright that the changed execution plan or better performance was a result of the Clustering Factor.

9. What I tried to put forward was that the better Clustering Factor with index_b was that although Oracle did read just as many rowids in the index range scan, the number of block gets for the *table* was much lower — the total consistent gets from 1.35million to only 159,300 !

10. Obviously, the number of rows is still similar. The number of *blocks* is much lower. That is my assertion about the Clustering Factor.

11. index_b’s leading column is a date column and rows are inserted into the table by date — ascending order. Much better clustering of rows into the same block and contiguous blocks when I fetch them ordered by the date.

12. So, the Clustering Factor didn’t result in inspection of fewer index entries. It did result in a significant reduction in consistent gets. Unfortunately, the row source operation statistics don’t really show how many index blocks and how many table blocks were read but I’m pretty sure that, because it was a Range Scan, the reduction was in table blocks.

you will want to put up supporting information regarding point #9 – A before and after tkprof of both queries would be infinitely better than a one or two like grep from a trace file, it would really show something.

#10 – that was the result of using a different index, not the result of an index having a better clustering factor. I demonstrated that an index with a horrific clustering fact can do the work in 2 IO’s what took the other 246 IO’s. It wasn’t the clustering factor – it was an index that was more efficient.

#12 – show us the tkprof and row source operations, that’ll go a long way here.

and the table is clustered by A, not by B – then it would be true that a LARGE index range scan that used t(a,b) would be more efficient than a range scan of equal proportion that used t(b,a) [in general, i’ll caveat that in a second].

But my point above was

a) putting the most clustered element first will not normally reduce the clustering factor unless that first element is almost unique (as the above example was)

b) and then it is usually a a big ho-hum that it does – because you would presumably be using it in a highly selective fashion.

It seems to me that NO INDEX for you might have been even better than an index. It would have to be a table of huge size for me to want to retrieve more than a million rows via the index (unless I could use the index INSTEAD of the table of course)

And for the caveat, there are opportunities to get plans that look like:

the rows are gotten from the table AFTER the nested loop operation and internally we sort them to make the access more efficient – there it would not matter so much what the clustering factor was – we fix it by sorting the rowids after the fact.

I’d still be interested in full up tkprofs that show what was happening – the bits of trace you have do not tell us much.

I guess that’s the point. The (a, b) would have a better CF, but only if the “a” column was extremely selective. And if it’s very selective and used mainly in a highly selective manner, then the better CF is somewhat pointless.

But if it’s used in a large range scan, then you would want the selectivity on the “b” column to be likewise very unselective (if that’s the right word ?), else you would be better off with column “a” as the leading column anyways.

So the scenario where the CF would actually be better is somewhat small and the scenario when it would be useful would generally be likewise small as well.

Still, it’s all good points of discussion as I sit here with a cold beer on a balmy Canberra evening !!

Tom,
1. You ask about more supporting information item 9. The facts already presented twice are :
a. Index index_a : Clustering Factor 21.441million, Index index_b : Clustering Factor 3.490million.
I guess you don’t dispute that index_b has a better clustering factor.
b. The first SQL execution with index_a did 1.3million consistent gets. The second SQL execution with index_b did 159,300 consistent gets. (The STAT entry value for “cr” is consistent read which I am merely repositioning as consistent gets).
I guess you don’t disupted that 159,300 blocks is less than 1.3million blocks.
I, personally, think that going down from 1.3million blocks to 159,300 blocks is a result of a better clustering factor.

And why I did not want a Full Table Scan ?
The table has more than 24million rows and 2.4million blocks.
Extent sizes are 128KB so the largest multiblock reads that
I will get are not very large.
The index range scan, I find was satisfactory.
And why I did I not change the statement to change the execution
plan any further ?
Having brought a query down from 40minutes execution time
to 4minutes execution time, I decided not to indulge in
compulsive tuning disorder.
It was also easier in terms of change management to put a Hint in
then to have an SQL rewrite put in.