Another good point – it is possible to find edge cases (particularly with a small number of index entries) where the order of data arrival can have a surprising impact. In your case the effect is also largely dependent on the great length of the table rows.

Hello Valentin,
sorry i misinterpret it. You are right, that the clustering_factor depends on how the data of the table is sorted in relation to the index and this does not change in case of index key compression.

Stefan,
As Valentin has pointed out, compression won’t affect the clustering_factor, but it will affect the size (number of leaf blocks) in the index – and that does have some impact on the cost of using the index; so if you have different compression on the indexes you can get counter-intuitive results.

Well, if you have a lot of ties for the second column and the third column in context of the key path is well correlated with the physical order of rows in blocks, then I suppose you could get a better cluster factor. (I haven’t tested this – yet.) If Oracle is already ordering by file, block, and row within ties across possibly many leaf blocks spanning the tie, then this wouldn’t help [and points to a possible enhancement if they aren’t already doing it.] If it does help, then possibly the reduction in rowid reference key size for partially repeated values could, I suppose, win a trade-off and the extra bit of key length required for the third column. I guess I know what I’ll be doing the next little while…

I suppose too that the third column’s key could consistently fit in the leaf block space that won’t fit another full row, then the index would be the same size, not bigger.

I’m not sure why you claim the 3rd key makes the cluster factor worse. Even if Oracle already sorts ties within the second key, the third key could represent an equally good cluster factor, right? Is this just a quibble on the <= versus < boundary? Probably.

This test will delay the start of my reading your new book, which has arrived.

If we have index on column A then within the group defined by any fixed value a1 all index entries will be ordered by rowid, thus when we are going through the leaf rows we get minimal possible number of changes of relevant table blocks.

If we add additional column B into the index, it may change the order of rowid when we are going through the leaf rows, thus the number of changes of relevant table blocks may also change (increase from the minimum).

So far so good, but when we jump to the next group defined by a2 (…<a1<a2<…), we can get one additional change of table block for 1-col index as compared with 2-col index in the following situation:
the table has rows with A = a1 and A = a2 in the same block (B1) and also rows with A = a2 in the previous (order by rowid) block (B2),
but column B might help to order the index entries within group defined by a2 so that they will go first from the block B1 and then from the block B2 for A = a2.
Thus we will get one less change for the 2-col index as compared with 1-col. Same is for 2- and 3- columns indexes.
The simplest situation (for 2-col and 3-col indexes) is presented in the previous comment.

Valentin,
I think your first couple of statements summarise the informal argument about why we generally assume a two-column index will have a lower clustering_factor than a three-column index – the inclusion of the rowid in the index entry is very important, so if we have six rows with the same (two-column) key, three rows in each of two blocks (call them B1, B2) the block component of the index entries would be for (B1, B1, B1, B2, B2, B2). Adding an extra column to the index could then very easily change the order of visiting the rows to (B1, B2, B1, B2, B1, B2).

On the other hand, as you point out, there are patterns of data where an odd synchronsiation of values and block locations could contradict the general intuition.

for a bitmap index the number of columns should have no impact on the clustering factor – as it is equal to num_rows. In a b*tree index the clustering factor should remain unchanged if the third column is a constant (or null). Maybe the value of these indexes would be limited…

Martin,
I wasn’t thinking about bitmap indexes when I posed the question, but my first thought is that if (‘x’,’y’) is a key value in the two-column index then it may have one index entry; if you add a third column you may have N index entries like (‘x’,’y’,’a’), (‘x’,’y’,’b’) – with a corresponding increase in the clustering_factor. This won’t change the cost of using the index, of course, since the clustering_factor of a bitmap index isn’t used in the cost.

Following stronger statement is true. For many “not very unique” indexes we can create a function (falsely deterministic) such that if you create the FBI with the columns from the original index and this function will be after them, the clustering factor of the resulting index will be smaller than the original.
See simple example below.

For a bulk loaded table the CF for the 2-column table should be at least as good as the CF of the 3-column table.
If the data is not bulk loaded and the third column increases for each insert (for example insertion date) I would expect a better CF for the 3-column table.

Guess it’s all a question of math. 2 columns index, combination of two values. 3 Columns index, combination of 3 values, taken by three. There goes your clustering factor…the last column increases the number of combinations possible. Of course compression will save something,but compression is used (correct me if i’m wrong) for the prefix, not the sufix, meaning:
in a two column index, you can compress the first column
in a three column index (depending on the density of the second column) you can compress the first two columns.

Still, that will not be the deciding factor.
Fact is: third column value has to be stored in the index. Clustering factor increases due to the number of combinations.
Maybe i’m wrong, but that’s how i see it.
Alex

Your blog highlights an interesting side-effect of the flaw in the optimizer’s model when it comes to using the clustering_factor. (It also fit quite nicely with a little posting I did recently about enhancing index statistics to include figures about every prefix combination.)

I could think of a test case, where a 2 column global index on a partitioned table could have a higher clustering factor than a super set 3 column local index. (I dont have access to an oracle database right now to test it out).
srivenu

Srivenu,
The mixture of local and global indexes is certainly a case where the side effects on clustering_factor can be very counter-intuitive.

On a completely different tack, creating a hash-partitioned index on a non-partitioned table could also result in less contention for popular blocks on insertion, which can avoid a bug that causes indexes to become much bigger than they need to. There’s lots of room for discusion when looking at partitioned indexes.