8. An index can potentially be the most efficient and effective may to retrieve anything between 0% and 100% of the data from a table.

A few recent posts on OTN reminded me that perhaps it’s about time I wrote something on this topic.

Generally, the question that’s commonly asked is at what point or at what percentage of data does Oracle no longer consider the use of an index and judges the Full Table Scan (FTS) as the most efficient method to retrieve the data from a table.

The answer unfortunately is that there is no such magic number or percentage, it all entirely depends. The way I often answer this question is by simply stating I can very easily come up with a scenario where a FTS is the most cost effective method to retrieve 1% of the data. Equally, I can very easily come up with a scenario where an index is the most cost effective method to retrieve 99% of the data.

Like I said, there is no magic number, it entirely depends on a whole list of different factors and variables.

To start, I thought I might go through the example of how a 1% cardinality result is best achieved via a FTS, highlighting why and how the Cost Based Optimizer comes to such a decision.

I’ll use a simple little scenario with nice simple numbers to make the mathematics nice and easy to follow :)

OK, let’s assume we have a table that has 10,000,000 rows. The table uses 100,000 table blocks to store these rows and so we have on average 100 rows per block. With an 8K block size, we’re basically looking at a table with an average row size of about 80 bytes.

Let’s say this table has an associated index with approximately 20,000 leaf blocks required to store the index entries for a particular column and the index has a blevel of 2 (or a height of 3). This basically means we can store approximately 500 index entries per block and the average index entry is about 16 bytes or so in length.

The indexed column has 100 distinct values which are evenly distributed such that each distinct value has approximately 100,000 occurrences each. The column has no NULL values.

Let’s say we write a query based on the indexed column and we’re interested in just one of the possible 100 values or approximately 1% of the data in total. For example:

SELECT * FROM bowie_table WHERE code = ‘ABCDE';

Does the CBO choose the index or does it chose the FTS ?

Well, let’s first cost the index access path.

We begin by reading the root block and the intermediate branch block for a cost of 2.

We also need to read approximately 1% of all the index leaf blocks in order to access all the index entries of interest. So that’s 20,000 (leaf blocks) x 0.01 = 200 leaf blocks in total.

So the total cost of reading just the index is 202.

Next comes the interesting bit. How many of the 100,000 table blocks do we need to access in order to read just 1% of the data (i.e. 100,000 rows) ?

Well, the answer depends entirely on the Clustering Factor of the index or to put it another way, in how well ordered the rows in the table are in relation to the index. If the index column values of interest are all very well clustered together in the table, then we can access the required rows by visiting fewer blocks than if the index column values are evenly and randomly distributed throughout the table.

In fact, in the worst possible cases scenario, if the Clustering Factor is appalling and has a value close to the number of rows in the table (10,000,000), we may actually need to visit each and every block in the table as each block has an average of 100 rows per block and we want on average 1% or one of these rows from each and every table block.

In the best possible case scenario, with the column values perfectly clustered together and with a Clustering Factor approaching the number of blocks in the table (100,000), we may get away with only having to visit 1% of all the table blocks or just 1,000 of them.

So the Clustering Factor is a crucial variable in how costly it would be to read the table via the index. The actual table access costs therefore are simply calculated as being the selectivity of the query (0.01 in our case) multiplied by the Clustering Factor of the associated index.

In this example, the Clustering Factor is indeed appalling with a value of 10,000,000 and the table access costs are therefore calculated as 0.01 x 10,000,000 = 100,000.

So the total costs of using the index is 202 (for the index related costs) + 100,000 (to access the rows from the table) = 100,202 in total.

So what are the costs associated with the FTS ?

Well, the FTS has a number of advantages over the index scan. Firstly, as Oracle needs to process all the blocks, it can retrieve all the necessary rows by reading a specific table block just the once. However, with the index scan, Oracle may possibly need to access a specific table block multiple times in some scenarios.

Secondly, as Oracle knows it has to read each and every block, Oracle can do so with a larger “bite of the pie” each time via multiblock reads, knowing it’s not wasting resources as all blocks need to be processed anyways. Index access reads perform single block I/Os whereas a FTS can perform muiltblock I/Os at a time. In this specific example, let’s assume the effective multiple read value is 10, remember, we want to keep the arthmetic nice and simple …

Finally, a FTS can be performed in parallel, even if the table itself isn’t partitioned, which means the overall response times can be further improved and the CBO can reduce its “costs” accordingly. In this example, we won’t worry about parallel query.

So that’s roughly an overall cost of 100,202 for the index vs. 10,001 for the FTS.

The results are not even close with the FTS winning hands down and that’s for just 1% of the data …

A couple of final little points for now.

Firstly, the cost of just reading 1 block (for the single block index reads) vs. 10 blocks (for the multiblock FTS reads) may actually differ somewhat as multiblock reads are doing more “work” with it’s associated I/O. By default, with no parameters set and with no system statistics, the CBO will cost each I/O as being the same. More about how to possibly adjust this another time.

Also, by default the CBO will assume all associated I/Os are physical I/Os and will cost them accordingly, even if the BCHR is nice and high and the index access path in question might be accessed within (say) a nested loop join where the likelihood of many of the index related I/Os in particular being cached is very high. More on this at another time as well.

But for now, just note how in this relatively trivial example, the following factors came into play when determining the potential costs of this query:

Selectivity of the query

Data distribution with regard to the actual occurrences of the required data

Number of table blocks (below the high water mark)

Number of leaf blocks

Index Height

Average number of rows per table block

Average number of leaf entries per leaf block

Clustering Factor

Caching characteristics of index and table

Effective multiblock read count

Relative cost of single vs. multiblock I/Os

Parallelism

All of which contribute to make any single “magic number” by which Oracle will no longer consider using an index but another fairy tale in the Oracle book of myths and folklore …

The first point was simply to do with how evenly distributed are the occurrences or frequencies of the required column values. Basically, how accurate is Oracle in determining the associated selectively and cardinality of the data, does the code value ‘ABCDE’ really occur 100,000 times or is the data skewed and the value only occurs 10,000 times. Do we need histograms to give the CBO more information or do we not.

I was just trying to make the point that even the assumption of 1% being selected is a variable that the CBO needs to get right else all its subsequent calculations on whether to use/not use an index are going to be inaccurate.

“Let’s say this table has an associated index with approximately 20,000 leaf blocks required to store the index entries for a particular column and the index has a blevel of 2 (or a height of 3). This basically means we can store approximately 500 index entries per block and the average index entry”

The documentation (ALL_INDEXES is where i looked) states about clustering factor:

Indicates the amount of order of the rows in the table based on the values of the index.

* If the value is near the number of blocks, then the table is very well ordered. In this case, the index entries in a single leaf block tend to point to rows in the same data blocks.
* If the value is near the number of rows, then the table is very randomly ordered. In this case, it is unlikely that index entries in the same leaf block point to rows in the same data blocks.

(The rest i am writing for me, so I will be long. It helps me understand things.)

INDEXes are ordered. For example, an INDEX on a single COLUMN will lists all similar entries together. If they can all fit in one block (based on space left from the previous entry) they will. Similar data is not distributed throughout the entire INDEX (that would be against the entire point of the INDEX!).

TABLEs, however, store data randomly (thus the usual help of an INDEX to quickly find specific similar data). Which means, look at an ordered INDEX blocks, to know where to go to the random TABLE blocks.

Now, the records do not need to be distributed. Indeed, due to sheer randomness they may actually be next to each other. Which makes two extreme cases (and reality in the middle).

1) The data is all in the same bock.
2) The data is distributed throughout all the TABLE’s blocks.

Or, put another way:

1) The distribution of the data in the TABLE blocks is exactly like the distribution of data in the INDEX.

2) The distribution of the data in the TABLE blocks is nothing like the distribution of data in the INDEX.

Clustering Factor (CF) is an INDEX property, not a TABLE property, so each INDEX has its own CF when compared to the TABLE it indexes.

Case 1 has a low CF, because the data is “clustered” together when compared to the INDEX entries.

Case 2 has a high CF, because the data is not “clustered” together when compared to the INDEX entries.

How is it calculated?
Brian, if you have Jonathan Lewis’s book Cost-Based Oracle Fundamentals, he shows the sql used to calculate the clustering factor. The idea is something like; start with CF=0, start from the first leaf block, get a rowid, look which table block that row is in, take the second rowid from the leaf block, look which table block that row is in, if the second row is in a different block increment CF to 1. If not do not increment it. Goes like this.

First, why the difference in cost with the FTS ? Well the big unknown here is what version of Oracle are you running and do you have system statistics.

If you do have system stats (which I suspect), then check out the MBRC value in sys.aux_stats$ as Oracle will use this value in it’s calculations. I would guess it’s value is around 4.5 or some such …

Note that without system stats, Oracle doesn’t actually use the 16 value but an “adjusted” value to compensate for the fact that Oracle may split up many of the 16 block multiblock reads anyways if it finds a cached block as part of the blocks it’s about to read. For 16, it gets adjusted down to something around the 10 mark.

Question 2, why the index and not the FTS ?

You’re only selecting the OBJECT_ID column which it can find directly from the index. Therefore it’s cheaper to treat the index (which only has 22384 blocks) as a “skinny table” rather than the table itself which has 137658 blocks.

Not sure what you mean by how is a unique and non-unique index handled ?

I believe in quality not quantity although I admit both is often nice ;)

CF, just like Yas says. Imagine a full scan of the index. Read the rowid from the first index entry and increase CF to 1. Read the rowid from the second index entry and check to see if it points to the same physical data block as the previous rowid. If it does, count remains the same, if it differs, count moves up to 2. Check the next rowid, if the same as previous rowid, count remains the same, if it differs (even if it only points back to the block from the first rowid), then the count goes up.

Repeat until you read all the rowids in the index (or the estimate thereof) and the final score is the CF.

Your suspection is correct :-) i dont have system stats. I will work on it.

and about this “Not sure what you mean by how is a unique and non-unique index handled ?”

I have indexed two column NO (Unique), OBJECT_ID (Non Unique).

I took explain plan for this two statements

1) SELECT no FROM hx_big_table

2) SELECT object_id FROM hx_big_table

The first one went for a full table scan and the second one for a Index fast full scan. Both have been indexed then why is this difference. Thats why i wantd to ask is Unique and Non Unique index cost are computed in different way.

The most obvious example of where this might be true is when the select clause only selects columns that can be entirely retrieved via an index. If Oracle can avoid visiting the table, this will obviously impact the costings as the table selectively component is no longer costed.

So yes, adding a column (or removing a column) from the select list can make all the difference.

One of my favourite posts (and so difficult to digest that I have to read it agin and again….of course, all credit to myself for finding it “difficult”). However, it would have been great if you could have included a Demo or link to a demo, as you generally do in other posts. Or is the demo already exists somewhere and I have missed it ?
I guess I am kind of a person who understands SQL better than english…

I have a query in which I am referring a BIG table once only, however when I see the v$session_longops, this table has been scanned (FTS) multiple times due to which the query is running very slow. Can you explain what might be causing this? Can the temp tablespace size and sort area available might be putting hinderance to the optimum performance of this query?

Not enough to go on I’m sorry. How is the big table being accessed, how is it being joined, what are the predicates, what are the indexes, what are there clustering factors and other stats, what’s the SQL, what’s the execution plan, what’s the selectivity of the query …

This is being caused by the CBO thinking it’s accessing the required data in the most efficient manner and it being perhaps wrong in it’s thinking :)

Clustering factor is in between Num_blocks of table and Num_rows of tables which i feel quiet OK

I also feel looking at the stats, full table scan is better.

In any kind of access, this should not do that many high buffer gets.
What do you think the reason for this very high buffer gets.Do you think full table scan is better here ?
If so why optimizer is not picking up full table scan on it’s own ?

OK, the key piece of information that’s missing here is exactly how many rows are actually being updated.

The 3 possible issues that instantly spring to mind here are:

1) Although the CBO estimates that 2488 rows (approx. 0.000786% of the data) is to be updated, in actual fact a substantial higher number of rows are being updated

2) The table has triggers that performs a substantial amount of additional work

3) The table undergoes significant amount of concurrent changes, requiring much undo to the accessed during the update

Whether a FTS might be better, the easiest way is to find out and test via hints. The reason why the CBO is not choosing the FTS is simply because it comes at a cost > 416. With 5690660 table blocks, this not a surprise. However, whether 416 is an accurate cost for the index comes down primarily to the selectivity of the update and the accuracy of the various stats (including system stats).

I have observed that 10053 trace output for one of my queries shows the cost of an unique index scan as 0. I am not sure why, to me the cost of index access would be at least equal to the blevel of the index . can you please help me understand this.