_SMALL_TABLE_THRESHOLD Parameter and Buffer Cache – What is Wrong with this Quote?

17062010

June 17, 2010

Warning: this article is certain to bring tears of pain to your eyes while you attempt to determine what is right and what is wrong.

While reading the June 2010 printing of the “Oracle Performance Firefighting” book I found a couple of interesting lines about the buffer cache and the effects, or lack there-of, of the _SMALL_TABLE_THRESHOLD parameter. This quote is from page 208 (this is a long quote, required to not lose the context of the discussion topic – please excuse any typos):

“The single most radical departure from the modified LRU algorithm [introduced in Oracle Database 6, per the book] is known as midpoint insertion [introduced in Oracle Database 8.1.5, per the book]. Each LRU chain is divided into a hot and cold region. When a buffer is read from disk and a free buffer has been found, the buffer and buffer header replace the previous buffer and buffer header contents, and then the buffer header is moved to the LRU chain midpoint. Single-block read, multiblock read, fast-full scan, or full-table scan—it makes no difference. The buffer header is not inserted at the MRU end of the LRU chain, but rather at the midpoint…

Because the window scheme used in the modified LRU algorithm is no longer used, the hidden instance parameter _small_table_threshold became deprecated. However, in Oracle Database 11g, it is being used again, but for a different purpose. Starting with this version, the _small_table_threshold parameter is the threshold for a server process to start issuing direct reads.”

Before deciding what about the above is correct or incorrect, take a look at Metalink Doc ID 787373.1. A couple of words from that document to encourage you to take a look at the original material:

“Applies to Version: 9.0.1.0 to 11.2.0.1.0: When loading objects and data into the buffer cache… best way to load for full scans. This functionality is called smallness logic and is driven by a hidden parameter called: _small_table_threshold. … Oracle will load the object via the buffer cache as this is more efficient than doing a direct read operation.”

Before deciding what about the above Metalink document is correct or incorrect, take a look at this AskTom article. A couple of words from that article:

“no, you are not correct on your guess about the[_SMALL_TABLE_THRESHOLD] parameter, it controls the caching of the blocks (whether they are cached like a single block IO or cached as we cache full scan blocks) – not the method of IO.”

Before deciding what about the above AskTom article is still correct, take a look at this blog article. A couple of words from that article:

“When direct path reads starts to happen? It is known that somewhat reliable figure is your _small_table_threshold multiplied by 5 (mentioned by Tanel Poder on oracle-l recently).”

“A table is deemed ‘small’ by Oracle if it’s 2% or less of the buffer cache in size. A small table is ‘cached’ when read via a FTS and is not immediately loaded into the least recently used end of the LRU list as becomes a ‘large’ table read via a FTS…”

“When a block is read from a ‘large’ table via a FTS, the blocks are basically loaded into the LRU end of the LRU list, even though they’ve only just been loaded into the buffer cache. As such, the chances of them remaining in the buffer cache is minimal and will likely be immediately overwritten once processed.

However, when a block is read via an index, the blocks are loaded somewhere near the ‘middle’ of the LRU list, not at the LRU end of the list…”

What, if anything, is wrong with the above quote from the book? Bonus question, what, if anything, is wrong with the Metalink document? Maybe there is a problem with the other quotes?

The point of blog articles like this one is not to insult authors who have spent thousands of hours carefully constructing an accurate and helpful book, but instead to suggest that readers investigate when something stated does not exactly match what one believes to be true. It could be that the author “took a long walk down a short pier”, or that the author is revealing accurate information which simply cannot be found through other resources (and may in the process be directly contradicting information sources you have used in the past). If you do not investigate in such cases, you may lose an important opportunity to learn something that could prove to be extremely valuable.

I’m mainly a hacker which isn’t very scientific, but well … that’s the way the ball bounces, but as far as hacking goes it seemed pretty likely that Oracle was using something like small table threshold in 8.1.6, after midpoint insertion started, to read in FTS into the cold end of the LRU.
Back in 8.1.6 we could have 50 blocks in the buffer cache (hard to imagine now) and with a program to attach to the SGA and read the memory directly with C code and a screen with 50 lines, I could watch the evolution of the buffer cache. It was pretty easy to see mid point insertion, block promotion to the hot end and to see that FTS above a certain size went into the cold end, not the mid point, of the LRU.
I have to dig out this program and try it again. Alas, with segmented buffer cache, the program needs some modifications now on 9i and above.
I also would have to segment the display to be just parts of the hot, midpoint and cold end as the minimum size of the buffer cache is well beyond 50 blocks these days.
Best
Kyle Haileyhttp://db-optimizer.blogspot.com/

First, sorry for the delay in your comment appearing. Thank you for the information regarding 8.1.6. The direct attachment to the SGA would likely put a different degree of insight into exactly what happens during a full table scan of a large table compared to looking at what information is presented through Oracle’s externalized view of what happens.

The “_small_table_threshold”, small tables, and buffering limits for tables are all on my “todo” list; especially after reading a post from Dion Cho ( http://dioncho.wordpress.com/tag/full-table-scan/ ) where he points out that the _small_table_threshold has probably not been the limit at which the LRU algorithm switches between end-point (for “long” tablescans) and mid-point insertion (for “short” tablescans).

I note that the metalink note seems to be claiming that the serial direct read (of 11.2) appeared in 9.0 – and it also references a test case that doesn’t exist, so perhaps a lot of the note has gone missing through the group editing effort. Certainly there’s a lot of change to worry about – covering:

a) _small_table_threshold used to dictate when a tablescan was considered so large that it was recycled to the end of the cache. (But as Dion Cho shows, this ceased to be true a long time ago).
b) _small_table_threshold is still 2% (or something to be decided) and is still used to dictate when a tablescan is reported in the instance and session stats as a “short” tablescan
c) Even a “long” tablescan will be read in to the mid-point in lots of newer versions of Oracle (even before we get to 11.2 and the serial direct) – but it is not easy to determine the rules about how Oracle decides when a table is too long to go to the mid-point. The problem of testing is made harder because the results are affected by the current actual location of the “mid-point” and the number of free (i.e. status = 0) buffers.
d) Alex seems to have done some sound work on investigating the relationship in 11g between the _small_table_threshold and the use of serial direct reads – but again it’s hard to test since the current content of the cache seems to affect the decision.

There is a lovely Powerpoint presentation on Julian Dyke’s website that has related information with some wonderful graphics to help understand how buffer headers are used and moved around in the less recent versions of Oracle. Unfortunately he doesn’t happen to make have any comments about the purpose of the _small_table_threshold.

I appreciate the comments, and the enhanced sense of direction provided by those comments.

Oracle Database 10.2 lists a description of “threshold level of table size for direct reads” for the _SMALL_TABLE_THRESHOLD parameter. I wonder if this only applies to parallel execution for this release? I cannot say that I have witnessed a full tablescan in 10.2 performing direct reads without parallel execution enabled (but maybe someone else has).

“In 9i and above this parameter is used as a threshold (number of blocks) eligible for direct reads. Any tables below this threshold will go thru the buffer cache during direct read.”

I read somewhere last night that the caching behavior involving long full table scans (and possibly the insertion point into the LRU) would be dependent on whether or not blocks from other objects already existed in the buffer cache. I saw Dion Cho’s article last night and noticed that he was flushing the buffer cache prior to his test execution – so I wonder if that might throw off the test results?

I’d forgotten the thing about parallel query and _short_table_threshold – again, it’s one of the details to check on my todo list. I believe (as Gopalakrishnan indicates) that a parallel-enabled table will not be read in parallel if it’s size is less than the parameter setting in 9i and 10g – which means the bit about direct path is implicitly about parallel query, although you can make 10g do direct serial reads through a hidden parameter.

I thought I had made the comment about flushing the cache and free blocks on Dion’s blog at the time – as a suggestion, not as a definitive statement – but it must have been somewhere else.

Alex’s investigation – was a reference to Alex Fatkulin (Quaddro) that you had linked to.

Thank you for the additional information. My comment about the _SMALL_TABLE_THRESHOLD parameter and parallel execution was 50% guess (maybe process of elimination is a better description) and 50% faded memory of an article I read a while ago.

Late 2009, I made an SR on this issue. The analysts were quick to say, “this is new functionality and we are unfamiliar with it and I am in contact with the author”. Then came a patch.
In 10.2.0.4, I am unfamiliar with the existence of such behavior (skip buffer cache in favor of doing direct path reads).
I think the option of automatically skip the buffer cache was not in 9.x
My guess is that the current incarnation of _SMALL_TABLE_THRESHOLD started in 11.0
I also guess that the actual threshold is based on an algorithm that is affected by many factors, and not just _SMALL_TABLE_THRESHOLD. The support analysts used the word “heuristics” to describe the decision process.
My original problem:
I have an old (bad?) habit, where I wanted to pin a (relatively) large table in the KEEP pool. It was a great idea in 9.2. In 11.1 upgrade tests, I suddenly started getting “direct path reads”, when I when I really wanted 0 disk i/o — hence the SR & patch.

I was unsuccessful in locating that bug number in Metalink (My Oracle Support) – a search for 6195768 returned 0 results, and a search for kcbcmt also returned 0 results. I also tried various other combinations including direct path read buffer cache.

One of the articles that might be interesting is Metalink Doc ID 793845.1 “High ‘direct path read’ waits in 11g” that suggests with Automatic Shared Memory Management enabled, a serial full table scan of a table could swing between performing direct path reads and db file scattered reads (using the buffer cache) due to the automatic resizing of the buffer cache. It might be interesting to see if the _SMALL_TABLE_THRESHOLD parameter automatically adjusts when the buffer cache size automatically adjusts in size.

The table is considered small enough to be read into the buffer cache when the table’s size is less than or equal to the number of blocks specified by the parameter.
Also, a table is considered to be of “medium” size when the number of blocks exceeds this paramater setting but is less than 10 % (maybe _small_table_threshold multiplied by 5)of the buffer cache. A “medium” table may be optionally cached based upon the frequency of full table scans executed against it.
The bug 6195768 is about ” MEDIUM SIZED TABLES ALWAYS READ THRU CACHE (FTS).”

I spent a little more than an hour trying to locate the bug that you described. I simply cannot find supporting evidence, even through many Metalink advanced searches, searches of Google Books, and searches of the Internet.

An advanced Metalink search, which included the bug database, found bug 6195768, but this bug has nothing to do with database tables:
“Bug 6195768: processes catting hid device can not be terminated”

—

A search on the exact phrase: medium sized table
Bug 205722: CBO FAILS TO USE INDEX IN MEDIUM SIZED TABLE
Fixed in Product Version 7.1.4
The above was the only match

—

A search on the exact phrase: frequency of full table scans
Latch Contention On ‘multiblock read objects’ Latch [ID 337940.1]
Applies to Oracle Server – Enterprise Edition – Version: 9.2.0.5.0
Database experiences latch contention on “multiblock read objects” latch.
Interesting, I did not know that there was a “multiblock read objects” latch, but it does not apply.

—

A search on the exact phrase: _small_table_threshold
Found the following bugs:
Bug 2384100: PAVDVC – PRC: DISTRIBUTE SUPPLIER INVOICE ADJUSTMENT COSTS PERFORMANCE
“_small_table_threshold = 256 and should be removed, and allowed to default. Setting this changes the default threshold which determines what is a ‘small’ table and what is a ‘large’ table.”

Bug 6357188: ORA-8103 EVEN WITH THE W/A FROM 3569503, ON RAC NOT SINGLE INST
“2. Run with _small_table_threshold=1. This forces everything to be a direct read.”

Neither of those bugs appear to be related.
—-

I wonder if Metalink Doc ID 787373.1 is partially to blame for the confusion in the bug report that you located (maybe it confused the Oracle support analyist)? If you read my discussion with Jonathan Lewis in the comments of this blog article, it appears that Metalink Doc ID 787373.1 is describing the behavior of tables that are set to a parallel degree, where the _SMALL_TABLE_THRESHOLD parameter helps Oracle decide whether to perform a full table scan using db file scattered reads (into the buffer cache) or direct path reads (into the PGA).

The warning at the beginning of this article appears to be justified: “Warning: this article is certain to bring tears of pain to your eyes while you attempt to determine what is right and what is wrong.”

Linux 10.2.0.4 (with compatibility set at 10.2.0.3)
running select count(*) 3 times from a table set to a % of db_cache_size
Oracle seems to successfully cache the whole table up to 80+% of buffer cache and caches in an effective manner with tables larger than the buffer cache size:

I noticed in the test case script that table statistics were not collected for the CACHER table. Do you think that skipping the statistics collection step might cause the runtime engine problems where it believes that the table is very small, resulting in too many of the blocks being cached, or do you think that the automatic dynamic sampling would have prevented problems like that? I am more than a little surprised by the results you posted for 10.2.0.4.

So in summary as per default optimiser behaviour ‘direct path read’,the expectation is that any table below 5% of buffer cache is promoted to hot area. When the table size is larger than this figure blocks with full table scans go to LRU end and are fully discarded.

An important point to consider in this test case is that every row in your table is chained (row length is more than 8,000, blocksize is 8KB, pctfree seems to be at the default of 10, means you have about 7,200 bytes per block – and then you can see that your table of 700,000 rows has used 1.4M blocks).

You may have discovered yet another type of boundary case for full tablescans, direct path reads, and caching, but you’re not looking at the general case. You’re query is simply select count(*), so Oracle shouldn’t have to follow the chain for each row, but it’s possible (and please take this as a flight of fancy at this point) that when Oracle reads a block which holds nothing but the target of a chained, it doesn’t count it is a “proper” read, and therefore manages to exclude it from the more generic arithmetic.

With that in mind In the case above we have a table with the following properties. I also added the simple ratios of buffer cache to block size to give an indication of how many blocks fit in to this cache

I believe now that you assertions about 10% boundary below which repeating a scan WILL increment the touch count holds as per optimizer use of direct path reads. For conventional path reads if table fits into the cache then Oracle allows that.

In a way it perhaps explains why Oracle prefers direct path reads to conventional reads for FTS. I think because Oracle has more control in place on the use of PGA than SGA.. It also shows that threshold limit of 2% does not mean much.

The base table tdash had 1.7 million rows and I tried using direct path read (default) and db scattered file read. with conventional path it was 20% faster.
At that time the argument was that the run time engine preferred direct path read to save CPU usage in multi-user environment. Fair enough. My test was every time after reboot, one run only. However, we are looping over table 100 times fatching rows matching object_id. tkprof output showed:

So we seems to have 194142811 – 131458857 = 62683954 or around 32% less physical reads. Table tdash had 1,729,204 rows in 2,709,153 blocks and Table size as percent of buffer size was 123%. So with more caching and less waits that 20% performance gain may be explained now?

In short, you saw the significant 20% performance improvement with db file scattered reads because the buffer cache was empty at the start of the test, which is unlikely in a production environment. If you want to test this theory:

Finally, execute your script with and without serial direct path reads enabled. You should now see that the buffer cache had less of an opportunity to cache frequently accessed blocks read during the full table scan of your tdash table, and you should find that the serial direct path read now not only uses less CPU, but also requires less elapsed time for its execution than the conventional buffered reads.

Jonathan, as usual, has provided excellent information in the comment section of this blog article, and on the OTN thread, and has mentioned items that I had not considered. For example, I did not consider that your cacher table example would result in chained rows (out of curiosity, I might test this just to see if I am able to identify the chained rows – I am assuming that Jonathan noticed that counting 67,500 rows required 136,006 consistent gets during the COUNT(*) FROM CACHER;).

No doubt when buffer cache is busy, direct path reads (DPR) will be faster than conventional path reads (CPR). As suggested, I tried that as well. Exactly the same code on identical tables on different schemas. I started running the query in second schema and then I kicked off the query in the main schema. With DPR disabled, CPR results were slower compared to doing exactly the same concurrent run tests with DPR.

My current argument is that the optimiser costing will alway favour DPR to CPR when a table scan of > 10% of buffer cache is involved. In other words it does not matter how much free buffers are there (case a single session running nothing else), the run time engine will perform DPR and will always put the table in the LRU end of the chain to be discarded immediately.

So in summary with DPR, any table > 10% of buffer cache will be fetch and discard and will not be promoted to hot area.

Hints for Posting Code Sections in Comments

********************
When the spacing of text in a comment section is important for readability (execution plans, PL/SQL blocks, SQL, SQL*Plus output, etc.) please use a <pre> tag before the code section and a </pre> tag after the code section:

<pre>

SQL> SELECT
2 SYSDATE TODAY
3 FROM
4 DUAL;
TODAY
---------
01-MAR-12

</pre>
********************
When posting test case samples, it is much easier for people to reproduce the test case when the SQL*Plus line prefixes are not included - if possible, please remove those line prefixes. This:

SELECT
SYSDATE TODAY
FROM
DUAL;

Is easier to execute in a test case script than this:

SQL> SELECT
2 SYSDATE TODAY
3 FROM
4 DUAL;

********************
Greater than and Less than signs in code sections are often interpretted as HTML formatting commands. Please replace these characters in the code sections with the HTML equivalents for these characters: