It is where Oracle works through a SQL array. I’ve usually seen this within PL/SQL where an array variable is used within a procedure or package, being passed in containing a set of accounts or customers that are of interest and is used with a cursor. But you might also see it as the step processing the output of a pipelined function (one that returns a set of “rows”) being cast into a table. See this example by Tom Kyte. Tom also suggests that it is called a “Pickler” fetch as the data is pickled – packed and formatted. I never knew that, I just thought someone in Oracle development was having a giggle and it was “Pickled” as it was preserved from the PL/SQL side of the SQL engine. It seems that I was a little off-target with that.

{My thanks to Timur (see comments) who corrected me when I said it was a PL/SQL array. It is not, it is a SQL object}.

The CBO is not very good at looking “inside” such arrays to determine the cardinality of that step. This can cause performance issues.

Because using them involves a few steps and potentially involves grants, it is possible for them to be “temporarily removed” during testing and so their impact is not seen.

They can leak memory. I think.

I’m now going to cover each of those points in turn. If you just wanted to know what a pickler fetch is and I’ve answered that for you, I suggest you go back to whatever you were doing before you started reading this 🙂

By “not very good at looking inside pickler fetches” I mean that the CBO seems to make a poor “base it on the first seen array” (V11 up) or utterly fails to identify how many records are inside the SQL array (V10 down), depending on the Oracle version. From an Oracle 10.2.0.4 system I’ve got the two following examples:

Note the cost and expected rows for the Pickler Fetch step. Or rather, the lack of them.

This would manifest itself in the following way in OEM screens:

Note the cardinality of the Pickler Fetch step is 0 {sorry, I cut off the column headings}. This resulted in this step having a cost of 0 and all the subsequent steps of having an expected cardinality of one and having very low costs – between 1 and 3 {Again, not shown, sorry}.

The end result of this was that the CBO struggled to accuratley cost any given plan and came up with several, usually quite terrible, plans that it swapped between as other table stats varied. The CBO was picking between very complex plans with total costs of 100 or 200 or so! Any difference was “significant”.

Please note, OPTIMIZER_DYNAMIC_SAMPLING was set to 4 on this system and I tried hints and session settings to higher levels and they did not prompt the CBO to look into the array, on 10.2.0.4 at least.

In 11.1 things seem to be better, as is shown in the explain plan at the top of this post. The step has a cost. I have to confess, I have not tested this very much on 11 {and if anyone has, feel free to correct me/add enlightenment via comments or pointers to other sources}, but it seems to set the cardinality to the number of elements the Pickler Fetch finds in the first itteration. Unless it uses the same sort of trick Oracle 11 now uses for bind variables (detecting when the value supplied is out of range and generates a new plan) this is going to lead to the old and much ‘loved’ issue of the plan being fixed by the first execution, irrespective of how suitable that plan is.

How do you fix this issue? Well, I resort to the cardinality hint. Usually the number of records being passed into the array is not too variable and any half-decent value is better than nothing in Oracle 10 and before. As for in 11, I like stating the value rather than risking a variable ‘first seen at parsing time’ setting. It is a judgement call. The below is from 11.1 but I’ve used it extensively in 10.2, where the impact is much more significant:

Note the change of ROWS to 11 in step 5. In V10 this is a change from blank to 11 and in real situations, do not be at all suprised if the plan changes dramatically – away from nested loop access and more to hash joins. {I should note, the cardinality hint is not documented in Oracle 10 or 11 and any use you make of it in live code is your responsibility. Sorry about that}.

What about my second point, about testing them? Well, as an example of testing Pickler processing of SQL arrays, which are defined SQL types, this is what I had to do to run my basic test:

I had to create a table type, which is the SQL array, and this was based on an object type which I had to create first {you can have table types based on standard SQL types but very often they are based on a “row” object}. After creating the stored procedure, I had to define and populate the array with a set of records which I then passed in to my procedure call . {If you want to repeat this yourself, check out my postings on IOTs to get the table creation statement for table CHILD_HEAP}.
Now, I created those types so I had access to them. If those types do not belong to you you have to be granted execute on the types to reference them. Not select, execute. Some sites have a pretty strict attuitude to granting execute on anything and types seem to get forgotten when the execute priviledges against packages and procedures are set up. In a recent situation I had, I was forced to do some testing work on Live and it had taken people with big sticks to get me select access on data. Execute privileges were totally refused. Calmly explaining why it was needed and how it was acceptable fell on not so much deaf as bricked-up ears.

So, for testing, the reference to an array passed in is often replaced by a little sub-select. After all, quite often what is being passed in for a pickler fetch is actually a set of records {but a subset of the rows} from a table that has been collected by a previous processing step.
As an example of such a change:

All that has changed is that we now have a little sub-select rather than the casting of the SQL array into a table and, heck, as the developer might say, those were the records that would have been passed in, the code still works as expected and the same data comes back. No identifying which object types you need, no getting the execute permissions, no populating it yourself in the test harness, we can just swap back in the array later.

{I apologise to good developers, who even now are throwing imaginary darts at me. I know Dawn, you would certainly not do this. But I’ve seen it a couple of times. Developers have enough on their plate to go worrying about esoteric aspects of the CBO}

But the thing is, Oracle can look at that select and evaluate it’s cost and get an expected cardinality. The pickler fetch version has the issues I’ve just covered. I had to deal with a situation just like this last year, it does happen. In dev it was fine, in pre-live testing it was not.

What about memory leaks? Well, I had lots of issues with SQL arrays and memory leaks with oracle 10.1. and 10.2 at one client site and there is a documented bug in oracle 8 with pickler fetch and memory leaks but I have to confess, a quick metalink search did not find any hits for Oracle 10 and 11. So maybe you should not trust me on that one. In the situation I saw the arrays were massive, several MB at times, and so if you are using SQL arrays to pass in a modest list of eg accounts or customers, it is not going to be an issue anyway.

You know, this was just going to be a quick post on something I’ve been meaning to mention for months, not a small essay :-).

So far I have covered the basics of Index Organized Tables, created some examples and shown how IOTs can help reduce the cost of single row selects and then how they can greatly aid range scans. Follow the above links for details.

Now I’d like to show that the use of IOTs has the potential to make your block buffer cache (BBC) far more efficient. Going to disc is very,very slow compared to going to memory {NB solid state storage improves this situation but does not remove it}. The block buffer cache has always been critical to oracle SQL Select performance as it allows you to access data in memory rather than disc and in general the more block buffer cache you have the faster your system will be.
{I am of the opinion that the BBC is even more important now than ever. As hard discs get larger we are seeing fewer and fewer spindles per GB of storage and, in essence, disc storage is effectively getting slower – because more data is hosted on the same number of spindles and those spindles are not themselves getting faster – I digress, for more details see posts Big Discs are Bad and IOPs and Form Factors}

In the scenario I’ve covered in my previous posts on IOTs we have a system where child data is coming in for many parents every day for 100 days. With a heap table the data pours into the growing end of the table, usually a record or two per parent each day and no guarantee that if two records come in, they will be put into the same block.

So, when you select a child record for a parent you get the situation shown below:

For many systems, the Block Buffer Cache is holding a lot of data no queries asked for -collateral data

When oracle needs to collect a record from the table, it has to read the whole block. Oracle only reads in tablespace data in whole blocks. That record comes with many other records in it that you did not ask for or want. I refer to this as Collateral Data – innocent bystander data that has got pulled into the BBC just because it was in the same block as required data. The larger the block size, the more collateral data there is.

To get all the child records for the parent, you need to read all those individual table blocks holding one or two records of interest. For our 100 child records you will probably need to read in close to 100 table blocks. Your Block Buffer Cache is filling up with of blocks where only one row out of each block is “of interest”. If that is one row out of 80 in a block, you are effectively wasting 98.75% of the space that table takes up in the block buffer cache.

With an IOT the situation is very different. We have already seen in my previous post on reducing IO that for a range scan on the IOT, oracle does not need to go and collect records from blocks scattered throughout the table. It simply collects the IOT leaf blocks holding the relevant data. Not only does this require less IO, it also results in the fetched blocks mostly holding the required data. The percentage of collateral data is greatly reduced:

IOTs are a powerful tool in reducing collateral data and using the BBC more efficiently

Thus instead of 100 table blocks that mostly hold collateral data, you have 2 or 3 blocks holding mostly the data you are interested in. Your wastage, the collateral data, is about 33-50%. With my example tables from post IOT2, it is indeed an average of 2-3 IOT blocks holding all the data for a single parent and 100 heap table blocks holding the same data.

You can think of it another way.

With my IOT I use only 3% of the memory to cache a single set of records for a parent compared to that needed with a normal HEAP table.

Let’s extend that idea a little. Let’s say I have 100,000 customer and 5% of the customers are active.
Each customer has on average 500 * 200-byte activity records for 100K of data
Each 8K Heap Table block holds 40 records, a very inefficient IOT holds 20 records per block {I’ll go into the details of this in a later post, but I am being a little unfair on the IOT here}.
There is some grouping of records for the customer so that those 500 records are over 400 blocks.
The whole working set of 400 blocks * (100,000*0.05) customers *8K = 2,000,000K

Now replace the table with an IOT keyed on customer ID.
Those 400 records would be spread over (400/20)+2 blocks. The +2 is the start and end of the range.
That is 22 *(100,000*0.05) *8K = 110,000K

Both would also need the overhead of an index structure to be cached also, for the Heap table it is the Primary Key index, for the IOT it is the rest of the IOT structure.

So you would need 2GB or so of Block Buffer Cache dedicated to caching the working set held as a heap table and 110MB of Block Buffer Cache dedicated to caching the IOT equivalent.

How many of you spotted that the space needed to “cache the heap table working set” is actually twice the size of the table? Well, that is because with a 5% working set and 40 records per block, there is a fair chance that some of those Collateral Data records in each block are for the 5% working set. Over all, almost every block will hold data for two or three active users. This is one of the complications of working out how efficient you block buffer cache is likely to be and I’ll have to leave that to another post. In reality you would need to hold 99% of the heap table in memory to cache the whole working set, so still 1GB. The IOT is still far more efficient.

I’ll just finish by saying that on one project I worked on we would have needed several hundred GB of Block Buffer Cache to hold the working set of the main tables if held as heap tables. That volume of memory was simply not available. With IOTs this reduced to about 40GB. This was available. The majority of this working set was able to stay in their SGA and it meant that so long as the instance stayed up, that working set of data mostly stayed in cache. I was able to see on the live system that processing of the data for these active customers was mostly being supported by consistent gets and less than 1% disk reads.

So, by using IOTS I reduced consistent gets dramatically, made more efficient use of the block buffer cache and, as a result of that, reduced the number of physical reads needed to support the consistent gets.

For as long as I have been working with Oracle technology {which is now getting towards 2 decades and isn’t that pause for thought} there has been a constant search for Performance Silver Bullets – some trick or change or special init.ora parameter {alter system set go_faster_flag=’Y’} you can set to give you a guaranteed boost in performance. For all that time there has been only one.

There are a few performance Bronze Bullets…maybe Copper Bullets. The problem is, though, that the Oracle database is a complex piece of software and what is good for one situation is terrible for another. Often this is not even a case of “good 90% of the time, indifferent 9% of the time and tragic 1% of the time”. Usually it is more like 50%:30%:20%.

Cartridge with copper bullet &spent round

I’ve just been unfair to Oracle software actually, a lot of the problem is not with the complexity of Oracle, it is with the complexity of what you are doing with Oracle. There are the two extremes of OnLine Transaction Processing (lots of short running, concurrent, simple transactions you want to run very quickly by many users) and Data Warehouse where you want to process a vast amount of data by only a small number of users. You may well want to set certain initialisation parameters to favour quick response time (OLTP) or fastest processing time to completion (DW). Favouring one usually means a negative impact on the other. Many systems have both requirements in one… In between that there are the dozens and dozens of special cases and extremes that I have seen and I am just one guy. People get their database applications to do some weird stuff.

Partitioning is a bronze bullet. For many systems, partitioning the biggest tables makes them easier to manage, allows some queries to run faster and aids parallel activity. But sometimes (more often than you might think) Partitioning can drop rather than increase query or DML performance. In earlier versions of Oracle setting optimizer_index_caching and optimizer_index_cost_adj was often beneficial and in Oracle 9/8/7 setting db_file_multiblock_read_count “higher” was good for DWs….Go back to Oracle 7 and doing stuff to increase the buffer cache hit ratio towards 98% was generally good {and I will not respond to any comments citing Connors magnificent “choose your BCHR and I’ll achieve it” script}.
You know what? There was an old trick in Oracle 7 you could maybe still look at as a bronze bullet. Put your online redo logs and key index tablespaces on the fastest storage you have and split your indexes/tables/partitions across the faster/slower storage as is fit. Is all your storage the same speed? Go buy some SSD and now it isn’t….

Cartridge with Wooden Bullet

Then there are bronze bullets that you can use that very often improve performance but the impact can be catastrophic {Let’s call them wooden bullets 🙂 }. Like running your database in noarchivelog mode. That can speed up a lot of things, but if you find yourself in the situation of needing to do a recovery and you last cold backup is not recent enough – catastrophe. A less serious but more common version of this is doing things nologging. “oh, we can just re-do that after a recovery”. Have you done a test recovery that involved that “oh, we can just do it” step? And will you remember it when you have a real recovery situation and the pressure is on? Once you have one of these steps, you often end up with many of them. Will you remember them all?

How many of you have looked at ALTER SYSTEM SET COMMIT_WRITE=’BATCH,NOWAIT’? It could speed up response times and general performance on your busy OLTP system. And go lose you data on crash recovery. Don’t even think about using this one unless you have read up on the feature, tested it, tested it again and then sat and worried about could possibly go wrong for a good while.

That last point is maybe at the core of all these Performance Bronze Bullets. Each of these things may or may not work but you have to understand why and you have to understand what the payback is. What could now take longer or what functionality have I now lost? {hint, it is often recovery or scalability}.

So, what was that one Silver Bullet I tantalizingly left hanging out for all you people to wait for? You are not going to like this…

Look at what your application is doing and look at the very best that your hardware can do. Do you want 10,000 IOPS a second and your storage consists of less than 56 spindles? Forget it, your hardware cannot do it. No matter what you tune or tweak or fiddle with. The one and only Performance Silver Bullet is to look at your system and your hardware configuration and work out what is being asked and what can possibly be delivered. Now you can look at:

What is being asked of it. Do you need to do all of that (and that might involve turning some functionality off, if it is a massive drain and does very little to support your business).

Are you doing stuff that really is not needed, like management reports that no one has looked at in the last 12 months?

Is your system doing a heck of a lot to achieve a remarkably small amount? Like several hundred buffer gets for a single indexed row? That could be a failure to do partition exclusion.

Could you do something with physical data positioning to speed things up, like my current blogging obsession with IOTs?

You can also look at what part of your hardware is slowing things down. Usually it is spindle count/RAID level, ie something dropping your IOPS. Ignore all sales blurb from vendors and do some real-world tests that match what you app is or wants to do.

It’s hard work but it is possibly the only Silver Bullet out there. Time to roll up our sleeves and get cracking…

{Many Thanks to Kevin Closson for providing all the pictures – except the Silver Bullet, which he only went and identified in his comment!}

In the previous two posts I covered the basics of Index Organized Tables (IOTs) and then created some test tables to demonstrate the benefit of IOTs that is most often covered – reducing the IO needed to get a single record by one IO, say from 5 to 4. {Whether this is a buffer get from memory or a disc IO depends on if the block is cached, of course}.

In this post I am going to show how IOTs can far more significantly reduce the IO when several related rows are required.

Below is one of my test tables, the one that is a normal heap table and has a primary key, CHHE_PK on PARE_ID and CRE_DATE:

As you can see, the table has 1 million records and 9,999 values for PARE_ID, there are approx 100 records per parent. The data was created to match a common situation – that of a bit of data coming in for each parent every day. See post 2 for details.

The result of this is that the data for any given parent is scattered through the table. As the data comes in for a given day, the data for the first parent is added to the end of the table, followed by all the data for all the other parents who have data that day. The next day this is repeated, so the child records for a given parent are interspersed with the child records for many other parents.

The below diagram demonstrate what will now happen if you issue a statement like
select *
from CHILD_HEAP
where PARE_ID=12

Oracle quickly works down the index to the leaf block containing the first key that matches the range. This takes, in my example, 4 block reads. Oracle now works through the index entries and, via the rowid, identifies the exact block to visit in the table for each key. For each key it has to visit a new block – because the data is scattered through the table. This is what the clustering_factor in the index statistics is measuring, how often contiguous rows in the index are for the same block. In our case, almost never.
In my diagram I do not number those table reads but in my simplistic diagram it would be 10 further reads.
If Oracle reaches the end of the leaf block before it reaches the end of the range of key values, oracle follows the pointer in the leaf block (not shown) to the next leaf block (whcih is another block read) and continues working through the keys until the range scan is completed.

In my simplified diagram I only have 6 entries per leaf block. In reality, and in my example tables, this is more like a few hundred. 247 in the case of CHHE_PK.

Now let’s consider my Index Organized Table, CHILD_IOT. It has exactly the same columns as CHILD_HEAP and the data was created in the same way. However, because it is an IOT, as the data came in it was inserted into the primary key index and is thus in an ordered state.

The below diagram demonstrate what will now happen if you issue a statement like
select *
from CHILD_IOT
where PARE_ID=12

Oracle works down the index to the leaf block where the range scan begins and now simply works along the leaf blocks. There is no need to go and visit the table as there is no table.

In my IOT diagram the leaf entries are longer and there are fewer in each leaf block, ie 5. So my scan has to visit 3 leaf blocks rather than 2. In reality the difference is more pronounced, in my example table there are actually 56 rows per leaf block, compared to the 247 in the index on the heap table. As such, my scan on the IOT will cover more leaf blocks but this is insignificant compared to the reduction in block visits caused by not having to go hunt down records scattered over the table. Even in the unlikely event of my IOT being deeper by 1 level (an extra layer of branch blocks) due to the reduces entries per leaf block, I would still be winning for range scans.

That is all nice theory and pictures. As ever, we need to back this up with some real tests. Firstly, I am using SQL*Plus and I need to set my arraysize large enough so that I do not introduce extra consistent gets through selecting small sets of rows between client and server. You will need to do the same to see similar results to me.
{I keep meaning to do a dedicated post on arraysize but H.Tonguç YIlmaz has a nice post already on it.}

set arraysize 200
set autotrace on

Now I will select all the records for PARE_ID=10, including a column not in the Primary Key, so that the table needs to be visited. I did this twice to remove the parsing overhead:

We actually gathered more data, 108 rows compared to 82, all for 6 consistent gets compared to 86 consistent gets. That is a reduction to less than 10% of the original effort.

Now for a more extreme test. I am going to select a single row summary of data for 10 parents, flushing the cache between each run to show the impact when you have to do real IO to support those consistent gets. This is on a fairly old {4 years} laptop with a rather tired hard disc

In my first post on IOTs I ran through the basics of what they are. Here I am going to create some test tables and show you a few things.

I am going to create a simple PARENT table with 9,999 records and then two CHILD tables. CHILD_HEAP, a normal table, and CHILD_IOT, an Index Organized Table. They have the same columns and will hold very similar data.

All of this is on Oracle 11.1 but is exactly the same on 10.2. 8K block size, tablespaces are auto segment space managed.

There are only two differences between the statements creating the CHILD_HEAP and the CHILD_IOT tables.

The main one is the inclusion of the line ORGANIZATION INDEX and is what instructs Oracle to create the table as an IOT. Note that it does not state the index and you cannot state the index. The IOT is created based on the Primary Key.
The other change is that you now cannot state the tablespace for the Primary Key index. I’ve not played with this at all but I don’t think you can state anything with the “using index” as the table storage clauses are used for the Primary Key index. I personally find this a little illogical as it is the index segment that is created, but I guess others would find it more natural that you still state this at the table level.

When I create IOTs on a real system, I put the IOT in a table tablespace {I still maintain table and index tablespaces, for reasons I won’t go into here}. I put it there as it holds the actual data. If I lose that Primary Key index I am losing real data, not duplicated data.

I then populated the two CHILD tables with data. The method of creating this test data is very important.

I am simulating a very common situation, where data is coming in for a set of Parents (think customers, accounts, scientific instruments, financial entities) and the data is coming in as a record or set of records each day. ie not where the parent and all of it’s child records are created at one time, like an order and it’s order lines. I am simulating where the child data is created a few records at a time, not all in one go.

The code is simple. it loops for one hundred days and for each day it creates 10,000 records for random parents. On each day any given parent will have none, one or several records. On average, each parent will end up with 100 records, but some will have more and some less. The key thing is that the data for any given parent is created a record at a time, with lots of records created for other parents before the next record for that given parent.

The two tables will have the same pattern of data but not identical data. {I could have seeded the random number generator to make the two data sets the same but this will do}. Below is the statement for one table, you just change the table name to populate each table. {BTW I like using the from dual connect by level <=x method of getting the number of rows desired – it is fast and is neat, once you have seen it once}.

ONE row? Where is the other table, where is CHILD_IOT? It does not exists.

Remember from my first post that I made the comment I would have prefered it if Index Organized Tables had been called something like ‘Table Containing Indexes’? The table data has been placed in the Primary Key index and the table segment does not even exist. If you start using IOTs this will catch you out periodically – it does me anyway and I’ve been using them on and off for years :-).

Note that the Primary Key index for CHILD_HEAP, CHHE_PK, is there and is 4,224 blocks in size, and the CHILD_IOT Primary Key, CHIO_PK, is a lot larger at 19,456 blocks. In fact, not only is the CHIO_PK index larger than the CHILD_HEAP table, it is larger than the combined size of the CHILD_HEAP table and CHHE_PK index combines. So much for me saying last post that IOTs can save disk space? I’ll come back to that in a later post…

Note the lack of BLOCKS for the CHILD_IOT table and the CLUSTERING_FACTOR of 0 for the CHIO_PK.

The clustering factor is the number of times Oracle, when scanning the whole index in order, would have to swap to a different Table block to look up the table record for each index entry. If it is close to the number of blocks in the table, then the clustering factor is low and the order of records in the table matches the order of entries in the index. This would make index range scans that need to visit the table reasonably efficient.

If the clustering factor is close to the number of records in the table then it means there is no correlation between index order and table row order and such index ranges scans that have to visit the table would be inefficient. Again, this is significant and will be the major topic of the next post.

The depth of the index does not change, being 3 in each case (BL or blevel 2)

So, can we see evidence of the theoretical efficiency of looking up single records via the IOT that I mentioned in the fist post? Here we go {oh, usual disclaimer, I run the code twice and show the second run, to remove the parsing overhead}:

{I had to look up the exact values of CRE_DATE of a couple of records to do the above queries}

To look up a single row with the heap table you can see that the explain plan was to carry out a unique scan on the primary key and then look up the row via the rowid and took 4 consistent gets. 3 to walk down the index and get the rowid, one to look up the row block.

For the IOT table the explain plan reveals that there was simply an index unique scan of the Primary Key, nothing more. All data for the row was there in the index entry rather than the rowid. Thus only 3 consistent gets were required.

For single row lookups on the Primary Key, IOTS are more efficient than traditional Heap tables with a Primary Key index. {Please, no one point out that if all the columns you need are in the index you also do not need to go to the table, that is a different topic}.

Quite a few people have shown this efficiency before but the next step is far, far more interesting and shows a much more significant impact of IOTs. That is the topic of the next post :-).

For now, I am going to finish off with what happens with range scans as I suggested they could slow down with an IOT.
Below, I select count(*) for just one of the parent values.

Both statements carry out a range scan on the Primary Key of the table. For the normal HEAP table this takes 3 consistent gets, which is no suprise as we have an 8k block size and only 100 rows for a given parent, they happen to fit into one block of the index. So Oracle works down the depth of the index and looks at one block.

For the IOT the scan works down the index but has to scan three blocks. Even though there are fewer entries, 93 compared to 98, they span three blocks and thus the total number of consistent gets is 5.

Admittedly I was a little lucky in my example above. Sometimes the entries for one parent will scan 2 blocks for the heap table’s Primary Key and occasionally the entries for the IOT will fit into 2 blocks. But if you look at the number of leaf blocks in the earlier stats (4,034 for the normal and 17,855 for the IOT, both for 10,000 entries) usually the 100 or so entries for single parent in the normal index will all fall into one block and the entries for the IOT will fall into between 2 and 3 blocks.

A select count(*) will full scan the smallest segment that can satisfy the query. Let’s try it:

The number of consistent gets (and physical reads) are close to the number of leaf blocks in the two segments, though higher. This is because Oracle is scanning the whole index, leaf blocks and branch blocks. The scan is far more expensive for the IOT, simply as the index is so much larger. I’ve not shown timings but on my little laptop, the count(*) takes about 3 seconds on CHILD_HEAP and about 5 seconds on the CHILD_IOT.

I think Index Organized Tables(IOTs) are a much under-used and yet very useful feature of Oracle. Over the next few postings I’m going to cover some aspect of Index Organised Tables, both good and not-so-good. I am going to cover some benefits of IOTs that I think many people are unaware of. In this first post I am just going to run through the basics of IOTs.

The idea behind an IOT is simple. You hold all the data for the table in the ordered structure of an index. Why would you want to do that? Let us consider a very common requirement, accessing a row in a “large” table via a known, unique key.

Traditionally you have a heap table holding the data you want to access and a standard index to support access to that table. See the first diagram below. The 4-layer triangle represents the index, with a root block, two levels of branch blocks and then the leaf blocks at the “bottom”. The blue rectangle represents the table with the squares being individual rows. Of course, in a large table there would be thousands or millions of “squares”, this is just a simple diagram to show the idea.

When you issue a SQL statement to select the row via the indexed column(s) then oracle will read the root block (1), find the relevent block in the first level of branch blocks (2), then the relevant block in the second level of branch blocks (3) and finally (as far as the index is concerned) the relevant Leaf Block for the unique key. The leaf block holds the indexed column(s) and also the rowid. The rowid is the fastest way to look up a record, it states the file, block and row offset for the row. This allows oracle to go straight to the block and get the row. That is read number (5).
The number of branch blocks {and thus the number of blocks that need to be read to find a row} will vary depending on how much data is indexed, the number and size of the columns in the index, how efficiently the space has been used in the blocks and one or two other factors. In my experience most indexes for tables with thousands or millions of rows have one, two or three levels of branch blocks.

The second diagram shows a representation of the Index Organized Table. The table has in effect disappeared as a distinct object and the information has been moved into the leaf blocks of the index {part of me feels Index Organized Tables should really be called Table Organized Indexes or Table Containing Indexes as that would better indicate what is physically done}:

So with the IOT oracle reads the root block (1), the two branch level blocks (2 and 3) and finally the leaf block (4). The leaf block does not hold the rowid but rather the rest of the columns for the table {this can be changed, a more advanced feature allows you to store some or all the extra columns in an overflow segment}. Thus to access the same data, Oracle has to read only 4 blocks, not 5. Using an IOT saves one block read per unique lookup.

This saving of block reads is probably the main feature that IOTs are known for, but there are others which I will cover in later posts. Two things I will mention now is that, firstly, the use of IOTs is potentially saving disc space. An index is in effect duplication of data held in the table. When you create an index no new information is created but space is used up holding some of the table information in a structure suitable for fast lookup. Secondly, the index and table have to be maintained whenever a change is made to the columns that are indexed. IOTs reduce this maintenance overhead as there is only one thing to maintain.

Now for some drawbacks.

The IOT has to be indexed on the primary key. There is no option to create an IOT based on other indexes. As such you have to either be accessing the table via the primary key to get the benefit – or you have to be a little cunning.

The index is going to be larger than it was and very often larger than the original table. This can slow down range scans or full scans of the index and a “full table scan” will now be a full index scan on this large object, so that can also negatively impact performance. However, if a range scan would then have resulted in access to the table to get extra columns, the IOT gives a similar benefit in reducing IO to that for single row lookups.

I just want to highlight that you now have no rowid for the rows.

Secondary indexes are supported but will potentially be less efficient due to this lack of rowid.

So, a brief summary is that Index Organised Tables effectively move the table data into the Primary Key index, reduce the number of block lookups needed to select one row, can save some disc space. But you can only organize the table via the Primary Key and it can make full or partial table scans and lookups via other indexes slower.

There are several more important benefits to IOTs {in my opinion} which I will come to over the next week or two.

One of my most popular postings is about why your SYSTEM tablespace could be rather large. Recently I’ve had issues with a SYSAUX tablespace being considerably larger than I expected, so I thought I would do a sister posting on the reason.

The client I was working with at the time was about to go live with a new application and database. For various reasons I was a little anxious about how the Java application (the User Interface) would actually call the stored PL/SQL code I had helped develop. Initial workloads would be low and so I asked that the STATISTICS_LEVEL be set to ALL, so that bind variables (amongst other things) would be gathered. This is on version 10.2.0.4, btw, enterprise edition and 4-node RAC.

We went live, issues were encountered and resolved, the usual way these things work. Then, a few weeks in and when everything was still very “exciting” from a problem resolution perspective, I got an odd email from the DBA team. Would they like us to add another datafile to the SYSAUX tablespace. Huh? I checked. I’d been watching the size of our application’s tablespaces but not the others {well, I was not supposed to be a DBA and I was watching an awful lot of other things}. Our SYSAUX tablespace was around 160GB in size, having pretty much filled it’s 5th datafile. Why? I checked to see what was taking up the space in the tablespace:

As you can see, almost all the space is being taken up by WRH$_LATCH_CHILDREN and WRH$_LATCH_CHILDREN_PK partitions. They are massive compared to other objects. A quick goggle did not come up with much other than many hits just listing what is in SYSAUX and the odd person also seeing SYSAUX being filled up with these objects and suggested ways to clear down space, nothing about the cause.

I had a chat with the DBAs and we quickly decided that this was going to be something to do with AWR given the name of objects – “WRH$_” objects are the things underlying AWR. The DBA suggested my settings of 15 minute intervals and 35 day retention was too aggressive. I knew this was not the case, I’ve had more aggressive snapshot intervals and longer retention periods on far busier systems than this. I did not have access to Metalink at that point so I asked the DBAs to raise a ticket, which they duly did.

Oracle support cogitated for a couple of days and came back with the advice to reduce the retention period. Hmmmm. Via the DBA I asked Oracle support to explain why those objects were so large when I had not seen this issue on several other systems. Was it a bug? I had by now corroborated with a friend from a previous site with 5 minute snapshot intervals and two months retention period and their SYSAUX tablespace was about 10GB all in. I did not want to go changing things if we did not know it would fix the issue as we really wanted to stop the growth of SYSAUX as soon as possible, not just try a hunch.

As you probably realise from the title of this blog, the issue was not the snapshot interval or retention period but the STATISTICS_LEVEL=ALL. The one and only hit you get in metalink if you search on WRH$_LATCH_CHILDREN is note 874518.1. From V10.1.0.2 to V11.1.0.7 setting this parameter to ALL is known to create a lot of data about Latch children and not clear it down when the AWR data is purged (Bug 8289729). The advice was to change STATISTICS_LEVEL and make the snapshot interval larger. I’d suggest you just need to alter the STATISTICS_LEVEL, unless you really, really need that extra information gathered. It seemed to take Oracle Support an extra day or two to find that note for us. {I’ve since checked out Metalink directly to confirm all this}.

So with a known issue we felt confident that altering the initialisation parameter would solve the issue. It took a while for us to change the STATISTICS_LEVEL on the production system – Change Control for that site is rather robust. This allowed us to see some other impacts of this issue.

The mmon process which looks after AWR data was becoming a top session in our OEM performance screens. In particular, a statement with SQL id 2prbzh4qfms7u that inserted into the WRH$_LATCH_CHILDREN table was taking several seconds to run each time and was running quite often {I include the SQL ID as it may be the same on many oracle V10 systems as it is internal code}:

The internal SQL inserting into wrh$_latch_children was becoming demanding

This was doing a lot of IO, by far the majority of the IO on our system at the time – it was a new system and we had been able to tune out a lot of the physical IO.

The physical IO requirements and 15-20 second elapsed time made this out most demanding statement on the system

We also now started to have issues with mmon running out of undo space when it ran at the same time as our daily load. This was particularly unfortunate as it coincided in a period of “intense management interest” in the daily load…

Yes, that is close to half a terabyte of SYSAUX and it is all used, more partitions have appeared and the total size of the largest segments in SYSAUX show how WRH$_LATCH_CHILDREN and WRH$_LATCH_CHILDREN_PK make up the vast majority of the space used.

Shortly after, we finally got permission to change the live system. The impact was immediate, mmon dropped from being the most demanding session, that SQL code dropped down the rankings and the issues with running out of undo ceased.

I was anxious to see if the old data got purged, as the Metalink note had suggested the data would not be purged. Thankfully, that was not the case. The space was slowly released as normal purging of data outside the retention period took place and after just over a month, the SYSAUX tablespace contained a lot less information and was mostly free space:

Now, how do we get that space back? I left that with the DBA team to resolve.

Oh, one last thing. I mentioned the above to a couple of the Oaktable lot in the pub a few weeks back. Their instant response was to say “You set STATISTICS_LEVEL to ALL on a live system?!? You are mad!”
🙂

{Update, I’ve just spotted this posting by Colbran which is related. Last time I googled this I just got a stub with no information}

I know I have mentioned it before, but I am a big fan of the OEM performance screens that are derived from the ASH/AWR information. One of the things I really like about it is the immediate information it gives you, in one glance, that things are “not normal”. Once you notice that things are not normal you can then, within a few seconds, get a feel for what is still probably OK and where you have something that has changed.

As an example of the immediate information, I recently came back to my desk and glanced at my OEM performance screen. It was showing the below:

This will not be an interesting picture to you, but to me it tells me a lot about my system

“data load has just ran” I said to my comrade-in-arms. “which one?” he asked. “The Delta – It ran the quick plan. But it started a bit late, 12:15. Oh, and looks like the transaction view code has swapped back to the full table scan plan and the summary code is not playing up at the moment.”

“you’re turning into Neo you are – can you see a lady in a red dress???” he asked.

That was of course a reference to the “Matrix” films where at times you see the virtual world displayed on a screen as a stream of characters running down the screen – but once you get used to it you can apparently “see” what is going.

The screen shot above is not even actually a very good example of what the performance screens can show you. One of my minor complaints about the performance screens is that it scales to show the greatest of the largest peak or a number of sessions to match the number of CPUs (real or fake) that are available to you. So if you have more CPU available than you need, you can’t see much detail in the graph. And if you have had a nasty peak of activity, again, all detail is squeezed out. In my case, the box is sized to cope in 12 months and the system is new, so activity is scuttling along the bottom of the graph.

However, “poor” though the example is, it told me what was going on across my system at a glance, something about the major tasks we are running, that one problem is currently occurring and that several of the other issues I need to keep an eye out for are not occurring.

That is why I love these screens – I recognise “my” activity patterns from the graph, I now recognise the SQL IDs for my key statements. If I see a pattern in the graph I don’t recognise, I need to check things out immediately. Three or four times over the last 2 weeks I have spotted an issues, started investigating and found out the cause before the Operations desk has even noticed an issue.

Oh, and what is SQL type 189? It is a merge statement. Our implementation of OEM is a little old, it does not correctly interpret that SQL command type. It might be a little old, it is still a lot useful.

Sometimes it can be very hard to solve what looks like a simple problem. Here I am going to cover a method that I almost guarantee will help you in such situations.

I recently had a performance issue with an Oracle database that had just gone live. This database is designed to scale to a few billion rows in two key tables, plus some “small” lookup tables of a few dozen to a couple of million rows. Designing a system of this scale with theory only is very dangerous, you need to test at something like the expected volumes. I was lucky, I was on a project where they were willing to put the effort and resource in and we did indeed create a test system with a few billion rows. Data structure and patterns were created to match the expected system, code was tested and we found issues. Root causes were identified, the code was altered and tested, fine work was done. Pleasingly soon the test system worked to SLAs and confidence was high. We had done this all the right way.

We went live. We ramped up the system to a million records. Performance was awful. Eyes swung my way… This was going to be easy, it would be the statistics, the database was 2 days old and I’d warned the client we would need to manage the object statistics. Stats were gathered.
The problem remained. Ohhh dear, that was not expected. Eyes stayed fixed upon me.

I looked at the plan and I quickly spotted what I knew was the problem. The below code is from the test system and line 15 is the key one, it is an index range scan on the primary key, within a nested loop:

On the live system we had an index fast full scan (To be clear, the below is from when I had tried a few things already to fix the problem, but that index_fast_full_scan was the thing I was trying to avoid. Oh and, yes, the index has a different name).

Now I knew that Oracle would maybe pick that plan if it could get the data it wanted from the index and it felt that the cost was lower than doing multiple range scans. Many reasons could lead to that and I could fix them. This would not take long.

But I could not force the step I wanted. I could not get a set of hints that would force it. I could not get the stats gathered in a way that forced the nested loop range scan. I managed to alter the plan in many ways, fix the order of tables, the types of joins, but kept getting to the point where the access was via the index fast full scan but not by range scan. I thought I had it cracked when I came across a hint I had not known about before, namely the INDEX_RS_ASC {and INDEX_RS_DESC} hint to state do an ascending range scan. Nope, no joy.

By now, 8 hours had passed trying several things and we had a few other people looking at the problem, including Oracle Corp. Oracle Corp came up with a good idea – if the code on test runs fine, copy the stats over. Not as simple as it should be as the test system was not quite as-live but we did that. You guessed, it did not work.

So what now? I knew it was a simple problem but I could not fix it. So I tried a technique I knew had worked before. I’d long passed the point where I was concerned about my pride – I emailed friends and contacts and begged help.

Now, that is not the method of solving problems I am writing about – but it is a damned fine method and I have used it several times. I highly recommend it but only after you have put a lot of effort into doing your own work, if you are willing to give proper details of what you are trying to do – and, utterly crucially, if you are willing to put yourself out and help those you just asked for help on another day.

So, what is the silver bullet method? Well, it is what the person who mailed me back did and which I try to do myself – but struggle with.

Ask yourself, what are the most basic and fundamental things that could be wrong. What is so obvious you completely missed it? You’ve tried complex, you’ve been at this for ages, you are missing something. Sometimes it is that you are on the wrong system or you are changing code that is not the code being executed {I’ve done that a few times over the last 20 years}.

Ignore the difference in name, that was an artifact of the test environment creation, the key thing is the primary key has a different column order. The DBAs had implemented the table wrong {I’m not blaming them, sometimes stuff just happens OK?}.
Now, it did not alter logical functionality as the Primary Key is on the same columns, but as the access to the table is on only the “leading” three columns of the primary key, if the columns are indexed in the wrong order then Oracle cannot access the index via range scans on those values! Unit testing on the obligatory 6 records had worked fine, but any volume of data revealed the issue.

I could not force my access plan as it was not possible – I had missed the screaming obvious.

So, next time you just “know” you should be able to get your database (or code, or whatever) to do something and it won’t do it, go have a cup of tea, think about your last holiday for 5 minutes and then go back to the desk and ask yourself – did I check that the most fundamental and obvious things are correct.

That is what I think is the key to solving what look like simple problems where you just can’t work it out. Try and think even simpler.

What seems like a couple of months ago I suggested the idea of The Fastest Oracle Server for a Grand. It turns out this was actually over 1/3 of a year ago! {such is the rapid passing of normal time}. Anyway, I’ve decided to give this a go.

The intention is that I am going to build a server based on PC technology which costs less than £1,000 and see how fast I can make it go. Of course “how fast” is a “piece of string” question – it depends on what you put into the Oracle database, how you want to use or manipulate the data and how business-ready the system is. I’m intending to build something that looks very, very un-business ready. That is, not a lot of redundancy. Before anyone wants to shoot me down for that (a) I am not running a bank or anything to do with finance (b) why are banks systems that only deal with cash so much more regulated and goverend than medical systems that are only relied on to keep you alive? (c) some of the biggest systems I know of are actually running on something close to PC kit.

I’m quietly confident I can build something that out-performs systems consisting 100 times as much. Now, that is a massive claim and I won’t be too sad if I fall short, but you can do a lot with modest kit. I worked for a charity for 6 years and boy did I see some clever stuff done on the sort of budget many organisation spend on office stationary.

So, what have I got so far? I confess I held off until I saw some new technology appear in a price band I could squeeze in. Namely USB3 and SATA3. There is always something just around the corner but I wanted those as I want to maximise the impact of solid state storage. So, my base server is:

Asus P7P55D-E motherboard supporting DDR3, USB3 and SATA3

Intel i5 760 2.8HHz chip

8GB memory

1TB samsung 7200rpm SATAII disk

AZCool Infinity 800W PSU

Coolmaster Elite RC-335 case

I chose the motherboard as it was getting good reviews and had the SATA3 and USB3 ports. I chose the case as it was large enough to take many hard drives, small enough to lug about and was a nice case. I stuck to 8GB RAM as RAM is expensive at the moment, but as it is in 2GB chunks I might regret that choice as all my slots are full. Many people forget the PSU but it’s like the tyers on your car. Those tyers keep you stuck to the road, a PSU keeps you powered. It might be utilitarian but they are vital and often overlooked. The hard disc is pretty good, but very likely to be swapped out (I don’t mind sticking it in another system). The CPU is a proper quad core CPU. I had plenty of scope to go bigger and better on the CPU but for grunt for cash, it seems presently to be the sweet spot.

The basic unit is not overclocked. I will increase the cooling and overclocking will be an option. It comes with 64 bit windows but linux is almost certainly going to be the faster option. No monitor is included but hey, it’s a database server, you don’t need fancy graphics. That old CRT in the corner will do! The server does have a rather nice nVidia GeForce GTX 460 in it but I am cutting out the cost of that. The server is currently the best gaming machine I have but that will end when I get time to start working on the Oracle side.

Total cost, £615 or so. That is like $615 seeing as we get so ripped off in the UK for IT kit. I can now go spend money on more fast hard discs, SSDs, even fast USB memory sticks. Any suggestions, I am happy to listen.

The biggest question is – When am I going to get time to work on this damn thing?