January 2011

The Hotsos Symposium 2011 schedule has been published. On the one hand, it's one of my easiest to plan because there are only two streams which takes away the ridiculous choice of something like Openworld, on the other it can be extremely difficult to pick because the quality is uniformly high and the presentations all tend to be about performance. There are always some clashes I simply can't solve. Anyway, here's my best guess at the moment.

The indexes par_i1 and chi_i1 are on the id column and, as you can see for chi_i1, have been created as unique indexes. Unlike the table, though, the code defining them isn’t necessarily identical.

Please state, with justification, what you think the blevel is for index chi_i1 ?

(Note – to avoid any risk of confusion, remember that height = blevel + 1; and just in case anyone thinks it makes any difference, the tables and indexes were all created in the same tablespace which is my usual setup of 1MB uniform extents, locally managed, 8K block size and freelist management.)

The quiz isn’t intended to be cunning, devious and tricksy, by the way. It’s all about observing and understanding a fairly ordinary situation. On the other hand I’m sure there are interesting variations with strange side effects that could be introduced if you really want to get clever.

Answer 23rd Jan:
Rather than writing my answer into a new post I’ve added it below – but to give new visitors a chance to work out the answer before seeing mine I’ve left a big gap before the answer.

Reading the comments I think all the interesting points have been covered. We have comments covering special treatment of root blocks in nested loop joins, the “buffer is pinned count” statistic, possible changes in 11g, the hypothesis that I had set a large pctfree for the child index,and the link between “A-rows” in one line of the plan and “Starts” in another line.

Then two hundred calls to:
Line 5 – index unique scan of chi_i1, which always finds it single row – hence returning 200 rowids in total
Line 4- check the matching row in the table – which always fails – hence returning zero rows in total

The critical counts that allow you to answer my question are the 402 and 602 buffer gets in lines 5 and 4. Ignoring, temporarily, the odd 2 these numbers are clearly 200 times something – which is rather nice given that we know that we are doing something 200 times. The simplest solution, of course is that the (602 – 402 =) 200 gets due to line 4 represents the visits to the 200 table blocks, leaving us with 400 (plus an odd bit) to account for.

Now consider the possibilities:

Blevel = 0 – we would visit the only index block (which would be the root, but look like a leaf) then visit the table.
Blevel = 1 – we would visit the root, the leaf which is the next layer down, then the table
Blevel = 2 – we would visit the root, the branch level, the leaf, then the table.

So, at first sight we might decide the only way to get 400 buffer visits from the index is to have Blevel = 1, visit the root and the leaf. But that’s not the way it works (apart from a couple of versions which had a bug).

When running a nested loop join, Oracle “pins” the root block of the inner (second table) index, which means it attaches a structure to the “users’ list” in the buffer header that links the session’s state object directly to the buffer header. So, in my nested loop join, Oracle gets the root block once and keeps it pinnned, then gets branch and leaf blocks 200 times. The blevel on the child index is 2. (The blevel of the parent index is only 1 – I had set the pctfree on the child index to 90 to make it much larger than it needed to be.)

This “root block pin” isn’t the only pinning optimisation in the query, though. In detail, the steps are as follows – and this is something you would be table to see in the trace file if you enabled event 10200 (one of the ‘consistent reads’ traces).

1) Get the root block of parent index and pin it
2) Get the first relevant leaf block of the parent index and pin it
3) Get the first relevant table block of the parent table and pin it.
4) Get the root block of the child index and pin it
5) Get the relevant branch block of the child index (no pin)
6) Get the relevant leaf block of the child index (no pin)
7) Get the table block of the child table (no pin) — row does not match final predicate and is rejected
8) Revisit the parent index leaf block through the pin (add one to “buffer is pinned count”)
9) Revisit the parent table block through the pin (add one to “buffer is pinned count”)
10) Revisit the child index root block through the pin (add one to “buffer is pinned count”)
11) Get a new child branch
12) Get a new child leaf
13) Get a new table block
14) Repeat for a total of 200 cycles from step 8

There a couple of deviations from this cycle, of course. The rows I needed from parent were spread across 4 consecutive blocks in the table so, roughly every 50 rows from parent, line (9) above would become “release current pinned parent block, get new parent block and pin it”. The other little oddity that I can’t explain is that Oracle does “get” the child root block on the second visit to the index as well as the first visit – and then pins it from that moment onwards. So the counts are:

Gets on the parent index 2 (root and leaf)
Gets on the parent table 4 (one for each table block visited)
Gets on the child index 402 (two on the root block, 200 on branch blocks, 200 on leaf blocks)
Gets on the child table (200 for each row/block accessed)

Inevitably, things change – Oracle keeps getting smarter about things like ‘consistent gets – examination’, ‘buffer is pinned count’ and, in 11g, “fastpath” access to buffered blocks. This is a clue to the difference in gets that Charles Hooper recorded in 11.2.0.2 – and explains why I chose to use 10.2.0.3 with a unique index for my example. If you want to investigate other variations all it takes is snapshots of v$mystat, calls to “alter system flush buffer_cache”, and event 10200; the treedump can also be very helpful for identifying block addresses.

Luis Daziano from Argentina asked me about the location of the series I wrote for Oracle Technology Network on Advanced Linux commands. Fortunately, OTN still has all of them, although it has become a little difficult to find. Here are the five articles, if you are interested.

Recently, I received into my Spam folder an ad claiming that a product could “...improve performance 1000%.” Claims in that format have bugged me for a long time, at least as far back as the 1990s, when some of the most popular Oracle “tips & techniques” books of the era used that format a lot to state claims.

Beware of claims worded like that.

Whenever I see “...improve performance 1000%,” I have to do extra work to decode what the author has encoded in his tidy numerical package with a percent-sign bow. The two performance improvement formulas that make sense to me are these:

A few weeks ago, while delivering a session at #000099; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">New York Oracle Users Group, I had the idea of starting a blog series with a title 1#000099; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">00 Things You Probably Didn’t Know About Oracle Database. From the myriads of emails I have received so far, I know it has been quite successful in helping many readers. I was pleasantly surprised to hear from even some seasoned DBAs how it helped them. I would like to take this moment to talk about a few things.In the initial installment, I put a self declared goal of producing an installment a week. I quickly, after the first installment itself, realized it was lofty, way lofty goal. I have a day job as a Database Architect where I research options, prepare project plans, estimate budgets, get approvals for the projects, jump in DBA operations, develop strategies of all terms, get them approved, chase everyone from the Sys Admin to the mailman to make sure the train chugs along. I also write articles, give training seminars, present technical sessions, review manuscripts of books (and occasionally write one), critique published books sent to me by publishers, do some mentoring, and yada yada yada. And, you see, I am just an average person with a family that needs me as well. This little “hobby”, however satisfying, faces a lot of competition. What is the alternative? I could write shorter pieces or simply drop the code examples, figures, etc. which take up most of the time. On second thought, most of you actually like the articles because of these very elements - code examples and figures; and I don’t want to drop them.Therefore, as much as it hurts me, I have to renege on the commitment of one installment a week. I would rather produce some quality content than just a bunch.Second, I don’t have a plan, a set schedule or a “roadmap” for this series. I choose the next content based on what I get in the feedback. If you would like to see a specific topic covered, please drop me an #000099; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">email, #000099; font-family: Arial; font-size: 11pt; font-style: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">tweet or put in the comment section here. Please be as specific as possible.Third, some of the comments has been about additional questions which were probably not covered very well in the blog entry. I consider them to be success, rather that a failure of the writing. You see, a “complete” writing is an oxymoron; it does not exist. If article, blog or book does not generate ten other questions, then it fails. The objective of my series is never to be 100% comprehensive; instead it aims to build a foundation and generate additional interest. Some of the questions are planned to be answered in the subsequent installments; rest are left in the wishlist for the future articles.Finally, an author is nothing without readers. From the bottom of my heart, I thank you, dear reader, for reading this. Life makes demands, which is perhaps more true for technologists like yourself; your decision to devote a slice of your life to this series is a conscious investment you made. I am honored and grateful for that.Arup NandaDanbury CT, USAJanuary 20th, 2011Twitter: arupnanda

As you probably know that my first co-authored book Expert Oracle Practices was released in 2009. I have co-authored one more book Pro Oracle SQL with my esteemed colleagues. This books covers many aspects of better SQL development. Have fun reading

I will be talking about advanced RAC troubleshooting in Hotsos symposium ’11. Hotsos Symposium, conducted every March in Dallas, TX, is an intensive seminar series probing the deep waters of Performance related to Oracle Database.

There are many great speakers in this conference. It gives me a great pleasure to meet many folks that I have known for years, exchange ideas, and learn from industry leaders.

Yesterday I have run a benchmark on a 2 node RAC cluster (ProLiant BL685c G6 with 4 Six-Core AMD Opteron 8431) and 32G RAM each. It’s running Oracle Grid Infrastructure 11.2.0.2 as well as an Oracle 11.2.0.2 database on Oracle Enterprise Linux 5.5 64bit and device-mapper-multipath.

I was testing how the system would react under load but also wanted to see if the Runtime Load Balancing was working. The easiest way to check this is to view the AQ events that are generated for a service if AQ HA notifications is set to true. They can either be dequeued from the database as described in chapter 11 of Pro Oracle Database 11g RAC on Linux or alternatively queried from the database. The latter is the quicker method and this article will focus on it.

Before you can make use of Runtime Load Balancing you need to set at least 2 properties in your service:

Connection Load Balancing Goal (either SHORT or LONG)

Runtime Load Balancing Goal (SERVICE_TIME or THROUGHPUT)

.Net applications require AQ HA notifications to be set to true as these can’t directly make use of Fast Application Notification (FAN) events as said in the introduction. My JDBC application is fully capable of using the FAN events, however as you will see later I am using the AQ notifications anyway to view the events.

Connected as the owner of the Oracle binaries, I created a new service to make use of both instances:

The service TESTSRV for database TEST has TEST1 and TEST2 as preferred instances, and the service should be started (manually) when the database is in the primary role. AQ Notifications are enabled, and I chose the connection load balancing goal to be “short” (usually ok with web applications and connection pooling) and a runtime load balancing goal of service time (should also be appropriate for many short transactions typical for a web environment). The remaining paramters define Transparent Application Failover. Please refer to the output of “srvctl add service -h” for more information about the command line parameters.

So to begin with I created the order entry schema (SOE) in preparation of a swingbench run. (I know that Swingbench’s Order Entry is probably not the best benchmark out there but my client knows and likes it). Once about 10G of data were generated I started a swingbench run with 300 users, and reasonably low think time (min transaction time 20ms and max of 60ms). The connect string was //scan1.example.com:1800/TESTSRV

A query against gv$session showed an even balance of sessions, which was good:

select count(inst_id), inst_id
from gv$session
where username = 'SOE'
group by inst_id

However, whatever I did I couldn’t get the Runtime Load Balancing in sys.sys$service_metrics_tab to chanage. They always looked like this (column user_data):

That sort of made sense as none of the nodes broke into a sweat-the system was > 50% idle with a load average of about 12. So that wouldn’t cut it. Instead of trying to experiment with the Swingbench parameters, I decided to revert back to the silly CPU burner: a while loop which generates random numbers. I wasn’t interested in I/O at this stage, and created this minimal script:

$ for i in $(seq 30); do
> sqlplus soe/soe@scan1.example.com:1800/TESTSRV @dothis &
done

This created an even load on both nodes. I then started another 20 sessions on node1 against TEST1 to trigger the change in behaviour. And fair enough, the top few lines of “top” revealed the difference. The output for node 1 was as follows:

Where it was initially even at 50-50 it soon became imbalanced, and TEST2 would be preferred after a few minutes in the test. So everything was working as expected, I just didn’t manage to put enough load on the system intially.