June 24, 2011

Plan is same, Wait profile is different ?

Here comes the nice one, What would you check when you are on 3 node RAC where 4 different database are balanced with services and Developer of one of the DBs come and asked you why their newly deployed batch is 3 times slower on PRD comparing to QA box which was snap copied (storage snap) from production box. RAC specs are exactly same plus all parameters are same ?

(I hope you like the posh sql monitoring output which I really enjoy to use. I have to admit, I wasn’t the big fan of sql monitoring on GUI-just because of the speed of OEM- till I discovered the sqlplus options like the ones above, and now I can’t stop myself from using it and it is very good alternative to trace which is expensive by all means on PRD )

From the ouput above, you can see PRD database is doing direct path reads and QA database is using db_file_scattered reads and PRD is 3 times slower for each run. Somehow oracle favours doing 11G direct path implementation for full table scan (Links for direct path Doug Burns,Alex Fatkulin, Dion Cho ) on PROD. If you read the links you can see that this has to do something with the memory the buffer cache and small table threshold (for more information about this threshold Charles Hooper-great discussion with Jonathan Lewis on comments, Tanel Poder ).

Initially when I checked the output, I saw something weird which is against our policy for this DB. We configured the user services to run on Node-2 only and somehow on QA query runs on Node-3 (instance id), I double checked this with developer and we found out that on test app server they are not using services as service name but instead they use instance name, so he is able to run query on node-3 on QA. I thought this might be the case since Oracle instance is not highly active in terms of memory usage on node-3 so Oracle might favour using expensive buffer cache reads instead of direct path reads. This was a goodish as initial shot but ended up disappointment because when I test the same query on node-2 results were again same. Wasted 5 important minutes :)

After losing 5 minutes on initial shot, now it is time to be more scientific, Lets check what the “documented” memory parameters are

Lets check if the small_table_threshold*5 (check Tanels answer on the oracle_l question ) is bigger than number of blocks of the BATCH_TBS which is probably the decision maker for favouring direct path reads

As you can see we find the root cause but I don’t think developer really cares about the root cause. All he need is a an answer Why they hell this code is working on QA and how can we avoid this problem ?

To answer his questions we need to look at the formula of small_table_threshold which is %2 of the buffer cache size. Which means our buffer cache is bigger on QA. Lets check

As you can see %2 of buffer cache formula is true and PRD buffer cache is 3 times less than QA buffer cache.
Why this can be ? Here is the answer

QA> select count(*) from v$sql;

COUNT(*)
———-
26135

PRD> select count(*) from v$sql;

COUNT(*)
———-
68711

We are running 3 times less number of sqls on QA which leads us having 1.8GB shared pool enough on QA but not enough on PRD.
This causes shrink of buffer cache and makes us think we have visually “SAME” environment but end up realistically “DIFFERENT” environments.

My solution for the problem was creating a better index on the table which needed to be there anyway (cannot touch system wide PRD setting):)
The real and long term solution to avoid these kind of QA-PRD different problems is in my opinion having bare minimums by setting db_cache_size and shared_pool_size
and not letting Oracle to decide to shrink below real production levels.

Hope you enjoyed as much as I enjoyed to diagnose it.

!!Make sure you read all the links in this post they are all informative

Note: I needed to stop my upgrade series but will continue as soon as possible :( Really sorry if someone is waiting an input from me.

mon_sqlid2 script is like follow

set long 10000000
set longchunksize 10000000
select dbms_sqltune.report_sql_monitor (sql_id=>'&1') from dual;

Like this:

Related

Coskan, I think this is an excellent blog entry as it takes the reader through a good diagnostic approach and is well written and displayed.

I noted that in your seemingly similar environments shared_pool_reserved_size is 278M on prod and only 67M on QA

What do you see as the benefit of setting sga_max_size as 10Gb but sga_target to only use 8Gb. Teh additional 2Gb is still grabbed by oracle and is therefore unusable to anybody else. I appreciate that you can grow the database parameters later, without a bounce, but to my mind you might as well use that extra 2Gb in the buffer cache for now.
regards
John

it normally should have been auto adjusted to its default which is 5% of the value of SHARED_POOL_SIZE but a little higher in our case which is I am not sure why.

When it comes to 10GB-8GB is just for the flexibility to grow “online” in case of a problem and secure the additional 2GB space to this database. I always find it more practical to give resources gradually to the end users so we can track down the real usage and not letting them waste the resources. When you have resource somebody will use it but real question is should they :)

You are right it normally asks for other parameters as well if you want to see the runtime of a sql for a spesific session. At this occasion I am sure that this is the one I run so I only pass the sql_id but if you want to see session id spesific then better you also pass sql_id and sid together like below