January 13, 2012

Session Snapper on Duty 2 – (how to support your answers, with proof)

Warning : This long blog is for educational and quality purposes and especially targeting Junior DBAs and DBAs who does not know how to use the Session Snapper
(greatest Oracle Tuning Tool yet) by Tanel Poder . If you feel comfortable with snapper and you think your relationship with your
clients just rocks then there is no need to read the post but you may still give it a go :)

Story started with a long email thread forward by a developer who is asking question below for a system I am responsible but not feeling perfectly comfortable about how things really work.

Hi Josh (note to reader: Coskan is pronounced as Joshkahn so I use Josh at office)
Any ideas on how we can work with dbas to find out why our database is only committing at a range of
2-30 sec for 100 commits instead of the ideal 100 commits a sec for a third party process?

I answered

From my point of view
I will be requesting
1- Time interval when this process run
2- If you have SID information when this process run
3- If it gives same results on UAT as well then can we run the operation while tracing is on (which need Change Request-hassle on production)
Then we can contione to understand what can be the problem

Developer replied

It runs 7/24 and UAT testing is not possible since we can't point the feeds to UAT but username is USER_XXX
if it helps.

Long story short all I understood was in simplest words they want to be able to process more data in same amount of time, so I replied with a long mail below

BEGINNING OF THE EMAIL

I did some sampling on your sessions to understand what is happening and I think I can come up with rough conclusions about improvement suggestions.

When we check the stats, you have done 326 commits in 500 seconds which makes 0.65 commit per second but if we take 180 second of idle events (SQL*Net message to client + SQL*Net message from client) out, it is 326 commits in 320 (500-180) seconds active time which is roughly 1 commit per second.

When we look at the wait profile we can see 276 seconds of “db file sequential read” (physical read) over 320 second ofnon-iddle period and 36 second “log file sync” which we will come later

If we focus on the “db file sequential read” we can see it is %86 (276/320) of the active time and there are 3 main contributer sqls for this case.

There is nothing much we can do about index unique scans other than changing the table structure but I think there is room for improve on index range scan for sql_id ggjkmtv8cxxh3 for delete statement which is waitig on db file sequential read for %31 (500*%20 / 320 ) of the active time.

The reason the delete is doing index range scan is because we are trying to access 3 column composite index PK_TABLE_T by only using 1 column which is COL1. If business logic lets you also give other 2 columns as filter as well we will be doing much throughput with less LIO and PIO

This time 94 seconds idle with 406 (500-94) second processing time which gives us 0.84 (343/406) commit per second and 392 seconds of wait
(364 “db file sequential read”+28 “log file sync”) wait.

When we look at the sqls below this time we can see 160 (500*%32) of 364 seconds (%43 of active time) is being spent on again index range scan (sqlid 29zz9n1b9sppr) on delete statement again due to the lack of composite index column filtering.

I again believe we can improve the access if we are able to give all columns as filter and maintain index unique scan.

I think I managed to make it clear about the index range scan (for more information see the bonus material below), Now Lets see what the log file sync is all about and why we are seeing it .
<p

Log File sync event has got definition below in Oracle Documentation.

log file sync

When a user session commits, the session’s redo information needs to be flushed to the redo logfile. The user session will post the LGWR to write the log buffer to the redo log file. When the LGWR has finished writing, it will post the user session.

Wait Time: The wait time includes the writing of the log buffer and the post.

Parameter

Description

buffer#

The number of the physical buffer in the redo log buffer that needs to be synchronized

Potential causes

Wait Event

General Area

Possible Causes

Look for / Examine

log file sync

I/O, over- committing

Slow disks that store the online logs

Un-batched commits

Check the disks that house the online redo logs for resource contention. Check the number of transactions (commits + rollbacks) each second, from V$SYSSTAT.

First rule of tuning log file sync is doing less commit which means instead of commiting every dml or every single operation like

begin
for i in 300001..600000 loop
delete from t1 where id1=to_char(i);
commit;
end loop;
end;
/

Doing commits in batch like

begin
for i in 300001..600000 loop
delete from t1 where id1=to_char(i);
end loop;
commit;
end;
/

To prove how you are doing commits, I just get before sql statistics then get sampling of the session then again get after sql statistics and to see if number of user commits statistics during sampling interval are matching with number of increase in the sql executions

—-note to blog reader: this is, how I did, no rocket science (ones you arm yourself with right tools) just run the block as whole

As you can see, both sql increase by 12 (after-before execution times) and both sql show user commits statisticsdelta as 12 which means there is a commit per execution that leads log file sync event.

After proving that the app is not doing batch commits, I wonder if this is something avoidable or not, in terms of business logic. Doing batch commits will definitely help to increase your overall throughput plus whole system health.

So in my humble opinion we have two options without me seeing any code logic.

1- Providing unique index access for the delete operations by giving all columns of the index
2- Decreasing the number of commits and try to do batch commit.

Timings for the test are as follows for deleting 100K row sets as you can see

Elapsed

Improvement

nonbatch-commit-range

84

0%

nonbatch-commit-unique

47

44%

batch-commit-range

55

34%

batch-commit-unique

35

58%

END OF THE EMAIL

Note to blog reader : (In case they are still reading), I wish you found something interesting about how I used the right tool right time
and how I answered the development questions without talking bollocks but numbers and proofs. One thing I need to admit there were only 2 sessions running this operation which made my life easy.

If you wonder the duration of whole diagnose: It took 15 mins in total to find out what is going on and 1 hour to write the mail but that 1 hour save at least 3-4 unnecessary email ping pong just because nobody can prove their fact with numbers.

Quick Tip: If you are sending any output to anybody use Courier New character set and make everybodies eyes happy.

I found your methodological explanation of this commit issue as a perfect example of what Cary Millsaps said “Why guess when you can know?”. I like very much this blog article even though that commiting once and outside a loop is the thing I would have suggested immediately if I have had been asked to help solving this issue.