Automated statistics gathering silently fails July 20, 2009

I posted a few days ago about what I feel is the best performance feature in Oracle and cited the Automated Stats Gathering job. A few people did not agree… 🙂 So, in the best traditions of politicians and lawyers the world over, I would now like to claim I was misunderstood and that is not what I said. I said the opposite…(* see later).

Of the many problems that can occur with the automated statistics job, and the one I am going to pick up on today, is the fact that it can choke on large table or table partition segments {and does so silently at that}. What do I mean by this?

The automatic stats job runs every weekday night, by default between 22:00 hours and 06:00 hours. It appears to have some logic where by it looks at the number of table segments (full tables or partitions within a table) that need to be analysed. It then calculates the amount of time it can spend on each segment. If it has 32 objects to analyse in the 8 hours available, it seems to decide it can afford to spend not 8/32 (1/4) hours on any given segment but something more. It looks to be a couple of hours.

The job also tries to prioritise the most “in need” segments to be considered first. I am not sure what it goes on but I think it is those with the largest percentage change, but it is not just as simple as that.

As a result of the above, this is what seems to happen. The stats job starts up and identifies a massive segment to analyse, say a 200GB table. It starts analysing it, say allowing 2 hours maximum, but after 2 hours it has not analysed the massive table so it gives up.

The next night, it identifies the same massive segment to analyze (now 201GB) and maybe gives it an even higher priority. The job tries to analyse it IN EXACTLY THE SAME WAY as before. Oddly enough {this is sarcasm} it usually fails. The only reason it WILL succeed is if the night before the automated job managed to analyse other segments and fewer new ones need considering tonight, so it can spend more time on each segment. In my experience this is unlikely.

As a result, the object will probably fail to be analysed every night until the weekend, when the automated job gets to spend 2 whole days analysing segments and so can give each one more time. Unless you spot the error yourself and do a manual dbms_stats.gather_xxx statement yourself, with a low sample size that is likely to succeed.

How can you tell this is happening, or has happened? Look at the history of schema/database wide statsistics gathering operations in the table DBA_OPTSTAT_OPERATIONS . If a job ran all night, it probably choked {in the below, I look at table sys.WRI_OPSTAT_OPR – it is the underlying object}:

The highlighted record is an example of where the automated stats job ran all night. It did this because it choked on a large segment. There are 3 other examples of this in this screen shot. Once the job starts choking, it becomes a constant problem!

{You may ask why this is not failing the next night, like I claimed. It was because we were sorting out the failure the next morning, but had not yet properly fixed the problem}

So, for very large segments, the automated stats job can fail to analyse them.

THIS FAILURE IS SILENT.

No error appears in the alert log that I have seen, OEM or grid control does not seem to flag it up anywhere on the system summary or performance monitoring pages.

In OEM you can look at the latest stats run and see how it did. It lies. {Well, last time I checked it lies, I do not currently have access to OEM on a site suffering from the problem and, as you can imagine, clients do not want me to create massive tables just to fake this up for my blog 🙂 }. OEM has a screen to show the activity of the latest automated stats gathering run but, even though it has the option to look for objects for which there is a timeout, it does not show them.

Even though there appears to be the option to show objects for which the stats gather timed out, at least in this version it does not.

This situation can get completly out of hand. In the below, pulled out with this script here from one site I have seen, the stats was choking every night. Not only at night, it was now choking every weekend {click on “view plain” to see a better layout} .

The Automated stats job was virtually as a standstill as every night and every weekend it tried to do the same things in the same way and failed. So,unlike a human who finds they choke if they try and swallow a pork pie in one go and so stop doing it, Oracle keeps on trying to swallow it whole.

Tsch.

Worse still, it was burning CPU and IO doing a pointless set of partial table analyzes all night and all weekend.

Double Tsch!

(*) This claim that I never said what I said about the stats job is of course a big weasly lie. I did say it was the best feature. I also stick to this, though I think I am clinging on by my fingertips. I think it is the best step forward for smaller, average, not-too-demanding databases, those which tend to have less DBA support. But not the big, complex, highly active ones. I did say that it has problems with them right at the beginning, honest! 🙂

but the situation is complicated in our case because we have both the auto stats gather job enabled (which people were unaware of) and the ETL batch gathers stats too, so the ETL batch took care of it. We set the AUTOSTATS_TARGET to ORACLE before the next night.

Nice little query Doug.
In a bizzare coincidence, the most recent case I looked at was also complicated in exactly the same way – DBA team collecting stats with their own jobs and unaware that the automated stats was running. They were also gathering stats on external tables like crazy, to support data loads, so the entires from that into dba_tab_stats history confused the picture somewhat.

There is also dba_scheduler_job_run_details whic offer an additional column error# whose value I am sure exists somewhere yet to fe found. I don’t have faulty execution so I am curious if you have something in error# and addtional_info.

Also and you could add a second query to identify user defined gather stats to complete the picture

Nice idea. However, the error# is set ot 0 and the reason is set to “stop job called because associated window was closed”.
My original script on DBA_OPSTATS_OPERATIONS does already pull out any user-executed schema or database stats gathering runs, becasue I simply don’t exclude them. There is one shown in the example above 🙂

Hi Mani. Unfortunately that will not cure the issue but would give you a breathing space. By doing a manual gather you will avoid the automated job from gathering stats on any table segment and it’s indexes until the segment or whole table has grown by 10% (NB this is under V10, under 11 you can alter this percentage). However, as soon as a segment or table that is large enough to cause the job to choke has changed by 10% or more, the job WILL choke – as the method_opt and sample size is determined by the automated job in the same way each time.
Of course, it could take a long time to manually gather stats on all objects at a level that is suitable for the CBO to make good decisions…

Joining this very late but I’m having problems with this but cannot identify th table causing th problem. I see above you say “There IS a way to spot what the stats job choked on, I’ll post today or tomorrow, when I get chance from the day job” but I don’t think you ever did! Be grateful if you could

Martin, I know the blog is quite old now so this is probably a feature that has appeared since but the following script indicates when the auto gather stats has not completed it’s work by the end of the maintenance window.
There still doesn’t seem to be an easy way of finding where it has got to. That is something I am looking at currently.

Thanks for that John.
It’s a while now since I had to track down the object the stats job was choking on, but the next posts in the series show how you can list what was analyzed during the run via LAST_ANALYZED on DBA_TABLES and also look at DBA_TAB_MODIFICATIONS to find the segments that are large and have over 10% change (the default percantage to be counted as STALE, but Oracle now allows you to state that at and object and system level so you can be caught out). Anything large that is staying stale is probably your problem.

I have also just logged in during the stats gather to see what is being analyzed and caught it that way, but that involves being awake at the time!

A final thing you can do is call DBMS_STATS.GATHER_DATABASE_STATS with OPTIONS set to to LIST AUTO and OBJLIST set to an array of the right type OBJECTTAB. This will give you a list of the segments that will be analysed, in the order they will be processed (I THINK – I might be wrong). The first big one that has been stale for a while is probably the one to fix.

I am a little late here, but the content is still good. This is a third way on how to found out what the stats job spent its time. (If you have the tuning/diagnostics pack):

–1. see whether we are hit by this problem
select client_name,window_name,job_start_time,job_duration,job_name,job_status from DBA_AUTOTASK_JOB_HISTORY
where client_name = ‘auto optimizer stats collection’
and job_duration > INTERVAL ‘3:50’ HOUR TO MINUTE
and job_status = ‘STOPPED’
order by window_start_time desc;

–2. get the session and serial id from the run details
select log_date,job_name,actual_start_date,run_duration,session_id,cpu_used from SYS.DBA_SCHEDULER_JOB_RUN_DETAILS where job_name in (
select job_name from DBA_AUTOTASK_JOB_HISTORY where client_name = ‘auto optimizer stats collection’ and job_status = ‘STOPPED’ and job_duration > INTERVAL ‘3:50′ HOUR TO MINUTE
) order by log_date desc;

–3. get the sql statements this session ran from ASH, needs diagnostics/tuning pack
select sql_id,count(*) from v$active_session_history /*DBA_HIST_ACTIVE_SESS_HISTORY*/ where session_id = 466 and session_serial# = 64717
and sample_time
between trunc(sysdate) -1 + 22/24 and trunc(sysdate) + 2/24 –last night 22:00 to 02:00
and session_id||’,’||session_serial# = (
select session_id from (
select actual_start_date,session_id,row_number() over (order by actual_start_date desc) rn from SYS.DBA_SCHEDULER_JOB_RUN_DETAILS where job_name in (
select job_name from DBA_AUTOTASK_JOB_HISTORY where client_name = ‘auto optimizer stats collection’ and job_status = ‘STOPPED’ and job_duration > INTERVAL ‘3:50′ HOUR TO MINUTE
) and actual_start_date > sysdate -2 –assumes that the job runs each day and looks at the last 2 days
) where rn = 1
)
group by sql_id
order by 2 desc;

–4. check the sqltext the session ran, it probably contains the tablename
select * from dba_hist_sqltext where sql_id = ’98p2az1f8k9u4’;