Why is my SYSAUX Tablespace so Big? Statistics_level=ALL June 2, 2011

One of my most popular postings is about why your SYSTEM tablespace could be rather large. Recently I’ve had issues with a SYSAUX tablespace being considerably larger than I expected, so I thought I would do a sister posting on the reason.

The client I was working with at the time was about to go live with a new application and database. For various reasons I was a little anxious about how the Java application (the User Interface) would actually call the stored PL/SQL code I had helped develop. Initial workloads would be low and so I asked that the STATISTICS_LEVEL be set to ALL, so that bind variables (amongst other things) would be gathered. This is on version 10.2.0.4, btw, enterprise edition and 4-node RAC.

We went live, issues were encountered and resolved, the usual way these things work. Then, a few weeks in and when everything was still very “exciting” from a problem resolution perspective, I got an odd email from the DBA team. Would they like us to add another datafile to the SYSAUX tablespace. Huh? I checked. I’d been watching the size of our application’s tablespaces but not the others {well, I was not supposed to be a DBA and I was watching an awful lot of other things}. Our SYSAUX tablespace was around 160GB in size, having pretty much filled it’s 5th datafile. Why? I checked to see what was taking up the space in the tablespace:

As you can see, almost all the space is being taken up by WRH$_LATCH_CHILDREN and WRH$_LATCH_CHILDREN_PK partitions. They are massive compared to other objects. A quick goggle did not come up with much other than many hits just listing what is in SYSAUX and the odd person also seeing SYSAUX being filled up with these objects and suggested ways to clear down space, nothing about the cause.

I had a chat with the DBAs and we quickly decided that this was going to be something to do with AWR given the name of objects – “WRH$_” objects are the things underlying AWR. The DBA suggested my settings of 15 minute intervals and 35 day retention was too aggressive. I knew this was not the case, I’ve had more aggressive snapshot intervals and longer retention periods on far busier systems than this. I did not have access to Metalink at that point so I asked the DBAs to raise a ticket, which they duly did.

Oracle support cogitated for a couple of days and came back with the advice to reduce the retention period. Hmmmm. Via the DBA I asked Oracle support to explain why those objects were so large when I had not seen this issue on several other systems. Was it a bug? I had by now corroborated with a friend from a previous site with 5 minute snapshot intervals and two months retention period and their SYSAUX tablespace was about 10GB all in. I did not want to go changing things if we did not know it would fix the issue as we really wanted to stop the growth of SYSAUX as soon as possible, not just try a hunch.

As you probably realise from the title of this blog, the issue was not the snapshot interval or retention period but the STATISTICS_LEVEL=ALL. The one and only hit you get in metalink if you search on WRH$_LATCH_CHILDREN is note 874518.1. From V10.1.0.2 to V11.1.0.7 setting this parameter to ALL is known to create a lot of data about Latch children and not clear it down when the AWR data is purged (Bug 8289729). The advice was to change STATISTICS_LEVEL and make the snapshot interval larger. I’d suggest you just need to alter the STATISTICS_LEVEL, unless you really, really need that extra information gathered. It seemed to take Oracle Support an extra day or two to find that note for us. {I’ve since checked out Metalink directly to confirm all this}.

So with a known issue we felt confident that altering the initialisation parameter would solve the issue. It took a while for us to change the STATISTICS_LEVEL on the production system – Change Control for that site is rather robust. This allowed us to see some other impacts of this issue.

The mmon process which looks after AWR data was becoming a top session in our OEM performance screens. In particular, a statement with SQL id 2prbzh4qfms7u that inserted into the WRH$_LATCH_CHILDREN table was taking several seconds to run each time and was running quite often {I include the SQL ID as it may be the same on many oracle V10 systems as it is internal code}:

The internal SQL inserting into wrh$_latch_children was becoming demanding

This was doing a lot of IO, by far the majority of the IO on our system at the time – it was a new system and we had been able to tune out a lot of the physical IO.

The physical IO requirements and 15-20 second elapsed time made this out most demanding statement on the system

We also now started to have issues with mmon running out of undo space when it ran at the same time as our daily load. This was particularly unfortunate as it coincided in a period of “intense management interest” in the daily load…

Yes, that is close to half a terabyte of SYSAUX and it is all used, more partitions have appeared and the total size of the largest segments in SYSAUX show how WRH$_LATCH_CHILDREN and WRH$_LATCH_CHILDREN_PK make up the vast majority of the space used.

Shortly after, we finally got permission to change the live system. The impact was immediate, mmon dropped from being the most demanding session, that SQL code dropped down the rankings and the issues with running out of undo ceased.

I was anxious to see if the old data got purged, as the Metalink note had suggested the data would not be purged. Thankfully, that was not the case. The space was slowly released as normal purging of data outside the retention period took place and after just over a month, the SYSAUX tablespace contained a lot less information and was mostly free space:

Now, how do we get that space back? I left that with the DBA team to resolve.

Oh, one last thing. I mentioned the above to a couple of the Oaktable lot in the pub a few weeks back. Their instant response was to say “You set STATISTICS_LEVEL to ALL on a live system?!? You are mad!”:-)

{Update, I’ve just spotted this posting by Colbran which is related. Last time I googled this I just got a stub with no information}

Like this:

Related

Well you didn’t mention it to me – since I wasn’t there :( – but I’d absolutely concur. STATISTICS_LEVEL=TYPICAL is entirely adequate for first pass diagnostics (like *is* there latch contention for example ) STATISTICS_LEVEL=ALL is for drilling down into problems that you already have some idea about – in my opinion anyway.

You were invited Niall, it isn’t my fault if i’m not scintilating enough to attract you to the pub {hmmm, that sounds kind-of wrong}.
As I said at the start of the posting, I knew the system would start quiet and I wanted information on binds, I thought the hit on performance (usually down to the high number of of calls to the OS timer) would be fine… And that side of it WAS! *sigh*.

At a previous site we wanted to be able to collect the binds and I proposed setting the STATISTICS_LEVEL=ALL and TIMED_OS_STATISTICS=0, to turn that off again as we did not need the OS stats and metalink suggested that was the most significant impact on performance (it is turned on by setting STATISTICS_LEVEL to all). Have you ever tried that out?

No, I am not suggesting you do not use the ALTER SESSION SET STATISTICS_LEVEL=ALL. This will only alter what you can see from your session, I believe.
Setting the initialisation parameter STATISTICS_LEVEL=ALL will have a system-wide effect; all database activity will generate full stats and full information will be stored away into the AWR tables.

Using the ALTER SESSION and the hint as you suggest are both really good techniques if you have the access to do so – the ability to alter sessions or the ability to run SQL on the production system.

I agree with nlitchfield, TYPICAL is fine, but oracle should be able to take care of an explosion in the SYSMAN tablespace and accordingly take down the data in the offending tables (LATCH_CHILDREN), and allow further analysis using alter session rather than system.
All I know is that it hurt :) Interesting that Metalink indicated the data would not be purged, but in fact it was. Basic I found was that the plan execution statistics were added at the level ALL which killed the system.

can you please summerize the effects of STATISTICS_LEVEL=ALL, sql plan will have actual stats (on top of optimizer estimated stats) is one effect I know of. What are the other places where STATISTICS_LEVEL=ALL is required to show more information? Thanks in advance for your help.
Kay

I had it set to ALL on a test database Oracle 11.2.0.3 and had the same issue with a rapidly increasing SYSAUX tablespace. I just changed the parameter. Thanks for the information, it made resolving my issue very quick and simple. Still waiting to see if space will clear.