June 7, 2009

PGA leaks

Here’s a simple script that I created a short time ago while investigating a memory problem on a client site. The purpose of writing the script was, as always, to strip the client’s code back to a bare minimum in an attempt to work out the root cause of a problem. (Warning: if you want to run this script, your Oracle shadow process will grab about 1GB of PGA RAM )

The procedure simply fills a pl/sql array with data, then uses the forall insert syntax to write the whole lot into a table. The first version of the code used a select with bulk collect to load the array (emulating the job run by the client), but I wanted to make the code even simpler.

After creating the table and procedure I simply called the procedure and, in a slightly more complex version of the code, introduced some long wait times (using dbms_lock.sleep) between steps. Then I used another session (logged on as SYS) to monitor what demands this procedure made for RAM for the PGA as it ran.

This is the code I ran from the monitoring session (supplying the SID from the first session as the input parameter):

The figures I’m after are just the session and process memory from v$sesstat, and the two different ways of reporting process memory from v$process and v$process_memory. These are the results I got on a database running 10.2.0.3. The first set reports the the state of PGA memory just after after the creation of the array, the second set reports it just after the call to the procedure has ended.

First (obviously) v$sesstat tells use that we’ve used a lot of memory – which isn’t too surprising since we generated 1,000,000 character strings of length 200, so we know we’re going to see at least a couple of hundred megabytes of RAM being used.

But there’s much more to see in these three result sets. The report from v$process_memory says we’ve allocated more than 600MB in the “Other” category – which is a little surprising since (a) we know that we’ve used the memory for a pl/sql array – so why is it in “Other” – and (b) it’s twice as much memory as reported in v$sesstat and (c) we don’t see any of it in the “Used” column.

In fact, a dump of the pga and uga heaps (oradebug dump heapdump 5) shows that we have 313mb of RAM in the pga heap, of which 77MB is marked as free (but not yet released from the heap). From this we we can probably infer that the code maintaining v$process_memory is broken. I suspect that we really ought to see that 313MB should in both the “Allocated” and “Used” columns with, quite possibly, that 77MB appearing in a separate row (not visible in this report) for the category “Freeable”.

Note that the report from v$process is closer to the actual state given by the heap dump. It shows pga_used_mem at 236M – which means that the 77M free from the heap dump is correctly subtracted from the 313MB total, but possibly that 77MB should be reported as “PGA_Freeable_Mem”, rather than disappearing completely. This view still has a problem, though: the pga_alloc_mem and pga_max_mem are both displaying the same doubling effect as v$process_memory.

Bottom line: v$sesstat seems to show you most of the truth, with v$process helping you to seperate the freeable memory out from the currently allocated memory, but both v$process and v$process_memory are not to be trusted. (Some of these anomalies are still present in 11.1.0.6 by the way).

The next bit of output shows you the report results after the procedure call had completed:

This is extraordinary. The session has allocated – and is still holding – more than 488MB in the PL/SQL category just because of that one million row insert. (My client was processing 3.8 million rows, and they “lost” 7.7GB of memory to this operation – the memory loss gives the appearance of growing geometrically with the number of rows).

In fact the penality is more than that 488MB; the memory that had been allocated for the bulk collect in category “Other” has been released and the 110MB still remaining in that categiry is also a side effect of the forall insert.

Note, by the way, that the pga_max_mem from v$process_memory is still 313M higher than it should be, as is the max_allocated in the “Other” category in v$process_memory.

When I first saw this behaviour I decided it was obviously a bug and spent a good 20 minutes searching Metalink for possible matches without success (I was looking for “forall insert”, if I’d tried “bulk insert” I would have had more luck) – so I forwarded my test case to the rest of the Oak Table to see if they had any thoughts about it and got a fairly prompt reply from Tanel Poder that this was bug 5866410. (He’d done a pga heapdump, and searched Metalink for the label of the memory type that was using most of the space – like all good strategies it was so obvious after it had been explained !)

The bug is a pl/sql memory leak in “forall insert…” It’s fixed in 11.1.0.6 and 10.2.0.5, with backports available to 10.2.0.3 and 10.2.0.4 on a couple of platforms so far (Solaris and IBM ZLinux as I write).

So if your code does very large “forall insert” calls and you see Oracle error ORA-04030 from time to time, or workarea operations dumping to disc unexpectedly, you may temporarily be losing large amounts of memory to this bug – in which case check Metalink for the bug number and see if there’s a patch that’s good for your platform. (And if you’re on 10.2.0.3 or 10.2.0.4 and there isn’t a patch for your platform yet, there’s a reasonable chance you can get one created since it’s already been done on a couple of other platforms.)

Thanks a lot for your post, the last week I has been worked on a ORA-4030 problem. I think that is not related to this bug but your information was very useful for make me understand the problem. I think that my problem was related to a PGA+SGA configuration vs the 2gb limits on a 32bits windows.

It’s becoming more common for people running 32-bit Windows to run into ORA-04030 errors as their databases get bigger and (more importantly) busier – the single Oracle process grows beyond the basic 2GB limit imposed by Windows implementation on 32 bits.

You’ve probably read up on things like the 3GB switch, AWE for memory windowing, and PAE for the extensions – but none of the options for getting past the 32 bit limit are really ideal for a busy system.

Even though the bug may not be relevant, some code to take regular snapshots of the v$sessstat figures may help you track down memory hogs before they cause other sessions to fail.

That’s a nice example – but on the plus side, the memory does get released when the call ends.

I quite understand the point of creating a script that demonstrates the threat in the minimum number of lines even if the script itself looks unreal – and totally approve – you should send this one to Oracle to see if this is necessary, or whether it’s a leak.

One of the things I like about the blog mechanism is how it can pull related bits of information together – here’s another example of potentially nasty PGA leakage that I can link to because Tom Kyte pointed one of his threads to this blog.

A colleague of mine pointed out your article to me re. PGA leaks. Since it’s something that I’ve done quite a bit of work on – and was quite interesting work, I thought I would add some comments.

We are running Oracle 10g AIX and have had quite a few issues with memory leaks involving processes selecting into PL/SQL memory objects with Bulk Collect (We are running a high speed Trading System).

Under load test conditions we could only run for a few hours before consuming all the server memory (in the region of 18 gb!).

What I discovered was that the problem was exacerbated hugely by having hash joins enabled and workarea_size_policy set to AUTO. We had to go live with hash joins disabled and manually setting our memory work areas.

I performed some in-depth memory analysis on this problem. AIX is terrific in this case, since it provides the ‘svmon -P’ command that allows you to get a detailed process memory heap dump.

I ended up writing some code that performed a time series analysis of all the processes we had running on our server – it took ‘svmon -P’ reports for each process, and compared the memory usage for each type of memory heap for that process over time. The results were very interesting…. .

What it showed was that Oracle was allocating large shared memory areas and then ‘sub-allocating’ blocks of memory to processes as required. Over time (and because of a bug in the memory heap management code) this shared block of memory would fill, and Oracle would then need to allocate a new block.

This had an interesting implication – which we noticed on our systems. That is, killing the sessions (in our case, a Java process which created a number of sessions onto the database) did not always return all of the RAM – and, in fact, we could never be quite sure how much RAM we would recover from a Java process restart.

This makes sense, since Oracle is ‘sub-allocating’ from a large shared memory segment, there would be other processes, which had not been killed, which would still be ‘registering’ an ‘interest’ in that memory block.

We found that the best way to retrieve all our RAM in a clean way was to bounce the instance.

As you discussed, the v$ views did not report on the situation correctly, but I guess that comes hand-in-hand with the bug in the memory heap management code.

In the end, Oracle provided a patch for us (on 10.2.0.3). They included the fix as part of 10.2.0.4 release (well, for AIX anyway).

As an aside, we have also noticed slow but gradual memory leaks + performance degradation with long lived dedicated connections executing large numbers of ‘Select *’ type SQL’s, even when the resulting datasets are fairly small. We have consequently banned the use of ‘Select*’ from our codebase. ;)

Two things struck me in particular. It sounds as if AIX has some very specific ways of using memory that don’t appear with other Unix flavours. (Or possibly what you’re describing is a consequence of the type of connection pooling that you used.)

The other thing is that you’ve turned Hans-Peter’s “artificial” demonstration into a live problem – his leak was on the “describe”, and that’s what Oracle has to do if you give it a ‘select *’.

I am wondering about the “kill the java process” having variable success – if your code populated the program, application, or module columns of v$session (with the Java equivalent of dbms_application_info) you may have been able to identify all the sessions associated with the Java and done an ‘alter system kill session’ on them to release the memory.

Regarding Dion’s comment about having free chunks only after db call ends:

It may not always be the case. There are cases of memory leaks where all memory is not freed in the end of db call. But yes I took the PGA heapdump with oradebug when the problem was ongoing – the db call was still running. And that should be obvious anyway – you can’t systematically diagnose a problem if you don’t gather evidence DURING its occurrence.

Regarding the AIX memory allocation: It should work like other standard unixes. I havent used svmon for such troubleshooting but procmap instead, it also shows you the breakdown of memory segments in a process address space. The PGA memory allocated shouldn’t really be shared but rather its mmap()’ed from /dev/zero (with private flags) but svmon may think its shared.

AIX should take care of releasing all process memory once the process exists – are you using shared servers perhaps? This would explain why killing a session/or java client wouldn’t necessarily release all memory as the server processes (with their leaked PGAs) still remain running.

From the ‘svmon’ information, and the behaviour we saw when stop/starting
Java processes (ie. not retrieving all the RAM), my impression was that
memory was being taken from ‘shared memory’ segments by the processes.

Tanel – in answer your question – no, we are not using shared servers.
You are quite right – I would not expect process private PGA memory to
be shared either. In this case, perhaps this ‘operation’ is not
intended to be a ‘private’ one.

If that were the case, then perhaps this had something to do with an error
in the management of cursor memory in a region where Oracle needs that
information to be ‘viewable’ by all processes (shared data structures etc).
That would explain some of the behaviour, but whether that’s the answer is
another question!

Hans-Peter,
There are two or three different mechanisms that impdp uses for the bulk of the data – but I wouldn’t expect them to use pl/sql. On the other hand it might use pl/sql for some of the metadata processing.

It’s probably worth checking on a small system with system level tracing enabled. Try using expdp/impdp on a complicated schema with only a little data.

If you’ve got what seems to be a memory leak on an import, though, perhaps it’s simply some OCI leak on the data load.

I faced memory leak from HASH JOINS year ago. Oracle Support didn’t want accept this bug and they told my proces allocated extra space for different operations (for example my PL/SQL AREA etc.). We are using 1GB “ulimit” for private memory for each process on OS level.
The DUMP od PGA+UGA (alter session set events ‘4030 trace name heapdump level 5′) showed as only there was was something like 900GB of memory allocated to kxs-heap-w SUBHEAP. So to show more I’d to start dumping of subheaps (with the help of book of Steve Adams) and it showed that 890GB was allocated in “QERHJ hash-joi” subheap and Oracle Support accpted bug.

The “PLS non-lib hp ” subheap can be dumped of course in your example. The addres of the subheap can be found in PGA+UGA HEAPDUMP TRACE

So the addres is in HEXA “11047e998″ (of course handler of my session, will be different on differnet systems). It can be also querried from x$ view from KERNEL SERVICES LEYER, from x$ksmup, the column KSMCHPAR shows the result

You can dump (setting 4030 trace => alter session set events ‘4030 trace name heapdump_addr level 1, addr 0x11047e998′ or directly from ORADEBUG if unlimit is not set and you don’t suffer ORA-04030 directly) this subheap after running your procedure. And dump of subheap will appear in trace (extra lines):

And there is a bug, Oracle does not free this subheap. It is marked as free, but not returned to OS. The Tanel’s analyzer won’t work satisfactorily with subheaps dumped, so you can suggest him to write version 1.10 :)

We are having 2 databases on a solaris box.The RAM is 16GB and the sum of the sga+pga for both the database totals to ~11Gb(approx).Now,vmstat is showing 2Gb free memory which is a concern here.I wanted to know is what all are the areas of concern and where can i look for it.

By “pga + sga” are you indicating that you are running 11g and using the memory target, or are you possibly using 10g and setting the sga target and pga target separately.

Is this memory report from the moments immediately after the database starts up when nothing more than the background processes are running, or is it when both systems are running with their full quota of processes.

You haven’t mentioned your operating system, but you might want to read Christo Kutrovski’s presentation on memory management – here’s a link from my blog, which includes a comment where Chen Shapira makes a point about the tcp buffer settings.

Just as a starting point, though: remember that each process that connects to the SGA will need some memory, including memory to buld a memory map of the SGA. The more processes you attach and the larger the SGA the more memory “disappears” even if none of the sessions do anything.

This looks a little like something we experience, but the symptoms are slightly different. Have you any advice as to what we could do to try to debug.

Symptoms: over time the virtual memory used by the oracle process (10.1.0.5 on Windows 2003R2) increases until it hits the 3Gb limit. SGA and PGA info show massively less memory in use than the VM shown by sysinternals.

Since Oracle operates on Windows as a single process with multiple threads it is perhaps a little more likely that a memory leak from an Oracle “process” might not be released.

Since you’ve got a fairly elderly copy of Oracle there, you may have a bug that has been fixed in a later release so it’s worth checking metalink and patch release “bugs fixed” listing to see if you can find a match.

The only little oddity that stands out is the gap in process IDs (particulary the jump from 60 to 88) – it’s not terribly important, but might be a hint that something odd is happening with process (thread) re-use.

The only thing I can think of is to run up a little task that takes a snapshot of the differing view points every minute or two (don’t bother capturing the SGA, do check the different sources for PGA memory) to see if you spot any oddity (e.g. a timed pattern of growth, or sudden jumps).

I am facing the same situation and the same problem.
On one windows 2003 server (32bits), an oracle database (10.2.0.4 32bits) used to crash with ORA-4030 error. First, I have increased the sga_max and sga_target, giving the database more memory after I noticed that the virtual memory was continuously increasing. So I used the /3GD parameter on the server and allowed the database to use about 2,8 Gb of memory. It was not enough… I have a small program to monitor the virtual memory (using pslist, using windows registry …) and I saw that from time to time in a month, the virtual memory was increasing suddenly (about 300 M° in two or three minutes) but it was never released. Lately, I have upgraded the database because this was the only thing I could do to solve the problem (the 11g upgrade is not my customer’s priority …). So I moved the database to a 64bits Windows 2003 server, EE R2 with about 45 Gb of RAM . I upgraded the database to a Oracle 10.2.0.4 64bits and continued monitoring the virtual memory. It is still increasing … Now the memory is about 3,8 Gb and it’s growing .. slowly but surely…
I also tried to add the “secret parameter” to the init.ora file, but it did nothing. I tried to use dead connection detection but it was useless either …
I would be pleased to paste a small snapshot of the monitoring but I don’t know how to do it here … If you want it, mail me and I’ll send it back…

Sometimes people ask questions like this and happen to get lucky that I recognise the problem, or have some very specific suggestion that may help – but I’m not trying to be a replacement for Oracle Support. In your case, all I can say is that it sounds like a slow memory leak (on x64, the 32-bit “big jump” bit sounded like a runaway pl/sql process or a violent explosion of logons); the only suggestion I have to make to track it down is to run some code in a fairly tight loop that keeps calculating and reporting changes in v$sysstat in case that shows you any sudden strange changes. Apart from that, there’s the VMMAP utility that MOS references – could that give you some clue that (for example) connects the constant growth with one or two specific processes such as m000 or DBWR ?

Since you were kind enough to offer some suggestions on resolving this, I thought I’d follow up with the solution I eventually tracked down

It turns out there is a known bug in Oracle 10gR2 (and we have now confirmed in in 10gR1 as well) on x64 chips whereby threads are not cleaned up correctly in the Oracle JVM after the call completes leading to “zombie” threads which take up memory but never get cleaned out. This means that on an x64 windows platform the Virtual Memory of the Oracle.exe process increases continuously
if you use the JVM.

What do you need to hit this bug
——————————–
1. 64 bit chip (note: not 64 bit OS – we hit this on 32 bit)
2. 10gR1 or 10gR2 (note: supposed to be fixed in 11gr1)
3. Windows OS (2003 in our case)
4. Run Java in the database

Symptoms
——–
1. Virtual memory of Oracle.exe increases continuously
2. If you run VMMap (sysinternals) you will see a massive allocation for Thread Stack space (we had 600Mb in thread stack with 20 active sessions!)
3. The thread stack breakdown will show lots of threads without an associated thread id

Solution
——–
1. Go onto metalink and read 1062406.1

The solution is an undocumented parameter and I am therefore not about to publish it in case somebody decides to set it without being advised to by Oracle Support.

I’m guessing a simple test case would be easy to do – x64 box, set up some java stored procedures, run them and keep watch on the virtual memory!

session uga memory seems to be very high . I have following questions :

1. what could be the reason of so high session uga memory , what does it really mean ?
2. session uga memory max is smaller then session uga memory – does this indicate any thing which we should fix or analyze.

Sorry I didn’t notice this one when it arrived.
You don’t say which version of Oracle – not that that would actually have made any difference in this case, but it’s generally a useful detail to include.
Given that the MAX is less than the current value I would be inclined to assume it’s wrong. As a general guideline, when the SYSTEM figures don’t seem to make sense I look at the session figures to see if there are any clues there.

[…] And Jonathan Lewis provides a script you can run if you are concerned about the potantial of Oracle PGA leaks. Over at Oraclue, Miladin Modrakovic shows how to discover memory “leaks and other problems […]