Tanel Poder

2007/09/05

Anyway, stack trace gives you the ultimate truth on what your program is doing, exactly right now. There are couple of but’s like stack corruptions and missing symbol information which may make the traces less useful for us, but for detailed hang & performance troubleshooting the stack traces are a goldmine.

So, I present another case study – how to diagnose a complete database hang when you can’t even log on to the database.

I was doing some performance diagnosis work and wanted to find out where some pointers in some stack traces were pointing in shared pool. The X$KSMSP is the place where you’d find out what type of chunk happens to own the given address and which one is the parent heap of that chunk.

However, at Hotsos Symposium a year or two ago I remember Jonathan Lewis mentioning to be very careful with selecting from X$KSMSP on busy systems with large shared pools.

The reason is that X$KSMSP will take the shared pool latches and start scanning through shared pool heaps, reporting all shared pool chunks in it.

This means that shared pool latches are held for very long time (the larger the shared pool, the more chunks you’ve got in SP, the more it takes).

This in turn means that you will not be able to allocate or deallocate any chunk from these pools during there latches are held

This in turn means that you will not be able to log on to database (as it allocates space in shared pool for session parameter values), you will not be able to hard parse a statement (as it needs to allocate SP space for your cursor and execution plan), you will not be able to even do some soft parses (as those may also use shared pool latches in some cases).

The shared pool size was set to 5GB, due some memory leak issues in past.

I decided to not run just select * from x$ksmsp on the full X$ table, I ran a select * from x$ksmsp where rownum <=10 instead, just to see whether this approach of fetching few rows at a time would work better. Obviously I did this in a test environment first… and it got completely hung. Oh man.

So I started up another sqlplus session to see what’s going on – and it got hung too. By then it was pretty clear to me that it was this killer X$KSMSP query, which didn’t care about my “where rownum <= 10” trick and probably kept sweeping through the whole shared pool – holding the shared pool latch all the time.

As I wasn’t able to log on to Oracle, the next thing was to check what my process was doing from OS level. The problem with that was that I didn’t have an idea what was my sessions process ID, I didn’t know when exactly I had logged on with my session. I didn’t have access to /dev/kmem to use lsof tool either which would have pointed me out which SPIDs correspond to TCP connections from my workstation’s IP.

Note that even though a test environment, this was a busy system, used heavily by various other processes and batch jobs. So I needed to be really sure which one is my process before killing it at OS level.

Even though the CPU usage and the process owner names gave me some hints which one could have been the troublemaker, I wanted to get proof before taking any action. And pstack was very helpful for that (note that due this hang I wasn’t even able to log on with another session!).

So I ran pstack on the main suspect process, 26136. I picked this one as it was the only remote Oracle process using 100% CPU (on a 24-CPU server). The other two top processes were spawned by a local user using bequeath, thus their username was different.

ksmshp: looks like this is the function which is called under the covers when X$KSMSP is accessed

kghscn: sounds a lot like Kernel Generic Heap SCaN :)

So, this stack trace proves that the process was definitely executing code which was doing a fetch from an X$ table (the qerfxFetch) and this had eventually resulted in an operation which sounded like a heap scanner ( kghscn ). This is exactly what X$KSMSP does, it scans through the entire shared pool heap and returns a row for every memory chunk in there.

Why was there a qercoFetch function involved even though my query did not have a count(*) in it? The explanation is below:

The qercoFetch rowsource function was the one which should have enforced the “where rownum <=10” part of my query. It should have stopped calling the qerfxFetch function after it had received its 10 rows, but for some reason it continued on. Or the reason could be that the qerfxFetch function is not sophisticated enough to return control back to calling function with only part of its resultset, leaving no chance to qercoFetch to decide whether to stop fetching half-way through. (Many row sources have such flexibility of returning data in cascading fashion up in stack. For example nested loops are cascading (qerjoFetch, qerjotFetch), hash joins are cascading starting from the moment the hash join build partition has built in memory (qerhjFetch). Sort merge joins are not cascading from high-level perspective, as both of the joined row sources need to be sorted before any rows can be returned. However when both row sources have been sorted (either in memory or temp tablespace), the matched rows can be returned in a cascading fashion again, meaning that the calling function can fetch only X rows, do its processing and fetch next X rows etc. The X is normally dependent on arraysize – the amount of rows you fetch from Oracle at a time. Anyway, I leave this interesting topic for a next post ;)

Continuing with my hang diagnosis, after I had identified the troublemaker X$KSMSP scanner process, using prstat and pstack, I remembered that I could also have checked the sqlplus.exe’s timestamp against the server process start time (accounting the time zone difference and clock drift of course). This can be helpful when you have all your processes either trying to be on CPU or sleeping – unless you like to pstack through all of them.

I had identified one such process with logon time close to my sqlplus.exe’s start time and I took pstack on it:

So this stack indicates ( I might not be entirely correct ) that this process had got stuck while trying to create a library cache object in shared pool for an internal trigger during logon ( this database used session auditing by the way ).

Anyway, when I killed my original session (the one who was scanning through X$KSMSP), the database worked ok again.

The advantage of stack tracing was once again shown in a case where Oracle’s instrumentation was not usable (due my stupid mistake ;). This again illustrates once more that you should test everything out thoroughly in test environments even if you think it should work ok.