The first approach relies on previous experience (both in particular subject area/technology and about the context/environment the problem occurs). For example if a patient comes to doctor complaining about pain in chest, then for doctor (and also for the patient) it would definitely be beneficial to know more relevant info about the patient – the context. If the patient had just fell off a 10-foot ladder, then it’d be more suitable to look for broken ribs. On the other hand, if the patient has been a long-time smoker and was watching TV on a couch when the pain started, then perhaps it’d be more suitable to start with an EKG (note that I’m not an expert on how human body works so should anybody complain about any pain in their chest to you, send them to real doctor immediately!)

Anyway, if you’ve been administering a database full time for last 5 years, you will probably know where to look immediately when a specific problem occurs. Note that I wrote “where to look immediately” here, not “what to change immediately”. Using previous experience to identify root causes of problems is obviously a perfectly valid approach which may get you to the solution very fast (again and again) – but it can work well only if you do have lots of previous experience in solving problems for that technology and that particular environment (in other words, you know the context). The big risk here is that if a “new” problem expresses itself in similar symptoms like the “old-and-well-known-problem”, we could easily end up looking for and fixing the wrong issue. And if that doesn’t work, continuing to try out a solution which helped with another problem last year. And if that doesn’t help then the cycle continues, we dig up even more unlikely fixes which have been useful once in past and we apply them. And then we resort to googling and trying out whatever solutions anyone has suggested for problems others have experienced.

We have ended up in Desperate Switch Flipping state. Usually this leads to flipping even larger switches, starting from Oracle session/instance parameters and adding random SQL hints up to restarting servers, upgrading databases, operating systems, hardware – usually without any luck. This wastes time, doesn’t solve our problem and may cause even more trouble. This is not good. Checking out the usual suspects may help solving common recurring problems (hey, shouldn’t good specialists avoid recurring problems rather than fixing them again and again?) . However it is very important to draw the line between checking for usual suspects and falling into DSF state, as looks like happened to someone at Oracle-L.

Quoting:

“I ran Statspack reports at the highest level of detail until I was blue in the face. I ran traces. I set events. But I also am by nature intuitive and tend often to use intuition to solve a problem with facts to back up my intuitive conclusion. So after providing all of this stuff to Oracle Support, they were at a loss, well, they were very eager to look at corruption as a cause, because they didn’t have another solution.”

(Btw, I think the DSF state would be a suitable addition to pathological DBA problems list which Gaja once started withCTD)

The other way to diagnose problems is to follow a systematic approach, an appropriate methodology based on knowledge how computers work.

Following a systematic approach

I personally allow myself to check for the usual suspects based on my (sic!) experience in diagnosing problems on that system (or similar systems) and often do a quick Metalink search, but if this doesn’t reveal the root cause immediately, I will get systematic by starting measuring the dynamics of the task I’m interested in. I will not start randomly shooting in the dark or doing DSF based on random internet-posts showing up in search engines.

It’s easy to be systematic in Oracle world. This is mainly for following reasons:

Computers and computer systems are deterministic – they always generate exactly the same set of output conditions for same set of input conditions. One could argue that human interaction, distributed systems and possible lower-level bugs/corruptions make large computer systems non-deterministic. However when troubleshooting a specific problematic task it is enough if we identify only the immediate points of system interaction which our task is experiencing. This is where instrumentation comes into play.

Oracle is well-instrumented – it is easy to narrow down problems with your task as with various tracing, oracle wait interface and statistic counters you have good understanding what Oracle is doing when a problem occurs.

Computer software is not a black box – and Oracle’s no exception. Any piece of software is just a bunch of computer instructions physically clustered into functions which the CPU then executes in a deterministic way (I avoided saying very deterministic here because you can’t have a more or less deterministic system, it’s either deterministic or not. And computer CPUs are definitely deterministic in their machine code execution). So it is fairly easy to follow where in the kernel code path Oracle is, even if the session statistics and wait interface are not enough (as I started blogging in my previous post in this series)

I normally start measuring the problematic task I’m diagnosing from high-level. Based on the results I will decide where to drill down and measure the next set of data, more focused on a specific area. I will look at the data then, measure again where appropriate and usually I get closer to the root cause step-by-step.

So, how does all this look in practice? I think the best way is to provide some examples from real life.There are various types of problems you could experience with Oracle, but one of the most complex (and interesting) type of problem is a performance problem. And this is where I start. I will start from a simple performance problem though – a single identifyable Oracle session experiencing performance issues.

Case study 1: Session hung – no wait no CPU

I often have to deal with issues where a session has apparently got hung and hasn’t responded for a while. Usually these cases are something simple as waiting for some lock or a query plan gone wild, doing nested loop (cartesian) joins over millions of rows. Such cases are easy to solve, however the following case was a pretty interesting one ( Oracle 10.1 on Solaris 8 Sparc 64bit ):

1) Check V$SESSION_WAIT for the session

I queried v$session_wait for that session using my little script sw.sql ( sw = Session Wait ). As my script formats some data for readability I added the output of a plain “raw” query against v$session_wait as well:

From output of both queries we see that Oracle thinks the session is NOT waiting on anything (as status is WAITED not WAITING), thus being on CPU, apparently for last 1500 seconds. In such cases I always check what the OS has to say, as OS tools will have more accurate picture on process’s state (and the reason is that OS tools will query the OS process tables directly, while Oracle’s V$SESSION_WAIT is populated by Oracle itself).

Both prstat and ps output show that this process is sleeping and hasn’t used any noticeable CPU time at all, even though it had been active over 1500 seconds.

3) Check where in the Oracle kernel code path the process is sleeping

The two-step diagnosis above has already proven that this session has been waiting for something vast majority of it’s lifetime. And as Oracle’s wait interface has failed to record this wait then there’s no point in looking further into v$session_event and such. We go directly to the source and get what’s going on.

By going to the source I don’t mean looking into source code, as I don’t have it (and never had it). I rather mean that instead of making hypothesis and wild guesses where the process could be stuck, we will just go and see where it is stuck.

Any debugger can provide us this information, also in Solaris and Linux we have a handy pstack utility to use:

This is the full stack trace of the current state of server process 9597.

Even though at beginning it may look unreadable, it’s actually easy! Just start reading from bottom up:

From the lowest line we see that _start() function has been invoked (normally by fork() + exec() syscalls in Unix) and the execution in that function has got to 0x17c bytes from the function start address, where a call to next function up in the list has been made. Note that the function start address is specified in the 1st column of output, immediately before the function name.

In our case main() was the next function called (sounds familiar? ;) and the program execution has got to 0x94 bytes from its start when it called sou2o().

sou2o called opidrv() after reaching 0x48th byte from its start

opidrv (Oracle Program Interface DRiVer) called opiodr() after reaching 0x354th byte from its start. Note that the numbers in brackets are the arguments passed to the called function. These may not always be complete and entirely accurate though, as they are read from thread stack storage but some parameters may be passed in CPU registers, thus never be written to stack at all.

opiodr called opiino() after reaching 0x598th byte from its start. Note that this doesn’t mean that only 0x598 bytes worth of instructions have been executed, a lot of other function calls may have been made, which returned after doing their work, also there may have been lots of jumping back and forth inside the function itself (looping for example)

opiino called opitsk(). This is the place where the main low-level task dispatching is done. When there is no work to do, then opitsk normally waits in opikndf2() which in turn waits for n* functions which hold Oracle’s network-handling code. Once the network handling functions return, they usually have some data sent from client, for which opitsk then calls an appropriate handling function. This is usually ttcpip as also seen in stack trace above.

Despite its name, ttcpip() doesn’t have directly to do with TCP/IP, it actually means Two-Task Common PIPe read/write (TTC pipe). As the name suggests, this is the separating layer and gatekeeper between “inner-side” of the server code and the “outer-side” exposed to the world. I think of it like a system call interface into Oracle kernel. Many sanity checks and also conversions are done in TTC layer.

Skipping few rows, we see opiexe() which is OPI function for execute requests ( they translate to EXEC lines in sql trace, also there are corresponding opipar and opifch functions for PARSE and FETCH )

Continuing on, we see peicnt, plsql_run and pfrrun which are PL/SQL related functions, so obviously this process must be running some PL/SQL

Continuing on, we see rpiswu2 function – RPI stands for Recursive Program Interface and is used for setting up recursive calls like recursive SQL, autonomous transactions, executing SQL from PL/SQL and vice versa etc etc etc. Basically RPI calls set up a separate context in form of variables, state objects etc for calls which need to be executed before the current call can complete.

Continuing, we see a peftrusted() call. This one is an interesting one, it allows to call external libraries in trusted-mode, meaning that they already are in (or are loaded in to) Oracle server processes address space (as opposed to calling the library function through extproc). Oracle’s XMLType type and oracle_loader & oracle_datapump ODCI cartridges use this feature.

Continuing, we see a bunch of qm* calls, these handle XML datatypes and XML schemas.

Continuing towards the top of the stack we see some UGA initialization related functions and a function which name indicates that some code is going to be executed as ksys (I don’t know what ksys means, but as that qm_run_as_ksys function calls rpiswu2 again, this is a good example how Oracle sets up a separate context to execute code in elevated privilege mode).

Now, 0x6c8 bytes into qm_init_uga_helper function we see something different happening – this function calls a next function _libc_sleep() which in turn has issued sigsuspend() system call, but from the first column we see that these functions reside in a way different location in that processes address space (at 0xffffffff7c956d2c as opposed to 0x000000010xxxxxxx…). So what’s going on?
Such difference in function start addresses can be explained by examining the processes address space:

From above output we see that Oracle binary itself has been mapped to address 0x100000000 in the process address space

libc.so.1 has been mapped to ffffffff7c900000 and it’s total size is 728kB, which means that the _libc_sleep function is somewhere in the middle of library there (at address ffffffff7c956d2c – 56d2c is roughly 355k). The function starting address relative to starting address of the library image could be verified using nm utility:

So, after doing this stack reading we have conclusive evidence where our Oracle process is its execution. It’s stuck – in a sleep system call made by some XML code, and it is not instrumented by Oracle wait interface. This is definitely abnormal and is not just some perfrormance or user code issue. Reading a stack trace became very handy.

Armed with knowledge that this wasn’t just an user process waiting uninstrumented on IO or spinning on the CPU and after seen the rpiswu2 / init_uga / _libc_sleep calls in stack, I started wondering whether this could be something related to memory allocation or system library calls (yes, this is where the previous experience kicked in again).

I looked into Unix environment variables used for starting the Oracle instance and there it was – $ORACLE_HOME/lib32 directory coming before $OH/lib in LD_LIBRARY_PATH. This caused some XMLDB functions to fail (maybe trying to load 32-bit library into 64-bit address space and failing had messed something up, perhaps in that processes PLT section where addresses of shared object-library functions are loaded).

While checking the stack trace in this example did not point the environment variable problem out to me directly, it helped to immediately eliminate many other causes for such hangs like uninstrumented IO or missing an IPC post from another process thus not being woken up etc. Also the Oracle kernel functions reported in top of the stack gave some indication on the nature of work done ( RPI setting up another call environment, UGA initialization – which can mean memory allocation etc).

So, stack tracing can be very helpful where Oracle instrumentation doesn’t deliver.The key thing about stack traces is that from there you can see the ultimate computer reality, the most accurate place to check what an Oracle server process is doing. No statistic or wait event can’t beat the accuracy and reliability of stack traces as this info comes directly from “the source” – the thread stack memory structure organized by hardware ABI (Application Binary Interface) standard.

Note that there are few issues (as always) with getting and interpreting stack traces and there may be cases where even stack traces don’t give you accurate picture (stack corruptions for example). However, I’ve just realized, to my surprise, that it’s 1am already, so I’ll blog a about it some next time ;-)

Conclusion

So, where am I getting at?

Well I just got started, I have couple examples from my experience of diagnosing ASSM performance bugs using stack tracing. When I have time to blog these, it should be evident how crucial tool stack tracing is for advanced and accurate troubleshooting. Nevertheless, stack tracing may not be required at all if Oracle/OS instrumentation can show you the accurate picture. Or even such tools may not always be required – if you have some serious previous experience on the system you’re troubleshooting ( anyone firefought bad query plans due bind variable peeking with comment on table T is ‘blah’ every Monday morning? ;-)

So, where I’m getting at is stated below – and I intend to write more posts about the second point below:

Checking for the usual suspects may get you there, very fast.

Following a systematic approach will get you there.

References

Metalink Note 175982.1 ORA-600 Lookup Error Categories ( many Oracle kernel function prefixes are documented there ). Update: This note has been taken down from MOS (as of 2014), so you can search for “ORA-600 Lookup Error Categories” in Google to find some “cached” copies.

20 Responses to Advanced Oracle Troubleshooting Guide, Part 2: No magic is needed, systematic approach will do

Thanks Howard, I plan to blog about it more in my post about getting stack dumps. On Windows you could use the Oradebug dump errorstack, however one needs to be careful with that, as oradebug is very intrusive compared to Solaris pstack. Linux’s pstack is somewhere in between as it really uses just gdb’s backtrace command (Solaris pstack just reads the stack from /proc/SPID/mem in a very non-intrusive fashion)

Starting from the moment anyone is able to extract Oracle symbol data from the %ORACLE_HOME%/rdbms/admin/*.SYM files on Windows, then we could use any Windows debugger or Sysinternals Process Explorer for getting stack traces the non-intrusive way.

I haven’t been able to extract the symbols from there yet (although the Oracle’s built-in stack unwind code seems to access them), so if anyone manages to do it, I’d be very interested to hear about it!

I think I pushed submit a bit too soon, as I meant to add that the wait_time column shows zero for a background process (DBWR) waiting on a semaphore yet the OS indicates it is actually sleeping (prstat). So it looks like this is another example of where Oracle’s description might not totally match reality.

Thanks to all for the comments.
John, yep I’m already using v$session_wait.state which in turn is also decoded from x$ksusecst’s time_waited equivalent column (and from x$kslwt in 11g).
I do prefer to use the STATE column though, it’s easier to remember: waiting = waiting, anything else = NOT waiting :)

Jeroen, the seconds_in_wait column is increased for long waits in v$session_wait. I don’t use the time_waited column directly at all when diagnosing performance issues, as seconds_in_wait shows the time spent in current wait and if Im interested in wait history then I either query v$session_wait_history for last 10 events or use v$session_event or Session Snapper…

In 5.2+ it should be pretty safe as it uses /proc filesystem for reading process stack (earlier versions may use /proc too)

The procstack command briefly suspends the probed process (just like scheduler does when timeslice has expired) and reads stack through /proc filesystem.

This is fast and non-intrusive.

On linux however the pstack is just a wrapper around gdb command. So gdb starts up, attaches to the process, suspends it, loads a bunch of libraries and reads the process stack using slow ptrace() system calls. Also I’ve seen the pstack crashing the probed process in some circumstances (like when running a strace on the pstack process itself and attaching to target)

Hi Tanel,
Thank you so much for this article.
I followed many books and authors but I am not getting a proper direction. To my observation, it is like everybody is saying how the different organs of oracle works, and I am not able to find any text which really teaches, how you should start examining as a doctor and once you find out the pulse rate is irregular (many diseases will have the same primary symptoms) what should be your next question (whether about the symptoms of known diseases to a particular geographical location, or anything else like check the temperature etc …) ? I am sure there must be a guided way of checking human health, and then the doctor is supposed to use his experience or demographic/cartographic knowledge.

Please let me know, if you happen to stumbled upon any kind of such literature.
Or if you have written any such books.

About this post of yours, it is excellent, I got a direction today.
So as my Guru, kindly accept my salute.