Troubleshooting Node Reboot

Sometime simplest looking issues tends to be very hard to debug and can be caused by something that you would have not thought about. Well it was a quite week so far and you got paged for something that you don’t wanna listen and yes you guessed right it was node eviction. Well the first thing you look for is to make sure that DB came up correct before just going to find the cause by the way this is 2 node RAC cluster running 11.2.0.3 DB and the 1st node went down. This roughly happened at afternoon and while looking for the cause after making sure everything is up and running suddenly second node went down. Now, this is really not a good sign and something is going on. So, started looking at some logs and this is what system message tells me.

But at least it pointed to something. As you can see from above output that process are getting timed out which leads to something may be CPU starvation or IO or may be memory issues where OS is not able to fork the process. So next step to see OSwatcher logs and yes we have OSwatcher setup lucky we. So, here is the output from vmstat and top

Ok now this confirms that we are swapping heavily and we have very less space left and that also explains everything. Well this also give something that needs attention see that process 8517 which is consuming 120G of virtual memory and which is an oracle process. So, as we have identified the cause and now how do we know what that session was doing during that time.

Well AWR/ASH to the rescue and believe me i didn’t find anything related to heavy memory usage by any SQL and i don’t know how to tie PID with SID using historic views. Meanwhile client was also working on something and one of the user sent me an email with two thing 1st he way running a SELECT query while the issue happened and one small pl/sql block may ten lines of code. But the SELECT statement was throwing ORA-04030: out of process memory when trying to allocate 55600 bytes (kxs-heap-w,qmemNextBuf:Large Alloc)

So, this was a wow moment and we have found the culprit, and this was also reproducible so ran that SQL while closely monitoring memory well it consumed hardly 52 GB virtual memory and died and never crossed what we are observing(120GB), at least i would expect it to grow consistently. Then ran that PL/SQL block to see if that may be causing the issue, and yes that was the real culprit and here is that code.