Your Oracle database – production DB, of course – is hanging. All users are blocked. You quickly check the obvious suspects (archivelog destination full, system swapping, etc.) but it’s something else. Even you, the Oracle DBA, cannot do anything: any select is hanging. And maybe you’re even not able to connect with a simple ‘sqlplus / as sysdba’.

What do you do ? There may be several ways to investigate deeper (strace or truss for example) but it will take time. And your boss is clear: the only important thing is to get the production running again as soon as possible. No time to investigate. SHUTDOWN ABORT and restart.

Ok, but now that everything is back to normal, your boss rules has changed: the system was down for 15 minutes. We have to provide an explanation. Root Cause Analysis.

But how will you investigate now ? You have restarted everything, so all V$ information is gone. You have Diagnostic Pack ? But the system was hanged: no ASH information went to disk. You can open an SR but what information will you give?

Hang Analyze

The next time it happens, you need to have a way to get some information that can be analyzed post mortem. But you need to be able to do that very quickly just before your boss shouts ‘shutdown abort now’. And this is why I’ve put it at the begining of the post, so that you can find it quickly if you need it…

That takes only a few seconds to generate all post-mortem necessary information. If you can take 1 more minute, you will even be able to read the first lines of hanganalyze output, and you will be able to identify a true hanging situation and maybe just kill the root of the blocking sessions instead of a merciless restart.

In order to show you the kind of output you get, I’ve run a few jobs locking the same resources (TM locks) – which is not a true hanging situation because the blocking session can resolve the situation.

Analysing the System State takes much longer than the hanganalyze, but it has more information.

V$WAIT_CHAINS

When the blocking situation is not so desesperate, but you just want to see what is blocking, the hanganalyze information is also available online in V$WAIT_CHAINS. The advantage over ASH is that you see all processes (not only foreground, not only active ones).

Here is an example:

CHAIN_ID

CHAIN

CHAIN_SIGNATURE

INSTANCE

OSID

PID

SID

BLOCK

1

FALSE

‘PL/SQL lock timer’ <=’enq: TM
– contention’ <=’enq: TM – contention’

1

7929

42

23

TRUE

1

FALSE

‘PL/SQL lock timer’ <=’enq: TM
– contention’ <=’enq: TM – contention’

1

7927

41

254

TRUE

1

FALSE

‘PL/SQL lock timer’ <=’enq: TM
– contention’ <=’enq: TM – contention’

1

7925

39

256

FALSE

2

FALSE

‘PL/SQL lock timer’ <=’enq: TM
– contention’ <=’enq: TM – contention’

1

7933

46

25

TRUE

3

FALSE

‘PL/SQL lock timer’ <=’enq: TM
– contention’ <=’enq: TM – contention’

1

7931

45

260

TRUE

4

FALSE

‘PL/SQL lock timer’ <=’enq: TM
– contention’ <=’enq: TM – contention’

1

7935

47

262

TRUE

ASH Dump

There is something else that you can get if you have Diagnostic Pack. The ASH information can be dumped to trace file even if it cannot be collected in the database.

oradebug dump ashdumpseconds 30

that will gather ASH from latest 30 seconds, and the trace file will even have the sqlldr ctl file to load it in an ASH like table.

sqlplus -prelim

But what can you do if you can’t even connect / as sysdba ?
There is the ‘preliminary connection’ that does not create a session:sqlplus -prelim / as sysdba

With that you will be able to get a systemstate. You will be able to get a ashdump.
But unfortunately, since 11.2.0.2 you cannot get a hanganalyze:

ERROR: Can not perform hang analysis dump without a process state object and a session state object.

But there is a workaround for that (from Tanel Poders’s blog): try to use a session that is already connected.