Not all Deadlocks are created the same

Doug's Oracle Blog

I've blogged about deadlocks in Oracle at least once before. I said then that although the following message in deadlock trace files is usually true, it isn't always.

The following deadlock is not an Oracle error. Deadlocks of
this type can be expected if certain SQL statements are
issued. The following information may aid in determining the
cause of the deadlock.

So when I came across another example recently, it seemed worth a quick blog post. Not least for the benefit of other souls who hit the same issue (and probably hit Google moments later).

But while it's easy to say - "Hey! Look! I found an exception! Aren't I clever?" - it occurred to me that actually Oracle's capabilities in this area might be underrated by raising the occasional anomaly. Because the truth is

1) In most cases, deadlock errors are down to the way you've written your application or some documented restriction in Oracle. The type of restrictions that you're more likely to hit if you're handling high degrees of concurrency with lots of DDL, parallel query, partition management and the like. Such activities often have unusually restrictive locking requirements and most locking issues can be turned into deadlock issues quite easily if you have a few sessions running concurrently.

2) It's still the case that Oracle will handle the deadlock situation, at least to the extent of rolling back one of the statements causing the issue. (Although, whilst writing this post, I noticed that Jonathan Lewis raised the question of what exactly people mean when they suggest that Oracle resolves deadlock issues.

3) Deadlock trace files are typically very useful and not the most difficult to read. Yes, they tend to use Oracle kernel terminology (not surprising) but I'd wager that most people could have a rough idea of the root cause with some initial analysis and could have a very detailed idea, given more time. Even if you can't decipher the things yourself, it gives Oracle Support detailed information to help root cause analysis.

So, to the particular issue we hit. Towards the end of a data loading process that loads around a billion rows in a short period of time (30/60 minutes that also includes a bunch of surrounding activities), we would hit the occasional deadlock error. Fortunately, the site I'm working at just now has a very enlightened policy towards developer access to the alert log and trace files, so I can do my own initial investigation. On digging out the relevant deadlock trace file, it looked like this (some details changed)

Which happened to fit in with what we were seeing. We have two concurrent runs which perform similar actions using different input files that load into different partitions.

2) The first session is using Parallel Query (note the enq: PS - contention and different PX Deq wait events)

3) The deadlock is a little unusual because it's not between two transactions trying to lock database objects or rows being locked by the other session but between in-memory structures. One session is waiting on 'row cache lock' and the other is waiting on
'library cache lock', as opposed to waiting for specific row or
table-level locks. This is also visible from the chain signature at the start of the trace file.

Chain Signature: 'library cache lock'<='row cache lock' (cycle)

Armed with 2) and 3) in particular, my next step was to go to My Oracle Support, as usual. I find that Google isn't too great with issues like this because some of them are quite specific and might not be affecting too many others. Sure enough, a search turned up :-

Which is confirmed as affecting versions 11.2.0.2 and 11.2.0.3. The fix is in Bundle Patch 12 for Exadata, in Oracle 12.1 and is also available as one-off patch that we're in the process of applying to different environments.

The issue is that "When a parallel query is hard parsed, first QC hard parses the query and then all the slaves. When a partition maintenance operation (DDL) comes in between the hard parses of QC and Slaves.", then you can hit the deadlock. There's more detail in the bug notes, but it's worth noting this phrase "This is basically a timing issue, in high concurrency environments.", which means it only affects us very intermittently and is a nightmare to prove we've eliminated without a lot of testing.

What I find a little disconcerting is that there seem to be quite a few of these library cache deadlock issues kicking around in recent versions that I haven't been used to seeing in prior versions. Given some of the library cache madness I've seen in my few years with 11g, I do wonder what on earth they've done to it!

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.