February 21, 2012

Not In – 2

My note on “NOT IN” subqueries is one of the most popular on my blog, staying in the top 5 hits for the last five years – but it’s getting a bit old, so it’s about time I said something new about “NOT IN” – especially since the Null Aware Anti Join has been around such a long time. The example I want to talk about is, as so often, something that came up as a problem on a customer site. Here’s a bit of SQL to model the situation, which is currently running under Oracle 11.1.0.7:

There are a couple of oddities in the model – one is that the second index starts with the same columns as the primary key index, and this is to emulate the functionality of the client system; the other is that I’ve made the primary key index invisible, the nature of the client data and code was such that the primary key index was not used in the query I’m about to emulate and I’ve made the index invisible so that I don’t have to mess around with statistics or complicated data routines to make the appropriate behaviour appear in the model.

Now that I’ve got the model in place, let’s take a look at the query:

delete
from t1
where
(id1) in (
select x1 from table(cast(:b1 as jpl_table)) v1
)
and (id1, id2, n1) not in (
select x1, x2, x3 from table(cast(:b2 as jpl_table3)) v2
)
;

This query is actually trying to delete a load of data from the table. The surrounding PL/SQL package populates two collections with up to 1,000 items, The first collection identifies rows that may need to be deleted but the second collection rows then identifies rows from the first set that should not be deleted. You’ll notice that the first collection identifies rows only by the first column of the primary key, and the second collection uses both parts of the key and a non-key column to countermand the deletion. Typically both collections would hold close to the 1,000 item limit set by the developer, and typically only one or two rows would end up being deleted each time the statement ran (the id1 column tends to be “nearly unique” across the first collection).

With that overview in mind, thinking particularly of the number of rows intially identified and the number of rows that survive, look at the execution plan:

The optimizer has unnested the first collection (IN list) sorted the set for uniqueness, and used it to drive a nested loop through the index we need (avoiding the table) to pick up the rowid and all the columns we need to check against the second collection. However, as it does the index range scan for each unique item in the first collection it runs the NOT IN subquery checking whether the row it has acquired has a match in the second collection. This means that around 1,000 times we fetch a row and scan the second collection for a match. We almost always find a match so, on average, we will have to scan 500 items from the second collection to find that match. The statement was CPU intensive; on the production system it did about 3,000 buffer gets but took about 1 CPU second to find and delete (on average) one row.

The problem is the repeated scanning on the second collection – and it’s a problem that shouldn’t exist. I want Oracle to unnest the second query and do a hash anti join with it. If we did that then we would only scan the second collection once, scatter it into memory, and only do one probe and comparison for each row brought back by the first collection. This is (a mockup of) the plan I want to see:

With the SQL supplied, I couldn’t make this plan appear in 11.1. I had hoped to force the path I wanted and then create an SQL Baseline for it, but I actually had to rewrite the query, converting the “NOT IN” to “NOT EXISTS” – now this isn’t always legal, of course, but in my case I knew that all the relevant columns would be non-null (even though the n1 column in the table had not been declared as such) and the data accumulated in the collections would also be non-null so the transformation was safe. So here’s the rewrite and the new plan:

It’s exactly what I wanted – but the code has to be modified, which means the full testing cycle and delay, and because of the complexities of the collection objects and the need to use realistic data it’s something that I can’t actually do on the sand-pit that the client lets me play with. So, as it stands, I think it ought to be quite a bit more efficient – but I’ll have to wait a couple of weeks to find out.

Here’s the irritating bit. The client will be upgrading to 11.2.0.3 in the not too distant future, and here’s the plan you get from the original query on that version of Oracle:

With my data set it uses a merge join rather than a hash join, but it manages to unnest both collections – note, also, the appearance of the null-aware anti join; the appearance of the non-mergeable view (vw_nso_1) is also an interesting detail – 11.1 didn’t have this operator in its plan.

In principle 11.1 ought to be able to produce the same plan, all the units of functionality seem to be there – including the null-aware anti-join – but the I just can’t make the plan appear (although I did managed to get an ORA-00600 error with one of my more bizarre attempts at hinting.)

Despite the automatic appearance of what seems to be a suitable (though slightly sub-optimal) path with the upgrade, I think we’ll still be doing the rewrite – interestingly 11.2 does produce a slightly different plan when you go for existence subqueries – it’s a (differently named) non-mergeable view again:

One last thought – it looks as if the optimizer has some new ways (including dynamic sampling) of handling collections and subquery manipulation of collections in 11.2: and this client loves doing cunning things with collections – so we’re probably going to get a number of better execution plans from the upgrade – but they’re going to have to check every single example they’ve got of code using collections, because you can bet that somewhere they’ll hit an edge case where the “new improved” mechanisms manage to be the wrong choice.

Footnote: I put a couple of /*+ cardinality (XXX 10) */ hints into my code while I was creating the examples above, but took them out to present the code and results. My data set was small compared to the client’s data set, so I needed the hints but didn’t want to give the impression that they were a necessary part of the solution.