September 13, 2013

Quiz Night

Here’s a little quiz about Bloom filtering. There seem to be at least three different classes of query where Bloom filters can come into play – all involving hash joins: partition elimination, aggregate reduction on non-mergeable aggregate views, and parallelism.

This quiz is about parallel queries – and all you have to do is work out how many Bloom filters were used in the following two execution plans (produced by 11.2.0.2), and where they were used.

I’ve got 4 tables, 3 very small dimensions and one large fact. I’ve joined the three dimensions to the fact on their primary key, and filtered on each dimension (SQL for the test supplied here). Stripping out the eighteen hints that I inserted to get the plans I wanted the queries both looked like this:

The t4 table is the fact table and t1, t2, t3are the dimensions. Here are two execution plans; in both cases I’ve hinted a path that produces three consecutive hash joins, building in-memory hash tables from the dimension tables, before scanning the fact table and probing the three dimensions in turn. I captured the plans from memory with calls to dbms_xplan.display_cursor() after running the queries.

Plan 1 – which uses the “broadcast” distribution for the dimension tables.

If you can answer the Bloom filter question then here’s a bonus question – what is the significance of the access(:Z>=:Z AND :Z<=:Z) predicates that appear all over the place ?

It’s quite important to be able to answer the Bloom filter question correctly if you’re running parallel queries – especially if you’re on an Exadata machine – since it can make a big difference to the performance. The significance of the Exadata comment is that Bloom filters can be pushed down to the cell servers, which could reduce the traffic back to the database servers quite dramatically (and reduce the memory and CPU utilisation of the database servers at the same time).

Update (Feb 2014)

As often happens, each new version of Oracle fills some of the gaps left in older versions – here’s the execution plan for the “broadcast” query from 12.1.0.1 (from v$sql_monitor); you might like to compare it with Sayan’s “prediction” in the comments:

The improvement, of course, is the in introduction of the “JOIN FILTER CREATE” and “JOIN FILTER USE” lines. (There is still the odd little detail that the predicate section for line 18 manages to cite Bloom filter :BF0000 three times, rather than synchronising the labels with the body of the plan – in more complex plans this could be important.)

For completeness, here’s the 12c version of the hash distribution plan:

There ‘s basically no change in presentation quality of this plan, although the “px send hybrid hash” operators, along with the associated “statisics collector” operators, show that Oracle has considered adaptive execution and could switch from a hash distribution to a broadcast distribution at three different points in the plan. Again there’s a little oddity with the name of the Bloom filters (:BFnnnn) – they’re numbered in the opposite order to which they are created and used.

I think:
1. in the first plan was 1 bloom filter, but it checked by list, which contains 3 bloom filters sets(SYS_OP_BLOOM_FILTER_LIST with 3 bloom filters). It seems like and-conditions.
2. in the second – 3 different bloom filters in different places

“access(:Z>=:Z AND :Z<=:Z)" – as far as I can remember, I never seen such predicates related to bloom filters. It may be datatype(size) checking?

in Christian Antognini’s paper on bloom filters (http://antognini.ch/papers/BloomFilters20080620.pdf) there is the statement: “[the operation creates] the bloom filter named :BF0000 (if there are serveral bloom filters, the numeric value is increased, i.e. the second one would be named :BF0001)”. According to this I would say that there are three bloom filters in query 2 and just one bloom filter in query 1. But that’s not a technical analysis but merely text exegesis…

The “access(:Z>=:Z AND :Z<=:Z)" appears because parallel tablescans operate through rowid ranges, so the PX slave receives a rowid range as a "between bind1 and bind2" clause, so this check ensures that bind1 is less than bind 2. There's a related statistic counting rowid range access: "table scans (rowid ranges)" – if you want to play with it you may find that a single tablescan results in 13 x DOP tablescans by rowed range for any reasonable sized table.

There are 3 Bloom filters created and used in the first plan, though you can't see them being created and they are all used at the same time at line 21 of the plan, where you can only see them being used because of the filter predicate. Sayan's modified plan shows you where they are created.

There are 3 Blooms being created and one filter used according to the second plan structure. Unfortunately the predicate at line 33 says that only one filter is being used, and it's not the filter that the body of the plan says is being used. So we still need to know – how many filters really are used, and where are they used in the hash/hash plan.

I suggest the very first information to notice is to read TQ column properly. Although the first plan is also parallel, thanks to broadcast distribution it can be read in same way as serial plan. The execution starts at line 9, the Parallel Process Set 1 read table T3 in parallel and send to the Set 2 (BF is also created, of course). Then Set 1 read table T2 (creating the second BF silently) and send data to the Set 2, again the same story with table T3. Finally, Set 2 read table T4 (huge fact table) in parallel , with the help of all 3 Bloom Filters, and performing all the necessary hash joins, since each Parallel Process in the second Set has all broadcasted data from T1,T2,T3 in virtual tables Q0,Q1,Q2.

This is not the case for the second plan. The Set 1 read table T1 (line 28), redistributing data by HASH to the Set 2, which will be joining data, same time creating BF by the first Set. Since we have only two PQ Sets and there are more execution layers, data has to be written to TEMP, which can be also observed on line 23 (HASH JOIN BUFFERED, blocking operation). To be honest, I’m not able to say how is work exactly distributed between PQ Sets at this moment (HASH JOIN BUFFERED is still shrouded in mystery to me). However, it seems to be clear that only one BF can be used while reading T4, immediately discarding majority of rows which won’t match later.
The table T2 is scanned thereafter, creating the second BF, which will help only to the resultset of join T1,T4. Similarly, the last BF will be created while reading T3 and used for resultset of T1,T4,T2.

Both plans produced three Bloom filters – but plan 1 (with the broadcast distribution) doesn’t show them being created, and doesn’t have an operation showing them being used; you have to infer the details from the predicate section at line 21 which shows three Bloom filters (for columns id1, id2 and id3) being used in a “Bloom filter list” as table T4 is scanned.

Plan 2, on the other hand, shows operations for the Bloom filter creation and usage that goes on (JOIN FILTER CREATE, JOIN FILTER USE). Unfortunately the predicate section makes it look as if only one of the Bloom filters is used – although all three do get actually get used – and the predicate at line 33 even lists the wrong join filter (:BF0000 when it’s really applying join filter :BF0002).

The easiest way to check this interpretation is to use the SQL Monitoring feature – if you’re licenced for it – it’s very easy to see how many rows flow through each section of the plan, summed across all the PX slaves, and correlate the numbers with what is (or isn’t) visible in the execution plans. If you’re not licensed to use SQL Monitor then v$pq_tqstat is a free, and very helpful, view that gives you enough of the same information to show you what’s happening.

If you’re running Exadata it’s very important to track the Bloom filters – in the first (broadcast) case all three filters will be pushed down to the storage servers, so the minimum volume of data travels back up the network to the database servers. In the second (hash/hash) case only one filter will be pushed to the storage servers, so a lot more data will come back and will be passed between parallel execution slaves as the other filters are applied late. This isn’t a bug, by the way, it’s a necessary side effect of the way that the hash/hash distribution works.

It helps to understand what’s going on if you can read parallel execution plans properly – and the method for doing that isn’t the good old “start at the first child, recursive descent” strategy. But that’s a topic for another day (and for a few minutes in my OpenWorld presentation).

Oh, I was writing answer in slow manner (there were, meanwhile, other important things to do), so I didn’t notice you’ve provided the answer. I have still many questions in my head regarding HASH JOIN BUFFERED operation, although Randolf Geist made some very interesting observations few months ago.

please, write next article, at least you are able to explain whole thing with much better english :) I could hardly find few people who read my Comment, your article would be read by plenty of readers and it could be usefull.People always forget that PARALLEL HASH JOINS can be processed in very different order.

I’m still quite confused by HASH JOIN BUFFERED operator. You had also wrote article, but later you found out you were wrong in some details. Randolf Geist observed in spectacular way the second row source is being buffered before the probe phase starts and non-matching rows are discarded. However, I do not understand how is Parallel Slave Set 1 able to find out which row can be discarded (when Bloom Filter is not present) before Sending to Set 2 (only Set 2 does have the build table).
Did you have time to dig deeper into HASH JOIN BUFFERED?

[…] be basing my comments on the two execution plans I produced for a recent question on Bloom filters, so you may want to read that article and its subsequent comments first (the link will open in a […]

[…] that I’ve been using in my previous demonstrations. This was a query first introduced in a note on Bloom Filters (opens in a separate window) where I show two versions of a four-table hash parallel hash join, one […]