You are here

8 Bulk Update Methods Compared

What I love about writing SQL Tuning articles is that I very rarely end up publishing the findings I set out to achieve. With this one, I set out to demonstrate the advantages of PARALLEL DML, didn't find what I thought I would, and ended up testing 8 different techniques to find out how they differed. And guess what? I still didn't get the results I expected. Hey, at least I learned something.

As an ETL designer, I hate updates. They are just plain nasty. I spend an inordinate proportion of design time of an ETL system worrying about the relative proportion of rows inserted vs updated. I worry about how ETL tools apply updates (did you know DataStage applys updates singly, but batches inserts in arrays?), how I might cluster rows together that are subject to updates, and what I might do if I just get too many updates to handle.

It would be fair to say I obsess about them. A little bit.

The two most common forms of Bulk Updates are:

Update (almost) every row in the table. This is common when applying data patches and adding new columns.

Updating a small proportion of rows in a very large table.

Case 1 is uninteresting. The fastest way to update every row in the table is to rebuild the table from scratch. All of these methods below will perform worse.

Case 2 is common in Data Warehouses and overnight batch jobs. We have a table containing years worth of data, most of which is static; we are updating selected rows that were recently inserted and are still volatile. This case is the subject of our test. For the purposes of the test, we will assume that the target table of the update is arbitrarily large, and we want to avoid things like full-scans and index rebuilds.

The methods covered include both PL/SQL and SQL approaches. I want to test on a level playing field and remove special factors that unfairly favour one method, so there are some rules:

Accumulating data for the update can be arbitrarily complex. SQL updates can have joins with grouping and sub-queries and what-not; PL/SQL can have cursor loops with nested calls to other procedures. I'm not testing the relative merits of how to accumulate the data, so each test will use pre-preared update data residing in a Global Temporary Table.

Some methods - such as MERGE - allow the data source to be joined to the update target using SQL. Other methods don't have this capability and must use Primary Key lookups on the update target. To make these methods comparable, the "joinable" techniques will use a Nested Loops join to most closely mimic the Primary Key lookup of the other methods. Even though a Hash join may be faster than Nested Loops for some distributions of data, that is not always the case and - once again - we're assuming an arbitraily large target table, so a full scan is not necessarily feasible.

Having said that we're not comparing factors outside of the UPDATE itself, some of the methods do have differences unrelated to the UPDATE. I have included these deliberately because they are reasonably common and have different performance profiles; I wouldn't want anyone to think that because their statements were *similar* to those shown here that they have the same performance profile.

The 8 methods I am benchmarking here are as follows (in rough order of complexity):

Not many people code this way, but there are some Pro*C programmers out there who are used to Explicit Cursor Loops (OPEN, FETCH and CLOSE commands) and translate these techniques directly to PL/SQL. The UPDATE portion of the code works in an identical fashion to the Implicit Cursor Loop, so this is not really a separate "UPDATE" method as such. The interesting thing about this method is that it performs a context-switch between PL/SQL and SQL for every FETCH; this is less efficient. I include it here because it allows us to compare the cost of context-switches to the cost of updates.

This is the simplest PL/SQL method and very common in hand-coded PL/SQL applications. Update-wise, it looks as though it should perform the same as the Explicit Cursor Loop. The difference is that the Implicit Cursor internally performs bulk fetches, which should be faster than the Explicit Cursor because of the reduced context switches.

This method is pretty common. I generally recommend against it for high-volume updates because the SET sub-query is nested, meaning it is performed once for each row updated. To support this method, I needed to create an index on TEST8.PK.

This one is gaining in popularity. Using BULK COLLECT and FORALL statements is the new de-facto standard for PL/SQL programmers concerned about performance because it reduces context switching overheads between the PL/SQL and SQL engines.

The biggest drawback to this method is readability. Since Oracle does not yet provide support for record collections in FORALL, we need to use scalar collections, making for long declarations, INTO clauses, and SET clauses.

The modern equivalent of the Updateable Join View. Gaining in popularity due to its combination of brevity and performance, it is primarily used to INSERT and UPDATE in a single statement. We are using the update-only version here. Note that I have included a FIRST_ROWS hint to force an indexed nested loops plan. This is to keep the playing field level when comparing to the other methods, which also perform primary key lookups on the target table. A Hash join may or may not be faster, that's not the point - I could increase the size of the target TEST table to 500M rows and Hash would be slower for sure.

This is much easier to do with DataStage than with native PL/SQL. The goal is to have several separate sessions applying UPDATE statements at once, rather than using the sometimes restrictive PARALLEL DML alternative. It's a bit of a kludge, but we can do this in PL/SQL using a Parallel Enable Table Function. Here's the function:

Note that it receives its data via a Ref Cursor parameter. This is a feature of Oracle's parallel-enabled functions; they will apportion the rows of a single Ref Cursor amongst many parallel slaves, with each slave running over a different subset of the input data set.

Note that we are using a SELECT statement to call a function that performs an UPDATE. Yeah, I know, it's nasty. You need to make the function an AUTONOMOUS TRANSACTION to stop it from throwing an error. But just bear with me, it is the closest PL/SQL equivalent I can make to a third-party ETL Tool such as DataStage with native parallelism.

In this test, we apply the 100K updated rows in Global Temporary Table TEST{n} to permanent table TEST. There are 3 runs:

Run 1: The buffer cache is flushed and about 1 hour of unrelated statistics gathering has been used to age out the disk cache.

Run 2: The buffer cache is flushed and the disk cache has been aged out with about 5-10mins of indexed reads. Timings indicate that the disk cache is still partially populated with blocks used by the query.

Run 3: The buffer cache is pre-salted with the table and blocks it will need. It should perform very little disk IO.

Amongst the non-parallel methods (1-6), context switches only make a significant and noticable difference with cached data. With uncached data, the cost of disk reads so far outweighs the context switches that they are barely noticable. Context Switching - whilst important - is not really a game-changer. This tells me that you should avoid methods 1 and 2 as a best practice, but it is probably not cost-effective to re-engineer existing method 1/2 code unless your buffer cache hit ratio is 99+% (ie. like RUN 3).

There were no significant differences between the 6 non-parallel methods, however this is not to suggest that it is not important which one you choose. All of these benchmarks perform Primary Key lookups of the updated table, however it is possible to run methods 5 and 6 as hash joins with full table scans. If the proportion of blocks updated is high enough, the hash join can make an enormous difference in the run time. See Appendix 1 for an example.

Parallel updates are a game changer. The reason for this is disk latency. Almost all of the time for RUN 1 and RUN 2 of the non-parallel methods is spent waiting for reads and writes on disk. The IO system of most computers is designed to serve many requests at a time, but no ONE request can utilise ALL of the resources. So when an operation runs serially, it only uses a small proportion of the available resources. If there are no other jobs running then we get poor utilisation. The parallel methods 7 and 8 allow us to tap into these under-utilised resources. Instead of issuing 100,000 disk IO requests one after the other, these methods allow (say) 100 parallel threads to perform just 1000 sequential IO requests.

Method 8, which is the equivalent of running many concurrent versions of Method 4 with different data, is consistently faster than Oracle's Parallel DML. This is worth exploring. Since the non-parallel equivalents (Methods 4 and 6) show no significant performance difference, it is reasonable to expect that parallelising these two methods will yield similar results. I ran some traces (see Appendix 2) and found that the Parallel Merge was creating too many parallel threads and suffering from latch contention. Manually reducing the number of parallel threads made it perform similarly to the Parallel PL/SQL method. The lesson here is that too much parallelism is a bad thing.

It looks as though there is a small premium associated with checking the foreign key, although it does not appear to be significant. It's worth noting that the parent table in this case is very small and quickly cached. A very large parent table would result in considerably greater number of cache misses and resultant disk IO. Foreign keys are often blamed for bad performance; whilst they can be limiting in some circumstances (e.g. direct path loads), updates are not greatly affected when the parent tables are small.

I was expecting the Parallel DML MERGE to be slower. According to the Oracle® Database Data Warehousing Guide - 10g Release 2, INSERT and MERGE are "Not Parallelized" when issed against the child of a Foreign Key constraint, whereas parallel UPDATE is "supported". As a test, I issued a similar MERGE statement and redundantly included the WHEN NOT MATCHED THEN INSERT clause: it was not parallelized and ran slower. The lesson here: there may be merit in applying an upsert (insert else update) as an update-only MERGE followed by an INSERT instead of using a single MERGE.

Well, if further proof was needed that Bitmap indexes are inappropriate for tables that are maintained by multiple concurrent sessions, surely this is it. The Deadlock error raised by Method 8 occurred because bitmap indexes are locked at the block-level, not the row level. With hundreds of rows represented by each block in the index, the chances of two sessions attempting to lock the same block are quite high. The very clear lesson here: don't update bitmap indexed tables in parallel sessions; the only safe parallel method is PARALLEL DML.

The other intesting outcome is the differing impact of the bitmap index on SET-based updates vs transactional updates (SQL solutions vs PL/SQL solutions). PL/SQL solutions seem to incur a penalty when updating bitmap indexed tables. A single bitmap index has added around 10% to the overall runtime of PL/SQL solutions, whereas the set-based (SQL-based) solutions run faster than the B-Tree indexes case (above). Although not shown here, this effect is magnified with each additional bitmap index. Given that most bitmap-indexed tables would have several such indexes (as bitmap indexes are designed to be of most use in combination), this shows that PL/SQL is virtually non-viable as a means of updating a large number of rows.

Context Switches in cursor loops have greatest impact when data is well cached. For updates with buffer cache hit-ratio >99%, convert to BULK COLLECT or MERGE.

Use MERGE with a Hash Join when updating a significant proportion of blocks (not rows!) in a segment.

Parallelize large updates for a massive performance improvement.

Tune the number of parallel query servers used by looking for latch contention thread startup waits.

Don't rashly drop Foreign Keys without benchmarking; they may not be costing very much to maintain.

MERGE statements that UPDATE and INSERT cannot be parallelised when a Foreign Key is present. If you want to keep the Foreign Key, you will need to use multiple concurrent sessions (insert/update variant of Method 8) to achieve parallelism.

Don't use PL/SQL to maintain bitmap indexed tables; not even with BULK COLLECT / FORALL. Instead, INSERT transactions into a Global Temporary Table and apply a MERGE.

Although we are updating only 1% of the rows in the table, those rows are almost perfectly distributed throughout the table. As a result, we end up updating almost 100% of the blocks. This makes it a good candidate for hash joins and full scans to out-perform indexed nested loops. Of course, as you decrease the percentage of blocks updated, the balance will swing in favour of Nested Loops; but this trace demonstrates that MERGE definitely has it's place in high-volume updates.

That's a pretty significant difference: the same method (MERGE) is 6-7 times faster when performed as a Hash Join. Although the number of physical disk blocks and Current Mode Gets are about the same in each test, the Hash Join method performs multi-block reads, resulting in fewer visits to the disk.

All 8 methods above were benchmarked on the assumption that the target table is arbitrarily large and the subset of rows/blocks to be updated are relatively small. If the proportion of updated blocks increases, then the average cost of finding those rows decreases; the exercise becomes one of tuning the data access rather than tuning the update.

Why is the Parallel PL/SQL (Method 8) approach much faster than the Parallel DML MERGE (Method 7)? To shed some light, here are some traces. Below we see the trace from the Parallel Coordinator session of Method 7:

From this, we can see that of the 30.3 seconds the Co-ordinator spent waiting for the parallel threads, this one spent 7.52 waiting for shared resources (latches) held by other parallel threads, and just 12 seconds reading blocks from disk.

For comparison, here is the trace of the Co-ordinator session of a Parallel PL/SQL run:

The Parallel PL/SQL spent just 11.85 seconds starting parallel threads, compared to 23.61 seconds for PARALLEL DML. I noticed from the trace that PARALLEL DML used 256 parallel threads, whereas the PL/SQL method used just 128. Looking more closely at the trace files I suspect that the PARALLEL DML used 128 readers and 128 writers, although it hard to be sure. Whatever Oracle is doing here, it seems there is certainly a significant cost of opening parallel threads.

Also, looking at the wait events for the Parallel PL/SQL slave thread, we see no evidence of resource contention as we did in the PARALLEL DML example.

In theory, we should be able to reduce the cost of thread startup and also reduce contention by reducing the number of parallel threads. Knowing from above that the parallel methods were 10-20 time faster than the non-parallel methods, I suspect that benefits of parallelism diminish after no more than 32 parallel threads. In support of that theory, here is a trace of a PARALLEL DML test case with 32 parallel threads:

Comments

I find this a very good article and I would like to explore some of the results of the tests. Can you provide the code use for the tests? For example, when I try to replicate method 8, the function won't compile because return value (test_num_arr) is not defined. I'm a little unclear what this is and how I should define it.