What is the purpose of segment level checkpoint before DROP/TRUNCATE of a table?

What is the purpose of a segment level checkpoint before DROP/TRUNCATE of a table?

In other words, why do we have to wait for the enq: RO – fast object reuse wait event (and in 11.2 the enq: CR – block range reuse ckpt wait) when dropping & truncating segments?

I’m not fully confident that I know all the real reasons behind this, but it could be related to the need to get rid of segment’s dirty buffers in buffer cache, before dropping the object.

Imagine this:

You have a large buffer cache and you drop table A without checkpointing the dirty buffers.

Immediately after the drop succeeds (some buffers are still dirty in cache) some other segment (table B) reuses that space for itself and writes stuff into it.

A few seconds later, DBWR wakes up to find & write some dirty buffers to disk (anything it finds from its lists). As there are some old & dirty blocks of table A still in the cache, they get written to disk too, overwriting some of the new table B blocks!

So, this is one reason why you should checkpoint the blocks to disk before dropping (or truncating) a segment. Of course you might ask that why doesn’t DBWR just check whether the dirty buffer is part of an existing object or a dropped one when it walks through its dirty list? It could just discard the dirty buffers of dropped objects it finds. It would be doable – but I also think it would get quite complex. DBWR is a low level background proces, understanding the cache layer and dealing with physical datablocks in a file# = X block offset = Y. It doesn’t really know anything about the segments/objects which use these blocks. If it should start checking for logical existence of an object, it would have to start running code to access (a much higher level concept) data dictionary cache – and possibly query data dictionary tables via recursive calls, etc, so making it much more complicated.

So, this logic may just be matter of implementation, it’d be too complex to implement such selective discarding of dirty buffers, based on a higher-level concept of existence of a segment or object. Dropping and truncating tables so frequently, that these waits become a serious problem (consuming significant % of response time) indicate a design problem anyway. For example, former SQL server developers creating multiple temporary tables in Oracle – for breaking a complex query down into smaller parts, just like they had been doing it in SQL Server.

Anyway, here’s what I think about this – I’d love to hear other opinions, if you think otherwise!

30 Responses to What is the purpose of segment level checkpoint before DROP/TRUNCATE of a table?

>Dropping and truncating tables so frequently, that these waits become a serious problem
You haven’t seen some Peoplesoft jobs. Choose to delete thousands to tens of thousands of rows a few dozen times or choose to execute a truncate a few dozen times in a batch. Multiply that by the number of countries running concurrently in the batch and you begin to hate TRUNCATE and wonder if DELETE makes more sense.
(And some anecdotal evidence is that DELETE is faster in some situations).

Truncate was introduced in response to a VLDB plea to alleviate redo generation at a time when sequential write speeds were much slower, disks were much smaller, and Oracle Parallel Server operated via flushing through shared disk because network bandwidth between nodes was much more expensive. Tablespaces were dictionary managed and dropping a table that contained many extents blocked more than you’d imagine for potentially a long time. (And the need to keep the number of extents low for dictionary managed objects many of which were renamed, copied into new tables with additional columns, and then the renamed tables were dropped is the root case some mythologists misinterpreted into a claim that having many extents is slow.)

Certainly the fastest way in calendar time to reliably implement truncate was to accept using the checkpoint to obviate special new code to invalidate dirty buffers. And we really did need it right away. I agree it is likely that a more optimal approach should have been found and implemented by now. It is difficult to accept a claim that writing to disk is ever more efficient than marking a block free. Even if putting the blocks directly on the free buffers list would cause some complication (without the code in hand I suppose it is possible, though I always thought that could only be a side effect of a different problem that should also be fixed), certainly (since we know DBWR has block batching) a bit could be flipped so that the block would be omitted from block batching for the actual writes but still be subjected to whatever else occurs to the blocks to get them available as free buffers again.

I’m hard pressed to imagine the situation where delete all rows is faster, even with forcing all cached blocks through DBWR. You still have to produce and write all the extraneous undo and redo.

I asked the original question on oracle-l. I have thought in multiple directions and contacted many experts privately over few months before asking the question on a public forum. What Mark mentioned above makes sense to me. Why write the dirty buffer to disk when you could just mark it as “invalid”/”clean” because truncate is going to “discard it on the disk” anyways. This seems to be one of the ingredients to Oracle’s secret sauce and the most definitive hint to the answer (not the answer itself) I have got is from Harald Van Breederode, it is required for recoverability, specifically PITR.

@Hemant K Chitale
Yes I have seen very bad jobs like that ;-)
But crappy application code isn’t always a good enough reason for Oracle to optimize their RDBMS for it. Although, some “optimizations” such as CURSOR_SHARING = FORCE and SESSION_CACHED_CURSORS are Oracle’s RDBMS improvement to work around crappy apps without requiring application code change…

@Mark W. Farnham
Thanks Mark for the insight! Yeah, sometimes the way Oracle works is not because of a fundamental “physical” conflict, but it’s just about an implementation detail (write less code, have less complexity, no time to implement as it’s friday evening, etc ;-)

This seems to be one of the ingredients to Oracle’s secret sauce and the most definitive hint to the answer (not the answer itself) I have got is from Harald Van Breederode, it is required for recoverability, specifically PITR.

@Tanel Poder
But two points to consider –
a) the buffers are marked “free” after the drop/truncate anyway, so why not just make a code change that recognises the case where you don’t need to write them before making them free. There must have been a code change to mark them free anyway.

b) the drop/truncate code has changed enormously since the first implementation because it used to scan the buffer cache rather than being able to follow the object queue – so why not fix the write/no-write bit of code at the same time.

In passing, I’ve never been able to work out a PITR scenario that could require the table to be written before freeing – the log file contains everything you need to get it right.

I understand that DBWR is not using x$bh.obj since it would have to check dc_segments in row cache all the time (it would be more complex and less efficient). Tanel’s predictions seems to be right, again
However, I think oracle could make some extra column to x$bh to indicate dirty buffers which will be never written to disk. DBWR would just mark them free during checkpoint.

@Tanel Poder
I use TRUNCATEs.
Anecdotal evidence about DELETEs over TRUNCATEs was diagonally presented in MetaLink Note#334822.1 where there is a reference to discussion in unpublished Bug 4147840.
TRUNCATEs were a serious issue for us in 9i. Using a seperate block size only for the “temporary” tables (see Note#334822.1) was actually a reasonable workaround.

@Mark W. Farnham
>I’m hard pressed to imagine the situation where delete all rows is faster, even with forcing all cached blocks through DBWR
See MetaLink Note#334822.1 where there is a reference to discussion in unpublished Bug 4147840.

TRUNCATE implementation has changed from 9i to 10g (see my previous update about performance issues in 9i). However, it still has to be improved. See Oracle Support Article “Bug 6844739 – Truncate very slow [ID 6844739.8]” which lists a number of bugs as Oracle has kept trying to improve the TRUNCATE operation.

I think everybody is using TRUNCATE. It is very reasonable to have table with temporary data iwth 20 mil rows and truncate it once a day. But is is not great idea to insert data and truncate 20 times per hour.
I also suggest DELETE could be sometimes faster then TRUNCATE, however it would be quite rare sitauation. TRUNCATE code should be obviously improved even in 11g.

Based on what Hemant wrote, I’ll revise that:
“I’m hard pressed to imagine the situation where delete all rows is faster, even with forcing all cached blocks through DBWR, that does not involve a bug.”

Hemant found my blind spot: I just have trouble imagining a bug in the write part of redo and dbwr, because in my mind’s eye Bill Bridge and Jonathan Kline are still writing and testing all that stuff. This admitted delusion allows me to sleep at night. “Everything Bill starts ends in checkmate” – anon. administrative manager of Bill.

I had a similar thought, as in the previous ocmment. However I never used and do not know any of the workings about flashback database, so this might be completely off, but I was thinking about this:
In the case of truncates, how does Oracle deal with uncommitted changes, already applied to disk? does there not need to be some kind of immediate cleanup process before truncate is “committed”?

With reference to Janathan’s comment:@Jonathan Lewis
and the original post problem, when we are talking about “the buffers”, are we talking about buffer cache entries from a prior DML operation on the table about to be dropped/truncated ?
I understood DROP/TRUNCATE was a DDL command and didn’t suck rows into the buffer cache.
If I understand correct, then I agree that the prior “COMMIT” to DROP/TRUNCATE is a worthwhile exercise to comply with the Atomic and Durable of ACID.
As Tanel states: “DBWR wakes up to find & write some dirty buffers to disk (anything it finds from its lists). As there are some old & dirty blocks of table A still in the cache, they get written to disk too, overwriting some of the new table B blocks!”, we can’t be having this for our Atomic or Durable compliance.
But my other thought is: if no DML was performed on the table *before* the TRUNCATE, would it not be quicker?

@Paresh
>I have got is from Harald Van Breederode, it is required for recoverability, specifically PITR.

one can generalize: every memory (Oracle’s talk is buffer cache ) modification goes to redo and [usually – later] to datafile. Else it is easy to jeopardize the recovery, either crash or PITR or whatever.

as for the original question then Oracle already has some optimizations that need to be taken into account. Direct writes for example. Those violates the usual path and need to be handled. Looks like the buffer cache can really overwrite the direct write, as Tanel mentioned.

Generally, given the flood of bugs databases, specifically – Oracle, introduce with every new release I am not sure additional optimizations are the way to go. Database is a base, fundament on which one builds it’s data “house” More and more the fundament can not be trusted.

There are special object queue linked list pointers in buffer headers (the OQ_NXT and OQ_PRV pointers in X$BH). These linked lists link together buffers belonging to individual segments. So during a segment level checkpoint, all that CKPT/DBWR process needs to do is to walk trough the linked list with the segment’s data object id and checkpoint all the buffers there. So, there’s no need to scan through the entire buffer cache like (apparently) had to be done back in old days.

The related X$’s are X$KCBOQH (object queue Header) and X$KCBOBH (object queue buffers). The starting point to any segment’s linked list is stored in the X$KCBOQH header table. There is also a counter in that X$ structure, which shows how many blocks of that segment are currently cached in buffer cache. This information is used also by the full table scan (and fast full index scan) codepath, to decide whether to do a direct path read or a buffered read instead (a buffered read is preferred if enough blocks of the segment are already in cache)

We are trying to implement a batch processing application in 11g R1. After going through all the comments here, I am baffled that I am seeing Enq: ro fast object reuse. Slowing 20 batches(run with a parallism of 3). I have quite a large buffer cache 15G in default and another 15g in recycle. Nowhere in our application we truncate the any of the objects. But we see a huge wait event. When we trace it, it points to one object_id which is supposed to be domain table. Just a question to the group if someone can respond so that I can troubleshoot…why are we seeing the fast object reuse wait event without even truncate….

Actually, consider what happens if the instance craps out during the checkpoint. Since the ‘truncate’ is not complete at point of failure, recovery will find modified blocks in the redo stream, but perhaps be unable to find them in that state on disk. So, checkpointing to disk by dbwr is essential to avoid a hosed up and stuck recovery.

Note, this happens for a permanent object, but would not be an issue for temp tables, since we don’t recover them anyway. So, the Peoplesoft behaviour noted earlier happens because that product typically avoided temporary objects for sorts, preferring to create permanent object, then to truncate them later. So, Peoplesoft batch processes revealed this issue rather frequently.

After your comment I just realized that the “local write wait” in truncate could also be complicating things. As normally the redo will always have to land in redo logs before the related blocks get written to disk by DBWR (thus it should be impossible to have a missing “before-image” block or too old block on disk when processing redo for recovery). But the “local write wait” sort of breaks the LGWRDBWR RBA tracking logic, so it has to be ensured elsewhere in the code …

If you have a more specific example of a situation where things would break recovery, I’d love to read about it!