Redo Latch Tuning

The high redo copy latch miss rates seen under Oracle 7.3 and 8.0 have generated a lot of interest in redo latch tuning.
However, it is seldom a problem, and there is not normally any tuning required.
All that is required, is an understanding of how the redo latches are used.

How are the redo latches used?

The redo latches are used primarily for redo generation, but also in connection with log file sync waits.

Redo Generation

A logically atomic database change normally consists of two or more physical block changes.
For example, inserting a row into a table may involve changes to several index blocks, as well as a change to one of the blocks of the table itself.
And for most block changes, there must be a corresponding change to at least one rollback segment block.
There may also be changes required to a rollback segment header block, block cleanouts, freelist changes, and so on.

Before making a database change,
a process must take buffer locks on each of the buffers holding the current mode image of the database blocks affected,
and prepare a set of change vectors representing the intended changes.
Before the set of change vectors can be applied to the database blocks,
they must all be copied into the redo stream as a single redo entry.

The redo allocation latch must be taken to allocate space in the log buffer.
This latch protects the SGA variables that are used to track which log buffer blocks are used and free.
See our tip Tuning the Log Buffer Size for an explanation of these variables.
The amount of space allocated is that required for all of the change vectors comprising that logical database change,
plus an allowance for a 16-byte block header at the beginning of each redo log block, if the redo entry spans the beginning of one or more log blocks.

However, before taking the redo allocation latch,
the process first takes the redo copy latch that will be needed for the copy into the log buffer.
The redo copy latches are used to indicate that a process is copying redo into the log buffer,
and that LGWR should wait until the copy has finished, before writing the target log buffer blocks to disk.

No-wait mode is used for most gets against the redo copy latches,
because the process can use any one of them to protect its copy into the log buffer.
It first attempts to get the copy latch that it last held.
Failing that, the process attempts to get each other copy latch in turn, in no-wait mode.
Willing-to-wait mode is only used to get the last copy latch if no-wait gets against all the other copy latches have failed.

Prior to release 8.1, the use of a redo copy latch was conditional on the size of the redo entry.
If the redo entry was less than the number of bytes specified by the _log_small_entry_max_size parameter
then a redo copy latch would not be taken, and the copy would be performed while retaining the redo allocation latch.
This parameter defaulted to 800 bytes up to the early releases of Oracle 7.3, and to 80 bytes thereafter.
Copies performed under the redo allocation latch were reported in the redo small copies statistic,
and it was not uncommon for this to represent almost all redo entries.
However, it was a common tuning practice to set the parameter to 0 bytes
to prevent small copies, and thus minimize the load on the critical redo allocation latch.

Once a redo copy latch and the redo allocation latch have been acquired, and space has been allocated in the redo log buffer,
the redo allocation latch is then released
and the change vectors are copied into the log buffer from temporary buffers in the PGA of the process.

Prior to release 8.1, the change vectors for redo entries larger than the value of the _log_entry_prebuild_threshold parameter
would first be copied into a single buffer in the PGA of the process, before acquiring any latches,
so that they could be copied into the log buffer in a single operation.
This was reflected in the redo entries linearized statistic.

Once the copy is complete, the change vectors are applied to the affected database blocks, the redo entry is marked as valid,
and the process then releases its redo copy latch.
At this point the process may need to post LGWR to signal that it should begin to flush the log buffer.
This applies if the allocation raised the number of used blocks in the log buffer above the threshold set by the _log_io_size parameter,
or if a commit marker has been copied into the redo stream as part of a commit.
However, to ensure that LGWR is not posted needlessly,
the process takes the redo writing latch to check whether LGWR is already active or has already been posted.
The redo writing latch is then released, and if appropriate the LGWR process is posted.

Log File Syncs

If a process is waiting for LGWR to write a particular log buffer block to disk, it waits in a log file sync wait.
The normal cause of log file sync waits is transaction termination;
however, DBWn also suffers these waits when writing recently modified blocks.

When a process wakes up from a log file sync wait,
it must check whether the log buffer block containing the redo of interest has yet been written to disk.
If not, it must continue to wait.
The SGA variable that shows whether a particular log buffer block has yet been written to disk
is the index into the log file representing the base disk block for the log buffer.
This variable is of course protected by the redo allocation latch, and so the redo allocation latch must be taken to check it.

How does LGWR use the redo latches?

The redo latches are not only taken in connection with redo generation and log file sync waits.
They are taken by LGWR as well in connection with writing redo from the log buffer to the log files.

When LGWR wakes up, it first takes the redo writing latch to update the SGA variable that shows whether it is active.
This prevents other processes from posting LGWR needlessly.
Then, if it was not posted, LGWR then takes the redo allocation latch to determine whether there is any redo to write.
If not, it takes the redo writing latch again to record that it is no longer active, before starting another rdbms ipc message wait.

If there is any redo to write, LGWR then inspects the latch recovery areas for the redo copy latches (without taking the latches)
to determine whether there are any incomplete copies into the redo buffers that it intends to write.
If so, LGWR sleeps on a LGWR wait for redo copy wait event, and is posted when the required copy latches have been released.
The time taken by LGWR to take the redo writing latch, the redo allocation latch
and to wait for the redo copy latches is accumulated in the redo writer latching time statistic.

Under Oracle 7.3 and 8.0 foreground processes held the redo copy latches more briefly
because they did not retain them for the application of the change vectors.
Therefore, LGWR would instead attempt to assure itself that there were no ongoing copies into the log buffer
by taking all the redo copy latches.

After each log write has completed, LGWR must take the redo allocation latch again
in order to update the SGA variable containing the base disk block for the log buffer.
This effectively frees the log buffer blocks that have just been written, and they may then be reused.

How should the redo latches be tuned?

In general, redo latching only needs tuning if the willing-to-wait miss rate on the redo allocation latch is high,
or if the no-wait miss rate on the redo copy latches is high.

Redo Allocation

There is, and can only be, one redo allocation latch in each instance, so there is no scope for parallelism on this latch.
There is also no way to reduce the duration for which the latch is held.
All you can do is to attempt to reduce load on the latch.
Most importantly, ensure that LGWR is not overactive because of too low a _log_io_size setting,
and that DBWn is not checkpointing recently changed blocks too intensively.
At the application level, beware of spurious COMMITs, avoid SELECT FOR UPDATE statements and before row triggers,
and code DML operations to minimize redo generation.

Redo Copy

Under Oracle 8.1, because LGWR does not itself sleep on the redo copy latches,
sleeps against these latches do indicate a higher degree of concurrency than supported by the current number of latches.
In such cases, the number of redo copy latches may be raised, using the _log_simultaneous_copies parameter.
Note, however, that there is a fixed limit of 32 redo copy latches,
and that some platforms support shareable redo copy latches.

Under Oracle 7.3 and 8.0, a high willing-to-wait redo copy latch miss rate is routine.
Willing-to-wait gets for a redo copy latch by foregrounds are rare,
and only occur when all the other redo copy latches are unavailable.
The normal reason why all the redo copy latches are unavailable at times is not that there are not enough latches,
but that LGWR is holding all of them.
Increasing the number of latches only increases the redo writer latching time,
and does not reduce the risk of redo copy latch unavailability.
If any process goes to sleep while holding a redo copy latch,
LGWR will spin waiting for it, while holding all the other redo copy latches,
and all other gets will quickly fall through to a willing-to-wait miss.
Therefore, a high willing-to-wait miss rate is to be regarded as a symptom of a small problem elsewhere,
rather than a problem in its own right.

In general, redo copy latch issues should not be addressed until redo generation has been minimized
and the redo allocation latch has been tuned.
You should also check that latch holders are not suffering from CPU starvation.
Thereafter, modest increases in the number of redo copy latches may be appropriate in some circumstances.