IT Technology; Oracle, linux, TCP/IP and other stuff I find interesting

Archive

Tag Archives: oracle IO performance gdb debug internal internals

This post is about log writer (lgwr) IO.
It’s good to point out the environment on which I do my testing:
Linux X64 OL6u3, Oracle 11.2.0.3 (no BP), Clusterware 11.2.0.3, ASM, all database files in ASM.

In order to look at what the logwriter is doing, a 10046 trace of the lgwr at level 8 gives an overview.
A way of doing so is using oradebug. Be very careful about using oradebug on production environments, it can/may cause the instance to crash.

What we see here is the logwriter spending 3 seconds (ela= 3000812) on the event ‘rdbms ipc message’, simply because the timeout (timeout column obviously) was set so (timeout=300; timeout is set in centiseconds, Oracle old “fine” measurement of time). To understand what this means from the perspective of the operating system (Linux), there’s a need “trick”: use strace with verbose output of writing. That way the extended SQL trace will “document” the system calls. This is how this looks like:

First thing to notice (which is not relevant to this text about logwriter IOs, but nice to know): any time there’s a timestamp in the Oracle trace file (like *** 2013-08-29 21:38:58.721), Oracle did issue a times() syscall prior to that. If you want to know what that does, just issue “man 2 times” on the linux prompt. As a summary: this syscall fetches the CPU time spend in user mode and system/kernel mode, for the current process and for its children.

Back to where this post is about: we see rdbms ipc message waits. This is the instrumentation of the time sleeping (actually) on a semaphore, with timeout mostly set to 3 seconds:

Use the manpage of semtimedop to understand all, but the last struct ({3, 0}) is a struct timespec. The first number is seconds, the second number is nanoseconds. But in other words: if you see the logwriter spending its time on ‘rdbms ipc message’ it’s waiting for work. The semaphore construction is made so the process can asked (by another process) to do something, and can do that immediately.

I guess most people reading this blog, and coming this far, are aware what the function of the logwriter is (writing the log buffer to disk), and the logwriter does that if it finds something to write, or if some other process asks the logwriter to do so. The process which asks the logwriter to write waits for a response from the logwriter on the event ‘log file sync’.

What we see is an asynchronously submitted IO request (one IO request; io_submit), and it’s reaped after that (io_getevents), after which the logwriter writes a wait event to the tracefile indicating this has happened. The logwriter log file parallel write events statistics tell use it wrote to 1 file (equalling one IO, as can be seen from the IO system calls), with a length of 2 blocks. As we know, the log block size is and has been 512 bytes on Linux, which can be seen in the io_getevents call (1024 equals 2*512). The elapsed time (ela) is 107 microseconds.

The log blocksize is and has been 512 bytes on Linux, but can also be 4kB with more recent versions of Oracle, for newer SSD devices.

It would be logical to think in this example the IO time is 107 microseconds. But hold that thought, I’ll come to that. Let’s add a logfile member to every logfile, and switch to a new logfile to make writing to both logfile members happening. Let’s look at a logwriter writing something now:

We see here that Oracle submits two IOs at the same time, using one call, and reaps both of them (read the manpage of io_getevents, but the minimal number of IOs is set to 2). So indeed log file parallel write writes in parallel. The number of blocks is a bit puzzling, in total there are two (log-) blocks written (blocks=2), the logwriter IO to a single online redologfile had blocks set to 2, which meant it wrote 1024 bytes (2 512 blocks), here blocks=2 seems to mean the logwriter wrote 2 blocks in total.

How would Oracle write when asynchronous IO is not enabled? Well, let’s set disk_asynch_io to false, and redo enabling the extended sql trace, and strace it:

Ah! So when there’s no asynchronous IO, Oracle issues two pwrite() calls sequentially. Not really parallel!

Now think about what the wait means in both situations. In the first situation, with asynchronous IO (which should be your setting, if you have a modern system) the timing is about two IOs, and it seems to indicate the time both IOs took. Since they have been issued in parallel, if you have enough storage bandwidth left, this seems to indicate the time of the slowest IO. In the second situation, the IOs are issued serially, so the wait time seems to indicate the total latency time of the two IOs. This means you have to divide the wait time by the number of requests to get the average IO time of the IOs. Again: if the IOs are done synchronously instead of asynchronously.

But is this true?

Let’s dig a little deeper using gdb! First synchronous IO. Attach to the logwriter using gdb (gdb -p PID), and issue the following debugger commands:

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break pwrite
Breakpoint 3 at 0x3f38a0ee90: file ../sysdeps/unix/syscall-template.S, line 82. (2 locations)
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb)

Hey! Do you see what I see? First I see logwriter nicely sleeping like a little baby. With that I mean: I see kslwtbctx() to indicate the starting of the timing of a wait event, then semtimedop() is issued for the sake of sleeping, and after semtimedop() has timed out, kslwtectx() to end the event. But then we start writing (which is the function of the log writer; the pwrite64 calls), but there seems to be a kslwtbctx() missing??? Actually, kslwtbctx() is there, but AFTER the pwrite64() calls!

Am I missing something? Is Oracle doing something incredibly smart? Well, let’s startup an extended SQL trace and throttle the IO. By throttling the write IO for the logwriter I can (severely) increase the IO time, so the IO time should increase drastically. (see the throttling blogpost on this blog; instead for the logwriter you must limit write IOPS instead read IOPS). I’ve done that, and this is what the extended SQL trace shows:

This is what I call caught with the fingers in the cooky jar! The wait event shows an (extremely) low timing for the IOs (the ela=), and after that we get a warning the log writes were terribly slow! Oracle doesn’t time it’s logwrite events correct if we do synchronous IO in this situation!!!

Now let’s have a look at asynchronous IO. I do the same: attach to lgwr with gdb, and now break and continue on the asynchronous IO system calls: io_submit and io_getevents (please mind this requires breaking on io_getevents_0_4 with linux):

(gdb) rbreak ^kslwt[be]ctx
Breakpoint 1 at 0x8f9a652
<function, no debug info> kslwtbctx;
Breakpoint 2 at 0x8fa1334
<function, no debug info> kslwtectx;
(gdb) commands
Type commands for breakpoint(s) 1-2, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_submit
Breakpoint 3 at 0x3f38200660: file io_submit.c, line 23.
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>c
>end
(gdb) break io_getevents_0_4
Breakpoint 4 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>c
>end
(gdb) c
Continuing.

You should now see a stream of kslwtbctx and kslwtectx coming, which is the timing of the semtimedop() system calls. If you insert something in a table you’ll see:

Here’s something worthy to see! The first line shows the ending of a wait (kslwtectx), then the logwriter submitting an IO request (io_submit). Notice there isn’t any wait event timing started at this point, because kslwtbctx is called AFTER the IOs are submitted. This means that the wait event is the timing of the reap of the submitted IOs, instead of measuring full IO latency time. I know this is and should be a tiny fragment of time, but nevertheless the wait only shows the waiting for the reaping of the IO, not the timing of the entire IO.

Now look at the io_submit/io_getevents strace output again shown earlier in the post:

Look at the struct timespec in the io_getevents call: {0, 0}. This is a non-blocking reap call just to peek at the O/S completion queue. In other words, if the submitted IOs take a long time/are not ready, this call will just be unsuccessful and execution continues. This looks like the asynchronous read implementation of direct path reads (see this blogpost about non-blocking io_getevents calls).

Let’s throttle the IO of the lgwr, and print out the timespec struct at the break on io_getevents:

(gdb) break io_getevents_0_4
Breakpoint 4 at 0x3f38200620: file io_getevents.c, line 46.
(gdb) commands
Type commands for breakpoint(s) 4, one per line.
End with a line saying just "end".
>print *timeout
>c
>end

Now redo the insertion of data into a table, and commit to put the logwriter to work. Please mind IO is throttled!

Ah! We see the io_submit call issuing two IOs (nr=2), then the starting of the timing of the log file parallel write event (kslwtbctx), and then two io_getevents calls, the non blocking first, which almost certainly does not succeed (and is NOT shown with strace!) after which a blocking io_getevents call is issued (because seconds in the timespec struct is set to 600).

I haven’t got an idea why the non-blocking io_getevents call is issued after which a blocking io_getevents call is issued.

Summary
This is the description taken from Oracle 11.2 reference guide of log file parallel write:

‘Wait Time: Time it takes for the I/Os to complete. Even though redo records are written in parallel, the parallel write is not complete until the last I/O is on disk.’

This is actually a very good description, surprisingly accurate for asynchronous IO.
As we’ve seen, this is not true for synchronous IO, there Oracle just issues the I/Os it needs to do sequentially, which are not timed at all, but a wait time is displayed, which is wrong.

It turns out that for asynchronous log writes, the log file parallel write event times the time waiting for the writes to be flushed to disk. This means waiting for the slowest write is what is made visible with the event timing, all the fast(er) writes can have any time below the timing of the slowest write. This means the wait is timing from the perspective of the process, not IO latency timing. It gives a fair hint in the IO time of the slowest IO, so for the way most people using it (measure the impact of logwriter IO time spend), it’s fairly accurate, but for actual IO latency measurement it’s weak. Because, if you want to go beyond measuring the time logwriter spend on IO, you should be very well aware of the implementation. For example, you cannot divide the log file parallel write time by the number of IOs; the resulting time does not reflect anything.

Correction: Chris Buckel pointed out I had ‘db file parallel write’ instead of ‘log file parallel write’ in the summary. This is corrected.
Correction: Kevin Closson pointed out that the log block size can be 4kB too with newer versions of Oracle.