A Brief Analysis of Logfile Sync Waiting for Super High Case

Monitoring tool DPA found a DB Commit Time indicator alarm on an Oracle database server overseas, which exceeded the red alarm line (about 40 milliseconds, yellow alarm 10 milliseconds, red alarm line 20 milliseconds). As shown in the following screenshot, the AWR report for the corresponding period was generated, and log file sync waited inside Top 5 Timed Events The average wait time for an event is 37 milliseconds, and the average wait time for a log file parallel write event is 40 milliseconds

If you are familiar with Tanel Poder:: Understanding LGWR, Log File Sync Waits and Commit Performance s (classic picture below), then by waiting for the event log file sync and log file parallel write Avg Wait(ms) indicator: 37ms & 40ms, you can basically tell where the redo log is locatedDisk I/O on

The detailed statistics are then analyzed using the lfsdiag.sql script as follows:

As you can see from the above data, the maximum number of log file sync waiting events is in the range of 32ms, and the maximum number of log file parallel writes waiting events is in the range of 32ms. In fact, this value is far more than 7ms, which is extremely abnormal.The log file parallel write event is a wait event exclusive to the LGWR process, which occurs when LGWR writes the redo log information from the log_buffer to a member file of the online redo log file group, on which LGWR waits for the completion of the write process.The event took too long to wait, indicating that the disk on which the log file resides is slow or contentious.Log file sync and log file parallel write are related to each other.In other words, assuming that the log file parallel write takes a very long time, it must cause the log file sync to wait longer.If log file parallel write waits very high, it may generally be a physical disk I/O problem

In addition, we also checked the frequency of redo log switching. As shown below, redo log switching is not frequent and the amount of archive logs generated is not large.

Then we found a machine (server with normal index above) to simply test the speed of IO. This method is very simple, that is, to see how long it takes to generate a large file, to simply test the I/O performance (without considering cache, etc., test sampling is not exhaustive), but the comparison data is also basically validated.Certification, feedback disk IO problems).

Problem Server:

# time dd if=/dev/zero of=./test bs=512k count=2048 oflag=direct

2048+0 records in

2048+0 records out

1073741824 bytes (1.1 GB) copied, 88.271 seconds, 12.2 MB/s

real 1m28.273s

user 0m0.010s

sys 0m0.655s

Compare servers (normal servers):

# time dd if=/dev/zero of=./test bs=512k count=2048 oflag=direct

2048+0 records in

2048+0 records out

1073741824 bytes (1.1 GB) copied, 2.48344 seconds, 432 MB/s

real 0m2.485s

user 0m0.004s

sys 0m0.386s

As shown in the above comparison, the two servers generate the same size file, which takes a lot of time and has very different I/O performance. This fully verifies that the alerted server is located in a storage I/O performance problem, but the company division of labor is very clear, and the DBA does not know what is wrong with the underlying storage, so it can only feedback this problemOut and wait for replies from colleagues overseas responsible for maintaining systems and storage.