RAID controller got frozen

Why it's important to buy a good RAID controller

I write this post to explain a problem I found with a RAID controller. One morning I was notified that PostgreSQL in one development server was stopped. I connected to it and I found the service stopped. It was a physical server running Windows Server 2008 R2 and PostgreSQL 9.0. The server had a RAID 1. It’s important to notice that Windows Server Backup started at 22:00 and it usually takes 2 hours to complete, so during the time of the failure it was working and there was a lot of I/O.

Analyzing the logs

PostgreSQL logs

Then I checked the last pg_log:

2012-05-17 23:46:12 PANIC: could not fsync log file 120, segment 174: Bad file descriptor
2012-05-17 23:46:12 STATEMENT: COMMIT
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application’s support team for more information.
2012-05-17 23:46:12 LOG: server process (PID 4488) exited with exit code 3
2012-05-17 23:46:12 LOG: terminating any other active server processesThese 3 lines are repeated several times:2012-05-17 23:46:12 WARNING: terminating connection because of crash of another server process2012-05-17 23:46:12 DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.2012-05-17 23:46:12 HINT: In a moment you should be able to reconnect to the database and repeat your command.
2012-05-17 23:46:12 WARNING: terminating connection because of crash of another server process
2012-05-17 23:46:12 DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2012-05-17 23:46:12 HINT: In a moment you should be able to reconnect to the database and repeat your command.
2012-05-17 23:46:12 LOG: all server processes terminated; reinitializing
2012-05-17 23:46:22 FATAL: pre-existing shared memory block is still in use
2012-05-17 23:46:22 HINT: Check if there are any old server processes still running, and terminate them.

I had no idea why this crash happened. I started the server successfully. It was in a consistent state. Log after the start:

Event Viewer error:17/05/2012 23:46:12 ntfs (eventID 57): The system failed to flush data to the transaction log. Corruption may occur.

This gives us the explanation: the system couldn’t flush PostgreSQL data to the log file and PostgreSQL stopped immediate. It was the only way to maintain the integrity of the database. I suppose that Windows Backup I/O was also helping to cause these problems. During 1 minute and 30 seconds the RAID controller was frozen and couldn’t flush to disk. After it, everything was running fine because the RAID 1 had one disk to work with. PostgreSQL and Windows Backup were affected by this problem, all the other software continued working without any problem. But we’ve to keep in mind that databases are very sensitive to this.

I goggled to find some information about the RAID controller and in one post where someone was complaining it.

I could have a wrong idea of what a RAID is. So I checked Wikipedia RAID:

RAID (redundant array of independent disks, originally redundant array of inexpensive disks) is a storage technology that combines multiple disk drive components into a logical unit. Data is distributed across the drives in one of several ways called “RAID levels”, depending on what level of redundancy and performance (via parallel communication) is required.
[…]
In RAID 1 (mirroring without parity or striping), data is written identically to two drives, thereby producing a “mirrored set”; at least two drives are required to constitute such an array. While more constituent drives may be employed, many implementations deal with a maximum of only two; of course, it might be possible to use such a limited level 1 RAID itself as a constituent of a level 1 RAID, effectively masking the limitation. The array continues to operate as long as at least one drive is functioning. With appropriate operating system support, there can be increased read performance, and only a minimal write performance reduction; implementing RAID 1 with a separate controller for each drive in order to perform simultaneous reads (and writes) is sometimes called multiplexing (or duplexing when there are only two drives).

Conclusion

Reading twice the Wikipedia quote "The array continues to operate as long as at least one drive is functioning" I think that it could be a little ambiguous after what happened to me. I thought that a RAID idea was that the system could work seamless when a disk fails. But it looks like that some implementations of RAID are not exactly this or at least some have an interruption before the array continues to operate…

As a DBA I’m in charge of databases, but I don’t control the hardware where they’re running on. And things can always go worst: there are full of stories where RAIDs weren’t monitored and disks failures weren’t detected until all information was lost. So, that’s a reason why a good backup policy and a good recovery plan are the most important things.