mfi1: TIMEOUT AFTER 5669 SECONDS + Kernel Panic

This is my first post here on the forums, and I'm a relatively new FreeBSD sysadmin. We ordered a Dell R710 with two RAID controllers, the PERC H700 for 6 internal drives (mfi0), and the PERC H800 for a 24-drive direct attached storage device (mfi1). I have a zpool on 4 of the internal drives on mfi0 and a zpool across all 24 drives in the mfi1 DAS.

From time to time I have noticed that reads and writes to the mfi1 DAS hang for a long time. While it is hung, timeout error messages show up in /var/log/messages. After some seemingly arbitrary amount of time we're back to normal operation. It may be days before the next time this shows up in the logs and may be on the order of 30, or 15000 seconds. I have not yet discerned a pattern. Here is an example:

I have a hunch that this is occurring when copying large amounts of data (I.E. 0.5TB) between the drive arrays. I've also seen it occur during a hefty amount of database activity on mfi1. This database is not heavily used as an online database and is mostly for data warehousing so I've waited a bit to attack the problem as other fires have demanded my attention.

This morning the server is unresponsive and the following information is in the logs and on screen. A hard shutdown and startup seems to have solved the problem with no data loss (thanks zfs checksumming).

Recovered from /var/log/messages. Also shown on screen before panic message below:

Remember also that when there is a battery discharge in progress, the write policy will change from Write-Back to Write-Through if the capacity is under a certain level and you'll get that in your /var/log/messages:

Thanks for the information about the battery. That was a likely candidate. The problem was actually solved because the Dell PERC H700 and H800 raid controllers had a pretty serious bug in their firmware.

In the end it wasn't FreeBSD's fault, but i'll go ahead and enumerate the solution for wayward googlers with the same problem.

From the release notes:
1. Corrects a potential scenario where a drive may repeatedly fail
during i/o.
2. Fixes an issue in which IEEE scatter gather lists may be
misinterpreted by firmware leading to possible PCIe fatal error.

Here's what I had to do to get the H800 firmware updated. Dell only supplies executables for Windows, and Redhat Linux. In order to avoid adapting the installation script to FreeBSD I had to:

A quick note for those who find this solution. My PERC H800 seems to be cured of all timeout problems. The H700 with the same firmware update showed some timeouts in the /var/log/messages just now. So while this may be a valid solution for the H800, the jury is still out on the H700.

I had the same problem here with a LSI 9260-4i. The controller in plugged in a Intel S1200BTL inside a Supermicro chassis 24 bays running FreeBSD 9 and ZFS.
After trying lots of things including other releases of FreeBSD, I solved this problem by changing the read police from adaptive to normal in the controller setup.