PERC H700 predictive failure continues with new drive

System is a PowerEdge R710.with PERC H700 integrated controller. The virtual drive in question is RAID 5 with 3 drives. I have been monitoring controller logs weekly, and running consistency check for the past 3 months with 0 issues. Suddenly yesterday, OMSA is showing 1 drive as predictive failure. I went through logs and see a bunch of unexpected sense logs many of them stating "corrected medium error", but I do not see any unrecoverable errors. I went onsite, offlined drive, then inserted new Dell branded drive. Rebuild completed, but the new drive is also reporting as predictive failure. I went through logs again, and notice many more unexpected sense logs during rebuild process. I then ran consistency check, which again had many unexpected sense logs. Drive is still in predicted failure state, so I replaced drive again with a new drive. This time, after rebuild drive was not showing predictive failure. Just to be sure, I ran another consistency check, which put same drive in predictive failure state again. There are again a bunch of unexpected sense logs many of them stating "corrected medium error". I doubt that both drives are bad, and am unsure how to proceed.

The firmware for controller, drives, & BIOS are all up to date, but IDRAC6 & Lifecycle are out of date. IDRAC6 is at 1.92.00 (build 5) and Lifecycle is at 1.4.0.445

Please let me know your thoughts. If you recommend updating IDRAC6 & Lifecycle, please let me know where to find updates and steps for updating. I am a little confused identifying where to locate these updates on support page. Also, can I just update to latest version or does it have to be in steps?

Re: PERC H700 predictive failure continues with new drive

I suspect that the virtual disk is punctured. Predictive failure status is a SMART issue that is reported by the firmware on the disk. When the virtual disk has a puncture it will copy bad virtual disk information onto the replacement disk. This bad information can cause false bad blocks which can lead to predictive failure.

You will need to review the controller log to look for a puncture. If there is a puncture then you should see the same bad block listed on two or more disks. I think the H700 is the first controller that will report a puncture. You can search the log for the word puncture. The log only stores the last 10k lines, so if you have been replacing and rebuilding disks the information may be gone.

If the virtual disk is punctured you will need to delete the virtual disk, create and initialize an unlike virtual disk, and then create the desired virtual disk. This process is data destructive, so you will need to backup any important information first.

Re: PERC H700 predictive failure continues with new drive

I saved copies of all the logs before exporting, and just did a search for "puncture" on all the log files with turned up 0 results. All the unexpected sense errors are on PD 03 which is the drive I replaced twice.

I do agree that there is something wrong with the virtual disk. Do you have any other suggestions before I re-create virtual disk?

Re: PERC H700 predictive failure continues with new drive

If you would like to upload the controller logs to a text sharing site and provide links then myself or someone else in the community could review them. I think it would just be verifying the issue of a puncture though.

Re: PERC H700 predictive failure continues with new drive

I don't see a puncture in those logs. It looks like they go back to 11/30/2018. During the rebuild the controller is encountering a lot of bad blocks on the virtual disk. Here is a snippet of what is occurring several times during the rebuild.

I'm decent at reading logs, but I don't know what everything in the log means. We do not have detailed documentation on the logs.

I think 247eb8 is the LBA address on the physical disk. In the ErrLBAOffset, the first LBA is the good or expected LBA. The second LBA is the returned value that is in error. The (1) is the offset value or difference between the two.

What I think is happening, information is being copied to the drive during the rebuild. During the verification process it is reading physical disk logical block address 247eb8. It is expecting virtual disk LBA 123f38 to be written in that block but 123f39 is there instead. After the read retries are exhausted it moves onto the next logical block. It tries to copy another virtual disk LBA to the same physical disk LBA and encounters the same error.

I can't say with certainty whether or not the virtual disk or physical disks are at fault since I'm not seeing the same bad blocks on more than one disk. I didn't check all of the blocks though. I would run diagnostics on the disks. If the disks are shelf spares that have been sitting for a long time or were received in the same shipment then they may be faulty.

Re: PERC H700 predictive failure continues with new drive

Daniel, thank you much for taking the time to examine the logs. The drives were purchased together in same shipment 4 months ago. I ran Dell Online Diagnostics disk self test on the drives. PD03 failed the dst test, but passed the quick test. The other 2 drives in array passed the dst test.

I ordered more Dell drives. I'll try replacing drive again in a few days and let you know results.

Re: PERC H700 predictive failure continues with new drive

Thanks for your help, Daniel. I replaced PD 03 for a third time, which resolved issue. Although I am happy to see controller not logging errors and PD 03 in normal state, I'm still not confident virtual disk is healthy.