I have two of these. Both were running for about 50K hours, and both are dead:

HDD #1: SMART status was failed (high read error rate), there were about 2000 reallocated sectors in GLIST, and sometimes RAID controller reported that there is no spare sectors to reallocate bad blocks.After performing low-level (?) format by using "sg_format" utility, there were only 10 defects in GLIST and SMART status changed to "OK".But it still occasionally reports read errors and GLIST is growing (this drive has ARRE bit set, so it automatically reallocates sectors which are almost dead but can be read, now there are 378 sectors).Number of 'read errors corrected with delay' is also growing.Looks like it is completely dead and can't be repaired.Please correct me if I'm wrong and there are magic procedures in the firmware which may bring it back to life (selfscan? but I have no idea how to start it).

HDD #2: this one is much more interesting. It had about 1000 reallocates and failed SMART status:"DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=32]".

After formatting it with sg_format, there are no defects in GLIST, and the number of errors corrected with delays doesn't change. SMART error was not cleared.But after few hours of reads and writes, 8 sectors (two groups of 4 subsequent sectors, seems that physical sectors in this drive are 4K) were unable to read.After overwriting these sectors, read errors were gone and GLIST is still empty, so these were 'soft' bad-blocks.

Currently this drive works perfectly well and even when I decreased its 'Read Retry Count' to 1 (which limits the time sector read time to 60ms per the Savvio 15K.2 manual), there are no read errors.It also passes short and long self-tests without any issues.But SMART status has not changed: "DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=32]"

I have no idea which "data channel" it refers. If this is a SAS channel, I would assume a controller failure (now the drive is running on another controller and I have no information about the controller this drive was running on before).Or is it an internal HDD interface? May were those 'soft' bad-blocks caused by this mysterious error rate?Does it makes a sense to replace the PCB on this drive (both drives have the same boards, so I know where I can get one working PCB )?

Basing on a photo I found on the Internet, there are Marvell 88i8062-BHC2 controller and 4Mbit SPI flash on the PCB.Is it enough to replace only SPI flash chip from the old PCB, or this Marvell controller has its own flash?Both drives are running the same firmware, but I'm concerning about adaptives and other unique stuff.

How could I reset that SMART error after replacing the PCB?

PS: I know that it looks unreasonable to spend a time for repairing this drive, but this is rather a hobby. I want to bring this drive back to life even if I will need more time than I would spend earning money for buying a bunch of similar working drives.

Your platter surfaces are developing bad sectors. Think of this like old chipping paint on a house. It'll only continue to get worse until you remove the lose paint and re-paint the house. Your data is literally stored in paint coated on the platters. That paint is degrading. Unless you have multi-million dollar equipment for coating hard drive platters at your disposal (to repaint them as it were) they will just continue to degrade.

What you are asking is like asking for the cure to aging. Best you're going to get is some suggestions on how to mask symptoms for a bit. The underlying problem isn't going away.

I agree that the first one can't be repaired due to faulty platters and/or heads (perhaps, only 1 surface/head from 4?).

But the platters of the second HDD seems to be OK after I did 'sg_format':- no defects in G-list- no read delays anymore- no read errors with RRC=1 (60 ms per sector limit)- the drive passes long self test and multiple 'badblocks' scans without errors and remaps

But it still has data channel "DATA CHANNEL IMPENDING FAILURE DATA ERROR RATE TOO HIGH" SMART error,and 2 groups of 4 soft bad-blocks appeared once. I can't reproduce soft bad-blocks anymore,but it looks like this is a problem of the main chip (or write heads?), so I want to replace the PCB.

But I'm not sure is it enough to replace only a 8-pin 4 Mbit SPI-flash from the original PCB.Perhaps, Marvell chip may have its own flash also?

Data channel in SAS/SCSI sense codes, or End-to-end error in SATA, means that at some point between a platter and a host drive, the checksum did not match. May be in-drive cache issue, or cabling issue. The most simple thing to try from common troubleshooting is to swap the suspect cable with a different drive and see if the error follows the cable.

This drive was running with another SAS controller, backplane, and cable when SMART status was changed to failed.I swapped PCB's and formatted both drives with sg_format, but nothing changed.Error was not cleared automatically.

Well... I did some research with HDD #1, which I disassembled recently:- Heads and platters were probably good (basing on statistics of LBA numbers associated with read errors fetched from SCSI logs).- Spindle motor and its bearing were OK (windings had good L and Q, hydrodynamic bearing had minimal worn-out for 53,000,000,000 revolutions it did during its life).- But the actuator bearing was worn-out, and I believe this was the root cause of failure.This explains why these disks were having phantom read errors and soft bad-blocks.

BTW: I cleared HDD #2 SCSI logs, but 'data rate' error still persists. Seems that it is latching and will never be cleared by the drive itself.

Anyway, there is no much a sense to reset this error on a drive with bad actuator bearing assembly.Despite its perfect platters and heads, there is a very little chance that I can do something with the actuator.

I did several iterations with 'sg_format' followed by 'badblocks' check and reproduced soft bad-blocks several times (neither random, nor sequential zero-fill caused bad-blocks, but sg_format did).

Assuming actuator bearing failure (I disassembled first HDD, which had similar symptoms, and its actuator bearing was definitely bad), I started end-to-end seek test and when it did more than 10,000,000 seeks.To be honest, I expected that this drive won't survive such a high load, so I ran sg_format followed by badblocks in a loop to check how it works, and... I didn't seen any G-list defects, unrecoverable errors, or even read errors recovered with delays anymore.Read curve is smooth and seek time diagram looks vital.It also passed 2 hours SeaChest random read test.

[asc=5d, ascq=32] error is still there, but everything else works surprisingly well.Looks like Savvio don't support UART terminal, because there is no activity on the pin which is supposed to be drive TX output.

I will appreciate if somebody help me with enabling UART terminal or clearing SMART error using anything else.

Perhaps the UART port needs to be enabled in some way? Maybe with jumpers? It stands to reason that the PCB would be manufactured with blank flash, in which case there would need to be some way to program it.

As for adaptives, the firmware update file (B62C_4F0.LOD) contains the text string, "SER NUM MISMATCH", which would imply that the flash does indeed contain adaptive data.

All pins are routed on the PCB and supposed pinout is the following (I started to count pins from SAS data connector):1 - routed to the main (probably) chip, have a 10K pull-up to +2.5V. Only low-side clamp diode is present, so this pin is supposed to be 3.3 or 5-volt tolerant RX. Also it is connected to pin 4 of the 10-pin board edge connector (that one with a notch in the middle, near the heads connector).2 - routed directly to the main chip, no pull-up and only low-side clamp diode is present. Supposed to be 2.5V UART TX. Also connected to pin 3 of the board edge connector.3 - GND4 - routed to the main chip through a 200 Ohm series resistor, have a 10K pull-up to +2.5V. Similar to other pins, no high-side clamp diode for this input (?) inside the main chip. Maybe I should short this pin to GND and see what changes. But I didn't tried this yet.

I monitored pin #2 (as well as other 2 pins) and it is always high. No activity at all.Looks like UART is physically present, but terminal commands are either disabled or not implemented in the firmware.I'll try to add a jumper between the pin 4 and GND, but maybe no earlier than this weekend.

Who is online

Users browsing this forum: Google [Bot] and 39 guests

You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot post attachments in this forum