Mean Time Between Failures: Can it help predict hard drive failure?

Here at Kroll Ontrack, we’re well aware that data loss can affect anyone. For many of us, it comes in the form of hard disk drive (HDD) failure – an umbrella term for mechanical, electronic and logical defects that render the information stored therein unreadable. There are dozens of possible causes for this type of malfunction, ranging from logical software errors to physical damage and overheating – and, of course, the fact that all storage devices have a finite lifespan.

You might be acquainted with some of the tell-tale signs that a hard drive is on its last legs. Strange noises, for example – if your HDD shifts from whirring and clicking to grinding and thrashing, it’s a safe bet that it’s about to give up the ghost. In addition, slow access times, frequent crashes and abnormal behaviour – such as corrupted data and vanishing files – are reliable indicators of hard drive failure.

Unfortunately, these aren’t what you’d call scientific metrics to detect a HDD malfunction. And while it’s one thing to listen out for odd sounds emanating from your laptop or tower, it’s another to apply the same methodology to a redundant array of independent disks (RAID) environment in a remote data centre.

So how can consumer and business users alike predict when their hard drives are about to fail? Well, their first port of call might be to check the manufacturers’ estimates of their storage device lifespans, usually provided in the form of a mean time between failures (MTBF) rating.

However, common as this benchmark is, it’s important to bear in mind that the given reading might not be as transparent and reassuring as it first appears.

What is mean time between failures (MTBF)?

In theory, an MTBF rating is pretty much what it sounds like – the average period of time between one inherent failure and the next in a single component’s lifespan. So, if a machine or part malfunctions and is afterwards repaired, its MTBF figure is the number of hours it can be expected to run as normal before it breaks down again.

With consumer hard drives, it’s not uncommon to see MTBFs of around 300,000 hours. That’s 12,500 days, or a little over 34 years. Meanwhile, enterprise-grade HDDs advertise MTBFs of up to 1.5 million hours, which is the best part of 175 years. Impressive stuff!

It should be plain to see that these figures are misleading, and that they’re a far cry from our real-world expectations of hard drive longevity and reliability. That’s not because there’s a problem with the MTBF metric per se – far from being a marketing buzzword, it has a long and distinguished lineage in military and aerospace engineering. But realistically, no hard drive manufacturer has been testing its enterprise HDDs since the mid-18th century – instead, the figures are derived from error rates in statistically significant numbers of drives running for weeks or months at a time, not the devices’ average lifespan in the field.

Correspondingly, studies have demonstrated that MTBFs typically promise much lower failure rates than actually occur in real-world performance. In 2007, researchers at Carnegie Mellon University investigated a sample of 100,000 HDDs with manufacturer-provided MTBF ranges of one million to 1.5 million hours. This translates to an annual failure rate (AFR) of 0.88 per cent, but their study found that AFRs in the field “typically exceed one per cent, with two to four per cent common and up to 13 per cent observed in some systems”.