We recently shone a light on the somewhat concerning Intel SSD endurance limitations in the M.2 600p SSD review. The 600p switches into a locked read-only mode when the SSD surpasses the endurance threshold. Intel clarified the process for recovering data after its SSDs enter into the locked state, but we also want to clear up some common misconceptions about endurance ratings, which SSD vendors spec with the somewhat misleading "TBW" (Terabytes Written) measurement.

Let's start with the immediate issue of the Intel 600p's hard endurance limitation. From the review:

How Intel's consumer SSDs expire once you surpass the endurance threshold is troubling. In an almost over-zealous move to protect user data, Intel instituted a feature on many of its existing SSDs that automatically switches it to a read-only mode once you surpass the endurance threshold (measured via the MWI SMART attribute). Surprisingly, the read-only state only lasts for a single boot cycle. After reboot, the SSD "locks" itself (which means you cannot access the data) to protect the user from any data loss due to the weakened flash. The operating system typically generates error notifications when an SSD switches into a read-only mode, so most users will restart without being aware that the SSD will be inaccessible upon the next reboot. The process to recover the data is unclear.

Intel designed its IMFT 3D TLC NAND-powered 600p SSDs specifically for the low end of the market, and it has only 72TB of endurance, which pales in comparison to value offerings from other SSD vendors. For instance, the recently announced 960 EVO (which also uses 3D TLC NAND) offers 400TB of endurance with its 1TB model. Most users will not reach the Intel-imposed 72TB limit, but we do know that as a general rule, the flash will outlive its endurance specifications. There have been many reports of SSDs outlasting the endurance specification to the tune of hundreds of TBs (or several PBs) before the user actually experiences data loss, but Intel is the only SSD vendor that institutes a hard endurance limit.

Intel's hard endurance limit has been a staple in both its client and enterprise products for several years. We explained in the review that although we knew the read-only locking procedure applied to past products, we had not received confirmation from Intel that the new low-endurance models also employ the same technique. The locking feature really hasn't been an issue in the past, but due to the 600p's low endurance rating, a casual user certainly has a much higher chance of encountering the odd situation.

We finally received an official response from Intel that confirms that the feature is active on the 600p series, and the MWI SMART value (more on that shortly) triggers it. The company also outlined the data recovery process after the endurance expires:

Under typical client usage, a user will not wear out the endurance of the drive before reaching end of warranty period. For NVMe SSDs, the Percentage Used SMART info is the end user indicator for when drive is reaching its write endurance EOL. If SSD reaches Percentage Used value of 100, then the drive has reached the planned life of the media, and the user should replace drive. Another quality metric of the drive is available spare. If available spare area drops below threshold, which is very untypical during warranty period and write endurance of drive, then the user will also be warned via the SMART information that drive is in critical state. If user continues to use the drive, it will reach a point that it will be forced into read only mode. The user can then place drive in a system that only requires reading from the drive and recover data before replacement. (emphasis added)

Most users will know that the SSD has entered into the read-only state because the operating system will begin generating error messages. The OS generates the error messages because it has to be able to write data to the drive in order to function (there are always myriad processes, such as logging, that occur in the background).

Intel's process for copying the data from a read-only SSD involves simply installing the drive as a secondary volume (non-OS) in a computer. The operating system will not lock up if the secondary drive does not accept incoming write data, so the user is free to copy the data to another drive.

The process to copy the data is simple, but Intel designed the 600p series for casual users, and most non-technical users will never know that the drive has entered into a read-only state. Successive reboot attempts will be unfruitful and not resolve the issue, and many users may decide that the drive has died, taking the data with it.

We feel that Intel should be more forthcoming with the end of life process and educate users in standard documentation. As it stands, we cannot find any direct Intel support or reference materials that outline the end of life process. The hard endurance limit is present on all Intel SSDs, and users should be aware that the issue exists so that they can remedy the situation if they reach the limitation.

Why The TBW Rating Is Misleading

Another interesting facet of the endurance conversation revolves around the widely used, but somewhat misleading, TBW measurement. Simply put, SSD vendors provide a TBW rating to indicate how many terabytes of data a user can write to the SSD before it expires.

SSD endurance is a tricky subject. Unfortunately, most users rely upon the "host writes" measurement (which calculates how much data the host has written to the SSD) as an indicator of how much endurance they have used, and how much remains. Most of our readers comment that they have "only" written XX amount of data to their SSD in X amount of years, but that is not an accurate indicator of the used, or remaining, endurance.

Copying a 1GB file does not always mean that the SSD actually writes only 1GB of data. In fact, unless the SSD uses compression technology (which is very rare after the slow and silent death of SandForce controllers), the SSD will normally write more data than the host computer sends to the storage device. This “write amplification” is due to internal SSD processes. Write amplification is widely documented but often misunderstood. The amount of write amplification varies between SSD vendors, controllers, and firmware implementations, but it usually falls into the 2X to 3X range. This means that a 1GB file transfer can result in as much as 3GB of data written to the NAND (and possibly even more).

The SSD also constantly juggles data internally due to static data rotation and garbage collection routines, so there is always a constant stream of wear inside the SSD, even if the user is not actively writing data to the drive. This wear is beyond the user’s control. Some SSDs have aggressive garbage collection routines, which increases the amount of internal wear compared to other SSDs with more conservative algorithms.

Intel, like all other SSD vendors, uses the MWI (Media Wearout Indicator) SMART value to determine how much life the SSD has left. The wearout indicator is not based on the amount of data that the user writes to the drive. Rather, the MWI measures what percentage of the finite program/erase cycles the SSD has consumed. The MWI indicator takes into account all of the "unseen" writes that constantly sap endurance in the background, including the non-"host write" variety. Intel's SSDs enter the read-only state based solely upon the MWI indicator, which it referred to as the "Percentage Used" value in its official statement.

The media wearout indicator is not a one-to-one measurement of the endurance in relation to the amount of data written to the drive. This uncomfortable fact means that you cannot judge how much endurance you need by the amount of data that you write to the drive, or even by the TBW (Terabytes Written) value that many common utilities provide.

SSD vendors spec SSD endurance by the TBW metric, but it is merely a guideline. For most purposes, TBW serves as a good general guideline, but older SSDs tend to suffer much more of the "unseen" wear that is measured only by the MWI counter.

The MWI counter is the only true indicator of remaining endurance with all SSDs.

We advise readers to refer to the MWI counter to accurately gauge their current data usage patterns so they can make an informed decision before they purchase their next SSD.

What do people typically use to read back MWI values, and is it a scaled 0 to 100 type reading?

kinney

Very informative article.

Intel didn't really address anything about the low TBW, especially for the 1TB drive. I think the real answer is that there was little overprovisioning done on the 600P drives and this MWI information makes the 72TB endurance limit even worse than at first glance. If Intel provided that info for the article, they did their 72TBW drives a disservice.

I can't see as someone interested in a 1TB NVME M.2 drive how I wouldn't spend the extra $120 for the 960 EVO over 600P. Even with the 3-year vs 5-year warranty, the 72TBW vs 400TBW is huge and is the difference between "I'm concerned about doing extra writes" and "within reason, I'm going to use this pretty much however I please". Since the TBW or warranty is "whatever comes first".

Both the 600P and 960EVO drives are great everyman's products though, and I'm not stepping up further unless I was doing video/3D work all day everyday, and then I'd step up to something with a hefty heatsink on it to prevent throttling. For a professional usecase, I'd look at the Intel 750 series before I moved to the 960Pro class.

tigerwild

An I have officially banned purchasing any Intel SSDs from here on out. I will not recommend them to anyone either.

Kewlx25

My HD stopped working and crashed Windows. Time to reboot. All my data is gone and the drive is read-only. Looks like a a duck, quacks like a duck. If I'm going to lose all of my data, it shouldn't be because of some arbitrary artificial limitation. I understand this feature can be good for RAID, but not for non-RAID, and especially not for OS drives.

Nintendork

Would be neat if it stayed as read only, but locking itself?

mavikt

Why are Intel talking about warranty period in the endurance context? Do they use the warranty period as reference to calm people that it will last atleast that long or do they replace it if you reach then endurance max within that time?

It would be nice if Windows would run without any errors on a locked media. The drive would effectively be a time capsule. I think I tried Linux on a CD many years ago, no problems.Also that you'd have the ability yourself to write lock/unlock the drive at your own discretion. Wouldn't that provide a nice security feature, perhaps in a windows UAE fashion.

So many thoughts...

This feature should be optionnal. Most consumers don't do checksums or anything, so maybe it's better to have the drive turn read-only rather than start having silent data corruption.

PaulAlcorn

512712 said:

Why are Intel talking about warranty period in the endurance context? Do they use the warranty period as reference to calm people that it will last atleast that long or do they replace it if you reach then endurance max within that time?
...
So many thoughts...

The warranty is based solely upon the MWI counter. Once it has expired, so has your warranty. .

PaulAlcorn

1841500 said:

What do people typically use to read back MWI values, and is it a scaled 0 to 100 type reading?

It is a 100 to 0 measurement, with 100 being a brand new SSD. You can use a SMART value reader (such as Crystal Disk Info) to read the MWI. However, it isn't always the same attribute for each SSD. The Intel MWI counter is attribute E9.

computer-guy

No more INTEL SSDs for me. You know Intel, that 160GB drive firmware bug that bit me a couple of times was bad and now this BS. Good by Intel, I am sad to see you go...

mark0718

It isn't reasonable to base warranty on hard to determine internal functions of adevice. The lifetime of the warranty should be based on host writes.

If Intel wants to use the MWI values to help them with quality control that isfine. If Intel wants the user to use MWI as an indicator of when the usershould start thinking about replacing the device, that is fine, but the warrantyshould still cover any failing devices within the stated lifetime.

Should the user be denied a warranty replacement for the device ifthe firmware doesn't handle some usage case and gets a WMF of 300?What about if the firmware goes batty and just writes for no reason?

Finally, of course you should be able to read the data by setting a modesomeplace that says: do the best you can, and, for a particular read,say this is could or this is the best guess and do operation X for the rawdata including error correction code.

kinney

Hm lot of Intel SSD haters. I hope people realize that Samsung has a worse reliability record than Intel here. Intel>Crucial/Micron>Samsung is the pecking order on quality SSDs. They've all had problems, but in order of quality/reliability superiority that's how it comes out for those who have been paying attention. It is true that they don't always the best value, but that's another discussion.

rhysiam

7272 said:

Hm lot of Intel SSD haters. I hope people realize that Samsung has a worse reliability record than Intel here. Intel>Crucial/Micron>Samsung is the pecking order on quality SSDs. They've all had problems, but in order of quality/reliability superiority that's how it comes out for those who have been paying attention. It is true that they don't always the best value, but that's another discussion.

Speaking for myself, and I suspect many others who've commented above, the issue is not so much about reliability as it is about Intel effectively placing a hard limit on the life of this drive.

Let's say I buy a new car that has a 5 year, 150,000km warranty. Once I cross the 150,000km mark, I understand there's an ever increasing risk of a major component of that car failing. That's fine, I understand the manufacturer can only accept wear & tear liability to a certain point. I also understand that if I decide to quit my job and become a full time Uber driver, I'm going to rack up those kilometres and exceed my warranty really quickly. Again, that's fine, I'm an adult, I understand those risks and can make an informed decision about whether I choose to keep running the car and deal with the costs of failure myself, or replace it.

What Intel has done is effectively set a hard limit on the life of their drive such that the moment the endurance point of the warranty period is reached, it will no longer operate at all and must be replaced... the consumer who paid for and owns the drive has absolutely no say in the matter. As this article spells out it's actually slightly worse than setting a hard kilometre (or miles) driven limit, because SSDs themselves effectively clock up writes internally with their own internal maintenance tasks. At least with a car and km (/m) limit, the owner is in complete control of how far they choose to drive.

Now Intel would likely argue that my analogy is flawed because where a car might break down, the risk for the SSD is that exceeding the endurance rating could result in in users losing critical data. So, the argument goes, their "feature" is implemented to protect users. That's a feeble argument on a number of levels:

Even if you somehow cling to the argument that this "feature" really is in place to protect non-technical users, there is absolutely no excuse for not providing more technical users with the ability to turn it off and use the drive they bought until it actually fails... rather than just reaches an arbitrary cap that Intel has put in place.

So for me, this is nothing to do with reliability. It's about Intel hard-coding the death of my drive at a point when they've determined that I've used it enough and should get a new one, and then trying to justify it with weak arguments which do not hold up under scrutiny.

panathas

Intel believes that this way it's going to increase its future SSD sales.That's why it has such a low price on the 600p. It's like Intel is saying: your SSD is dead get a new one, they are dirty cheap. But how many of those locked drive customers are going to buy an Intel SSD again? I think Intel is losing future customers with that bad strategy. What's next for Intel? Maybe they should put a counter on their CPUs, telling us we can use our CPUs for an x amount of time (hours/days) and then after that amount expires, boom they are locked. Get a new one, it's time for an upgrade. Are they serious?

tsnor

The author makes a good point about educating consumers so that they know to (1) buy a new drive and (2) copy the data off the old drive.

The intel SSD toolkit is where you read the wearout indication. Intel could also start giving messages every 10 mins when they are within a few days of locking.

The MWI is not a linear measure on many drives. We use 8 SSD in a raid configurations where each of the 8 drives gets exactly the same IO load. We often see SSDs fail. Once of the common failure modes is the MWI on one of the drives drops from 99 left to 40 left or 28 left or some other really low number while the other 7 drives are still at 99. At that point you need to replace the drive ASAP -- it's going soon. I've always assumed the MWI was looking at the number of bad flash cells and that the large jumps in MWI were the result of chunks of SSD failing, but that's pure guess.

Aside: the comment about "no compression now that sandforce is gone" is not true. Some of the enterprise all flash arrays use compression. IBM has "IBM® Real-time Compression™" Hitachi has "VSP F series flash-only storage arrays feature custom flash modules of up to 6.4 TB raw capacity with on-board data compression."

PaulAlcorn

200136 said:

Aside: the comment about "no compression now that sandforce is gone" is not true. Some of the enterprise all flash arrays use compression. IBM has "IBM® Real-time Compression™" Hitachi has "VSP F series flash-only storage arrays feature custom flash modules of up to 6.4 TB raw capacity with on-board data compression."

That is correct, I was referring more to SSD controller-level compression, as opposed to software/system level dedupliction and compression, which are, of course, wonderful tech

tripleX

This is more about Intel protecting its enterprise margins than it is about protecting user data. Intel has had problems with datacenters using consumer gear, and not paying the enterprise tax. This same SSD will come to market as an enterprise product with a 3X endurance rating and 3X the cost.

PaulAlcorn

200136 said:

The MWI is not a linear measure on many drives. We use 8 SSD in a raid configurations where each of the 8 drives gets exactly the same IO load. We often see SSDs fail. Once of the common failure modes is the MWI on one of the drives drops from 99 left to 40 left or 28 left or some other really low number while the other 7 drives are still at 99. At that point you need to replace the drive ASAP -- it's going soon. I've always assumed the MWI was looking at the number of bad flash cells and that the large jumps in MWI were the result of chunks of SSD failing, but that's pure guess.

This can be from a number of things. Some software-based RAID implementations allows users to select a parity drive, which is then hammered more than the other drives in the array.

It sounds like you are using more of a traditional hard RAID, though, but LSI and Adaptec controllers are also known to selectively hammer one drive, or set of drives, more than others, even within the same contiguous array. Some of this is due to the RAID coding, but it can also be due to application hotspots and other phenomenon. I occasionally review HBAs and RAID controllers (since the first 6Gb/s adaptors came to market), and have measured wear on a 24-drive RAID 10 set with Micron P400m (outstanding SSDs, btw.) I confirmed that some drives received more wear than others, and LSI reps have confirmed that this is an issue in some circumstances. They have tried to address it through firmware, etc, as the SSD RAID age came to fruition, but most users are still using older gear due to maintenance contracts. I've always found it best to leave an unaddressed portion of the array to serve as extra OP for the SSDs, and while it doesn't address the issue directly it can help to improve general performance and endurance. Of course, the economics aren't great, but since i'm not running a production environment that isn't as much of a concern. There are some rather scattered articles around the net, mostly on Linux user forums, that explore the uneven wear issue in some detail, but i'm not aware of a one-size-fits-all solution. The wear issue is also an issue with HDDs, as some will go through way more load/unload cycles due to the concentrated wear on certain LBA spans.

tsnor

1920539 said:

200136 said:

Aside: the comment about "no compression now that sandforce is gone" is not true. Some of the enterprise all flash arrays use compression. IBM has "IBM® Real-time Compression™" Hitachi has "VSP F series flash-only storage arrays feature custom flash modules of up to 6.4 TB raw capacity with on-board data compression."

That is correct, I was referring more to SSD controller-level compression, as opposed to software/system level dedupliction and compression, which are, of course, wonderful tech

Double check where Hitachi and IBM are doing the compression.

PaulAlcorn

200136 said:

1920539 said:

200136 said:

Aside: the comment about "no compression now that sandforce is gone" is not true. Some of the enterprise all flash arrays use compression. IBM has "IBM® Real-time Compression™" Hitachi has "VSP F series flash-only storage arrays feature custom flash modules of up to 6.4 TB raw capacity with on-board data compression."

That is correct, I was referring more to SSD controller-level compression, as opposed to software/system level dedupliction and compression, which are, of course, wonderful tech

Double check where Hitachi and IBM are doing the compression.

I was under the impression that Hitachi does compression with secondary FPGA, making it a system level approach. Here is an article I wrote on it.

the part covering FPGA offload is buried a bit in the VSP G Series yada yada, but here it is for reference;

Quote:

As with many of the HDS core technologies, the company provides non-disruptive and transparent tiering services by employing powerful FPGAs to offload the associated processing overhead. Many of the storage vendors leverage the x86 platform to perform compute-intensive processing tasks, such as inline deduplication, in a gambit to reduce cost. In contrast, FPGAs can process more instructions per cycle, which boosts efficiency and allows the company to perform compute-intensive tasks in a non-disruptive fashion.
HDS leverages the intrinsic benefits of FPGAs for many of its features and employs sophisticated QoS mechanisms to eliminate front-end I/O and latency overhead. The industry is beginning to migrate back to FPGAs for some compute-intensive tasks, and Intel is even working to bring FPGAs on-die with the CPU as it grapples with the expiration of Moore's law.
HDS has extensive experience with FPGA-based designs and views the recent resurgence as a validation of its long-running commitment to the architecture, which it infused into its product and software stacks. It will be interesting to see how HDS, and others, adapt to the tighter integration as Intel moves to fuse FPGAs onto the CPU.

If memory serves correctly, IBM uses the same offloaded approach. These are in the 'controller' as defined by the node, but I am unsure if it is direct compression on the controller inside the final storage device (i.e., HDD/SSD).

bit_user

200136 said:

The MWI is not a linear measure on many drives. We use 8 SSD in a raid configurations where each of the 8 drives gets exactly the same IO load. We often see SSDs fail. Once of the common failure modes is the MWI on one of the drives drops from 99 left to 40 left or 28 left or some other really low number while the other 7 drives are still at 99. At that point you need to replace the drive ASAP -- it's going soon. I've always assumed the MWI was looking at the number of bad flash cells and that the large jumps in MWI were the result of chunks of SSD failing, but that's pure guess.

If MWI is a measure of spare capacity (which is essentially the same thing as counting bad blocks), then it makes sense that it would drop at a non-linear rate. How abruptly it falls is primarily a function of the variance of block endurance.

If endurance is extremely consistent, across all blocks, and the SSD does a superb job of leveling wear, then it would basically be a cliff. On the other hand, if the SSD does a relatively poor job of wear-leveling and there's a relatively wide distribution of block endurance levels, then MWI should drop somewhat linearly (although you'd get less life out of the drive, assuming mean endurance and overprovisioning were the same, in both cases).

Ricardotheanonymous

I read the additional and fixed text.So, as a conclusion, 72TB wasn't a threshold for read-only feature, right? I feel that endurance part of the 600p review is completely misleading. There is no more evidence shown to judge 600p as a low endurance SSD, especially for high capacity model. I hope Toms will ask Intel, "what does TBW shows in 600p's spec?" I predict the answer is "it's for reference about endurance, but does not show actual endurance for each model".

computer-guy

here is another problem: it goes by reason that there is some kind of software code that controls all this, and by all nature, code has bugs. So who is to say that the drive will not die because of a code induced bug. So Intel is wasting their time on this. People will less likely buy drives, there is danger that the code will brick the drive, and they had to spend money on the code to begin with and now will have to maintain it. Intel used to be a great company, but slowly, they are eroding their user base. Intel, get you house in order.

bit_user

1461603 said:

here is another problem: it goes by reason that there is some kind of software code that controls all this, and by all nature, code has bugs. So who is to say that the drive will not die because of a code induced bug.

All drives, in recent history - whether mechanical or solid state - have a large amount of embedded firmware. In the past, Intel has used 3rd party controllers and differentiated only with custom firmware (and their own NAND).

Just look at the firmware updates released for any given SSD, and check the release notes for a sense of the kind of bugs that occur.

It goes without saying that the only way to ensure your data is safe is to back up anything you care about. Or you could just put it on the cloud, in the first place, but that has its own disadvantages & I'm old school.

P.S. I think Paul recently wrote an article on a move to dis-intermediate the storage from the host, and basically let the host OS do the work of wear leveling, garbage collection, etc. I don't know whether that'll come to the consumer space, but it'll be a while.