NAND flash gets baked, lives longer

An article has been published in the December 2012 edition of IEEE Spectrum that highlights an interesting and potentially useful discovery by ROM manufacturer Macronix. Researchers there have discovered that applying heat to NAND flash cells can drastically extend their life, thus overcoming one of the biggest problems with the solid state storage technology.

NAND flash is used everywhere, from smartphones to SSDs to thumb drives, and we've written extensively before on how it works. The technology's biggest failing is that NAND flash only lives so long. Every time the flash cells are erased, they retain some residual charge; eventually, they get to where it takes so long to make them change their charge level that they stop being useful as a storage medium.

As NAND flash grows denser, it gets more delicate; in our discussion on the future of flash, we discuss the decreasing lifetimes of NAND flash and the potential alternatives. SSDs rely on complex mathematical gymnastics at the controller level to reduce writes and hence lengthen the life of their flash cells, but the need for those kinds of workarounds could be substantially lessened by the Macronix discovery.

The smaller the flash cells get, the less residual charge it takes before they need to be marked as bad.

Aurich Lawson

It's long been known that annealing NAND flash—that is, subjecting it to high heat—can force the long-trapped electrons out of the NAND floating gate, reducing its retained charge and returning it to usefulness. But it's been thought all along that such annealing was too energy-intensive and too difficult to do precisely—essentially, an entire NAND chip had to be baked for hours.

However, using techniques borrowed from phase-changing RAM, where heat is applied to a material to change its state from conductive to insulating, the Macronix boffins constructed a redesigned NAND flash package with its existing electrical pathways modified to carry heat to the floating gate, the portion of the NAND transistor that is filled and drained to denote a 0 or a 1.

The modification is a complex one and required substantial engineering, but the results are impressive—a brief and restricted jolt at 800C appears to "heal" the flash cell, removing its retained charge. Macronix estimates that this can be done repeatedly as needed, leading to a flash cell that could potentially last for 100,000,000 cycles, instead of the roughly 1,000 cycles that current 21nm TLC flash cells are rated to last.

Since flash cell life cycle decreases as process size shrinks, this method of heating cells back to life is good news for the future of SSDs. Moore's law charges on; the International Technology Roadmap for Semiconductors projects an eventual arrival at 8nm features, and the useful life of NAND flash at that size is very, very short. If Macronix's method can be commercialized it will have profound implications on the future of the medium.

So far, there is no word on when the discovery will actually be transformed into a usable product, and there's also no word on what impact it might have to the packaging and design of the rest of the device in which it's used. The IEEE article notes that the 800C hot spot is "restricted to the area near the [floating] gate," and that the heating cycle doesn't have to be run very often, so overall phone or laptop battery life won't be affected by the technology's addition. We'll have to wait and see what the tech actually looks like and if it comes with added heat sinks or other SSD changes.

800 degrees C is quite a bit. Even if you only do it periodically, its an additional risk. Beyond the mechanics just to achieve this in a commercial drive, would this turn your ssd into a fire hazard? Imagine if some bit controlling the temperature of the drive fails and it just stays at 800 degrees until the whole thing fails catastrophically. Its probably going to be a while until we see practical applications.

800 degrees C is quite a bit. Even if you only do it periodically, its an additional risk. Beyond the mechanics just to achieve this in a commercial drive, would this turn your ssd into a fire hazard? Imagine if some bit controlling the temperature of the drive fails and it just stays at 800 degrees until the whole thing fails catastrophically. Its probably going to be a while until we see practical applications.

A sensible approach might be to heat each cell individually. I don't think you have to heat the whole drive at once.

Heat can be concentrated, just like it can be spread out. 100C in a 8 square inch area does increase to 800C when reduced to one square inch.

[pedantry]Wouldn't that only be true if you were measuring in Kelvin, since Celsius is relative measure of temperature? (ie 200K is twice as much heat as 100K, but 200C is not twice as much as 100C)[/pedantry]

OrangeCream wrote:

Couple the heat sink from the CPU to the NAND, plus a resistive element to generate additional heat?

You'd need a mechanism to decouple them, since the heating only needs to be applied occasionally and I'd guess that doing so constantly could have some bad consequences for the NAND, which would necessitate quite a bit of engineering. Also, as others have pointed out, regardless of where the heat is coming from, applying 800C to anything inside a laptop seems like a really bad idea, unless there's some serious insulation around the SSD to prevent the heat from frying anything else. In a desktop, this seems possible, given how much space is available, but in an ultrabook or tablet…

I don't see this helping flash in the long run - we are going to see a release of MRAM in 2013 and with its near DRAM speed and being non-volatile I think the days of SSDs being made of flash are just about over. Hopefully Intel is designing Broadwell to take advantage of the much faster storage medium as that could really boost sales.

Considering your oven gets to maybe 800F at best if you hack the self-cleaning cycle and you need 800C, ummm yeah no. Not to mention the REST of the SSD would melt.

Depends on what they mean by "area near the floating gate." 800 C is quite hot, but if it's a tiny enough area being heated, it's not all that much energy. The GCMS in the lab my dad used to manage had a standing argon plasma at ~18,000 F. It was roughly a two-inch sphere of dim purple, and it was open to the air on two sides of the device. You could hold your hand quite close to it, and barely feel warm.

If they raised the entire chip to 800 C, it would absolutely be a problem. But I can certainly envision a scheme where individual cells are targeted for heating.

Sort of like these guys creating steam from ice by only heating the water directly adjacent to a nanoparticle - without melting the bulk of the ice.

800 degrees C is quite a bit. Even if you only do it periodically, its an additional risk. Beyond the mechanics just to achieve this in a commercial drive, would this turn your ssd into a fire hazard? Imagine if some bit controlling the temperature of the drive fails and it just stays at 800 degrees until the whole thing fails catastrophically. Its probably going to be a while until we see practical applications.

800C sounds like a lot, but the volume of material being heated to that temperature is incredibly small, and as a result the total energy involved is really small. Think of what happens if you boil 1 mL of water, then toss it into a 2L jug. The temperature of the 2L jug will barely increase at all, because there's just so little energy in the 1mL that's actually boiling.

They only need to heat the floating gate to drive trapped electrons off of it, and bear in mind that the gate length on modern NAND chips is around 20nm. The gate isn't perfectly square, of course, but it doesn't take a lot of energy to heat a few hundred cubic nanometers to 800C -- you will not have to worry about your SSD bursting into flames.

As to how the gates will be heated, it will be straight up resistive heating, just like the electric coils on your stovetop. Essentially the same thing is done in phase change memory (though that only needs to be heated to around 600C) -- drive a lot of current through a small area to heat it up, then turn the current off and it will cool down extremely quickly as all that energy spreads out throughout the device.

The real concern here is going to be the increase in die size required to accomodate these heating structures. NAND is basically a commodity, so losing usable active area on a die is always painful for a flash manufacturer (most customers don't really care what fancy features you have in your NAND -- they care how much it costs per bit). If the cell lifetime really can be increased by several orders of magnitude then this will probably make its way into most NAND flash chips -- if not, then we may see some compute/enterprise parts redesigned to work this way, but they'll likely be much more expensive than your garden variety NAND flash (partly because of the loss of die area, partly because the chips which end up falling short of enterprise-grade will have a much lower profit margin when they're resold into applications with lower requirements).

The real concern here is going to be the increase in die size required to accomodate these heating structures. NAND is basically a commodity, so losing usable active area on a die is always painful for a flash manufacturer (most customers don't really care what fancy features you have in your NAND -- they care how much it costs per bit). If the cell lifetime really can be increased by several orders of magnitude then this will probably make its way into most NAND flash chips -- if not, then we may see some compute/enterprise parts redesigned to work this way, but they'll likely be much more expensive than your garden variety NAND flash (partly because of the loss of die area, partly because the chips which end up falling short of enterprise-grade will have a much lower profit margin when they're resold into applications with lower requirements).

If this tech turns out viable, then things like 4 bit cells, which would not have otherwise been possible because of the drastically lower life compared to 3 bit cells, could become a reality.

Apparently this was the dumbest question in the history of mankind and intelligent life in general, so now it's gone and you have an excellent opportunity to down-vote this post for an even better reason: being irrelevant.

Why doesn't that cause thermal expansion issues? Suddenly heating a tiny spot by that much sounds like trouble to me.

Certainly thermal stress will be introduced. However, at the length scales of these features (10's of nm) silicon crystal is VERY strong. You'll locally stress the bonds of the crystal which will be doing their damndest to pull the heat away from that tiny spot. However, as soon as you turn off the heat source the crystal will pull the heat away and equilibrate quite quickly. Assuming you're only heating small volume at a time (~20 nm on a side) and you aren't doing every cell at once there won't be an appreciable size increase of the bulk and locally the crystal can handle it.

This isn't exactly new... Thermal annealing tends to fix damage in crystalline semiconductors. While it can improve the lifetime of the cell, it will also most likely erase any data stored in the cells by thermally exciting carriers out of the floating gates as bit errors do follow an Arrhenius relationship, while induced defects create tunneling pathways from the floating gate to the substrate. Quantum mechanical tunneling and thermal escape are competing failure modes.

Since heating releases the electrons, wouldn't it cause all cells to read 0 and effectively erase all your data?

What I think is needed is some way to leave a spare flash chip, move 1 chip worth of data to that chip, heat the previous chip, and move the data back. This could be the new manual TRIM operation that you kick off periodically. (or would it be BAKE?)

Since heating releases the electrons, wouldn't it cause all calls to read 0 and effectively erase all your data?

What I think is needed is some way to leave a spare flash chip, move 1 chip worth of data to that chip, heat the previous chip, and move the data back. This could be the new manual TRIM operation that you kick off periodically. (or would it be BAKE?)

This would probably be used after erasing an entire block, to return those cells to something closer to fresh performance. A clever flash controller would schedule a heat cycle to run on aging blocks when the chip in question has some downtime, to minimize the impact to the user.

800 degrees C is quite a bit. Even if you only do it periodically, its an additional risk. Beyond the mechanics just to achieve this in a commercial drive, would this turn your ssd into a fire hazard? Imagine if some bit controlling the temperature of the drive fails and it just stays at 800 degrees until the whole thing fails catastrophically. Its probably going to be a while until we see practical applications.

Some of these drives are good for something like 1TB/day for 5 years. You only need to apply this when the NAND starts to degrade.

800 degrees C is quite a bit. Even if you only do it periodically, its an additional risk. Beyond the mechanics just to achieve this in a commercial drive, would this turn your ssd into a fire hazard? Imagine if some bit controlling the temperature of the drive fails and it just stays at 800 degrees until the whole thing fails catastrophically. Its probably going to be a while until we see practical applications.

A thermal cutoff switch near the outside of the case would solve that problem pretty easily and reliably.

This isn't exactly new... Thermal annealing tends to fix damage in crystalline semiconductors. While it can improve the lifetime of the cell, it will also most likely erase any data stored in the cells by thermally exciting carriers out of the floating gates as bit errors do follow an Arrhenius relationship, while induced defects create tunneling pathways from the floating gate to the substrate. Quantum mechanical tunneling and thermal escape are competing failure modes.

One would assume that the erasing is on a bit-level or worst-case on a block level. You'd probably want to move data out of adjacent cells too. However, that's all stuff that can easily be incorporated into a controller that's smart about annealing cells or blocks.

If things can be erased as often as the paper's author seems to think such an operation may be as common as a TRIM command.

One would assume that the erasing is on a bit-level or worst-case on a block level. You'd probably want to move data out of adjacent cells too. However, that's all stuff that can easily be incorporated into a controller that's smart about annealing cells or blocks.

++

Wouldn't be a "go bake the drive" operation. Would happen without user knowledge or intervention at a tiny level spread across the life of the drive when the controller finds a bit that is failing to write. The whole drive does not need to be heated to 800C. Just the bit that's failing.

Some basic failsafes to keep it from catching fire if you end up with runaway input and done deal.

Heat can be concentrated, just like it can be spread out. 100C in a 8 square inch area does increase to 800C when reduced to one square inch.

[pedantry]Wouldn't that only be true if you were measuring in Kelvin, since Celsius is relative measure of temperature? (ie 200K is twice as much heat as 100K, but 200C is not twice as much as 100C)[/pedantry]

Not true. The Celsius scale is neither relative nor logarithmic. The zero point of the Celsius is always at 273.15 K.You may be confused by the conversion from Fahrenheit to Celsius, where 1° F = 5/9° C, effectively making the Fahrenheit scale need "more degrees" to represent the same temperature span (i.e +50° C to -50° C roughly equals +120° F to - 60° F).

Why doesn't that cause thermal expansion issues? Suddenly heating a tiny spot by that much sounds like trouble to me.

Certainly thermal stress will be introduced. However, at the length scales of these features (10's of nm) silicon crystal is VERY strong. You'll locally stress the bonds of the crystal which will be doing their damndest to pull the heat away from that tiny spot. However, as soon as you turn off the heat source the crystal will pull the heat away and equilibrate quite quickly. Assuming you're only heating small volume at a time (~20 nm on a side) and you aren't doing every cell at once there won't be an appreciable size increase of the bulk and locally the crystal can handle it.

Is it possible that the thermal stress around the gate could degrade the gate oxide layers to where they become leaky after repeated heating cooling cycles over time? I imagine in a worst case of that assumption that the opposite problem of actually holding onto charge could be more difficult, and in that case it may degrade how long the flash can hold data in the powered down state. This is probably a much larger issue (if it is one) as the feature sizes are shrunken too.

I just read the article + posts. Why is there an endless number of "put the whole thing in the oven" + "solutions" to the oven problem when the actual article already states the heat is specifically applied to the gate area only because heating the whole chip would be stupid?

I want one of 3 things: a reading comprehension test before you can post; a -50 down vote button that can be clicked repeatedly; a pony.

Heat can be concentrated, just like it can be spread out. 100C in a 8 square inch area does increase to 800C when reduced to one square inch.

[pedantry]Wouldn't that only be true if you were measuring in Kelvin, since Celsius is relative measure of temperature? (ie 200K is twice as much heat as 100K, but 200C is not twice as much as 100C)[/pedantry]

Not true. The Celsius scale is neither relative nor logarithmic. The zero point of the Celsius is always at 273.15 K.You may be confused by the conversion from Fahrenheit to Celsius, where 1° F = 5/9° C, effectively making the Fahrenheit scale need "more degrees" to represent the same temperature span (i.e +50° C to -50° C roughly equals +120° F to - 60° F).

I don't think Voynix was confused, that 273.15K offset is important. As Voynix originally pointed out, 200C is not twice the heat of 100C, but 200K is twice the heat of 100K (assuming unit thermal mass in all cases, yada yada yada). Take the extreme case, 2C is not twice the heat of 1C, but that is more obvious when I say 275K is not twice the heat of 274K.

One would assume that the erasing is on a bit-level or worst-case on a block level. You'd probably want to move data out of adjacent cells too. However, that's all stuff that can easily be incorporated into a controller that's smart about annealing cells or blocks.

++

Wouldn't be a "go bake the drive" operation. Would happen without user knowledge or intervention at a tiny level spread across the life of the drive when the controller finds a bit that is failing to write. The whole drive does not need to be heated to 800C. Just the bit that's failing.

Some basic failsafes to keep it from catching fire if you end up with runaway input and done deal.

I doubt failsafes are even an issue because the circuit would most likely fail in open circuit before it ever drew enough power to smoke. They are using the tiny conductors with electrically generated heat, and so they would burn up if they were transferring too much power and you would just have bad memory addresses at best and a nonfunctioning memory chip at worst.

This isn't exactly new... Thermal annealing tends to fix damage in crystalline semiconductors. While it can improve the lifetime of the cell, it will also most likely erase any data stored in the cells by thermally exciting carriers out of the floating gates as bit errors do follow an Arrhenius relationship, while induced defects create tunneling pathways from the floating gate to the substrate. Quantum mechanical tunneling and thermal escape are competing failure modes.

One could reasonably assume that these cells are being marked as "bad" and therefore are not being used or the data in it is already considered corrupt and not of value.

The way I see it, things like thumb drives, sd cards, and the flash used in devices like phones and other consumer devices will probably remain the same.

However, things like SSD drives will probably see the application of this kind of technology, to drive more SSD usage in the Enterprise space. For example, you'd be insane to put a busy database on an SSD right now, because you'd wear it out too fast. Extending the lifespan by five orders of magnitude allows one to seriously consider using SSDs in more applications.

I could see how the increased reliability and lifetime would also push more SSD adoption into the desktop and laptop space.

The way I see it, things like thumb drives, sd cards, and the flash used in devices like phones and other consumer devices will probably remain the same.

However, things like SSD drives will probably see the application of this kind of technology, to drive more SSD usage in the Enterprise space. For example, you'd be insane to put a busy database on an SSD right now, because you'd wear it out too fast. Extending the lifespan by five orders of magnitude allows one to seriously consider using SSDs in more applications.

I could see how the increased reliability and lifetime would also push more SSD adoption into the desktop and laptop space.

I don't. What I see is that increasing the reliability allows us to create more marginal designs to make SSDs cheaper, which is what would push SSDs into laptops and desktops.

Lee Hutchinson / Lee is the Senior Reviews Editor at Ars and is responsible for the product news and reviews section. He also knows stuff about enterprise storage, security, and manned space flight. Lee is based in Houston, TX.