New “semi-floating” gate makes for fast, low-power circuitry

As well as a very compact image sensor.

Share this story

After a long period during which the emphasis had been on building ever-faster computer circuits, things have shifted toward making them more energy-efficient. Some efficiency gains have come through small tweaks to the layout of the individual components, but most of the efficiency was a product of changes to the manufacturing process: new materials and ever-smaller features. Unfortunately, we're getting close to the point where shrinking the features of circuits any further will cause the inherent noise of quantum mechanics to start interfering with the chip's operations.

But that doesn't mean an end to potential improvements. A team of Chinese researchers have now described a new structure for the individual gates that control the flow of electrons within chips. Their design, which they're calling a semi-floating gate, switches states in as little as a nanosecond, and it requires very little power to operate.

The gates in electronics share a common design. They have a source of electrons and a drain for them connected by something that can be switched between two states: one that allows the current to flow between the source and drain and one where the current is blocked. Typically, the switch material has been a semiconductor that directly connects the source and the drain. A neighboring bit of material can switch the semiconductor between insulating and conducting, controlling the flow of electrons through the gate.

Flash memory uses a distinctive variant on this called a floating gate. In these structures, the material that bridges the source and sink is electrically isolated from them by a thin layer of insulator—in other words, it floats. This forces the electrons to transit through the gate by tunneling, with the rate of tunneling set by the control wire. A floating gate can stably trap charges, letting it be set in a semi-permanent on or off state, which is why flash can work as a long-term storage solution.

The semi-floating gate is like a hybrid of the two. On the source side, the gate is electrically isolated, forcing electrons to tunnel into the semiconductor that can transfer them to the drain. On the drain side, the semi-floating gate directly contacts the drain, allowing electrons to flow through. The control wiring that sets the state of the semiconductor is also slightly different. In addition to sitting above the semiconductor, it curves around to flank the junction between the semiconductor and the drain, forming a structure called a tunneling field-effect transistor. This provides finer control of the flow of electrons through the gate.

The end result is a device that can store its state, much like flash, but switches much more quickly. Changing between the on and off states took only 1.3 nanoseconds. All the switching also took place within a range of 3V and required very little current: less than one micro-Amp. The device was also very stable. Even after 1012 cycles of writing and erasing, it retained about 90 percent of its original performance. The researchers estimate it would still work out to 1015 cycles, which handily beats floating gate performance.

In effect, the authors say that the device has the speed of SRAM, but it only requires a single transistor to provide the equivalent behavior.

It's also remarkably flexible. The authors replaced the control gate with a photosensitive material that built up charge in response to light. The amount of current that flowed through the gate ended up being proportional to the amount of light the device was exposed to, meaning that each one of these gates could act as an incredibly compact photosensor.

The device doesn't hold its state without power, so it's not currently a replacement for flash. But the authors suggest it could eventually stand in for current forms of RAM (both SRAM and DRAM)—and there are obvious applications in digital imaging. The authors also note that by lowering the speed of operations, it's possible to use even less power, which could make the technology useful for mobile applications.

For perspective what's the switching time of a standard gate? 1ns seems a long time when the clock is 4GHz.

You're right, it is much faster. All the gates in a pipeline stage need to cascade to and stabilize at their next state within a single clock cycle.

However since this was an initial research prototype the fact that it's operating much slower than current high speed ICs isn't necessarily a problem because it's the first time anyone managed to make it work at all. Also, the 3v operating voltage implies that like many R&D prototypes it was made on a much larger process than the current state of the art (obsolete used gear is much cheaper to buy). 3V is the approximate voltage used by the initial Pentium class processors and SDram (what cam before DDR1) both of which were 3.3v designs.

For perspective what's the switching time of a standard gate? 1ns seems a long time when the clock is 4GHz.

You're right, it is much faster. All the gates in a pipeline stage need to cascade to and stabilize at their next state within a single clock cycle.

However since this was an initial research prototype the fact that it's operating much slower than current high speed ICs isn't necessarily a problem because it's the first time anyone managed to make it work at all. Also, the 3v operating voltage implies that like many R&D prototypes it was made on a much larger process than the current state of the art (obsolete used gear is much cheaper to buy). 3V is the approximate voltage used by the initial Pentium class processors and SDram (what cam before DDR1) both of which were 3.3v designs.

Thanks. So, given the older process, any feel for what the switching time would have been then? Figure, the original P5 architecture was clocked ~100 times slower this would be the equivalent of 10 ps switching today? That seems quick - if it scales.

Current Samsung SRAM requires a minimum of 4ns between presentation of valid data to the part and the disabling of write_enable. Write times are not a direct measurement of actual gate switching times, but are a good first-order indicator. Based on this, 1.5 to 2.5 ns would be my guess.

One trillion operations per gate sounds like a lot, but a modern 4Ghz processor could go through that in about 4 minutes. Even assuming some slowdowns for bus access and whatnot, it doesn't seem sturdy enough to operate in the place of DRAM.

<quote>The gates in electronics share a common design. They have a source of electrons and a drain for them connected by something that can be switched between two states: one that allows the current to flow between the source and drain and one where the current is blocked. Typically, the switch material has been a semiconductor that directly connects the source and the drain. A neighboring bit of material can switch the semiconductor between insulating and conducting, controlling the flow of electrons through the gate.</quote>

Transistor gates control the switching functionality, opening or closing the channel connecting the source and drain. Gates DO NOT have "a source of electrons and a drain for them". The gate IS that "neighboring bit of material [that] can switch the [channel] semiconductor between insulating and conducting."

<quote>In these structures, the material that bridges the source and sink is electrically isolated from them by a thin layer of insulator—in other words, it floats.</quote>

The gate is always electrically insulated from the source, drain, and the the channel (otherwise the operating current would flow into the controlled gate). It is called a floating gate because the gate is also electrically insulated from its control line (wire) by an insulator. It is insulated on both the bottom AND on the top. Therefore it has no conductive electrical connection to anything - hence the term floating.

What is typical read cycle time / write cycle time / erase cycle time for standard flash memory, if there is even such a thing? Like for example, what are the numbers for TLC NAND used in Samsung's 840 series?

One trillion operations per gate sounds like a lot, but a modern 4Ghz processor could go through that in about 4 minutes. Even assuming some slowdowns for bus access and whatnot, it doesn't seem sturdy enough to operate in the place of DRAM.

You don't write through your DRAM constantly though.

I honestly have no fuzzy idea just how much data you'd write to DRAM in a typical day for a normal desktop or even server, but I'd assume you are not constantly overwriting what is in DRAM.

What...maybe 40-100GB of data per day maybe for a normal workstation? Even a Terabyte of data per day is pretty long endurance? Complete guess though. However, I assume the larger the DRAM, the fewer writes you have in the end as you don't need to flush the DRAM cache to the drive and then load a new program, you have the spare space to just load what you need to memory (and then subsequently don't need to retrieve the stuff you flushed from memory later if you reuse it). A trillion write cycles for 90% performance is pretty darned good. 4GB of memory means 4 zebibytes of writing, or 4 billion 1 terabyte write cycles for a 4GB DRAM composed of this stuff for 90% performance, or roughly a quadrillion 4GB writes based on their estimated actual endurance.

I don't believe this stuff was proposed to replace the transistors within the processor itself. Seems way too slow for something like that (though I don't know how transistor switching speed scales with size, so if this was done on a large process, it might be that it would switch a lot faster at a smaller size). 4Ghz means .25ns...which these are 1/5th the speed of. I think this is more proposed to replace something like DRAM. It would probably have all of the benefits of being able to be manufacturered on a traditional CMOS process, but possibly be lower powered than NAND flash, maybe DRAM (maybe), be super fast (sounds like very ball park to existing DRAM or faster) and have storage permenance like NAND flash.

Could possibly beat the pants off something like ReRAM and some of the other proposed DRAM/NAND merger techs. At least that is what it is sounding like to me.

One trillion operations per gate sounds like a lot, but a modern 4Ghz processor could go through that in about 4 minutes. Even assuming some slowdowns for bus access and whatnot, it doesn't seem sturdy enough to operate in the place of DRAM.

Wrong number10^15 = 1,000,000,000,000,000

That's one quadrillion, not 1 trillion.

Also, a 4ghz processor has nothing to do with memory access. The CPU is performing 4ghz worth of processing which often involves parrallel internal commands on the various CPUs in the core. However, that is not 4GHZ worth of memory accesses. If it were, your SDRAM sticks would explode after about a minute with that many accesses. Your fastest memory is running at about 1000mhz, or 1GHZ at best. And it's still very hot. Again though, that's not quite the way you think of it. If I remember correctly, most IO clocks are between 100 and 200 MHZ, all the extra MHZ for memory has to do with channels. Each individual bit in memory is probably being hit about at about 150 MHZ on average if that memory is correct.

However, even at 1333 MHZ, if my memory is wrong...

1,000,000,000,000,000 / 1,333,000,000 hertz = 750,187 seconds of life span750,187 / 60 = 12,503 minutes of life span12,503 / 60 = 208.385 hours of life spanOr a little over 8 days of life span.

If my memory is correct, each individual gate would get : 1,000,000,000,000,000 / 100,000,000 hertz = 10,000,000 seconds of life span10,000,000 / 60 = 166,666.667 minutes of life span166,666.667 / 60 =2,777.778 hours of life span2,777.778 /24 = 115.75 days of life span

Granted, a bit low, but this is a prototype, so that's not bad for a prototype. And that all assumes I've remembered my college computer architecture course from 22 years ago correctly. Not at all a safe assumption. Either way, the point is, it's a quadrillion uses, not a trillion uses.

EDIT : And note, that's of continuous maximum level access to the memory at maximum speed, maximum usage (IE: not spread over the entire chip, just on that one gate). Not sure how that 115.75 days of continuous usage would translate to a normal computer operation, probably less than SDRAM, but again, prototype...

To be clear, there are parts of memory that are very often being written to. Things like counters, DMA data coming in from periperals, etc. Imagine a busy transaction heavy database that is serving a million write transactions per second (yes yes, on a super turbo server with a billion cores ). Even then, the writes are spread out. The CPU cache does absorb some of that which is good. On average, each bit of RAM is probably not written to that often, but there are very extreme outliers that are constantly being modified.

It is highly unlikely that a single memory location is written to for every memory bus cycle, unless maybe the CPU gets stuck in an infinite loop. So, if we guess that the most commonly writen-to areas are accessed at best once every 100 cycles (again, don't forget that CPU cache absorbs a lot of the read/write to super frequently accessed locations), the longevity of a single bit gets better.

As a replacement for NAND, this could be amazing. As DRAM, it can still be amazing, but with further engineering refinement which I am sure will come.

For the most intensive applications in commodity server hardware, the relevant metric for write rates would be the ratio of memory bandwidth to memory capacity. I.e. if you have 32 GB of RAM and can write 10 GB/s, then you can potentially get a full rewrite every 3.2 seconds. At that rate, with 10^15 rewrites you have about 100 million years of lifetime. Even at the measured durability, that's still 100,000 years.

The reason we can divide memory capacity by bandwidth is because if you're using less than full capacity, you're also likely reusing more in cache as well, and the above arguments apply.

Looking over the paper it looks like neat technology but I'd bet on some form of MRAM, most likely STT-MRAM as the next DRAM replacement. MRAM has its own disadvantages, high write latency and write energy in particular but newer forms of MRAM (OST-MRAM from NYU for example) look very promising.

EDIT: Since most of the discussion is focused on write endurance a few points on write endurance.1) Remember that DRAM isn't accesses strictly at the address level but rather at the page level. This means that without any form of endurance protection a single dead cell doesn't just kill that cell but a 4K (typical page size) area. Additionally neither accesses nor individual cell write endurances are uniform. Given a technology with a write endurance of 10^8 writes over a large multi gigabyte array there will probably be cells with an order of magnitude smaller lifetime.

2) Write endurance within DRAM replacement technologies is a well studied problem. One of the current prime candidates for DRAM replacement is Phase Change Memory (PCM) and there is a fair amount of literature on increasing endurance in PCM. Techniques like SAFER, ECP, RDIS, dynamic page formation among others can grant life time improvements of several orders of magnitude.

In the old days, 10^14 cycle endurance was considered effective unlimited endurance (for main memory).Then processors got faster -- 1GHz+ and the line moved up to 10^15.

This memory sounds like it is more than just a simple transistor -- more like a compound transistor that would have more than 3 wires -- maybe 4 or 5? And thus would not be nearly as cheap as Flash or DRAM. Just guessing. Maybe a candidate for level 3 cpu cache, speed-wise, but endurance would have to be REALLY good. My guess is it may never find a huge niche as memory and that's why they are touting it mainly as a photo sensor.

<quote>The gates in electronics share a common design. They have a source of electrons and a drain for them connected by something that can be switched between two states: one that allows the current to flow between the source and drain and one where the current is blocked. Typically, the switch material has been a semiconductor that directly connects the source and the drain. A neighboring bit of material can switch the semiconductor between insulating and conducting, controlling the flow of electrons through the gate.</quote>

Transistor gates control the switching functionality, opening or closing the channel connecting the source and drain. Gates DO NOT have "a source of electrons and a drain for them". The gate IS that "neighboring bit of material [that] can switch the [channel] semiconductor between insulating and conducting."

OK, this tech looks interesting if the power requirement can be driven down to picowatts per cell while retaining nanosecond access times under 'ordinary' uses. Gigabits of RAM consuming low single-digit milliwatts would be great for untethered/self-powered devices. Using existing fabrication techniques is a huge plus.

"Flash memory uses a distinctive variant on this called a floating gate. In these structures, the material that bridges the source and sink is electrically isolated from them by a thin layer of insulator—in other words, it floats. This forces the electrons to transit through the gate by tunneling, with the rate of tunneling set by the control wire."

i) The material that bridges the source and drain in a floating gate transistor is still semiconducting (usually a p-well or p-doped substrate).

ii) The thing which "floats" is the floating gate itself.

ii-a) In a regular MOSFET, there is a single gate. It is electrically insulated from the transistor's body, but you will still have direct control over the voltage which appears on the gate itself (you can directly tie a wire to the gate). By manipulating this voltage, you can vary the conductivity between the source and drain (via an induced electric field).

ii-b) In a floating gate transistor, there are two gates: the control gate, and the floating gate. The floating gate is completely encapsulated by insulator, so it electrically "floats." However, like the singular gate in a regular MOSFET, a charge ("voltage") on the floating gate will alter the electrical conductivity between the source & drain. In NAND flash, charge can be injected and released from the floating gate by inducing sufficiently high electric fields which lead to currents which "tunnel" through the insulator (this is a quantum electrical phenomena known as Fowler-Nordheim tunneling).

I wonder what benefit this would provide for photo detectors beyond "making an extremely small photo sensor". We're already making sensors so small (~1 micron pixels) that there is really no benefit in shrinking them further. If this technology significantly improves the pixel fill-factor (ratio of active sensing area vs. passive area) then it could be useful, though back-thinning does address this issue already.

That doesn't seem very impressive. Computer DRAM modules already operate at a rate upward of 1600mhz. Which means that there is transistors switching every 0.625 ns. Also a microamp per state change is quite high as well. That puts it at a minimum of 32.768 milliamps to write a standard 4KB page (Something that happens very often at least once a second).

The only real advantage that I see in this technology is that it sounds like the semi-floating gate could offer a smaller transistor size, which can boost density, and it might be able to get away without the constant refresh cycle that DRAM requires.

That's the amount of time for the entire dram chip to respond to an input; not for a single transistor within it.

That's right ... and to answer the first query, leading edge technologies usually claim sequential logic delays on the order of sub-50 picoseconds. Here, delay = time taken for a 0 or 1 to propagate through a logic gate (such as an inverter)

But as others have stated already, as a memory element, it's the time required to store data that matters and a nanosecond is not bad at all for something experimental such as this.

For the most intensive applications in commodity server hardware, the relevant metric for write rates would be the ratio of memory bandwidth to memory capacity.

Memory accesses have very high degree of both temporal and spatial locality. The active working set and stack of busy processes would get the majority of the write traffic (though caches should absorb a lot of it), while code and data of inactive tasks remain mostly static. An OS could try performing some kind of wear leveling when allocating physical memory pages, but long-running processes would defeat it, especially on servers.

That doesn't seem very impressive. Computer DRAM modules already operate at a rate upward of 1600mhz. Which means that there is transistors switching every 0.625 ns. Also a microamp per state change is quite high as well. That puts it at a minimum of 32.768 milliamps to write a standard 4KB page (Something that happens very often at least once a second).

The only real advantage that I see in this technology is that it sounds like the semi-floating gate could offer a smaller transistor size, which can boost density, and it might be able to get away without the constant refresh cycle that DRAM requires.

Modern DRAM actually runs about 200mhz, it's only the external interface that runs at 1600mhz.

The reason for this is that bandwidth is frequency times bit-width. The internal traces are much wider than the external traces that connect the DRAM to the CPU. To compensate, the increase the external frequency.