Dealing with Bits Flipped- From the Weak Write Test Mode Saga

In Circuits and Systems I, Prof Emad informed the class that sign matters. A student requesting to receive partial credit on an exam question because she only got the sign wrong would receive no mercy. “What’s the difference between + 5 Volts and – 5 Volts?” Prof Emad pointedly stated, “10 Volts.”

Ever since Ben Franklin mistakenly set the direction of electron travel, there has been some confusion in electronics class room. Yet, while you can make a mistake on an exam and switch the signs; in the practice of electrical engineering there are consequences. Some can be devastating like injury to a worker. Some cost money like damage to a piece of measurement equipment. This is a story about switching signs, finding it out and coping with the consequences.

The P54CS microprocessor consisted of 3,300,000 transistors and the embedded memories included five sizeable memories that traditionally had received the Data Retention Test (DRT). A 500 millisecond pause occurred between writing a value into an SRAM cell and reading it. If it failed, then a defect was found. With two bit-cell values to check “1” and “0” that came to 1 second. Reducing it by an order of magnitude was our goal with the new test methodology we dubbed Weak Write Test Mode. The DRT could be run in parallel on all five memories. We designed WWTM implementation so it too could test all five memories in parallel. The difference being we would weakly write a “0” to a “1” state and then check if the bit cell still read a “1.” Followed by weakly writing a “1” to a “0” state.

In preparing for the silicon bring up I investigated how to validate that the new method would work. As naturally occurring defects wouldn’t be frequent enough to validate the new test method I pursued paths to deliberately create defects. While Fibbing a defect could be done to check out the methodology, it could only be cost effective to do it on a couple of parts and in only a couple of SRAM cells. My “it’s alive” moment proved that the method worked. Still it could not guarantee that the implementation in all five memory designs worked. A different design engineer was responsible for each memory design. Five embedded memories, five circuit designers, five layout designers, five opportunities to implement it wrong. To provide full validation, we decided to deliberately create defects that would test both failure modes in all five memories. This would be done with a defective mask set. A non-trivial cost to manufacture the masks (1-2) and required well controlled wafer manufacture as these wafers would only be for engineering purposes.

The test program development could have had five different product engineers as well, yet it didn’t. Initially, we had one product design engineer, Bao Nguyen, responsible for implementing the WWTM code on all five memories. However, for one memory, Trace Cache, Rama Pedarla, requested to do the WWTM test code. He already had ownership of all the other tests associated with Trace Cache.

I vaguely recall how the tester code was organized; I do know that the WWTM control was a separate piece of code that would perform the test on all five memories at the same time. So I’m going to guess that the code looked like this:

Write a 1 to every cell in each SRAM cell

SRAM A: Data Cache

SRAM B: Instruction Cache

SRAM C:

SRAM D:

SRAM E: Trace Cache

WeaK Write a 0

Read every cell in each SRAM and check for a 1

SRAM A: Data Cache

SRAM B: Instruction Cache

SRAM C:

SRAM D:

SRAM E: Trace Cache

Write a 0 to every cell in each SRAM cell

SRAM A: Data Cache

SRAM B: Instruction Cache

SRAM C:

SRAM D:

SRAM E: Trace Cache

WeaK Write a 1

Read every cell in each SRAM and check for a 0

SRAM A: Data Cache

SRAM B: Instruction Cache

SRAM C:

SRAM D:

SRAM E: Trace Cache

While each memory was read and written to separately, the Weak Write was done all at once, just like the Data Retention Test (DRT), with a key difference. In the DRT test nothing was done- it was Pause. In the WWTM test an action was done, weakly writing the opposite state.

Eventually the engineering wafers created with the defect masks arrived and we could validate the implementation on all five memories. A curious thing occurred though. Four of the five memories behaved as expected but the Trace Cache did not. It never failed. What made this odd was all the SRAM cells were equivalent design-wise. While the defective cells were placed in different parts of each memory, i.e. location, addresses, they all should have behaved the same in each embedded memory. I.e. a weak write 1 should result I some failed cells and because we know the location we could run an SRAM diagnostic test to determine the location. The same with a weak write 0.

Rama owned the test program for Trace Cache and the hunt was on for why it didn’t work. He knew the Trace Cache design very well as he had implemented all the tests for it. He dug into it and probed and experimented. Eventually, he determined that if he did a weak write 0 after writing a 0 he got fails. And if he did a weak write 1 after writing a 1 he got fails. Aha! The bit columns had been flipped. His persistence paid off. The design owner Joseph Y had incorrectly connected the WWTM circuit to his bit columns. He connected the Bit to Bit Bar and Bit Bar to Bit. So while all the other SRAMs were weakly writing a 0, the Trace Cache was weakly writing a 1- i.e. reinforcing the current SRAM cell state instead of writing the opposite of it. Sign mattered and Joseph had reversed it, there were consequences and now it was a matter of how it would be fixed.

An outcome of first silicon debug and validation is a list of design fixes. So I asked if the WWTM circuit could be correctly connected. Joseph declined as we could cope with the design error by modifying the tester code. As unhappy as I was with this response, it meant I wouldn’t get the full reduction in one order of magnitude, we had a path to properly implement the test. Though the test program had been constructed in such a way that we couldn’t change the just the Trace Cache code. We had to run the whole test sequence twice and in the first case, mask any of the Trace Cache results, in the second run we would have to mask the other four SRAM results. The cost- the test took a little over 200 ms to test all five SRAMs instead of the little over 100 ms. Twice the estimated test cost, yet still a 5X reduction in the Data Retention test.

Nobody died or was injured by mixing up the “sign”, no test equipment was damaged. We just needed to run the test twice. It’s a DARN good thing we invested in that defect mask, as it would have been very challenging to determine we had a test exposure until much much later. As the Trace Cache was the smallest memory it would have taken significant volume testing to have found this exposure. The volume engineering data collection would be the next chapter of the Weak Write Test Mode Saga. Come back soon to learn more.

Have a productive day,

Anne Meixner

Dear Reader, What memory or question does this piece spark in you? Have you ever dealt with a problem in which things were turned upside or topsy turvy? Please share your comments or stories below. You too can write for the Engineers’ Daughter- See Contribute for more Information.

Additional Reading

You can learn more about the Pentium Micro-architecture here. However, it doesn’t describe all the embedded memories- just the data and instruction cache.

Bao Nguyen has had a long career at Intel in various aspects of product engineering at Intel.

Rama Pedarla and I continued to cross paths on implementing new Design For Test methods. He worked on pathfinding for no touch leakage testing and worked on AC IO Loopback. He is currently a principal engineer at Intel based in Austin Texas and when we last crossed paths he had moved on to the challenges of system level test. Ah the stories about system level test are quite interesting.