Beware of the Borrowed Design

It was on this exact address where the processor made the final jump from upper memory to lower memory (just after switching processing modes) that the target board with emulator attached lost its mind and jumped to a totally random address to begin executing garbage instructions. Because the problem occurred so consistently on that one jump instruction (and didn’t occur with an actual processor chip installed), the customer was convinced the emulator was at fault.

We had debugged plenty of code that made the real-to-protected mode transition with no trouble, so we were doubtful of a fault with the emulator. Yet the customer’s early setup code for the transition looked correct, so we were hard pressed to blame his code and tell him the problem was somehow in that portion of his code -- particularly since it did run on the actual processor chip itself.

Finally, in desperation, we asked the customer to send us his target system so we could replicate the problem -- which we did quite easily. But we noticed one thing the customer didn’t: The switching power supply on his board (which drove the whole processor and memory subsystem, as well as several peripheral devices) seemed awfully small for the number of chips on the board. It was so small, in fact, that we went back and redid a rough calculation of his power budget and found the switching supply to be nominally about 40 percent undersized.

We then took a hard look at the power rails on the board and found that they sagged just enough momentarily to cause the processor to run at below its minimum spec for Vcc with the emulator installed. Putting the processor in where the emulator was, we saw a similar sag, only not quite as pronounced. Ground also tended to “bounce” noticeably in both cases, but less so with the actual processor than it did with the emulator.

Working back, we cross-triggered a logic analyzer and an oscilloscope and discovered the point of “power sag” was on the second major jump instruction -- exactly where the code went wild when running under emulation.

The conclusion? On that second jump instruction, the switching supply, which was already huffing and puffing, had the job of changing the value on just about every single one of its address lines from “high” to “low” all at once. That’s a lot of signals simultaneously going from Vcc to ground.

It has been my experience as an end-user that the primary failure mode of most embedded microprocessors has been the power supply.

Back in the early 1990s it was fashionable to include calculations for MTBF with such gear. The numbers we were given were clearly ridiculous. I would routinely point out that these numbers were based upon heat, not component aging. Sure enough, about 12 years later we began seeing a very high high failure rate of our field devices. The cause was traced to... the power supplies.The electrolytic capacitors were failing.

Naturally, the were no longer in warranty, though the MTBF numbers would have suggested that they should have continued to be useful for many more years.

It's not just the issue of ripping designs from someone eles's homework. Power supplies are one of the weakest links in keeping a system working. Remember that electrolytic capacitor scandal from around 2003? The lowly power supply deserves a great deal more attention than most engineers are willing to admit.

Gsmith120, I would think that the practice of getting it done now and fixing it later is an expensive path. How long can a manager get away with that behavior before it comes to the attention of upper management and bean counters?

Nancy, I agree some people will never see the obvious. As a senior engineer once told me "always remember pay me now or pay me later and if you pay me later it will cost you more". That was one of things I hated about some decision makers they just wanted to get it done and worry about fixing it later.

I feel your pain, Eric. Sometimes it just doesn't make sense that people can't see the obvious...but it happens all the time. Effective cost reduction does not equal inferior quality and sometimes you have to pay more up front to save in the long run, but sometimes you just can't convince people of that. I am with you - I would prefer that they are someone else's customers!

You hit it right in the head, Rob - but you would be surprised at how much one can "scavenge" in a company that has been operating for years and has had a lot of test equipment designed and built. I remember on more than one occasion swapping power supplies around for that very reason...or IEEE cards, or memory...or video cards...sometimes we played "ring around the test set" to get the hardware specs we needed.

Yours was a problem I saw all the time when I was in the development tools business. The only companies I've encountered where this issue isn't endemic are those where somebody takes the time, on a module by module basis, to write up a spec sheet for what the module in question will and won't do. Then there's a chance that somebody can produce the document that says, "This won't do that" when his boss says, "Just reuse this old design." That, I think, is more a comment on human psychology. But it seems critical that the entire conversation be reframed in the context of, "What changes do we have to make to the existing design in order to make it appropriate for the new design?" Asking that question assures that the matter is at least addressed, and if the answer is, "So many changes it'd be easier to start over with a clean sheet of paper," that message can be taken to management with some chance of success.

Nowadays I'm out of the development tools business, but it's not as though I don't reguarly see our potential client companies doing exactly what these heavy equipment people did -- largely because management gets it in its head that everything is infinitely reusable unless there's some internal procedure for triggering a review of "reusability" and some kind of document that says, "Ah... Come to think of it... No, not at all."

Note that above I say, "potential client companies," and not, "actual clients." Because where this kind of cavalier behavior is standard practice, it's usually because the company in question is penny wise and pound foolish. Running, as I do, a high-end design shop where we have a total *value* proposition instead of a message that "we compete strictly on price," I like for the self-selection algorithms for our customers to run in such a way that the guy who regularly reuses too much to "save money" is somebody else's client -- not ours.

It wasn't *that* long ago that we turned away a client who refused to spend the money to buy an extra week of engineering time in order to figure out how to make his system run better with about half as many components in it. Had he only been building a handful of units a year, not spending the NRE dollars would have made sense. But he was bulding 15,000 units a year, and 40 hours of engineering time would have saved $8 of BOM cost on every one of them. When you do the math, in one year, he'd have saved $120,000 (and had a more reliable end product, since it'd have fewer things in it to break). Or dividing it through anohter way, for NRE costs to have outweighed BOM cost savings, I'd have had to have been charging him more than $3,000 an hour for engineering time. And as much as I think Focus Embedded is a really good shop with the best product in the business, that rate is a tad high even for us... ;-)

Interesting story, Nancy. I would imagine that in certain critical areas -- like power supply -- you still had to test to make sure the existing design would work with the new features. That's the odd part about of this story.

From my experience, Rob - I think you are correct in it happening all of the time. I look back at some of the test equipment that I worked on that was 15 years old and still expected to meet current test needs. The engineer was expected to add to the design to meet new needs rather than redesign the test set - all in an effort to save money. Every once in awhile we could shake our heads and go "no way, this just won't work" and get to do a total redesign - but that didn't happen very often. If it actually compromised the integrity of the product then we would insist on a redesign, but of course with the economy and lay offs being a fact of life back then - we didn't complain too loudly if we could make it work and it was "good enough." We did the best we could "within the parameters we were given."

Yes, I think you're right, Chuck. The design engineering staff turns over and the new person simply adjust the existing design to accommodate (though not really accommodating) the new features. Then time-to-market come in as a pressure on design. The features keep getting added on to the old design until the design breaks down.

I worked with aerospace software design engineers that expected their software to execute without power. In fact they specifically designed software for this condition. When there is the loss of electrical power in an airplane, which can happen any time, for a number of reasons, the electronic box needs to sense this and the software needs to execute shutdwon code for the few seconds left on the caps. Probably all data should be in nonvolatile memory all the time (but usually is not done that way), not registers.

When the electonic box starts up it needs to determine stable staus as soon as possible. Some times knowing the last state is at least somewhat helpfull. But again the box will not know the state that it is being power up in. There are some boxes on the airplane that get enought of the right kind of information to determine the mode of the airplane; ground power, engine power, taxi, climb, cruise, decent and land.

I had a running debate of the importance of immediately determining the present non-intermitent state of the box inputs so valid condition may be reported. The project manager insited in following series flight mode "states". They are sill having trouble because of this false series progression.

Focus on Fundamentals consists of 45-minute on-line classes that cover a host of technologies. You learn without leaving the comfort of your desk. All classes are taught by subject-matter experts and all are archived. So if you can't attend live, attend at your convenience.