The original posting finally brought to mind what I learned from my very best technician, who had been a service person for jet fighter avionics and systems on an aircraft carrier. Those repairs were taken a lot more seriously, since both lives and expensive aircraft depended on repairs being correct the first time. When one outcome could be a court-marshal there is a very serious attitude about verifying the fault, finding the failure, and then producing the repair. It was a good method and certainly produced good results, although there were times when a plane might be out of service until the failure was found. NOT what the CO wanted to hear, but a lot better than losing one.

Oh yeah, our technical computing people are the best and a huge contribution to our productivity. Part of what makes their job easier is that all design work is done on Linux systems, which seem to be easier to keep in lock-step with respect to the OS and applications that are installed. We also have a very large farm of compute systems, that are kept absolutely identical, where most of the heavy lifting is done. Things are good now, but I remember the bad old days from 15 or 20 years ago when things weren't so good! I don't want to go back!

In spite of all this "sameness", each PC boots differently, displays differently & reacts differently. It has been far easier over the years to live with these anomalies than to waste a lot of time delving into the complexion of WINDOWS!!!!

The freeze-spray posts remind me of a problem I had to trouble-shoot in the mid-1990s. We were working with an embedded modem assembled from a pair of vendor parts and our surrounding analog components. The modems would start and run properly sometimes, but most of the times they would hang. I puzzled over this for two weeks, not making much progress. After six hours of no success in the lab one day I retired to my office to try to get some distance from the problem.

My manager found me there and demanded to know why I wasn't heads-down in the lab. And by the way, he asked, "What is this problem, anyway?" So I walked him to the lab and started the product. It started correctly. I stated that this was the rare, successful case as I quickly shut the device off and restarted it--and it started into a hang state I had never before seen. Suspecting something, I quickly turned it off and on again--and it hung in the all-too-familiar hanging state.

I told my boss to stay right there--not to move--and walked to the chemical storage cabinet in the next room and returned with a can of freeze-spray. "Watch," I said. "I will now make it boot successfully several times. I laid down what was probably 1/16" of frost on the two modem parts and sure enough it started successfully several times in a row. As the frost began to melt and the water droplets warmed, eventually it hung again.

I had changed the problem from "It hangs most of the time." to "It hangs except when cold." This was a real breakthrough, since it indicated a race condition that changed winners as temperature increased, and was now isolated to the modem chips. (Incidentally I was only able to make this discovery because I had taken a break, allowing the system to cool. I never pointed this out to the manager--realized it just now--but had I not left the lab for a break I would not have made this breakthrough.)

Ultimately the problem had to be fixed by the modem chip vendor. The circuit included an unused three-state input/output register pin where proprietary internal software carried out these steps:

Enable register output

Read input <--Hangs here on unexpected value being returned

Initialize register which was being read

instead of

Enable register output

Initialize register which was being read

Read input <--Now reads the properly initialized value

Had we not found the thermal sensitivity we would never have been able to get the modem vendor's attention.

I presume that your customer was fairly competent and that they had some credibility when they said that the product failed. If the fault was as well defined as one motor ceasing to run, why in the world would you return it until you understood why? A complaint about a malfunction that is so obvious as a motor not running would certainly merit a much more intense investigation, such as running the machine all night. Every machine thatr I have delivered or repaired gets that 24 hours constant run after being "completed" and otherwise ready to ship. The 24 hour run would have spotted the fault before it was shipped initially.

The only time to ship a product back and claim "no trouble found" is when the flaw that is repaired is such a dumb thing that you are embarassed to tell anyone. But if the machine does not fail again they will probably not complaintoo much.

In the software debugging world there's a saying: "My computer is not the same as your computer." The idea is the same as in the article: you need to test under the same conditions as the end user. In our lab our technical computing people have spend a lot of time and effort to make sure everyone's computer is the "same" so that our DA tools work (or fail!) the same way for every engineer!

I actually have had similar problems with the motor controllers, specially for the motors that have high current ratings and run on high load. The heating problem is the worst of them all. Ultimately, I ended up using huge heat sinks to solve my problem.

"This is the kind of problem that causes engineers to search for weeks and examine every possibility, no matter how complex. Only an experienced engineer would have thought of something as simple as vent holes."

Until it happens once - then you never forget and overheating comes to the top of your list...I can't tell you how many times I have had to tell my son to quit covering the vents on any of the numerous electronic boxes that are common to any household from TVs to stereos etc. or to not use his laptop on the carpeting. Air circulation is your friend and covering vents with books or DVD cases is not a friendly act...

Freon was also standard on every tech bench in the old days. When I worked in product engineering - we also tested chips across temp. We were either forcing them cold with liquid nitrogen or burning them in. Spec sheets will tell you a chip's operating temperature tolerances but it's harder at the system level for the reasons stated in various comments. Keeping electronics cool is definately an issue that needs to be considered in every design from chip to chassis...

Yes, I used a lot of the spray Freon over the years of my carrer. In this case I just failed to give thought to heat being the source of the problem, mainly since both controllers were mounted in what I thought were nearly identical environments. It turned out that there was leakage nearer to the one that allowed the cooling air to be concentrated on that controller. Lesson learned!

A few weeks ago, Ford Motor Co. quietly announced that it was rolling out a new wrinkle to the powerful safety feature called stability control, adding even more lifesaving potential to a technology that has already been very successful.

It won't be too much longer and hardware design, as we used to know it, will be remembered alongside the slide rule and the Karnaugh map. You will need to move beyond those familiar bits and bytes into the new world of software centric design.

People who want to take advantage of solar energy in their homes no longer need to install a bolt-on solar-panel system atop their houses -- they can integrate solar-energy-harvesting shingles directing into an existing or new roof instead.

Focus on Fundamentals consists of 45-minute on-line classes that cover a host of technologies. You learn without leaving the comfort of your desk. All classes are taught by subject-matter experts and all are archived. So if you can't attend live, attend at your convenience.