Component Count & Reliability

System Component Count is often one of the larger determining factors in system reliability. Let's compare two designs -- two done largely with discrete and one done with ICs.

In the first version, the solution is a low integration modular, discrete-based design. This is for a transceiver with multiple bands handled via multiple LRUs (line replaceable units). In the second version, the integration is higher, but still largely discrete-based. Here, everything is in one LRU with much less wiring. In the last version, the system is largely a VLSI-based implementation that significantly reduces parts count from the first two approaches.

Version 1 has about 4,500 components, and very limited built-in test/built-in self-test (BIT/BIST) capability. Instead, it relies on pre/post dispatch checks. In this approach it is not unusual for wires to work loose and cause tuning errors or other faults in the system. With 4500 parts, even if each part were to only fail once per 100,000,000 hours, that works out to a failure about every two years.

Version 2 drastically reduces the wiring count, and adds BIT/BIST to the wire bundle. This is in addition to the pre/post dispatch manual checks. In this approach more emphasis can be given to the wiring due to the reduced number of interconnections and error checking added to the signaling over the wires. Undetected tuning or other errors are much less common.

With about 3,500 parts and the reduced wiring count for a basic system -- again assuming each part were to fail only once per 100,000,000 hours -- we could expect a fault approximately every 2.5 to 3.5 years.

In the VLSI version, the component count is reduced to about 450 parts. Error checking is prevalent in much of the system. There also error checking in the wiring, due to the reduced number of interconnections. Undetected errors are quite rare.

Using the same criteria and calculations as above in Versions 1 and 2, and considering the greatly reduced wiring + interconnections, there is a fault approximately every 10 to 20 years (some reduction is taken for the use environment and the SMD packaging). For the industrial environment, we use a value of 10 years = 100,000 hours of operation.

How does this compare with past generations of products you have worked with? Do VLSI/high integration solutions improve reliability in your situation? What are your experiences with reducing component count to improve reliability?

If the component count is decreased in design, the probability of error point is reduced. Reduces components also increase reliability. Systems engineering often includes efforts to reduce part count with the objective of cutting costs, enhancing performance, or improving reliability.

All - Where IC's often have failures that take them below the rated # of MTBF hours are 1) High Delta Thermal Cycles 2) ESD during Test and Maintenance 3) Thermal shock/vib. 4) Exceeding Design Ratings(Example internal hotspot due to overvolts / over-current) 5) Over Temperature of Die(Electromigration at temp, etc)

If kept within design ratings modern IC's that have been designed to pass and passed a reliablitiy qual test in a benign room temperature environment are quite reliable with #'s easily exceeding 100,000 hrs for even a really hot part like a PC, and ultra low power processors exceeding 100 million hours easily. It is often where one cycles the part 5-10 times a day over a large delta like a delivery truck engine controller IC that one must pay close attention to everything, and numbers down in the 20K hrs are sometimes what one end's up with.

Hi William--your post reminded me of when I learned about tolerance stack ups and how to do RSS calculations to get more realistic estimates of whether an assembly might have a problem. Young engineers faced with their first stack up, and not being street savvy in statistics, look at all those parts, and all those tolerances, and start to worry that the assembly will never work.

Later, I learned about Monte Carlo simulations and how you could model combinations of possible distributions in an even more realistic way. However, in both cases you still have one value for every part--the difference is in how you add it in.

Which brings me to your case, and gets to my question. I think I'm along the same lines as DaeJ--is it correct to consider an ASIC or other IC a lumped component with a count of 1? If the feature count is the same, what are the reliabilities of each resistor, capacitor, etc. in the integrated part? I think that either the features in ICs are extremely reliable compared to assembling from discrete parts (where I assume some failures come from connectivity and soldering issues, as you note), or the ICs are tested directly (I know some are--they do reliabiilty testing at the IC level and publish MTTF etc.) or there is some other method, known to give more realistic estimates of combining the reliabilty of all the features.

Mil-Hndbk-217 Is one method for determining reliability of electronics. There also have been Bel-Core Standards and others. With that said I have seen some really cooked numbers come out of Mil-Hndbk-217 compared with what one got in actual service. Some of the reasons for this -- Mil-Hndbk-217 does not specify a thermal cycling profile for the analysis, 2) The customer may fail to specify a normal operating temperature for the equipment location for the analysis. 3) Factors like humidity, lightning(for aircaft which are often struck several times in a year), and others. Still if one puts 50,000 parts in Mil-Hndbk-217 one will get a much worse number for reliability than if one puts in only 4500 similar parts.

It is not easy to count each component for normal operating hours, unless datasheet indicates it. I think that Chip maker defines this kind of requriement. Reliability engineer might figure out the system for long terms reliability. Secondly, all discrete components of previous system were updated to only one microcontroller. We did not measure the reliability for the updated system, eventhough the system is simplified. I think that industry has a process.