FAILURE RATE • The number of failures of an item within the population per unit of operation (time, cycles, miles, runs, etc.)
ELECTRONIC SYSTEM RELIABILITY - WHY IMPORTANT? • PROBLEMS – Electronic systems involves the utilization of very large numbers of components which are very similar. – The designer has little control over their production and manufacture but must specify catalogue items. – The designer has little control over device reliability. – Control of the production process is a major determinant of reliability. – It is difficult to test for electronic component defects that do not immediately affect performance. • SOLUTION: Very close attention must be paid to electronics part reliability. The design must involve a reliability team.
OUTLINE • DEFINITIONS • CAUSES OF ELECTRONIC COMPONENT FAILURE • PREDICTION METHODS- TEST • Mil- HDBK- 217 PREDICTION METHODS- CALCULATIONS – PARTS STRESS ANALYSIS PREDICTIONS
– PARTS COUNT RELIABILITY METHOD – LIMITATIONS • ADDITIONAL INFORMATION – Other Failure Rate Data Sources – Arrhenius Model
DEFINITIONS • OPERATING STRESS – The actual stress (or load) applied during operation of the part (e. g. voltage for capacitor, dissipated power for resistors) • RATED STRESS – The manufacturers rating for the part. • STRESS RATIO – Ratio of operating stress to rated stress. • PART GRADES
– Grade 1, 2 etc. designates high quality standard parts. – JAN, Industrial and Commercial Grades designations for other parts that can be used.
BACKGROUND • Reliability engineering and management grew up largely in response to the problems of electronic equipment reliability. • Many reliability techniques have been developed from electronics applications.
CAUSES OF ELECTRONIC COMPONENT FAILURES
Electronic Failures = f ( design, mfg. process, quality type, temperature, electrical load, vibration, chemical, stresses ) OTHER CAUSES OF ELECTRONIC COMPONENT FAILURES (con't)
Electrical Load • Higher that anticipated voltage or current loads can cause arcing, and other damage. Vibration • Shock and vibration can cause fatigue damage to even properly made components. Chemical • Contaminants introduced in the manufacturing process may eventually degrade an IC or other device. • Environmental contaminants (moisture, etc) may promote chemical attacks on components.
Mil- HDBK- 217 PREDICTION METHODS
PARTS STRESS ANALYSIS PREDICTIONS • This method is applicable when most of the design is completed and a detailed parts list including parts stresses is available. • This model takes into account part quality, use environment, the base failure rate (which includes electrical and temperature stresses)
Mil- HDBK- 217 PREDICTION METHODS
PARTS STRESS ANALYSIS (con't)
λ p= λ bπ Tπ Aπ Rπ sπ cπ Qπ E (Failures/ 10 6 Hour) where: λ p = parts failure rate (Failures/ 10 6 Hours) λ b = base failure rate (often with electrical, temp. stress) π T = Temperature Factor (dimensionless typical 1 - 150) π A = Applications Factor (dimensionless, typical 1- 5)
π R = Power Rating Factor (dimensionless, typical 0.5- 1.0)
Mil- HDBK- 217 PREDICTION METHODS
PARTS STRESS ANALYSIS (con't)
π s = Voltage Stress Factor (dimensionless, typical 0.1- 1.0) π c = Construction Factor (dimensionless, typical 1 - 5) π Q = Quality Factor (dimensionless, typically 0.7 to 8.0) π E = Environmental Factor (dimensionless, typical 1 - 450) Each devices uses some or all of these factors. Other factors are also used.
Mil- HDBK- 217 PREDICTION METHODS
COMBINING RESULTS • The general procedure for determining board level failure rate is to: • Sum individually calculated failure rates for
each component. • This summation is then added to a failure rate for the circuit board (which includes the effects of soldering parts to it). • Then effects of connecting circuit boards together is accounted for by adding in a failure rate for each connector.
Mil- HDBK- 217 PREDICTION METHODS
Non- operating Failures • Parts continue to fail even when not in use. In general electronic parts fail less frequently when not operating because failures are
related to operating stress. But other components tend to degrade even when not in use. Example: – Hydraulic parts fail because organic rubber seals out gas and cross link when exposed to heat and ultraviolet light. – Solid rocket engines undergo chemical degradation and can develop cracks. •R s = R operating R non operating
Mil- HDBK- 217 PREDICTION METHODS
Parts Count Reliability Method • Used early in the design or when detailed data is not available.
• Uses Generic Part Type, a Quality Factor and Environmental Factor. • information needed: • (1) generic part types (including complexity for microcircuits) and quantities, • (2) part quality levels, and • (3) equipment environment.
Mil- HDBK- 217 PREDICTION METHODS
Parts Count Reliability Method
λ EQUIP = Σ N i ( λ g π Q ) i λ EQUIP = Total equipment failure rate (Failures/ 10 6 hrs.)
λ g = Generic failure rate for i th generic part. π Q = Quality factor for the i th generic part . N i = Quantity of the i th generic part .
n = Number of different generic part categories in the equipment.i= 1 i= n
Mil- HDBK- 217 PREDICTION METHODS
LIMITATIONS • RELIABILITY PREDICTION MUST BE USED
INTELLIGENTLY, WITH DUE CONSIDERATIONS TO ITS LIMITATIONS • FAILURE RATE MODELS ARE POINT ESTIMATED WHICH ARE BASED ON AVAILABLE DATA –THEY ARE VALID FOR THE CONDITIONS UNDER WHICH DATA OBTAINED AND DEVICES COVERED. –MODELS ARE INHERENTLY EMPIRICAL Purpoee - The purpose of thfs MruboOk is to establish and maintain consistent and uniform ti.~ for estimating the hhemnt rek&Slity (i.e., the reUabflityof a mature design) of rnilbry @edron& ~~~ - systems. It provides a common basfs for ~ predictionsckhg aoquis&bn progmms for military ebctrcmc systems and equipment. h atso establishes a common basis for oomparfng and
evafuatlng reliability predictions of rdated or competitive destgns. The handbook is intended to be used as a tool to increase the reliabil”~ of the equ@merxbeing designed. 1.2 Appllcatlon - This handtmok oontains two methods of reMWiJity pmdiotbn - “Part Stress Analysis” In Sectfons 5 through 23 amf 7%rts Count- in Appendix IL These methods vary in degree of informatbn needed to apply them. lhe Part Stress Anafysii Method recpires a greater amount of detailed In&mtfon and ts appfkabfe mrfng the later design phase when actual hardware and c&wits are being designed. The Parts Count Method raquires less infonnatbn, generally part quantities, qmtity level, and the applkatbn environmen& This method Is appfioable cMng the early de- @ase and du~ pmpo@ formulation. In general, the Parts Count Metfwd wffl usually result in a more conservative estknate (i.e., ~f*mte)ofsy’stem r@taMtythanthe Parts Stress Method. 1.3 Computerfzad Rellablllty PmcffctlOn - Rome Laborato~ - ORACLE is a computer program developed to aid in appfying the part stress analysis procedure of MIL-HDBK-217. Based on environmental use chamcteristks, piece part oount, thermal and electrical stresses, subsystem repair rates and system configuration, the program calculates piece part, assemMy and subassembly failure rates. It also flags overstressed parts, afbws the user to perform tradeoff analyses and provides system mean-timeto-failure and availability. The ORACLE computer program software (available in both VAX and IBM co~atible PC versbns) is available at replacement tape/disc cost to all DoD organizations, and to contractors for applbcatbn on spedfk DoD contraots as government furnished property (GFP). A statement of terms and conditions may be obtained upon written request to: Rome Laborato~/ERSR, Grtffiss AFB, NY 13441-5700.
What is MTBF? MTBF is an acronym for Mean Time Between Failures. In general, a higher MTBF number indicates a more reliable product. Beyond this simple definition, you’ll find a wide variety of special meanings. In the military/aerospace industries, MTBF is defined by a specific set of calculations. The formula for system longevity is based on the thermal, electrical and environmental stresses on each component. The engineer evaluates the components and subassemblies in a particular product
by these formulas and produces an overall number called calculated MTBF. Another way to compute MTBF is to evaluate product reliability based on the product’s actual performance in the field. Instead of theoretical calculations of what might occur, field MTBF is a measure of the numbers and types of failures that the products actually experience in real applications. At Liebert, we track two types of field MTBF statistics: critical bus MTBF and hardware MTBF. In the next few paragraphs, we will explain each of these. Critical Bus MTBF Our primary focus is on critical bus MTBF. This measures how effectively the UPS, batteries and bypass source can support the customer’s critical load without a failure attributable to the UPS or System Control Cabinet. Liebert maintains a database with information on every Series 600 UPS ever shipped. We also keep records of all reported failures. Each quarter we evaluate the reliability information and tally up the critical bus outages that were attributable to the UPS or System Control Cabinet. Some events are excluded from the total. For example, if a UPS experiences an alarm condition and successfully transfers the load to bypass, there is no critical bus outage. Likewise if utility input power fails and the UPS and batteries support the critical load for the proper number of minutes, the UPS has done its job. If the utility power (or backup Diesel generator) is not available when the UPS has drained the batteries, the UPS -- with ample warning to the operator -- will perform an orderly shutdown. This is not a chargeable critical bus outage since the equipment performed as designed. Other excluded situations are those caused by site conditions or operator error. For example, one customer wired his facility fire alarm system to trigger the Emergency Power Off circuit on the UPS. Unfortunately, he forgot to disconnect the circuit before performing a routine test of the fire alarm system. This caused a critical bus outage, but did not count against UPS MTBF. What have we done lately? Each quarter we tally up the cumulative system operating hours and the total number of critical bus outages reported since the introduction of the Series 600 UPS. As of this writing, we have records of more than 7,000 Series 600 modules in more than 5,500 systems. Cumulative system operating hours exceed 220 million. Since shipments began in 1989, we have records of just 80 critical bus failures. Considering our exposure is approximately 4 million system operating hours per month, this is a remarkably small number of failures. We compute our field MTBF numbers by dividing system operating hours by “failures plus one.” We do this to be conservative and to be consistent with earlier published documents. Dividing 220 million hours by 81 (80 + 1) gives us a number considerably in excess of 2 million hours. We recognize that some Series 600 sites are not under contract to Liebert Global services and might not be reporting all failures. Therefore we choose not to advertise the exact calculated number. “In excess of one million hours” is sufficient. Module MTBF The other way we track reliability is the field MTBF of the UPS modules. For these purposes, we count every type of module or System Control Cabinet failure that causes the module to take itself off-line. As before, we exclude incidents of operator error, site problems or instances of shutdown after successful discharge of batteries.
To compile this number, we have taken various sample periods. For a challenge, one of the periods was chosen to coincide with one of the worst heat waves on record in large portions of the Midwest and Northeast. A difficult test indeed! During the sample periods, Series 600 UPS modules accumulated approximately 6 million operating hours and 35 hardware failures. Of these, only one caused a critical bus outage. The other 34 events featured the UPS successfully transferring the load to the bypass source. Dividing 6 million hours by 35 gives a module MTBF of approximately 170,000 hours.
Methodology The Equations Failure Rate, MTBF, and FITs
Description of Methodology The parts count method is a technique for developing an estimate or prediction of the average life, the Mean Time Between Failures (MTBF), of an assembly. It is a prediction process whereby a numerical estimate is made of the ability, with respect to failure, of a design to perform its intended function. Once the failure rate is determined, MTBF is easily calculated as the inverse of the failure rate, as follows: MTBF = 1 FR1 + FR2 + FR3 + ...........FRn where FR is the failure rate of each component of the system up to n, all components The general procedure for determining a board level (or system level) failure rate is to sum individual failure rates for each component. For MIL-HDBK-217, the summation is then added to a failure rate for the circuit board, which includes the affect of solder joints. Component failure rates are provided by MIL-HDBK-217, "Military Handbook, Reliability Prediction of Electronic Equipment", as standard part failure rate models or directly from the manufacturers. The failure rates presented apply to equipment under normal operating conditions, i.e., with power on and performing its intended function in its intended environment. Consideration is given to various environments, component quality, and thermal aspects.
The Equations A sample calculation for integrated circuits taken from MIL-HDBK-217 is as follows: Failure Rate = (C1 * PiT + C2 * PiE) * PiQ * PiL Each factor in this equation is dependent upon a certain part parameter. The end result of this equation is the failure rate of the integrated circuit. Failure Rate, MTBF, and FITs For this discussion, we will assume that the resulting failure rate is shown in failures per million hours. This is simply the number of failures that you would expect to have in a million hours of operation of your equipment. Failure rates for many basic devices are well below 1 failure per million hours, so these values may seem insignificant. But if you have hundreds of parts in your design and have a thousand systems operating in the field, you can see that the failure rates will quickly add up. MTBF, or Mean Time Between Failures, is the inverse of the failure rate and is the average time between failures. It is calculated from the failure rate as follows: MTBF = 1,000,000/Failure Rate You can choose the units in which the failure rate is shown. Another common unit used, besides failures/million hours, is failures per billion hours which is also known as FITs (Failures In Time). What is MIL-HDBK-217? MIL-HDBK-217 is a reliability prediction standard originally developed for defense and aerospace related organizations, but later adopted by many commercial and industrial companies. Many times referred to simply as 217, MIL-HDBK-217 includes mathematical reliability models for nearly all types of electrical and electronic components. These reliability models are based on parameters of the components such as number of pins, number of transistors, power dissipation, and environmental factors. Results from MIL-HDBK-217 are provided as both a failure rate and as an MTBF (Mean Time Between Failures) where the MTBF is the mathematical inverse of the failure rate.