Class 304: Fantastic Failures
Embedded Systems Conference Wednesday, 31 March 2004 By Kim R. Fowler
Historical Case Studies
1
The De Havilland Comet
31 March 2004
(courtesy of Marc Schaeffer, find this photo at the following website: www.geocities.com/CapeCanaveral/Lab/8803/)
3
Comet recounted • First jet airline – speed and comfort • 3 crashes between May 1953 and April 1954 • Extensive testing • Catastrophic cracking from metal fatigue • Fixes – rounded corners, reinforcing plates • New understanding of metal fatigue 31 March 2004
4
2
Ariane 5
(Photographic source is ESA/CNES. You can find these photos at the following website: www.mssl.ucl.ac.uk/www_plasma/missions/cluster/about_cluster/clu ster1/cluster1_images.html) 31 March 2004
5
Ariane 5 recounted • Dual-redundant processors • 3 unprotected variables that overflowed • Processors reset on overflow, no graceful recovery • Used in Ariane 4, no check of flight dynamics • Ariane 5 had > horizontal drift velocities • Reuse is tricky, end-to-end system test necessary • Find report at: www.esa.int/export/esaLA/Pr_33_1996_p_EN.html 31 March 2004
6
3
Therac 25 • Medical linear accelerator for treating tumors • Mid-1980s overdosed six patients • Problems – Quick editing by operator caused race condition – Cryptic error messages ignored – No explanation in Users Manual of error codes – 50 times full dose but displayed “no dose given” – No mechanical interlocks – No software reviews or audits, little documentation 31 March 2004
7
Therac 25 Lessons • Need general plan for system development • The operator interface must be clear, intuitive, and explained • Hardware safeguards must limit software faults • Good design, not testing, makes a safe system • See Appendix A – Medical Devices: The Therac25 from Nancy Leveson, Safeware: System Safety and Computers, Addison-Wesley, 1995.
31 March 2004
8
4
Chernobyl
31 March 2004
(Courtesy of the U.S. Department of Energy, http://insp.pnl.gov, photographs UK-CH-002, UK-CH-003, UK-CH-015, UK-CH-100)
9
Chernobyl recounted • Chernobyl reactor 4 exploded, April 1986 • Released clouds of radioactive material for 10 days • 100 x exposure over Hiroshima bomb • Background – Graphite block reactors unstable at low reactivity – Safety rules require power > 20% capacity at all times 31 March 2004
10
5
Chernobyl events – – – – – – – – –
Experiment called for by engineers in Moscow Manual shutdown, automatic control turned off Power dropped to 1% capacity Removed more control rods Power crept up to 7% Turned on more water to produce more steam Water cooled reactor, dropping steam and reactivity Removed even more control rods Steam production rose until 1:22 a.m. when operators shut off water flow – Heat built up quickly, control rod sleeves bent – Could not insert control rods – Steam explosion 31 March 2004
11
Chernobyl Lessons • Theoretical knowledge vs. hands-on • Humans “over-steer” dynamic systems • Humans don’t handle interacting, nonlinear problems well • “Groupthink” • Understand human nature – Clarity of function – Reduce confounding problems – Accommodate in system design 31 March 2004
12
6
Apple Lisa
(Part of the computer collection of Giorgio Ungarelli, photograph used with permission.)
31 March 2004
13
Apple Lisa Legacy • Brilliant concept before its time – Mouse – Graphical file management
• People not ready for paradigm shift
31 March 2004
14
7
Apple Lisa Lessons • Prohibitive price for unappreciated capability • Cost-effective solutions rely on users’ understanding • Failure falls into business/political arena – difficult to predict and avoid
31 March 2004
15
Navy Terrier/LEAP
8
Terrier LEAP outline • • • • •
Concept for ballistic missile intercept Use current (early-mid 1990s) technology Prepare and test quickly Target launched from Wallops Island Interceptor launched from cruiser in Atlantic • Basic human error foiled success 31 March 2004
17
LEAP Target
(Photograph courtesy of Raytheon, Inc.) 31 March 2004
18
9
LEAP General Operation • High-resolution radars at Wallops Island track target (shipboard radars insufficient) • Wallops Island processor collected data from the radars, filtered the target track with a six-state Kalman filter, and transmitted the track to the ship. • Sent target tracks to ship via redundant telephone landlines and Inmarsat satellite links • Ship processor received the data, predicted the intercept time and point, and indicated when to launch the interceptor missile. 31 March 2004
19
LEAP Missile & Intercept
(Photograph courtesy of Raytheon, Inc.) 31 March 2004
20
10
LEAP Testing Finds Problems • End-to-end tests of the system – simulated a target launch, – transmitted the simulated data through the entire system to the ship, – calculated an intercept as if we were at sea.
• Redundant landlines – switch maintenance in New Jersey cut off early test • Separate landlines – one through New Jersey – other through Pennsylvania 31 March 2004
21
Richmond K. Turner, GC20
31 March 2004
(Photograph courtesy of the Johns Hopkins University Applied Physics Laboratory.)
22
11
Testing Finds Problems (cont’d.) • Two shipboard radars caused problems – SPS-49 jammed the Inmarsat receivers – SPS-20 jammed the GPS receivers
• Inmarsat situated on port and starboard bridge to reduce superstructure blockage • Too many dropouts with commercial modems, switched to cell phone modems
31 March 2004
23
LEAP Targeting Processor and laboratory test set
(Photographs courtesy of the Johns Hopkins University Applied Physics Laboratory.) 31 March 2004
24
12
LEAP: Lessons Learned • Technical failure • Simple, human error can interrupt the best designs • Careful development and thorough testing necessary • All components must be tested within the system to uncover interactions
31 March 2004
25
Aegis LEAP • A success story • Three successful intercepts in 2002, more in 2003 • Carefully planned development
31 March 2004
26
13
Aegis LEAP Flight Profile
31 March 2004
27 (Figure courtesy of the Johns Hopkins University Applied Physics Laboratory.)
Aegis LEAP Missile
(Photograph courtesy of the Johns Hopkins University Applied Physics Laboratory.) 31 March 2004
28
14
Kinetic Kill Vehicle and Target Image
(Figure and photograph courtesy of the Johns Hopkins University Applied Physics Laboratory.) 31 March 2004
29
Aegis LEAP Launch
(Photographs courtesy of the Johns Hopkins University Applied Physics Laboratory.) 31 March 2004
30
15
Thorough Ground Test Program • Separation tests – squibs, batteries, explosive bolts • KW hover test for the closed loop pointing • Air bearing tests of maneuvers: pitch-to-ditch, IR seeker calibration, and pointing before separation • Hardware-in-the-loop simulation and test of avionics • KW tests for the IR seeker characterization, stabilization, third stage interfaces • Vacuum tests – PCB delamination, arcing, and outgassing • Aerothermal testing in a hypersonic wind tunnel for nosecone heating and outgassing, seeker shield function, strake heating and insulation 31 March 2004
31
Types of Failure
16
Examples: Product Recalls • [. . .] recalled 45,000 heaters for defective thermostats that were improperly positioned, which could lead to the overheating. • [. . .] recalled 3.1 million dishwashers. The slide switch (the lever that selects between heat drying and energy saving) can melt and ignite over time, posing a fire hazard. • [. . .] recalled 5,500 toy flashlights because the batteries may overheat or leak and children can suffer burns from the leaking battery. • [. . .] recalled upright vacuum cleaners because the power cord may break inside of the handle posing electrical shock and burn injury hazards. • http://www.matthewslawfirm.com 31 March 2004
33
Examples: Automotive Recalls • March 12, 2002 [. . .] recalled the [. . .] trailer hitch – circuitry in the converter is inadequate to properly manage voltage spikes that can lead to an electrical short or open circuit within the converter, causing a failure and an inoperative trailer light. • September 11, 2000 [. . .] recalled about 270,000 [cars] – air bags that may deploy unexpectedly because of corrosion in the inflator. • During 2000 [. . .] recalled ignition modules that could cause a car to stall. When the temperature of the ignition module rises above a certain temperature the chances of the module cutting out also increases. • http://www.crash-worthiness.com 31 March 2004
34
17
Examples: More Automotive Recalls • [. . .] recalled 263,000 1995-97 [vehicles] . . . The airbag electronic control module (AECM) could corrode from water or road salt and then accidentally fire the driver side airbag. • [. . .] recalled 757,000 1992-97 [vehicles] because higher than specified electrical load through accessory power feed circuit may cause a short circuit and allow current to flow through ground wiring. This could cause overheating and an electrical fire. • [. . .] recalled 1995-97 [vehicles] because improperly routed wire harness for the air-conditioner may permit wires to rub together and short circuit, resulting in a blown fuse, dead battery, or fire. • http://www.matthewslawfirm.com 31 March 2004
35
Examples: More Automotive Recalls • December 11, 1998 [. . .] recalled 226 [electric vehicles] to reprogram the logic in the motor electronic control unit (ECU), which can mistakenly detect a failure of an electrical current sensor at speeds above 50 mph. It can cause the sudden loss of power and unexpected deceleration. • http://autorepair.about.com/library/recalls/ 31 March 2004
36
18
Elements of Unintended Consequences in Previous Examples • Passage of time – usually fielded units • Nonobvious or obscure causes • Environmental interactions, i.e. corrosion, overheating • Failure modes with significant effects, i.e. fire or injury
31 March 2004
37
The Nature of Problems • Confounding complexity – unforeseen circumstances – multiple causes
• Human error – nonobviousness to user – improper use – design oversight – even if it appears to be a manufacturing problem 31 March 2004
38
19
Example: Complexity or Oversight? • September 2003, Hurricane Isabel • Power outages – trees down on power lines. • NIST experienced 180 VAC for 20 minutes that destroyed 1000s of fluorescent lamp ballasts • Protective mechanisms for AC power were controlled over telephone lines. • Guess what was also knocked down by windblown trees?
31 March 2004
39
Causes and Factors • Dishonest portrayal of capabilities – expertise, schedule estimation, unreasonable professional relationships i.e. management/engineering • Inadequate schedule for review and testing • Reinventing the wheel – building your own custom design
• Creeping featurism – the continual addition of new capabilities
• Perception is reality 31 March 2004
40
20
Remedies • Truth in advertising – expertise, schedule estimation, management style/employee responses • Work hard to develop reasonable schedules – review and testing – plan for contingencies
• Continuous learning – lessons learned, your own experience – others’ experiences
• Reduce complexity – understand and define interactions – do not “reinvent the wheel” – limit features
• Teamwork 31 March 2004
41
Integrity • The “Big Picture” • Truth in advertising (your capability and skills) • Estimation and scheduling • Plan for the long term – your success and reputation – your product’s viability – your company’s reputation 31 March 2004
42
21
Failure and How to Handle It • Types of failure – technical – professional – political/societal
Less control
Progression
• Embrace failure
31 March 2004
– admit and accept responsibility – understand and learn – put past behind you because others won’t – forgive others’ failures; help them to 43 rebound
Personal Examples
22
Technical Failure • Ultraviolet satellite camera with image intensifier • Automatic gain control for image intensifier • Nonlinear control problem • First version – blooming/collapsing picture • Second version – unreliable transmission of gain value 31 March 2004
45
Technical Failure – 1st Version Image intensifier Camera Frame sync
Video signal
reset
Dn Hi-threshold comparator
DAC Up Up-down counter
31 March 2004
Pixel clock
(© 2002, Figure courtesy of the Johns Hopkins University Applied Physics Laboratory.)
46
23
Technical Failure – 1st Version • Problem: blooming/collapsing picture • Background: – Discrete logic, up-down counters – Unstable for bright objects – Not fully simulated or analyzed – Short development time (flew breadboards)
• Should’a: analyzed/simulated expected scenes during design 31 March 2004
47
Technical Failure – 2nd Version
(© 1996, Oxford University Press, used with permission.)
31 March 2004
48
24
Technical Failure – 2nd Version • Problem: unreliable transmission of gain value • Background: – – – –
Microcontroller implementation of AGC AGC stable for all scenes Readout of gain by ground equipment unreliable Analog encoding of gain into video frame
• Should’a: – Use digital encoding into video frame for noise margin – Needed better understanding of noise environment 31 March 2004
49
Professional Failure • Asked to finish programming effort while original designer moved onto other projects • False starts and procrastination • Finally removed myself from project
31 March 2004
50
25
Professional Failure • Problem: did not complete assignment • Background: – Mounds of documentation to plow through – Early realization of no-win situation • Lost motivation • No real recognition of work obvious to me
• Should’a: – Either not taken the job in the first place – Or if no choice, plow through assignment while finding another job (setting precedence) 31 March 2004
51
Professional/Business Failure • Business deal • My personal performance – Technical excellence – Professional excellence – Maintained integrity
• Accused of bad stuff, which I did not do • Deal fell through 31 March 2004
52
26
Professional/Business Failure • Problem: business politics outside my control • Background: – – – –
Interesting proposition and product Long-term relationships Unknowns quantities introduced early in deal Weirdnesses grew
• Should’a: – Either not make deal in the first place – Or left earlier before weirdness got out of hand
• Note: always deal with integrity or don’t deal 31 March 2004
53
Political Failure • Satellite subsystem • Team’s performance – Technical excellence – Professional excellence
• NASA sponsor pulled project in-house
31 March 2004
54
27
Political Failure • Problem: politics outside my company’s control • Background: – 6-month long set of trade studies to define architecture – Thorough studies and review – Schedule well understood, team prepared to build system – Groups at NASA out of work – NASA pulled project in-house to feed their own
• Should’a: – None, politics happen 31 March 2004
55
A Success Story
28
The Sidewinder Missile – A Success Story
(Courtesy of the U.S. Navy. All U.S. Navy photos are public domain. http://library.thinkquest.org/jo113065/citations.htm)
31 March 2004
57
Sidewinder recounted • Goal: simple, sturdy, cheap missile • Small development team, 1949 – 1953 • Simple, clever combination of ideas – – – – –
Rollerons: simple but important control Proportional navigation simplified circuitry Torque-balance servo for maneuvering Canard control fins reduced wiring and connectors Simple data acquisition equipment
• Extensive testing and prototyping 31 March 2004
58
29
Sidewinder Lessons • Breakthroughs require vision • Small teams facilitate commitment and communications • Simple and robust design • Careful, thorough, and extensive testing and integration
31 March 2004
59
30