Fantastic Failures

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Fantastic Failures as PDF for free.

More details

  • Words: 2,459
  • Pages: 30
Class 304: Fantastic Failures

Embedded Systems Conference Wednesday, 31 March 2004 By Kim R. Fowler

Historical Case Studies

1

The De Havilland Comet

31 March 2004

(courtesy of Marc Schaeffer, find this photo at the following website: www.geocities.com/CapeCanaveral/Lab/8803/)

3

Comet recounted • First jet airline – speed and comfort • 3 crashes between May 1953 and April 1954 • Extensive testing • Catastrophic cracking from metal fatigue • Fixes – rounded corners, reinforcing plates • New understanding of metal fatigue 31 March 2004

4

2

Ariane 5

(Photographic source is ESA/CNES. You can find these photos at the following website: www.mssl.ucl.ac.uk/www_plasma/missions/cluster/about_cluster/clu ster1/cluster1_images.html) 31 March 2004

5

Ariane 5 recounted • Dual-redundant processors • 3 unprotected variables that overflowed • Processors reset on overflow, no graceful recovery • Used in Ariane 4, no check of flight dynamics • Ariane 5 had > horizontal drift velocities • Reuse is tricky, end-to-end system test necessary • Find report at: www.esa.int/export/esaLA/Pr_33_1996_p_EN.html 31 March 2004

6

3

Therac 25 • Medical linear accelerator for treating tumors • Mid-1980s overdosed six patients • Problems – Quick editing by operator caused race condition – Cryptic error messages ignored – No explanation in Users Manual of error codes – 50 times full dose but displayed “no dose given” – No mechanical interlocks – No software reviews or audits, little documentation 31 March 2004

7

Therac 25 Lessons • Need general plan for system development • The operator interface must be clear, intuitive, and explained • Hardware safeguards must limit software faults • Good design, not testing, makes a safe system • See Appendix A – Medical Devices: The Therac25 from Nancy Leveson, Safeware: System Safety and Computers, Addison-Wesley, 1995.

31 March 2004

8

4

Chernobyl

31 March 2004

(Courtesy of the U.S. Department of Energy, http://insp.pnl.gov, photographs UK-CH-002, UK-CH-003, UK-CH-015, UK-CH-100)

9

Chernobyl recounted • Chernobyl reactor 4 exploded, April 1986 • Released clouds of radioactive material for 10 days • 100 x exposure over Hiroshima bomb • Background – Graphite block reactors unstable at low reactivity – Safety rules require power > 20% capacity at all times 31 March 2004

10

5

Chernobyl events – – – – – – – – –

Experiment called for by engineers in Moscow Manual shutdown, automatic control turned off Power dropped to 1% capacity Removed more control rods Power crept up to 7% Turned on more water to produce more steam Water cooled reactor, dropping steam and reactivity Removed even more control rods Steam production rose until 1:22 a.m. when operators shut off water flow – Heat built up quickly, control rod sleeves bent – Could not insert control rods – Steam explosion 31 March 2004

11

Chernobyl Lessons • Theoretical knowledge vs. hands-on • Humans “over-steer” dynamic systems • Humans don’t handle interacting, nonlinear problems well • “Groupthink” • Understand human nature – Clarity of function – Reduce confounding problems – Accommodate in system design 31 March 2004

12

6

Apple Lisa

(Part of the computer collection of Giorgio Ungarelli, photograph used with permission.)

31 March 2004

13

Apple Lisa Legacy • Brilliant concept before its time – Mouse – Graphical file management

• People not ready for paradigm shift

31 March 2004

14

7

Apple Lisa Lessons • Prohibitive price for unappreciated capability • Cost-effective solutions rely on users’ understanding • Failure falls into business/political arena – difficult to predict and avoid

31 March 2004

15

Navy Terrier/LEAP

8

Terrier LEAP outline • • • • •

Concept for ballistic missile intercept Use current (early-mid 1990s) technology Prepare and test quickly Target launched from Wallops Island Interceptor launched from cruiser in Atlantic • Basic human error foiled success 31 March 2004

17

LEAP Target

(Photograph courtesy of Raytheon, Inc.) 31 March 2004

18

9

LEAP General Operation • High-resolution radars at Wallops Island track target (shipboard radars insufficient) • Wallops Island processor collected data from the radars, filtered the target track with a six-state Kalman filter, and transmitted the track to the ship. • Sent target tracks to ship via redundant telephone landlines and Inmarsat satellite links • Ship processor received the data, predicted the intercept time and point, and indicated when to launch the interceptor missile. 31 March 2004

19

LEAP Missile & Intercept

(Photograph courtesy of Raytheon, Inc.) 31 March 2004

20

10

LEAP Testing Finds Problems • End-to-end tests of the system – simulated a target launch, – transmitted the simulated data through the entire system to the ship, – calculated an intercept as if we were at sea.

• Redundant landlines – switch maintenance in New Jersey cut off early test • Separate landlines – one through New Jersey – other through Pennsylvania 31 March 2004

21

Richmond K. Turner, GC20

31 March 2004

(Photograph courtesy of the Johns Hopkins University Applied Physics Laboratory.)

22

11

Testing Finds Problems (cont’d.) • Two shipboard radars caused problems – SPS-49 jammed the Inmarsat receivers – SPS-20 jammed the GPS receivers

• Inmarsat situated on port and starboard bridge to reduce superstructure blockage • Too many dropouts with commercial modems, switched to cell phone modems

31 March 2004

23

LEAP Targeting Processor and laboratory test set

(Photographs courtesy of the Johns Hopkins University Applied Physics Laboratory.) 31 March 2004

24

12

LEAP: Lessons Learned • Technical failure • Simple, human error can interrupt the best designs • Careful development and thorough testing necessary • All components must be tested within the system to uncover interactions

31 March 2004

25

Aegis LEAP • A success story • Three successful intercepts in 2002, more in 2003 • Carefully planned development

31 March 2004

26

13

Aegis LEAP Flight Profile

31 March 2004

27 (Figure courtesy of the Johns Hopkins University Applied Physics Laboratory.)

Aegis LEAP Missile

(Photograph courtesy of the Johns Hopkins University Applied Physics Laboratory.) 31 March 2004

28

14

Kinetic Kill Vehicle and Target Image

(Figure and photograph courtesy of the Johns Hopkins University Applied Physics Laboratory.) 31 March 2004

29

Aegis LEAP Launch

(Photographs courtesy of the Johns Hopkins University Applied Physics Laboratory.) 31 March 2004

30

15

Thorough Ground Test Program • Separation tests – squibs, batteries, explosive bolts • KW hover test for the closed loop pointing • Air bearing tests of maneuvers: pitch-to-ditch, IR seeker calibration, and pointing before separation • Hardware-in-the-loop simulation and test of avionics • KW tests for the IR seeker characterization, stabilization, third stage interfaces • Vacuum tests – PCB delamination, arcing, and outgassing • Aerothermal testing in a hypersonic wind tunnel for nosecone heating and outgassing, seeker shield function, strake heating and insulation 31 March 2004

31

Types of Failure

16

Examples: Product Recalls • [. . .] recalled 45,000 heaters for defective thermostats that were improperly positioned, which could lead to the overheating. • [. . .] recalled 3.1 million dishwashers. The slide switch (the lever that selects between heat drying and energy saving) can melt and ignite over time, posing a fire hazard. • [. . .] recalled 5,500 toy flashlights because the batteries may overheat or leak and children can suffer burns from the leaking battery. • [. . .] recalled upright vacuum cleaners because the power cord may break inside of the handle posing electrical shock and burn injury hazards. • http://www.matthewslawfirm.com 31 March 2004

33

Examples: Automotive Recalls • March 12, 2002 [. . .] recalled the [. . .] trailer hitch – circuitry in the converter is inadequate to properly manage voltage spikes that can lead to an electrical short or open circuit within the converter, causing a failure and an inoperative trailer light. • September 11, 2000 [. . .] recalled about 270,000 [cars] – air bags that may deploy unexpectedly because of corrosion in the inflator. • During 2000 [. . .] recalled ignition modules that could cause a car to stall. When the temperature of the ignition module rises above a certain temperature the chances of the module cutting out also increases. • http://www.crash-worthiness.com 31 March 2004

34

17

Examples: More Automotive Recalls • [. . .] recalled 263,000 1995-97 [vehicles] . . . The airbag electronic control module (AECM) could corrode from water or road salt and then accidentally fire the driver side airbag. • [. . .] recalled 757,000 1992-97 [vehicles] because higher than specified electrical load through accessory power feed circuit may cause a short circuit and allow current to flow through ground wiring. This could cause overheating and an electrical fire. • [. . .] recalled 1995-97 [vehicles] because improperly routed wire harness for the air-conditioner may permit wires to rub together and short circuit, resulting in a blown fuse, dead battery, or fire. • http://www.matthewslawfirm.com 31 March 2004

35

Examples: More Automotive Recalls • December 11, 1998 [. . .] recalled 226 [electric vehicles] to reprogram the logic in the motor electronic control unit (ECU), which can mistakenly detect a failure of an electrical current sensor at speeds above 50 mph. It can cause the sudden loss of power and unexpected deceleration. • http://autorepair.about.com/library/recalls/ 31 March 2004

36

18

Elements of Unintended Consequences in Previous Examples • Passage of time – usually fielded units • Nonobvious or obscure causes • Environmental interactions, i.e. corrosion, overheating • Failure modes with significant effects, i.e. fire or injury

31 March 2004

37

The Nature of Problems • Confounding complexity – unforeseen circumstances – multiple causes

• Human error – nonobviousness to user – improper use – design oversight – even if it appears to be a manufacturing problem 31 March 2004

38

19

Example: Complexity or Oversight? • September 2003, Hurricane Isabel • Power outages – trees down on power lines. • NIST experienced 180 VAC for 20 minutes that destroyed 1000s of fluorescent lamp ballasts • Protective mechanisms for AC power were controlled over telephone lines. • Guess what was also knocked down by windblown trees?

31 March 2004

39

Causes and Factors • Dishonest portrayal of capabilities – expertise, schedule estimation, unreasonable professional relationships i.e. management/engineering • Inadequate schedule for review and testing • Reinventing the wheel – building your own custom design

• Creeping featurism – the continual addition of new capabilities

• Perception is reality 31 March 2004

40

20

Remedies • Truth in advertising – expertise, schedule estimation, management style/employee responses • Work hard to develop reasonable schedules – review and testing – plan for contingencies

• Continuous learning – lessons learned, your own experience – others’ experiences

• Reduce complexity – understand and define interactions – do not “reinvent the wheel” – limit features

• Teamwork 31 March 2004

41

Integrity • The “Big Picture” • Truth in advertising (your capability and skills) • Estimation and scheduling • Plan for the long term – your success and reputation – your product’s viability – your company’s reputation 31 March 2004

42

21

Failure and How to Handle It • Types of failure – technical – professional – political/societal

Less control

Progression

• Embrace failure

31 March 2004

– admit and accept responsibility – understand and learn – put past behind you because others won’t – forgive others’ failures; help them to 43 rebound

Personal Examples

22

Technical Failure • Ultraviolet satellite camera with image intensifier • Automatic gain control for image intensifier • Nonlinear control problem • First version – blooming/collapsing picture • Second version – unreliable transmission of gain value 31 March 2004

45

Technical Failure – 1st Version Image intensifier Camera Frame sync

Video signal

reset

Dn Hi-threshold comparator

DAC Up Up-down counter

31 March 2004

Pixel clock

(© 2002, Figure courtesy of the Johns Hopkins University Applied Physics Laboratory.)

46

23

Technical Failure – 1st Version • Problem: blooming/collapsing picture • Background: – Discrete logic, up-down counters – Unstable for bright objects – Not fully simulated or analyzed – Short development time (flew breadboards)

• Should’a: analyzed/simulated expected scenes during design 31 March 2004

47

Technical Failure – 2nd Version

(© 1996, Oxford University Press, used with permission.)

31 March 2004

48

24

Technical Failure – 2nd Version • Problem: unreliable transmission of gain value • Background: – – – –

Microcontroller implementation of AGC AGC stable for all scenes Readout of gain by ground equipment unreliable Analog encoding of gain into video frame

• Should’a: – Use digital encoding into video frame for noise margin – Needed better understanding of noise environment 31 March 2004

49

Professional Failure • Asked to finish programming effort while original designer moved onto other projects • False starts and procrastination • Finally removed myself from project

31 March 2004

50

25

Professional Failure • Problem: did not complete assignment • Background: – Mounds of documentation to plow through – Early realization of no-win situation • Lost motivation • No real recognition of work obvious to me

• Should’a: – Either not taken the job in the first place – Or if no choice, plow through assignment while finding another job (setting precedence) 31 March 2004

51

Professional/Business Failure • Business deal • My personal performance – Technical excellence – Professional excellence – Maintained integrity

• Accused of bad stuff, which I did not do • Deal fell through 31 March 2004

52

26

Professional/Business Failure • Problem: business politics outside my control • Background: – – – –

Interesting proposition and product Long-term relationships Unknowns quantities introduced early in deal Weirdnesses grew

• Should’a: – Either not make deal in the first place – Or left earlier before weirdness got out of hand

• Note: always deal with integrity or don’t deal 31 March 2004

53

Political Failure • Satellite subsystem • Team’s performance – Technical excellence – Professional excellence

• NASA sponsor pulled project in-house

31 March 2004

54

27

Political Failure • Problem: politics outside my company’s control • Background: – 6-month long set of trade studies to define architecture – Thorough studies and review – Schedule well understood, team prepared to build system – Groups at NASA out of work – NASA pulled project in-house to feed their own

• Should’a: – None, politics happen 31 March 2004

55

A Success Story

28

The Sidewinder Missile – A Success Story

(Courtesy of the U.S. Navy. All U.S. Navy photos are public domain. http://library.thinkquest.org/jo113065/citations.htm)

31 March 2004

57

Sidewinder recounted • Goal: simple, sturdy, cheap missile • Small development team, 1949 – 1953 • Simple, clever combination of ideas – – – – –

Rollerons: simple but important control Proportional navigation simplified circuitry Torque-balance servo for maneuvering Canard control fins reduced wiring and connectors Simple data acquisition equipment

• Extensive testing and prototyping 31 March 2004

58

29

Sidewinder Lessons • Breakthroughs require vision • Small teams facilitate commitment and communications • Simple and robust design • Careful, thorough, and extensive testing and integration

31 March 2004

59

30

Related Documents

Fantastic Failures
June 2020 3
Fantastic
June 2020 15
Fantastic
November 2019 20
Fantastic
November 2019 17
Fantastic
November 2019 18
Fantastic
November 2019 21