Availability Management
- Premanand Lotlikar 31st July, 2007
Agenda • • • • • • • • • •
Introduction Objective of Availability Mgmt Basic Concepts Benefits Relationship with other processes Activities in Change Mgmt Process Control Key Performance Indicators Cost Possible Problems
Objectives • Determining availability requirements in close collaboration with customers • Guaranteeing the level of availability established for the IT services • Monitoring the availability of the IT services • Proposing improvements in the IT infrastructure and services with a view to increasing levels of availability • Supervising compliance with the OLAs and UCs agreed with internal and external service providers
Basic Concepts
Basic Concepts • High Availability means – IT service is continuously available to the customer – Little downtime – Rapid service recovery
• Availability of service depends on – – – –
Complexity of the IT infrastructure architecture Reliability of the components Ability to respond quickly and effectively to faults Quality of maintenance by support and suppliers
Basic Concepts • Reliability means – Service is available for an agreed period without interruptions
• Includes resilience • Calculated using statistics • Determined by – Reliability of the components – Ability of service/component to operate despite failure (resilience) – Preventive maintenance
Basic Concepts • Maintainability needed to – Keep the services in operations – Restore services when they fail
• Includes – Taking measures to prevent faults – Detecting faults – Making diagnosis by components themselves – Resolving the fault – Restoring the service
Basic Concepts
Basic Concepts • Mean Time to Repair (MTTR) – Avg time b/w the occurrence of a fault and service recovery
• Mean Time Between Failures (MTBF) – Avg time b/w recovery from one incident and the occurrence of next
• Mean Time Between System Incidents (MTBSI) – Avg time b/w the occurrence of two consecutive incidents
Benefits • Fulfillment of the agreed service levels. • Reduction in the costs associated with a given level of availability. • The customer perceives a better quality of service. • The levels of availability progressively increase. • The number of incidents is reduced.
Inputs - Outputs
Relationship with other processes • Service Level Mgmt is responsible for negotiating & managing availability • Availability is one of the most important element in SLA
Relationship with other processes • Configuration Mgmt has information about the infrastructure and can provide valuable information to Availability Mgmt
Relationship with other processes • Changes in capacity can often affect the availability of a service • Changes to availability will affect capacity • These 2 processes exchange info about – Scenarios for upgrading – Phasing out IT components – Availability trends that may need changes to capacity
Relationship with other processes • Problem Mgmt is directly involved in identifying and resolving the causes of actual or potential availability problems
Relationship with other processes • Incident Mgmt provides reports with information about recovery times, repair times etc. This information is used to determine the achieved availability.
Relationship with other processes • Change Mgmt informs Availability Mgmt about FSC • Availability Mgmt informs Change Mgmt about maintenance related to new service and elements.
Activities • Planning • Monitoring
Planning • • • • • •
Determining the availability requirements Designing for availability Designing for recoverability Security issues Maintenance management Developing the Availability Plan
Determining the availability requirements • Must be undertaken before SLA is concluded • Should address both new IT services and changes to existing services • Clearly defining availability requirements early is essential to prevent confusion and differences
Determining the availability requirements • Should identify: – Key business functions – Agreed definition of IT service downtime – Quantifiable availability requirements – Quantifiable impact on the business functions of unscheduled IT service downtime – Business hours of customer – Agreements about maintenance windows
Designing for availability • Vulnerabilities affecting availability standards should be identified early • This will prevent – Excessive development costs – Unplanned expenditure at later stages – Additional cost by suppliers – Overall delays
Designing for recoverability • Uninterrupted availability is rarely feasible • Design for recoverability involves – Effective Incident Mgmt – Appropriate escalation – Communication – Backup and recovery procedures – Tasks, responsibilities and authority clearly defined
Key Security issues • Security and reliability are closely linked • High availability can be supported by effective information security • This includes: – Determining who is authorized to access secure areas – Determining which critical authorizations may be issued
Maintenance management • There will always be scheduled window of unavailability • These periods can be used for preventive actions • Maintenance must be carried out when impact on services can be minimized
Developing the Availability Plan • Long term plan concerning availability over the next few years • It is not the implementation plan for Availability Mgmt • Plan require liaison with areas such as – Service Level Mgmt – IT Service Continuity Mgmt – Capacity Mgmt – Change Mgmt
Methods and Techniques • Component Failure Impact Analysis(CFIA) – Uses an Availability matrix with strategic components and their roles in each service – Horizontal Analysis – Vertical Analysis
CFIA
Fault Tree Analysis • Used to identify chain of events leading to failure of IT service • Distinguishes following events: – Basic Event: power outages or operator error – Resulting Event: resulting from combination of earlier events – Conditional Event: events that occur only in certain conditions – Trigger Event: events that cause other events
Fault Tree Analysis
Availability Calculations • Availability is commonly defined as a percentage as follows: • For example, if the service is 24/7 and over the last month the system has been down for four hours to carry out maintenance, the real availability of the system was:
Process Control • Critical Success Factors – Business must have clearly defined availability objectives – SLM must have been setup to formalize agreements – Both parties must use the same definitions of availability and downtime
Process Control • Key Performance Indicators – Percentage availability per service – Downtime duration – Downtime frequency
Cost
Possible Problems • The real availability of the service is not monitored correctly. • There is no commitment to the process in the IT organization. • The appropriate software tools and personnel are not available. • The availability objectives do not match the customer's needs. • There is a lack of coordination with other processes. • Internal and external service providers do not recognize the authority of the Availability Manager as a result of a lack of support from management
Thank you!