Crs-rac Troubleshooting

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Crs-rac Troubleshooting as PDF for free.

More details

  • Words: 1,198
  • Pages: 29
CRS & RAC Troubleshooting Krishnadev Telikicherla Cluster & Parallel Storage Technology Oracle Corporation

Oracle Corporation

Topics:         

Defining the Issue Creating a Timeline Hang or Slowdown Performance Issues Gathering Data Testcases Rediscovery Engaging Oracle Support Examples

Oracle Corporation

Defining the Issue Layers 

What layers are involved in the issue: •

Oracle Clusterware

• CRS daemon • CSS daemon • HangCheckTimer [Linux] / Oprocd (not Linux) • EVM • OCR • Voting • • •

General RDBMS Operating System Hardware Oracle Corporation

Defining the Issue Cause vs. Effects  Causes: – – –

Resource issues Oracle issues OS issues

 Effects: – – – –

Hangs/Spins Instances Crashes and Evictions Node Reboots and Evictions Oracle Errors (ORA-600, ORA-7445, ORA-29740)

Oracle Corporation

Defining the Issue Description  When describing the problem while creating the SR via Metalink it is important that you use phrases that will help identify known issues either in bugs or Metalink content.  In the body of the SR try to be as detailed as possible about the environment.  Nobody knows the system better than the you.  Talk to the sys-admin as well regarding OS/Network related issues.

Oracle Corporation

Creating a Timeline  A timeline helps identify the times to concentrate on when reviewing files  A timeline can be built from reviewing the files themselves once they are provided to support but this will only slow resolution time down  Timelines should include an ordering of cause and effects as well as include all participating nodes  Include specific times, ie… –

At 3:00am PST we noticed that node2 was hanging.

Oracle Corporation

Hang or slowdown  Differentiate between a database hang and a database slowdown  Identify the extent of a hang

Oracle Corporation

Is it a Hang or a Slowdown?  Check:  System states to see if there is any change over a short period of time  V$SESSION_WAIT where wait_time=0  Overall machine load, including cpu, memory, swap, I/O

Oracle Corporation

Is it a Hang or a Slowdown?  Single or multiprocess hang: –



Usually characterized by a particular job hanging or not completing Essentially the same as in single instance unless it’s internode parallel query.

 Instance hang: A single instance is unusable.  Multi-instance or full database hang: Entire database is hung or not responding Oracle Corporation

Performance  Single process or statement  Instance  Multi-Instance

Oracle Corporation

Single Process or Single Statement  Find the wait event  10046 level 12 - oradebug setorapid - oradebug event 10046 trace name context forever, level 12 - oradebug tracefile_name

   

Explain plan 10053 if plan problems are found V$SESSTAT Truss/trace/dbx/pstack if OS-related problems are suspected Oracle Corporation

Instance Slowdown  Statspack / AWR  OS performance statistics - cpu, memory, and I/O  Characteristics: – – –

Related to a particular job? Certain time of day? What’s changed?

Oracle Corporation

Multi-Instance Slowdowns  AWR from each node can be of use:  AWR collects instance specific data  Examine and correlate the reports

Oracle Corporation

Multi-Instance Slowdowns     

In cases of extreme slowdowns: systemstates on all nodes V$SESSION_WAIT Alert logs and any trace files Process states, or stack traces if determined and applicable

Oracle Corporation

Debugging Techniques      

v$session_wait System states from all nodes 10046 level 12 trace of the hung process ORADEBUG Lock layer and DLM tracing Get any traces:  DLM traces  Background processes, alert logs, and init.ora  User traces Oracle Corporation

Debugging and Diagnostics  Performance issues or hangs:  Identify the resource being requested.  Identify who holds the resource.

Oracle Corporation

ORADEBUG and Tools  Hang analyze: –

hanganalyze

 Note: 301137.1 – OS Watcher User Guide  Note: 135714.1 - Script to Collect RAC Diagnostic Information (diagcollection.pl)

Oracle Corporation

Gathering Data Best Practices  Single most important step  There is never too much data, but including lots of useless data can increase download time of the data as well as increase the amount of time to process the data.  Always error on getting too much data, but be aware of the impact on the resolution time.  Too little data increases resolution time more than too much data.  Always include a readme.txt file that explains the contens of the provided files

Oracle Corporation

Gathering Data Processes  Always get stacks from processes that seem to be spinning, hanging or unresponsive: – – –

oradebug gdb pstack

 ps and top info can be very usefull when trying to determine if a processes exhibits issues such as memory leaks, spinning or hanging Oracle Corporation

Gathering Data RAC  For instance evictions please review Metalink note 219361.1  See Metalink note 203226.1 : RAC Survival Kit: Real Application Clusters Troubleshooting and Information  See Metalink note 289690.1 : Data Gathering for Troubleshooting RAC and CRS issues

Oracle Corporation

Gathering Data Tools  RDA – system and Oracle configuration information  racdiag – modifiable sql script for gathering rac data. See Metalink note 135714.1 “Script to Collect RAC Diagnostic Information  OSW – OS Watcher gathers top, slabinfo, netstat and ps data over programmable intervals 301137.1 “OS Watcher User Guide”

Oracle Corporation

Gathering Data CRS 10.2.0.x (continued)  CRS and other resource issues: –



ORA_CRS_HOME  log//cssd/oclsmon  log//cssd  log//client  log//crsd  log//evmd  log//racg ORACLE_HOME (rdbms)  racg/dump  ORACLE_BASE//hdump

Oracle Corporation

Gathering Data Tools (continue) 

Starting with 10.2.0.1 $ORA_CRS_HOME/bin/diagcollection.pl collect all RAC relevant files (run as root) oracle10@stnsp010>./diagcollection.pl Production Copyright 2004, 2005, Oracle. All rights reserved Cluster Ready Services (CRS) diagnostic collection tool diagcollection --collect [--crs] For collecting crs diag information [--oh] For collecting oracle home diag information [--ob] For collecting oracle base diag information [--all] Default.For collecting all diag information NOTE: 1. You can also do the following ./diagcollection.pl --collect --crs --oh 2. ORA_CRS_HOME,ORACLE_HOME and ORACLE_BASE env variables need to be set. --clean cleans up the diagnosability information gathered by this script --coreanalyze extracts information from core files and stores it in a text file

Oracle Corporation

Testcases  Not always feasible  If provided, can greatly influence resolution time  When providing a testcase: – –

Include a readme file Try to strip the testcase down to the minimal elements that are needed to reproduce the problem

 If at all possible, always try to build a testcase  Testcases are your friends!

Oracle Corporation

Rediscovery  Expensive for a support organization  Issue rediscovery is not always obvious  Use Metalink to identify possible causes for issues as well as workarounds and patch availability  Communicate new issues between DBAs

Oracle Corporation

Engaging Oracle Support 

Try to be responsive to all TARs when they are set to CUS status. Delays inherently causes two problems: 1. 2.

The issue loses momentum A new engineer may have to take over the issue

Oracle Corporation

Examples  

10.2.0.2 HP-UX/Itanium ServiceGuard, CRS, CFS and RAC Delays in reconfiguration

Oracle Corporation

Examples  

10.2.0.2 Linux CRS, RAC and ASM ORA-600[2103] and one instance crashed

Oracle Corporation

Questions?

Oracle Corporation

Related Documents

Troubleshooting
December 2019 34
Troubleshooting
June 2020 23
Troubleshooting
May 2020 25
Desktop Troubleshooting
November 2019 22
Troubleshooting Guide
October 2019 57