CRS & RAC Troubleshooting Krishnadev Telikicherla Cluster & Parallel Storage Technology Oracle Corporation
Oracle Corporation
Topics:
Defining the Issue Creating a Timeline Hang or Slowdown Performance Issues Gathering Data Testcases Rediscovery Engaging Oracle Support Examples
Oracle Corporation
Defining the Issue Layers
What layers are involved in the issue: •
Oracle Clusterware
• CRS daemon • CSS daemon • HangCheckTimer [Linux] / Oprocd (not Linux) • EVM • OCR • Voting • • •
General RDBMS Operating System Hardware Oracle Corporation
Defining the Issue Cause vs. Effects Causes: – – –
Resource issues Oracle issues OS issues
Effects: – – – –
Hangs/Spins Instances Crashes and Evictions Node Reboots and Evictions Oracle Errors (ORA-600, ORA-7445, ORA-29740)
Oracle Corporation
Defining the Issue Description When describing the problem while creating the SR via Metalink it is important that you use phrases that will help identify known issues either in bugs or Metalink content. In the body of the SR try to be as detailed as possible about the environment. Nobody knows the system better than the you. Talk to the sys-admin as well regarding OS/Network related issues.
Oracle Corporation
Creating a Timeline A timeline helps identify the times to concentrate on when reviewing files A timeline can be built from reviewing the files themselves once they are provided to support but this will only slow resolution time down Timelines should include an ordering of cause and effects as well as include all participating nodes Include specific times, ie… –
At 3:00am PST we noticed that node2 was hanging.
Oracle Corporation
Hang or slowdown Differentiate between a database hang and a database slowdown Identify the extent of a hang
Oracle Corporation
Is it a Hang or a Slowdown? Check: System states to see if there is any change over a short period of time V$SESSION_WAIT where wait_time=0 Overall machine load, including cpu, memory, swap, I/O
Oracle Corporation
Is it a Hang or a Slowdown? Single or multiprocess hang: –
–
Usually characterized by a particular job hanging or not completing Essentially the same as in single instance unless it’s internode parallel query.
Instance hang: A single instance is unusable. Multi-instance or full database hang: Entire database is hung or not responding Oracle Corporation
Performance Single process or statement Instance Multi-Instance
Oracle Corporation
Single Process or Single Statement Find the wait event 10046 level 12 - oradebug setorapid - oradebug event 10046 trace name context forever, level 12 - oradebug tracefile_name
Explain plan 10053 if plan problems are found V$SESSTAT Truss/trace/dbx/pstack if OS-related problems are suspected Oracle Corporation
Instance Slowdown Statspack / AWR OS performance statistics - cpu, memory, and I/O Characteristics: – – –
Related to a particular job? Certain time of day? What’s changed?
Oracle Corporation
Multi-Instance Slowdowns AWR from each node can be of use: AWR collects instance specific data Examine and correlate the reports
Oracle Corporation
Multi-Instance Slowdowns
In cases of extreme slowdowns: systemstates on all nodes V$SESSION_WAIT Alert logs and any trace files Process states, or stack traces if determined and applicable
Oracle Corporation
Debugging Techniques
v$session_wait System states from all nodes 10046 level 12 trace of the hung process ORADEBUG Lock layer and DLM tracing Get any traces: DLM traces Background processes, alert logs, and init.ora User traces Oracle Corporation
Debugging and Diagnostics Performance issues or hangs: Identify the resource being requested. Identify who holds the resource.
Oracle Corporation
ORADEBUG and Tools Hang analyze: –
hanganalyze
Note: 301137.1 – OS Watcher User Guide Note: 135714.1 - Script to Collect RAC Diagnostic Information (diagcollection.pl)
Oracle Corporation
Gathering Data Best Practices Single most important step There is never too much data, but including lots of useless data can increase download time of the data as well as increase the amount of time to process the data. Always error on getting too much data, but be aware of the impact on the resolution time. Too little data increases resolution time more than too much data. Always include a readme.txt file that explains the contens of the provided files
Oracle Corporation
Gathering Data Processes Always get stacks from processes that seem to be spinning, hanging or unresponsive: – – –
oradebug gdb pstack
ps and top info can be very usefull when trying to determine if a processes exhibits issues such as memory leaks, spinning or hanging Oracle Corporation
Gathering Data RAC For instance evictions please review Metalink note 219361.1 See Metalink note 203226.1 : RAC Survival Kit: Real Application Clusters Troubleshooting and Information See Metalink note 289690.1 : Data Gathering for Troubleshooting RAC and CRS issues
Oracle Corporation
Gathering Data Tools RDA – system and Oracle configuration information racdiag – modifiable sql script for gathering rac data. See Metalink note 135714.1 “Script to Collect RAC Diagnostic Information OSW – OS Watcher gathers top, slabinfo, netstat and ps data over programmable intervals 301137.1 “OS Watcher User Guide”
Oracle Corporation
Gathering Data CRS 10.2.0.x (continued) CRS and other resource issues: –
–
ORA_CRS_HOME log//cssd/oclsmon log//cssd log//client log//crsd log//evmd log//racg ORACLE_HOME (rdbms) racg/dump ORACLE_BASE//hdump
Oracle Corporation
Gathering Data Tools (continue)
Starting with 10.2.0.1 $ORA_CRS_HOME/bin/diagcollection.pl collect all RAC relevant files (run as root) oracle10@stnsp010>./diagcollection.pl Production Copyright 2004, 2005, Oracle. All rights reserved Cluster Ready Services (CRS) diagnostic collection tool diagcollection --collect [--crs] For collecting crs diag information [--oh] For collecting oracle home diag information [--ob] For collecting oracle base diag information [--all] Default.For collecting all diag information NOTE: 1. You can also do the following ./diagcollection.pl --collect --crs --oh 2. ORA_CRS_HOME,ORACLE_HOME and ORACLE_BASE env variables need to be set. --clean cleans up the diagnosability information gathered by this script --coreanalyze extracts information from core files and stores it in a text file
Oracle Corporation
Testcases Not always feasible If provided, can greatly influence resolution time When providing a testcase: – –
Include a readme file Try to strip the testcase down to the minimal elements that are needed to reproduce the problem
If at all possible, always try to build a testcase Testcases are your friends!
Oracle Corporation
Rediscovery Expensive for a support organization Issue rediscovery is not always obvious Use Metalink to identify possible causes for issues as well as workarounds and patch availability Communicate new issues between DBAs
Oracle Corporation
Engaging Oracle Support
Try to be responsive to all TARs when they are set to CUS status. Delays inherently causes two problems: 1. 2.
The issue loses momentum A new engineer may have to take over the issue
Oracle Corporation
Examples
10.2.0.2 HP-UX/Itanium ServiceGuard, CRS, CFS and RAC Delays in reconfiguration
Oracle Corporation
Examples
10.2.0.2 Linux CRS, RAC and ASM ORA-600[2103] and one instance crashed
Oracle Corporation
Questions?
Oracle Corporation