0
点赞
收藏
分享

微信扫一扫

What to Do if 11gR2 Clusterware is Unhealthy [ID 1068835.1]


What to Do if 11gR2 Clusterware is Unhealthy [ID 1068835.1]



 

Modified 08-JUL-2010     Type BULLETIN     Status PUBLISHED

 

In this Document
  ​​​Purpose​​​  ​​Scope and Application​​  ​​What to Do if 11gR2 Clusterware is Unhealthy​​     ​​1. Clusterware Process:​​     ​​2. Clusterware Exclusive Mode​​     ​​3. User Resource:​​     ​​Appendix​​  ​​References​​



Applies to:


Oracle Server - Enterprise Edition - Version: 11.2.0.1 and later   [Release: 11.2 and later ]
Information in this document applies to any platform.


Purpose


11gR2 Grid Infrastructure clusterware (CRS) may become unhealthy if:  filesystem becomes 100% full on "/" or mount point where clusterware home is installed, OS running out of memory, network not performing etc. 

Generally speaking clusterware should automatically recover from this kind of situation but in some cases it may fail.  The purpose of this document is to provide a list of troubleshooting actions in the event that clusterware auto recovery should fail.


Scope and Application


This document is intended for RAC Database Administrators and Oracle support engineers.


What to Do if 11gR2 Clusterware is Unhealthy


Common symptoms of unhealthy clusterware or srvctl or crsctl returning unexpected results or becoming unresponsive, include: 

  • OS running out of space.
  • OS running out of memory.
  • OS running out of CPU resource.

 

NOTE:  The following note provides a list of common cause for individual clusterware process failures:
​​​Note 1050908.1​​  How to Troubleshoot Grid Infrastructure Startup Issues

 

1. Clusterware Process:

Once issue is identified and fixed, please wait for a few minutes, and verify clusterware processes state - all processes should show up as ONLINE.

1A. To find out clusterware processes state:

$GRID_HOME/bin/crsctl stat res -t -init
​​​------------------------------------------------------------------------------​​​​NAME TARGET STATE SERVER STATE_DETAILS​​​​------------------------------------------------------------------------------​​​​Cluster Resources​​​​------------------------------------------------------------------------------​​​​ora.asm​​​​1 OFFLINE OFFLINE Instance Shutdown​​​​ora.crsd​​​​1 OFFLINE OFFLINE​​​​ora.cssd​​​​1 ONLINE ONLINE rac002f​​​​ora.cssdmonitor​​​​1 ONLINE ONLINE rac002f​​​​ora.ctssd​​​​1 ONLINE ONLINE rac002f OBSERVER​​​​ora.diskmon​​​​1 ONLINE ONLINE rac002f​​​​ora.drivers.acfs​​​​1 ONLINE ONLINE rac002f​​​​ora.evmd​​​​1 OFFLINE OFFLINE​​​​ora.gipcd​​​​1 ONLINE ONLINE rac002f​​​​ora.gpnpd​​​​1 ONLINE ONLINE rac002f​​​​ora.mdnsd​​​​1 ONLINE ONLINE rac002f​

1B. In above example, ora.asm, ora.crsd and ora.evmd remained OFFLINE which means manual intervention is needed, to bring them up:

$GRID_HOME/bin/crsctl start res ora.crsd -init
​​​CRS-2672: Attempting to start 'ora.asm' on 'rac002f'​​​​CRS-2676: Start of 'ora.asm' on 'rac002f' succeeded​​​​CRS-2672: Attempting to start 'ora.crsd' on 'rac002f'​​​​CRS-2676: Start of 'ora.crsd' on 'rac002f' succeeded​

As ora.crsd depend on ora.asm, ora.asm is started automatically when starting ora.crsd

To bring up ora.evmd:

$GRID_HOME/bin/crsctl start res ora.evmd -init
​​​CRS-2672: Attempting to start 'ora.evmd' on 'rac001f'​​​​CRS-2676: Start of 'ora.evmd' on 'rac001f' succeeded​

1C. If process resource fails to start up, please refer to <> for troubleshooting steps; then try to stop it and restart it:

$GRID_HOME/bin/crsctl stop res ora.evmd -init

If this fails, try with "-f" option:

$GRID_HOME/bin/crsctl stop res ora.evmd -init -f

If stop fails with "-f" option, please refer to Appendix.

If the process is already stopped, the following errors will be reported:

CRS-2500: Cannot stop resource 'ora.evmd' as it is not running
CRS-4000: Command Stop failed, or completed with errors.

1D. If a critical clusterware process fails to start and there's no obvious reason, the next action is to restart clusterware on local node:

$GRID_HOME/bin/crsctl stop crs -f

1E. If above command fails, you may kill all clusterware processes by executing:

ps -ef | grep keyword | grep -v grep | awk '{print $2}' | xargs kill -9

1F. As a last resort, you can take out local node by rebooting it.

1G. If there's more than one node where clusterware is unhealthy, repeat the same procedure on all other nodes, once clusterware daemons are up on all nodes, next thing to verify is user resource.

2. Clusterware Exclusive Mode

​​Certain tasks​​ requires clusterware to be in exclusive mode. To bring CRS in exclusive mode, shutdown CRS on all nodes (refer to above Step 1D, 1E and 1F), then as root, issue following command on one node only:

$GRID_HOME/bin/crsctl start crs -excl

If cssd fails to come up, as root, issue following command:

$GRID_HOME/bin/crsctl start res ora.cssd -init -env "CSSD_MODE=-X"

 

3. User Resource:

3A. The crs_stat command has been deprecated in 11gR2, please do not use it anymore.  Use the following command to query the resource state of all user resources:

$GRID_HOME/bin/crsctl stat res -t
​​​------------------------------------------------------------------------------​​​​NAME TARGET STATE SERVER STATE_DETAILS​​​​------------------------------------------------------------------------------​​​​Local Resources​​​​------------------------------------------------------------------------------​​​​ora.GI.dg​​​​ONLINE ONLINE rac001f​​​​ONLINE ONLINE rac002f​​​​ora.LISTENER.lsnr​​​​ONLINE ONLINE rac001f​​​​ONLINE ONLINE rac002f​​​​..​​​​ora.gsd​​​​OFFLINE OFFLINE rac001f​​​​OFFLINE OFFLINE rac002f​​​​ora.net1.network​​​​ONLINE ONLINE rac001f​​​​ONLINE ONLINE rac002f​​​​------------------------------------------------------------------------------​​​​Cluster Resources​​​​------------------------------------------------------------------------------​​​​ora.LISTENER_SCAN1.lsnr​​​​1 ONLINE ONLINE rac002f​​​​ora.LISTENER_SCAN2.lsnr​​​​1 ONLINE OFFLINE​​​​ora.LISTENER_SCAN3.lsnr​​​​1 ONLINE OFFLINE​​​​ora.b2.db​​​​1 ONLINE ONLINE rac001f​​​​2 ONLINE ONLINE rac002f Open​​​​ora.b2.sb2.svc​​​​1 ONLINE ONLINE rac001f​​​​2 ONLINE ONLINE rac002f​​​​ora.rac001f.vip​​​​1 ONLINE ONLINE rac001f​​​​ora.rac002f.vip​​​​1 ONLINE ONLINE rac002f​​​​ora.oc4j​​​​1 OFFLINE OFFLINE​​​​ora.scan1.vip​​​​1 ONLINE ONLINE rac002f​​​​ora.scan2.vip​​​​1 ONLINE OFFLINE​​​​ora.scan3.vip​​​​1 ONLINE OFFLINE​

 

NOTE:  ora.gsd is OFFLINE by default if there is no 9i database in the cluster.  ora.oc4j is OFFLINE in 11.2.0.1 as Database Workload Management(DBWLM) is unavailable.

3B. In example above, resource ora.scan2.vip, ora.scan3.vip, ora.LISTENER_SCAN2.lsnr and ora.LISTENER_SCAN3.lsnr are OFFLINE.

To start it:

$GRID_HOME/bin/srvctl start scan
​​​PRCC-1014 : scan1 was already running​

3C. To start other OFFLINE resources:

$RESOURCE_HOME/bin/srvctl start resource_type <options>

$RESOURCE_HOME refers to location where the resource is running off, for example, vip in $GRID_HOME, 11.2 .db in 11.2 RDBMS home, and 11.1 .db in 11.1 RDBMS home.

For srvctl syntax, please refer to ​​​Server Control Utility Reference​​​

3D. To stop user resource, try the following sequentially until resource stopped successfully:

$RESOURCE_HOME/bin/srvctl stop resource_type <options>​​$RESOURCE_HOME/bin/srvctl stop resource_type <options> -f​​​$GRID_HOME/bin/crsctl stop res resource_name​​​​$GRID_HOME/bin/crsctl stop res resource_name -f​

Where resource_name is the name in "crsctl stat res" output.

Appendix

A. Process Resource Fails to Stop even with "-f" option:

 

$GRID_HOME/bin/crsctl stat res  -w 'NAME = ora.ctssd' -t -init

ora.ctssd
     1        ONLINE  UNKNOWN      node1                    Wrong check return.

 

$GRID_HOME/bin/crsctl stop res ora.ctssd -init
CRS-2673: Attempting to stop 'ora.ctssd' on 'node1'
CRS-2675: Stop of 'ora.ctssd' on 'node1' failed
CRS-2679: Attempting to clean 'ora.ctssd' on 'node1'
CRS-2680: Clean of 'ora.ctssd' on 'node1' failed
Clean action for daemon aborted

 

    $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log

2010-05-19 15:58:39.803: [ora.ctssd][1155352896] [check] PID will be looked for in /ocw/grid/ctss/init/node1.pid
2010-05-19 15:58:39.835: [ora.ctssd][1155352896] [check] PID which will be monitored will be 611
..
2010-05-19 15:58:40.016: [ COMMCRS][1239271744]clsc_connect: (0x2aaaac052ed0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=node1DBG_CTSSD))

[  clsdmc][1155352896]Fail to connect (ADDRESS=(PROTOCOL=ipc)(KEY=node1DBG_CTSSD)) with status 9
2010-05-19 15:58:40.016: [ora.ctssd][1155352896] [check] Error = error 9 encountered when connecting to CTSSD
..
2010-05-19 15:58:40.039: [ora.ctssd][1155352896] [check] Calling PID check for daemon
2010-05-19 15:58:40.039: [ora.ctssd][1155352896] [check] Trying to check PID = 611
..
2010-05-19 15:58:40.219: [ora.ctssd][1155352896] [check] PID check returned ONLINE CLSDM returned OFFLINE
2010-05-19 15:58:40.219: [ora.ctssd][1155352896] [check] Check error. Return = 5, state detail = Wrong check return.
2010-05-19 15:58:40.220: [    AGFW][1155352896] check for resource: ora.ctssd 1 1 completed with status: FAILED
2010-05-19 15:58:40.220: [    AGFW][1165842752] ora.ctssd 1 1 state changed from: UNKNOWN to: FAILED

 

ps -ef|grep 611|grep -v grep
root       611     7  0 May19 ?        00:00:00 [kmpathd/0]

 

cat /ocw/grid/ctss/init/node1.pid
611

 

In above example, stop of ora.ctssd fails as daemon pid file shows pid of octssd is 611, but "ps -ef" shows  611 is kmpathd which is not octssd.bin; also connects to ctssd via IPC key node1DBG_CTSSD fails.

To fix the issue, nullify ctssd pid file:

 

> /ocw/grid/ctss/init/node1.pid

 

Location of process resource pid can be $GRID_HOME/log/$HOST/$DAEMON/$HOST.pid or $GRID_HOME/$DAEMON/init/$HOST.pid


References


​​NOTE:1050908.1​​​ - How to Troubleshoot Grid Infrastructure Startup Issues
​​​NOTE:1053147.1​​​ - 11gR2 Clusterware and Grid Home - What You Need to Know
​​​NOTE:942166.1​​​ - How to Proceed from Failed 11gR2 Grid Infrastructure (CRS) Installation
​​​NOTE:969254.1​​​ - How to Proceed from Failed Upgrade to 11gR2 Grid Infrastructure (CRS)
​​​NOTE:1069369.1​​ - How to Delete or Add Resource in 11gR2 Grid Infrastructure

举报

相关推荐

0 条评论