【Oracle RAC故障分析与处理】
原文地址:【Oracle?RAC故障分析與處理】作者:蟻巡運(yùn)維平臺
一?RAC環(huán)境
RAC架構(gòu),2節(jié)點(diǎn)信息
節(jié)點(diǎn)1
SQL> show parameter instance
NAME?????????????????????????????????TYPE????????VALUE
------------------------------------ ----------- -----------------------------------------------
active_instance_count????????????????????integer
cluster_database_instances????????????????integer?????2
instance_groups?????????????????????????string
instance_name??????????????????????????string??????RACDB1
instance_number????????????????????????Integer?????1
instance_type???????????????????????????string??????RDBMS
open_links_per_instance??????????????????integer?????4
parallel_instance_group???????????????????string
parallel_server_instances??????????????????integer?????2
節(jié)點(diǎn)2
SQL> show parameter instance
NAME?????????????????????????????????TYPE????????VALUE
------------------------------------ ----------- ------------------------------------------
active_instance_count????????????????????integer
cluster_database_instances????????????????integer?????2
instance_groups?????????????????????????string
instance_name??????????????????????????string??????RACDB2
instance_number????????????????????????integer?????2
instance_type???????????????????????????string??????RDBMS
open_links_per_instance??????????????????integer?????4
parallel_instance_group???????????????????string
parallel_server_instances??????????????????integer?????2
數(shù)據(jù)庫版本
SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Prod
PL/SQL Release 10.2.0.1.0 - Production
CORE????10.2.0.1.0??????Production
TNS for Linux: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production
操作系統(tǒng)信息
節(jié)點(diǎn)1
[oracle@rac1 ~]$ uname -a
Linux rac1 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:02 EDT 2007 i686 i686 i386 GNU/Linux
節(jié)點(diǎn)2
[oracle@rac2 ~]$ uname -a
Linux rac2 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:02 EDT 2007 i686 i686 i386 GNU/Linux
RAC所有資源信息
[oracle@rac2 ~]$ crs_stat -t
Name???????????Type????????????Target?????State??????Host????????
----------------------------------------------------------------------------------------------
ora....B1.inst????application????????ONLINE????ONLINE????rac1????????
ora....B2.inst????application????????ONLINE????ONLINE????rac2????????
ora....DB1.srv???application????????ONLINE????ONLINE????rac2????????
ora.....TAF.cs????application????????ONLINE????ONLINE????rac2????????
ora.RACDB.db??application?????????ONLINE????ONLINE????rac2????????
ora....SM1.asm??application????????ONLINE????ONLINE????rac1????????
ora....C1.lsnr????application????????ONLINE????ONLINE????rac1????????
ora.rac1.gsd????application????????ONLINE????ONLINE????rac1????????
ora.rac1.ons????application????????ONLINE????ONLINE????rac1????????
ora.rac1.vip????application????????ONLINE????ONLINE????rac1????????
ora....SM2.asm??application????????ONLINE????ONLINE????rac2????????
ora....C2.lsnr????application???????ONLINE????ONLINE????rac2????????
ora.rac2.gsd????application????????ONLINE????ONLINE?????rac2????????
ora.rac2.ons????application????????ONLINE????ONLINE?????rac2????????
ora.rac2.vip????application?????????ONLINE????ONLINE?????rac2
二?模擬兩個節(jié)點(diǎn)內(nèi)聯(lián)網(wǎng)不通,觀察RAC會出現(xiàn)什么現(xiàn)象?給出故障定位的整個過程
本小題會模擬RAC的私有網(wǎng)絡(luò)不通現(xiàn)象,然后定位故障原因,最后排除故障。
1.首先RAC是一個非常健康的狀態(tài)
[oracle@rac2 ~]$ crs_stat -t
Name???????????Type????????????Target?????State??????Host????????
----------------------------------------------------------------------------------------------
ora....B1.inst????application????????ONLINE????ONLINE????rac1????????
ora....B2.inst????application????????ONLINE????ONLINE????rac2????????
ora....DB1.srv???application????????ONLINE????ONLINE????rac2????????
ora.....TAF.cs????application????????ONLINE????ONLINE????rac2????????
ora.RACDB.db??application?????????ONLINE????ONLINE????rac2????????
ora....SM1.asm??application????????ONLINE????ONLINE????rac1????????
ora....C1.lsnr????application????????ONLINE????ONLINE????rac1????????
ora.rac1.gsd????application????????ONLINE????ONLINE????rac1????????
ora.rac1.ons????application????????ONLINE????ONLINE????rac1????????
ora.rac1.vip????application????????ONLINE????ONLINE????rac1????????
ora....SM2.asm??application????????ONLINE????ONLINE????rac2????????
ora....C2.lsnr????application???????ONLINE????ONLINE????rac2????????
ora.rac2.gsd????application????????ONLINE????ONLINE?????rac2????????
ora.rac2.ons????application????????ONLINE????ONLINE?????rac2????????
ora.rac2.vip????application?????????ONLINE????ONLINE?????rac2??
檢查CRS進(jìn)程狀態(tài)(CRS??CSS??EVM)
[oracle@rac2 ~]$ crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
檢查OCR磁盤狀態(tài),沒有問題
[oracle@rac2 ~]$ ocrcheck
Status of Oracle Cluster Registry is as follows :
?????????Version??????????????????:??????????2
?????????Total space (kbytes)?????:?????104344
?????????Used space (kbytes)??????:???????4344
?????????Available space (kbytes) :?????100000
?????????ID???????????????????????: 1752469369
?????????Device/File Name?????????: /dev/raw/raw1
????????????????????????????????????Device/File integrity check succeeded
????????????????????????????????????Device/File not configured
?????????Cluster registry integrity check succeeded
檢查vote disk狀態(tài)
[oracle@rac2 ~]$ crsctl query css votedisk
0.?????0????/dev/raw/raw2??????????????????????顯示2號裸設(shè)備為表決磁盤
located 1 votedisk(s).??????????????????????????????只定位1個表決磁盤
2.手工禁用一個私有網(wǎng)卡
[oracle@rac2 ~]$ cat /etc/hosts
127.0.0.1???????localhost.localdomain???localhost
::1?????localhost6.localdomain6 localhost6
##Public Network - (eth0)
##Private Interconnect - (eth1)
##Public Virtual IP (VIP) addresses - (eth0)
192.168.1.101???rac1????????????????????????這是RAC的共有網(wǎng)卡
192.168.1.102???rac2
192.168.2.101???rac1-priv????????????????????這是RAC的私有網(wǎng)卡
192.168.2.102???rac2-priv
192.168.1.201???rac1-vip?????????????????????這是RAC虛擬網(wǎng)卡
192.168.1.202???rac2-vip
看一下IP地址和網(wǎng)卡的對應(yīng)關(guān)系
[oracle@rac2 ~]$ ifconfig
eth0??????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:87??
??????????inet addr:192.168.1.102??Bcast:192.168.1.255??Mask:255.255.255.0
??????????inet6 addr: fe80::20c:29ff:fe8f:f187/64 Scope:Link
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????RX packets:360 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:593 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:1000
??????????RX bytes:46046 (44.9 KiB)??TX bytes:62812 (61.3 KiB)
??????????Interrupt:185 Base address:0x14a4
eth0:1????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:87??
??????????inet addr:192.168.1.202??Bcast:192.168.1.255??Mask:255.255.255.0
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????Interrupt:185 Base address:0x14a4
eth1??????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:91??
??????????inet addr:192.168.2.102??Bcast:192.168.2.255??Mask:255.255.255.0
??????????inet6 addr: fe80::20c:29ff:fe8f:f191/64 Scope:Link
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????RX packets:76588 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:58002 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:1000
??????????RX bytes:65185420 (62.1 MiB)??TX bytes:37988820 (36.2 MiB)
??????????Interrupt:193 Base address:0x1824
eth2??????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:9B??
??????????inet addr:192.168.203.129??Bcast:192.168.203.255??Mask:255.255.255.0
??????????inet6 addr: fe80::20c:29ff:fe8f:f19b/64 Scope:Link
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????RX packets:339 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:83 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:1000
??????????RX bytes:42206 (41.2 KiB)??TX bytes:10199 (9.9 KiB)
??????????Interrupt:169 Base address:0x18a4
lo????????Link encap:Local Loopback??
??????????inet addr:127.0.0.1??Mask:255.0.0.0
??????????inet6 addr: ::1/128 Scope:Host
??????????UP LOOPBACK RUNNING??MTU:16436??Metric:1
??????????RX packets:99403 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:99403 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:0
??????????RX bytes:18134658 (17.2 MiB)??TX bytes:18134658 (17.2 MiB)
eth 0?對應(yīng)RAC的共有網(wǎng)卡
eth 1?對應(yīng)RAC的私有網(wǎng)卡
eth0:1對應(yīng)RAC的虛擬網(wǎng)卡
我們現(xiàn)在禁止eth1私有網(wǎng)卡來完成內(nèi)聯(lián)網(wǎng)網(wǎng)絡(luò)不通現(xiàn)象,方法很簡單
ifdown eth1?????????????????????????????禁用網(wǎng)卡
ifup???eth1?????????????????????????????激活網(wǎng)卡
[oracle@rac2 ~]$ su – root?????????????????需要使用root用戶哦,否則提示Users cannot control this device.
Password:
[root@rac2 ~]# ifdown eth1???????????????
我從17:18:51敲入這個命令,4分鐘之后節(jié)點(diǎn)2重啟,大家知道發(fā)生了什么現(xiàn)象嘛?
Good?這就是傳說中RAC腦裂brain split問題,當(dāng)節(jié)點(diǎn)間的內(nèi)聯(lián)網(wǎng)不通時(shí),無法信息共享,就會出現(xiàn)腦裂現(xiàn)象,RAC必須驅(qū)逐其中一部分節(jié)點(diǎn)來保護(hù)數(shù)據(jù)的一致性,被驅(qū)逐的節(jié)點(diǎn)被強(qiáng)制重啟,這不節(jié)點(diǎn)2自動重啟了么。又說回來,那為什么節(jié)點(diǎn)2重啟,其他節(jié)點(diǎn)不重啟呢。
這里有個驅(qū)逐原則:(1)子集群中少節(jié)點(diǎn)的被驅(qū)逐
?????????????????(2)節(jié)點(diǎn)號大的被驅(qū)逐
?????????????????(3)負(fù)載高的節(jié)點(diǎn)被驅(qū)逐
我們中的就是第二條,OK,節(jié)點(diǎn)2重啟來了,我們登陸系統(tǒng),輸出用戶名/密碼
3.定位故障原因
(1)查看操作系統(tǒng)日志
[oracle@rac2 ~]$ su - root
Password:
[root@rac2 ~]# tail -30f /var/log/messages
我又重新模擬了一遍,由于信息量很大,我從里面找出與網(wǎng)絡(luò)有關(guān)的告警信息
Jul 17 20:05:25 rac2 avahi-daemon[3659]: Withdrawing address record for 192.168.2.102 on eth1.
收回eth1網(wǎng)卡的ip地址,導(dǎo)致節(jié)點(diǎn)1驅(qū)逐節(jié)點(diǎn)2,節(jié)點(diǎn)2自動重啟
Jul 17 20:05:25 rac2 avahi-daemon[3659]: Leaving mDNS multicast group on interface eth1.IPv4 with address 192.168.2.102.
網(wǎng)卡eth1脫離多組播組
Jul 17 20:05:25 rac2 avahi-daemon[3659]: iface.c: interface_mdns_mcast_join() called but no local address available.
Jul 17 20:05:25 rac2 avahi-daemon[3659]: Interface eth1.IPv4 no longer relevant for mDNS.
網(wǎng)卡eth1不在與mDNS有關(guān)
Jul 17 20:09:54 rac2 logger: Oracle Cluster Ready Services starting up automatically.
Oracle集群自動啟動
Jul 17 20:09:59 rac2 avahi-daemon[3664]: Registering new address record for fe80::20c:29ff:fe8f:f191 on eth1.
Jul 17 20:09:59 rac2 avahi-daemon[3664]: Registering new address record for 192.168.2.102 on eth1.
注冊新ip地址
Jul 17 20:10:17 rac2 logger: Cluster Ready Services completed waiting on dependencies.
CRS完成等待依賴關(guān)系
從上面信息我們大體知道,是因?yàn)?span style="font-family:Calibri;">eth1網(wǎng)卡的問題導(dǎo)致節(jié)點(diǎn)2重啟的,為了進(jìn)一步分析問題我們還需要看一下CRS排錯日志
[root@rac2 crsd]# tail -100f $ORA_CRS_HOME/log/rac2/crsd/crsd.log
Abnormal termination by CSS, ret = 8
異常終止CSS
2013-07-17 20:11:18.115: [ default][1244944]0CRS Daemon Starting
2013-07-17 20:11:18.116: [ CRSMAIN][1244944]0Checking the OCR device
2013-07-17 20:11:18.303: [ CRSMAIN][1244944]0Connecting to the CSS Daemon
重啟CRS??CSS進(jìn)程
[root@rac2 cssd]# pwd
/u01/crs1020/log/rac2/cssd
[root@rac2 cssd]# more ocssd.log???????查看cssd進(jìn)程日志
[CSSD]2013-07-17 17:26:18.319 [86104976] >TRACE:???clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_rac2_crs))
這里可以看到rac2節(jié)點(diǎn)的cssd進(jìn)程監(jiān)聽出了問題
[CSSD]2013-07-17 17:26:19.296 [75615120] >TRACE:???clssnmHandleSync: Acknowledging sync: src[1] srcName[rac1] seq[13] sync[12]
請確認(rèn)兩個節(jié)點(diǎn)的同步問題
從以上一系列信息可以分析出這是內(nèi)聯(lián)網(wǎng)通信問題,由于兩個節(jié)點(diǎn)的信息無法同步導(dǎo)致信息無法共享從而引起腦裂現(xiàn)象
4.節(jié)點(diǎn)2重啟自動恢復(fù)正常狀態(tài)
[root@rac2 cssd]# ifconfig
eth0??????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:87??
??????????inet addr:192.168.1.102??Bcast:192.168.1.255??Mask:255.255.255.0
??????????inet6 addr: fe80::20c:29ff:fe8f:f187/64 Scope:Link
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????RX packets:567 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:901 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:1000
??????????RX bytes:65402 (63.8 KiB)??TX bytes:96107 (93.8 KiB)
??????????Interrupt:185 Base address:0x14a4
eth0:1????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:87??
??????????inet addr:192.168.1.202??Bcast:192.168.1.255??Mask:255.255.255.0
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????Interrupt:185 Base address:0x14a4
eth1??????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:91??
??????????inet addr:192.168.2.102??Bcast:192.168.2.255??Mask:255.255.255.0
??????????inet6 addr: fe80::20c:29ff:fe8f:f191/64 Scope:Link
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????RX packets:76659 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:51882 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:1000
??????????RX bytes:61625763 (58.7 MiB)??TX bytes:26779167 (25.5 MiB)
??????????Interrupt:193 Base address:0x1824
eth2??????Link encap:Ethernet??HWaddr 00:0C:29:8F:F1:9B??
??????????inet addr:192.168.203.129??Bcast:192.168.203.255??Mask:255.255.255.0
??????????inet6 addr: fe80::20c:29ff:fe8f:f19b/64 Scope:Link
??????????UP BROADCAST RUNNING MULTICAST??MTU:1500??Metric:1
??????????RX packets:409 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:58 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:1000
??????????RX bytes:45226 (44.1 KiB)??TX bytes:9567 (9.3 KiB)
??????????Interrupt:169 Base address:0x18a4
lo????????Link encap:Local Loopback??
??????????inet addr:127.0.0.1??Mask:255.0.0.0
??????????inet6 addr: ::1/128 Scope:Host
??????????UP LOOPBACK RUNNING??MTU:16436??Metric:1
??????????RX packets:49025 errors:0 dropped:0 overruns:0 frame:0
??????????TX packets:49025 errors:0 dropped:0 overruns:0 carrier:0
??????????collisions:0 txqueuelen:0
??????????RX bytes:11292111 (10.7 MiB)??TX bytes:11292111 (10.7 MiB)
我們看一下網(wǎng)卡ip地址,被收回的私有eth1網(wǎng)卡ip現(xiàn)在已經(jīng)恢復(fù)了,這是因?yàn)閯倓偣?jié)點(diǎn)2進(jìn)行了重啟操作。重啟后會初始化所有網(wǎng)卡,被我們禁用的eth1網(wǎng)卡被重新啟用,重新恢復(fù)ip。
檢查CRS進(jìn)程狀態(tài),全都是健康的
[root@rac2 cssd]# crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
檢查集群,實(shí)例,數(shù)據(jù)庫,監(jiān)聽,ASM服務(wù)狀態(tài),也都是完好無損,全部啟動了
[root@rac2 cssd]# crs_stat -t
Name???????????Type???????????Target????State?????Host????????
------------------------------------------------------------
ora....B1.inst???application????ONLINE????ONLINE????rac1????????
ora....B2.inst???application????ONLINE????ONLINE????rac2????????
ora....DB1.srv???application????ONLINE????ONLINE????rac1????????
ora.....TAF.cs???application????ONLINE????ONLINE????rac1????????
ora.RACDB.db??application????ONLINE????ONLINE????rac1????????
ora....SM1.asm??application????ONLINE????ONLINE????rac1????????
ora....C1.lsnr???application????ONLINE????ONLINE????rac1????????
ora.rac1.gsd???application????ONLINE????ONLINE????rac1????????
ora.rac1.ons???application????ONLINE????ONLINE????rac1????????
ora.rac1.vip????application????ONLINE????ONLINE????rac1????????
ora....SM2.asm??application????ONLINE????ONLINE????rac2????????
ora....C2.lsnr???application????ONLINE????ONLINE????rac2????????
ora.rac2.gsd???application????ONLINE????ONLINE????rac2????????
ora.rac2.ons???application????ONLINE????ONLINE????rac2????????
ora.rac2.vip????application????ONLINE????ONLINE????rac2????????
RAC故障分析并解決的整個過程到此結(jié)束
三?模擬OCR磁盤不可用時(shí),RAC會出現(xiàn)什么現(xiàn)象?給出故障定位的整個過程
OCR磁盤:OCR磁盤中注冊了RAC所有的資源信息,包含集群、數(shù)據(jù)庫、實(shí)例、監(jiān)聽、服務(wù)、ASM、存儲、網(wǎng)絡(luò)等等,只有被OCR磁盤注冊的資源才能被CRS集群管理,CRS進(jìn)程就是按照OCR磁盤中記錄的資源來管理的,在我們的運(yùn)維過程中可能會發(fā)生OCR磁盤信息丟失的情況,例如 在增減節(jié)點(diǎn)時(shí),添加?or?刪除OCR磁盤時(shí)可能都會發(fā)生。接下來我們模擬一下當(dāng)OCR磁盤信息丟失時(shí),如果定位故障并解決。
實(shí)驗(yàn)
1.檢查OCR磁盤和CRS進(jìn)程
(1)檢查OCR磁盤,只有OCR磁盤沒有問題,CRS進(jìn)程才可以順利管理
[root@rac2 cssd]# ocrcheck
Status of Oracle Cluster Registry is as follows :
?????????Version??????????????????:???????????2
?????????Total space (kbytes)????????:??????104344
?????????Used space (kbytes)????????:????????4344
?????????Available space (kbytes)?????:??????100000
?????????ID???????????????????????:??1752469369
?????????Device/File Name??????????: /dev/raw/raw1????????????這個就是OCR磁盤所屬的裸設(shè)備
????????????????????????????????????Device/File integrity check succeeded
????????????????????????????????????Device/File not configured
?????????Cluster registry integrity check succeeded?????????????????完整檢查完畢沒有問題
(2)檢查CRS狀態(tài)
[root@rac2 cssd]# crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
集群進(jìn)程全部健康
(3)關(guān)閉CRS守護(hù)進(jìn)程
[root@rac2 sysconfig]# crsctl stop crs
Stopping resources.????????????????????????停止資源
Successfully stopped CRS resources????????????停止CRS進(jìn)程
Stopping CSSD.????????????????????????????停止CSSD進(jìn)程
Shutting down CSS daemon.
Shutdown request successfully issued.?????????
關(guān)閉請求執(zhí)行成功
[root@rac2 sysconfig]# crsctl check crs
Failure 1 contacting CSS daemon???????????????連接CSS守護(hù)進(jìn)程失敗
Cannot communicate with CRS????????????????無法與CRS通信
Cannot communicate with EVM???????????????無法與EVM通信
2.用root用戶導(dǎo)出OCR磁盤內(nèi)容進(jìn)行OCR備份
[root@rac2 sysconfig]# ocrconfig -export /home/oracle/ocr.exp
[oracle@rac2 ~]$ pwd
/home/oracle
[oracle@rac2 ~]$ ll
total 108
-rw-r--r-- 1 root???root?????98074 Jul 18 11:20 ocr.exp?????????已經(jīng)生成OCR導(dǎo)出文件
3.重啟CRS守護(hù)進(jìn)程
[root@rac2 sysconfig]# crsctl start crs
Attempting to start CRS stack?????????????????????嘗試啟動CRS
The CRS stack will be started shortly?????????CRS即將啟動
檢查CRS狀態(tài)
[root@rac2 sysconfig]# crsctl check crs???????很好,我們重新啟動后就變正常了
CSS appears healthy
CRS appears healthy
EVM appears healthy
4.使用裸設(shè)備命令0字節(jié)覆蓋OCR磁盤內(nèi)容模擬丟失狀態(tài)
[root@rac2 sysconfig]# dd if=/dev/zero of=/dev/raw/raw1 bs=1024 count=102400
102400+0 records in???????102400記錄輸入
102400+0 records out??????102400記錄輸出
104857600 bytes (105 MB) copied, 76.7348 seconds, 1.4 MB/s
命令解釋
dd:???????????????????????????????指定大小的塊拷貝一個文件,并在拷貝的同時(shí)進(jìn)行指定的轉(zhuǎn)換
if=/dev/zero?????????????????指定源文件,0設(shè)備
of=/dev/raw/raw1?????指定目標(biāo)文件,OCR磁盤
bs=1024????????????????????????指定塊大小1024個字節(jié),即1k
count=102400?????????????指定拷貝的塊數(shù),102400個塊
5.再次檢查OCR磁盤狀態(tài)
[root@rac2 sysconfig]# ocrcheck
PROT-601: Failed to initialize ocrcheck??????????????????初始化OCR磁盤失敗
檢查CRS狀態(tài)
[root@rac2 sysconfig]# crsctl check crs
Failure 1 contacting CSS daemon??????????????????????連接CSS守護(hù)進(jìn)程失敗
Cannot communicate with CRS???????????????????????無法與CRS通信
EVM appears healthy
CRS進(jìn)程失敗很正常,你想想連記錄的資源信息都丟失了,還怎么管理呢
6.使用import恢復(fù)OCR磁盤內(nèi)容
[root@rac2 crs1020]# ocrconfig -import /home/oracle/ocr.exp
7.最后檢查OCR磁盤狀態(tài)
謝天謝地順順利利恢復(fù)回來了
[root@rac2 crs1020]# ocrcheck
Status of Oracle Cluster Registry is as follows :
?????????Version??????????????????:??????????2
?????????Total space (kbytes)?????:?????104344
?????????Used space (kbytes)??????:???????4348
?????????Available space (kbytes) :??????99996
?????????ID???????????????????????:??425383787
?????????Device/File Name?????????: /dev/raw/raw1
????????????????????????????????????Device/File integrity check succeeded
????????????????????????????????????Device/File not configured
?????????Cluster registry integrity check succeeded
8.關(guān)注CRS守護(hù)進(jìn)程
[root@rac2 crs1020]# crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy
非常好,當(dāng)OCR磁盤恢復(fù)之后自動重啟CRS守護(hù)進(jìn)程
[root@rac2 crs1020]# crs_stat -t
Name???????????Type???????????Target????State?????Host????????
------------------------------------------------------------
ora....B1.inst????application????ONLINE????ONLINE????rac1????????
ora....B2.inst????application????ONLINE????OFFLINE???????????????
ora....DB1.srv???application????ONLINE????ONLINE????rac1????????
ora.....TAF.cs????application????ONLINE????ONLINE????rac1????????
ora.RACDB.db???application????ONLINE????ONLINE????rac1????????
ora....SM1.asm??application????ONLINE????ONLINE????rac1????????
ora....C1.lsnr????application????ONLINE????ONLINE????rac1????????
ora.rac1.gsd????application????ONLINE????ONLINE????rac1????????
ora.rac1.ons????application????ONLINE????ONLINE????rac1????????
ora.rac1.vip?????application????ONLINE????ONLINE????rac1????????
ora....SM2.asm??application????ONLINE????OFFLINE???????????????
ora....C2.lsnr????application????ONLINE????OFFLINE???????????????
ora.rac2.gsd????application????ONLINE????OFFLINE???????????????
ora.rac2.ons????application????ONLINE????OFFLINE???????????????
ora.rac2.vip?????application????ONLINE????ONLINE????rac2
我重啟了一遍CRS集群服務(wù)
[root@rac2 init.d]# ./init.crs stop
Shutting down Oracle Cluster Ready Services (CRS):
Stopping resources.
Successfully stopped CRS resources
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.
Shutdown has begun. The daemons should exit soon.
[root@rac2 init.d]# crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
[root@rac2 init.d]# ./init.crs start
Startup will be queued to init within 90 seconds.
現(xiàn)在都恢復(fù)了
[oracle@rac2 ~]$ crs_stat -t
Name???????????Type????????????Target?????State??????Host????????
----------------------------------------------------------------------------------------------
ora....B1.inst????application????????ONLINE????ONLINE????rac1????????
ora....B2.inst????application????????ONLINE????ONLINE????rac2????????
ora....DB1.srv???application????????ONLINE????ONLINE????rac2????????
ora.....TAF.cs????application????????ONLINE????ONLINE????rac2????????
ora.RACDB.db??application?????????ONLINE????ONLINE????rac2????????
ora....SM1.asm??application????????ONLINE????ONLINE????rac1????????
ora....C1.lsnr????application????????ONLINE????ONLINE????rac1????????
ora.rac1.gsd????application????????ONLINE????ONLINE????rac1????????
ora.rac1.ons????application????????ONLINE????ONLINE????rac1????????
ora.rac1.vip????application????????ONLINE????ONLINE????rac1????????
ora....SM2.asm??application????????ONLINE????ONLINE????rac2????????
ora....C2.lsnr????application???????ONLINE????ONLINE????rac2????????
ora.rac2.gsd????application????????ONLINE????ONLINE?????rac2????????
ora.rac2.ons????application????????ONLINE????ONLINE?????rac2????????
ora.rac2.vip????application?????????ONLINE????ONLINE?????rac2??
四 模擬votedisk不可用時(shí),RAC會出現(xiàn)什么現(xiàn)象?給出故障定位的整個過程
表決磁盤:在發(fā)生腦裂問題時(shí),通過表決磁盤來決定驅(qū)逐哪個節(jié)點(diǎn)。這是發(fā)生在集群層上的腦裂。
控制文件:如果是發(fā)生在實(shí)例層上的腦裂問題,通過控制文件來決定驅(qū)逐哪個節(jié)點(diǎn)。
Votedisk冗余策略:
(1)votedisk可以選擇外部冗余,通過外部的機(jī)制進(jìn)行保護(hù)
(2)votedisk還可以選擇Oracle自己的內(nèi)部冗余,通過添加votedisk磁盤鏡像來實(shí)現(xiàn)內(nèi)部冗余
實(shí)驗(yàn)
1.檢查vote disk狀態(tài)
[oracle@rac1 ~]$ crsctl query css votedisk
0.?????0????/dev/raw/raw2?????????????????顯示2號裸設(shè)備為表決磁盤
located 1 votedisk(s).?????????????????????????只定位1個表決磁盤
2.停止CRS集群
[root@rac1 sysconfig]# crsctl stop crs
Stopping resources.????????????????????????停止資源
Successfully stopped CRS resources????????????停止CRS進(jìn)程
Stopping CSSD.????????????????????????????停止CSSD進(jìn)程
Shutting down CSS daemon.
Shutdown request successfully issued.?????????
3.添加votedisk表決磁盤,實(shí)現(xiàn)內(nèi)部冗余,
crsctl??add??css??votedisk /dev/raw/raw3 –force???把raw3這塊裸設(shè)備添加入表決磁盤組
添加之后Oracle就會把原來表決磁盤內(nèi)容復(fù)制一份到新表決磁盤中
4.再次檢查vote disk狀態(tài)
crsctl??query??css??votedisk
5.啟動CRS集群
[root@rac2 sysconfig]# crsctl start crs
Attempting to start CRS stack???????????????嘗試啟動CRS
The CRS stack will be started shortly?????????CRS即將啟動
小結(jié):當(dāng)表決磁盤/dev/raw/raw2損壞時(shí),可以用其鏡像/dev/raw/raw3來代替,使其RAC可以繼續(xù)對外提供服務(wù)。
來源:互聯(lián)網(wǎng)
轉(zhuǎn)載于:https://blog.51cto.com/linuxzkq/1583890
總結(jié)
以上是生活随笔為你收集整理的【Oracle RAC故障分析与处理】的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: OSPF协议介绍及配置 (下)
- 下一篇: 转:3d max 2013 安装教程,凭