由 BUG 引发 ORA-15064 进而导致数据库实例意外中止

Linux大全评论6.8K views阅读模式

今天早上刚到公司一会,就接到南京客户打来的电话,说他们的核心会员数据库宕机了,让我远程帮忙查看一下。还没来得急看报纸的我赶紧打开电脑,远程连接到客户的服务器进行诊断。

客户的生产环境是AIX 6.1上的Oracle 11.2.0.3.0,在凌晨1:40分左右发生的故障。
为了不涉及泄露客户隐私,把数据库实例名进行了替换。
下面看具体的分析:  
1. 数据库alert.log分析   Mon Jan 05 01:40:50 2015
WARNING: ASM communication error: op 18 state 0x50 (3113)
ERROR: slave communication error with ASM
NOTE: Deferred communication with ASM instance Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/
test5_ora_16581034.trc: ORA-03113: end-of-file on communication channel Process ID:  Session ID: 288 Serial number: 5649 NOTE: deferred map free for map id 4422 Mon Jan 05 01:40:55 2015
NOTE: ASMB terminating Mon Jan 05 01:40:55 2015     ***********************************************************************   Fatal NI connect error 12170.     VERSION INFORMATION: TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production   Time: 05-JAN-2015 01:40:55   Tracing not turned on.   Tns error struct:     ns main err code: 12535     TNS-12535: TNS:operation timed out     ns secondary err code: 12606     nt main err code: 0     nt secondary err code: 0     nt OS err code: 0   Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=test2)(PORT=64460)) WARNING: inbound connection timed out (ORA-3136) Mon Jan 05 01:40:55 2015     ***********************************************************************   Fatal NI connect error 12170.     VERSION INFORMATION: TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production   Time: 05-JAN-2015 01:40:55   Tracing not turned on.   Tns error struct:     ns main err code: 12535     TNS-12535: TNS:operation timed out     ns secondary err code: 12606     nt main err code: 0     nt secondary err code: 0     nt OS err code: 0 Mon Jan 05 01:40:55 2015     ***********************************************************************   Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=test2)(PORT=64530)) Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/
test5_asmb_5898342.trc: ORA-15064: communication failure with ASM instance ORA-03135: connection lost contact Process ID:  Session ID: 216 Serial number: 7   Fatal NI connect error 12170. WARNING: inbound connection timed out (ORA-3136) Errors in file /u01/app/oracle/diag/rdbms/test5/test5/trace/test5_asmb_5898342.trc: ORA-15064: communication failure with ASM instance ORA-03135: connection lost contact Process ID:  Session ID: 216 Serial number: 7     VERSION INFORMATION: TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production   Time: 05-JAN-2015 01:40:55   Tracing not turned on.   Tns error struct:
ASMB (ospid: 5898342): terminating the instance due to error 15064     ns main err code: 12535     TNS-12535: TNS:operation timed out Mon Jan 05 01:40:55 2015     ***********************************************************************     ns secondary err code: 12606   Fatal NI connect error 12170.     nt main err code: 0     VERSION INFORMATION: TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production     nt secondary err code: 0   Time: 05-JAN-2015 01:40:55   Tracing not turned on.     nt OS err code: 0 Mon Jan 05 01:40:56 2015 System state dump requested by (instance=1, osid=5898342 (ASMB)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/test5/test5/trace/
test5_diag_20578640.trc Mon Jan 05 01:40:59 2015
ORA-1092 : opitsk aborting process Mon Jan 05 01:41:00 2015
ORA-1092 : opitsk aborting process Instance terminated by ASMB, pid = 5898342
Mon Jan 05 08:46:45 2015
Starting ORACLE instance (normal)   由于在故障点之前也会报一些ORA-3136的错误,故这里不做说明,与本次错误干系不大。从alert日志中发现,数据库实例在1:40:55时,被ASMB(进程号:5898342)进程终止,直到8点46分左右才开始启动(客户自己手动重启)。  
2.相关trace文件分析  
test5_ora_16581034.trc:   *** 2015-01-05 01:40:50.900 *** SESSION ID:(424.19913) 2015-01-05 01:40:50.900 *** CLIENT ID:() 2015-01-05 01:40:50.900 *** SERVICE NAME:(SYS$USERS) 2015-01-05 01:40:50.900 *** MODULE NAME:(backup incr datafile) 2015-01-05 01:40:50.900 *** ACTION NAME:(0000097 STARTED16) 2015-01-05 01:40:50.900  
WARNING: ASM communication error: op 18 state 0x50 (3113)
ERROR: slave communication error with ASM
NOTE: Deferred communication with ASM instance
ORA-03113: end-of-file on communication channel Process ID:  Session ID: 288 Serial number: 5649   基本与alert日志中报的错误相同,就是获知在1:40分这个故障点,数据库与ASM实例的连接中断了  
test5_asmb_5898342.trc:   *** 2015-01-05 01:40:55.870 NOTE: ASMB terminating   *** 2015-01-05 01:40:55.922
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact Process ID:  Session ID: 216 Serial number: 7
error 15064 detected in background process
ORA-15064: communication failure with ASM instance
ORA-03135: connection lost contact Process ID:  Session ID: 216 Serial number: 7 kjzduptcctx: Notifying DIAG for crash event ----- Abridged Call Stack Trace ----- ksedsts()+360<-kjzdssdmp()+240<-kjzduptcctx()+228<-kjzdicrshnfy()+100<-ksuitm()+5124<-ksbrdp()+4508<-opirip()+1624<-opidrv()+608<-sou2o()+136<-opimai_real()+188<-ssthrdmain()+268<-main()+204<-__start()+112  ----- End of Abridged Call Stack Trace -----   *** 2015-01-05 01:40:55.954
ASMB (ospid: 5898342): terminating the instance due to error 15064 ksuitm: waiting up to [5] seconds before killing DIAG(20578640)   从这个trace文件可以获知,由于遭遇ora-15064错误,ASMB进程终止了数据库实例。  
test5_diag_20578640.trc:   Session Wait History:         elapsed time of 1.543354 sec since last wait      
0: waited for 'ASM background timer'         =0x0, =0x0, =0x0         wait_id=881966 seq_num=30034 snap_id=1         wait times: snap=0.000422 sec, exc=0.000422 sec, total=0.000422 sec         wait times: max=infinite         wait counts: calls=0 os=0         occurred after 0.000016 sec of elapsed time    
  1: waited for 'ASM file metadata operation'         msgop=0x2, locn=0x3, =0x0         wait_id=881964 seq_num=30033 snap_id=2         wait times: snap=0.000082 sec, exc=0.000127 sec, 
total=4 min 4 sec         wait times: max=infinite         wait counts: calls=0 os=0         occurred after 0.000000 sec of elapsed time      
2: waited for 'latch: shared pool'         address=0x700000000107048, number=0x133, tries=0x0         wait_id=881965 seq_num=30032 snap_id=1         wait times: snap=4 min 4 sec, exc=4 min 4 sec, 
total=4 min 4 sec         wait times: max=infinite         wait counts: calls=0 os=0         occurred after 0.000000 sec of elapsed time      
3: waited for 'ASM file metadata operation'         msgop=0x2, locn=0x3, =0x0         wait_id=881964 seq_num=30031 snap_id=1         wait times: snap=0.000045 sec, exc=0.000045 sec, total=0.000045 sec         wait times: max=infinite         wait counts: calls=0 os=0         occurred after 0.000019 sec of elapsed time      
4: waited for 'ASM background timer'         =0x0, =0x0, =0x0         wait_id=881963 seq_num=30030 snap_id=1         wait times: snap=3.285876 sec, exc=3.285876 sec, total=3.285876 sec         wait times: max=infinite         wait counts: calls=0 os=0         occurred after 0.000011 sec of elapsed time      
5: waited for 'ASM file metadata operation'         msgop=0xc, locn=0x3, =0x0         wait_id=881962 seq_num=30029 snap_id=1         wait times: snap=0.000034 sec, exc=0.000034 sec, total=0.000034 sec         wait times: max=infinite         wait counts: calls=0 os=0         occurred after 0.000019 sec of elapsed time    
 6: waited for 'ASM background timer'         =0x0, =0x0, =0x0         wait_id=881961 seq_num=30028 snap_id=1         wait times: snap=5.000528 sec, exc=5.000528 sec, total=5.000528 sec         wait times: max=infinite         wait counts: calls=0 os=0         occurred after 0.000009 sec of elapsed time      
7: waited for 'ASM file metadata operation'         msgop=0x0, locn=0x3, =0x0         wait_id=881960 seq_num=30027 snap_id=1         wait times: snap=0.000003 sec, exc=0.000003 sec, total=0.000003 sec         wait times: max=infinite         wait counts: calls=0 os=0         occurred after 0.000016 sec of elapsed time      
8: waited for 'ASM background timer'         =0x0, =0x0, =0x0         wait_id=881959 seq_num=30026 snap_id=1         wait times: snap=5.000493 sec, exc=5.000493 sec, total=5.000493 sec         wait times: max=infinite         wait counts: calls=0 os=0         occurred after 0.000011 sec of elapsed time      
9: waited for 'ASM file metadata operation'         msgop=0x0, locn=0x3, =0x0         wait_id=881958 seq_num=30025 snap_id=1         wait times: snap=0.000002 sec, exc=0.000002 sec, total=0.000002 sec         wait times: max=infinite         wait counts: calls=0 os=0         occurred after 0.000017 sec of elapsed time   sample interval: 1 sec, max history 120 sec     ---------------------------------------------------       [120 samples,                                        01:38:56 - 01:40:55]         waited for 'latch: shared pool', seq_num: 30032           p1: 'address'=0x700000000107048           p2: 'number'=0x133           p3: 'tries'=0x0          
time_waited: >= 119 sec (still in wait)     ---------------------------------------------------     Sampled Session History Summary:       longest_non_idle_wait: 'latch: shared pool'       [120 samples, 01:38:56 - 01:40:55]          
time_waited: >= 119 sec (still in wait)   在ASMB进程的的dump文件中,
发现有2个4分4秒的等待,waited for 'ASM file metadata operation'和
waited for 'latch: shared pool',而其他的相关操作都是只有几微秒,甚至0秒的  
3.ASM实例的alert日志:   Mon Jan 05 01:40:56 2015     ***********************************************************************
TNS-12537: TNS:connection closed   Fatal NI connect error 12537, connecting to:  (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))     ns secondary err code: 12560     VERSION INFORMATION: TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production     nt main err code: 0   Time: 05-JAN-2015 01:40:56     nt secondary err code: 0   Tracing not turned on.     nt OS err code: 0   Tns error struct:     ns main err code: 12537    
TNS-12537: TNS:connection closed     ns secondary err code: 12560     nt main err code: 0     nt secondary err code: 0     nt OS err code: 0 Mon Jan 05 04:42:23 2015
NOTE: ASMB process exiting due to lack of ASM file activity for 305 seconds Mon Jan 05 08:46:53 2015 NOTE: client test5:test5 registered, osid 23593052, mbr 0x0 Mon Jan 05 08:47:24 2015 Starting background process ASMB Mon Jan 05 08:47:24 2015 ASMB started with pid=26, OS id=20185292  Mon Jan 05 08:47:25 2015 NOTE: client +ASM:+ASM registered, osid 16384406, mbr 0x1   在故障点之前,ASM的alert日志中未报过TNS错误,但故障点的时候,ASM中也出现了TNS连接关闭的错误,因此,也不排除网络因素引起的ASM实例故障。4点多的时候有一个
NOTE: ASMB process exiting due to lack of ASM file activity for 305 seconds,根据文档
NOTE ASMB process exiting due to lack of ASM file activity (文档 ID 754110.1)”,这个NOTE可以忽略。到了8:46的时候,ASM实例随着数据库的重启也正常启动了   最终,找到了一篇MOS文档,是和这次遇到的故障相关的,应该就是
Bug 13914613导致的这个问题:  

Bug 13914613  Excessive time holding shared pool latch in kghfrunp with auto memory management

 This note gives a brief overview of bug 13914613. 


 The content was last updated on: 17-DEC-2014


 
Click here for details of each of the sections below.

Affects:

Product (Component) Oracle Server (Rdbms)
Range of versions believed to be affected (Not specified)
Versions confirmed as being affected
  • 11.2.0.3
  • 11.2.0.2
  • 10.2.0.5
Platforms affected Generic (all / most platforms affected)

Fixed:

The fix for 13914613 is first included in
  • 12.1.0.1 (Base Release)
  • 11.2.0.4 (Server Patch Set)
  • 11.2.0.3.6 Database Patch Set Update
  • 11.2.0.3 Bundle Patch 12 for Exadata Database
  • 11.2.0.3 Patch 18 on Windows Platforms
  • 11.2.0.2 Patch 25 on Windows Platforms
  • 来源: <https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=182220034715943&id=13914613.8&_adf.ctrl-state=14dpmawvgr_260>  

 

Description

A session may spend excessive time holding the shared pool latch under
kghfrunp when auto memory management is used. This can ultimately 
result in an instance crash due to other sessions holding critical 
resources too long. 
eg: database crash due to ORA-240 and ORA-15064
 
Rediscovery Notes
 Session wait chains show "latch: shared pool" waits
 Wait times for shared pool latch over 10 seconds for the same holder
 Call stacks of shared pool latch holder show kghfrunp as a currently 
  executing function.  
 Auto memory management is in use such that "duration" subheaps are used
 eg: ASMM (sga_target) or AMM (memory_target) being used.
 
Workaround
 Set init.ora parameter _enable_shared_pool_durations=false
 
Getting a Fix
 Use one of the "Fixed" versions listed above
 (for Patch Sets / bundles use the latest version available as
  contents are cumulative - the "Fixed" version listed above is
  the first version where the fix is included)

 
分析了产生故障的原因,我给客户的解决方案就是以下两条:  
1.设置隐含参数"_enable_shared_pool_durations"=false
2.升级到11.2.0.4的patchset 

在CentOS 5.5 i386 上安装 Oracle 10G XE  http://www.linuxidc.com/Linux/2011-02/31928p2.htm

Linux下Oracle 11G XE 安装笔记  http://www.linuxidc.com/Linux/2012-08/66974.htm

在CentOS 6.4下安装Oracle 11gR2(x64) http://www.linuxidc.com/Linux/2014-02/97374.htm

Oracle 11gR2 在VMWare虚拟机中安装步骤 http://www.linuxidc.com/Linux/2013-09/89579p2.htm

Debian 下 安装 Oracle 11g XE R2 http://www.linuxidc.com/Linux/2014-03/98881.htm

企鹅博客
  • 本文由 发表于 2019年9月8日 13:16:16
  • 转载请务必保留本文链接:https://www.qieseo.com/190790.html

发表评论