临近中午,收到某客户生产库的故障请求:数据库出现了ORA-00600的错误!现在数据库启不来,影响业务的运行。
故障场景描述:这是我之前服务过的一家客户,是一套跑在Linux机器上的11gR2单实例数据库,由于服务器异常断电导致的故障!印象较为深刻的是就在2011年2月14日情人节那天,这个库也曾出现过另外一则ORA-00600的错误。较为不幸的是,就在今天又遭遇到这个ORA-600的错误!
下面记录一下这次故障的处理过程:
1 启库时遭遇ORA-600的错误:
[oracle@os5 ~]$ uname -rm 2.6.18-128.el5PAE i686 [oracle@os5 ~]$ ps -ef|grep asm_ oracle 9099 9068 0 10:54 pts/8 00:00:00 grep asm_ [oracle@os5 ~]$ ps -ef|grep ora_ oracle 4797 1 0 10:52 ? 00:00:00 ora_pmon_database oracle 4799 1 0 10:52 ? 00:00:00 ora_vktm_database oracle 4803 1 0 10:52 ? 00:00:00 ora_gen0_database oracle 4805 1 0 10:52 ? 00:00:00 ora_diag_database oracle 4807 1 0 10:52 ? 00:00:00 ora_dbrm_database oracle 4809 1 0 10:52 ? 00:00:00 ora_psp0_database oracle 4811 1 0 10:52 ? 00:00:01 ora_dia0_database oracle 4813 1 0 10:52 ? 00:00:00 ora_mman_database oracle 4815 1 0 10:52 ? 00:00:00 ora_dbw0_database oracle 4817 1 0 10:52 ? 00:00:00 ora_lgwr_database oracle 4819 1 0 10:52 ? 00:00:00 ora_ckpt_database oracle 4821 1 0 10:52 ? 00:00:00 ora_smon_database oracle 4823 1 0 10:52 ? 00:00:00 ora_reco_database oracle 4825 1 0 10:52 ? 00:00:00 ora_mmon_database oracle 4827 1 0 10:52 ? 00:00:00 ora_mmnl_database oracle 4829 1 0 10:52 ? 00:00:00 ora_d000_database oracle 4831 1 0 10:52 ? 00:00:00 ora_s000_database oracle 9105 9068 0 11:55 pts/8 00:00:00 grep ora_ [oracle@os5 ~]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.1.0 Production on Wed Nov 21 10:54:16 2012 Copyright (c) 1982, 2009, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production With the Partitioning, OLAP, Data Mining and Real Application Testing options SQL> alter database open; alter database open * ERROR at line 1: ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], [] SQL> select open_mode from v$database; OPEN_MODE -------------------- MOUNTED SQL>
从上可以看到,当前数据库停留在MOUNT状态,没法OPEN,然后,就用RMAN重新做了个全备,并手工拷贝了联机日志文件,保留现场!
2 接下来,重新尝试以RESETLOGS方式也打不开,尝试RECOVER依然报错!
SQL> alter database open resetlogs; alter database open resetlogs * ERROR at line 1: ORA-01139: RESETLOGS option only valid after an incomplete database recovery SQL> recover database; ORA-00283: recovery session canceled due to errors ORA-00264: no recovery required SQL> alter database open; alter database open * ERROR at line 1: ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], [] SQL>
3 那么接下来,看看ALERT日志都记录哪些有用的信息:
Wed Nov 21 10:59:16 2012 alter database open Beginning crash recovery of 1 threads parallel recovery started with 2 processes Started redo scan Completed redo scan read 1642 KB redo, 0 data blocks need recovery Errors in file /oracle/ora11gR2/diag/rdbms/database/database/trace/database_ora_4848.trc (incident=129790): ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], [] Incident details in: /oracle/ora11gR2/diag/rdbms/database/database/incident/incdir_129790/database_ora_4848_i129790.trc Aborting crash recovery due to error 600 Errors in file /oracle/ora11gR2/diag/rdbms/database/database/trace/database_ora_4848.trc: ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], [] Errors in file /oracle/ora11gR2/diag/rdbms/database/database/trace/database_ora_4848.trc: ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], [] ORA-600 signalled during: alter database open... Wed Nov 21 10:59:17 2012 Sweep [inc][129790]: completed Sweep [inc2][129790]: completed Wed Nov 21 10:59:17 2012 Trace dumping is performing id=[cdmp_20121121105917] Wed Nov 21 11:07:50 2012 db_recovery_file_dest_size of 3852 MB is 0.00% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup.
从alert里,一直看到ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], []的错误,还有一些提示就是Errors in file /oracle/ora11gR2/diag/rdbms/database/database/trace/database_ora_4848.trc。
4 就去看看上述的TRACE文件:
*** 2012-11-21 10:59:16.123 Successfully allocated 2 recovery slaves Using 66 overflow buffers per recovery slave Thread 1 checkpoint: logseq 3464, block 4945, scn 203336710 cache-low rba: logseq 3464, block 9148 on-disk rba: logseq 3464, block 12534, scn 203340512 start recovery at logseq 3464, block 9148, scn 0 *** 2012-11-21 10:59:16.144 Started writing zeroblks thread 1 seq 3464 blocks 12432-12439 *** 2012-11-21 10:59:16.144 Completed writing zeroblks thread 1 seq 3464 ==== Redo read statistics for thread 1 ==== Total physical reads (from disk and memory): 4096Kb -- Redo read_disk statistics -- Read rate (ASYNC): 1642Kb in 0.02s => 80.18 Mb/sec Longest record: 14Kb, moves: 0/1953 (0%) Change moves: 22/192 (11%), moved: 0Mb Longest LWN: 445Kb, moves: 0/733 (0%), moved: 0Mb Last redo scn: 0x0000.0c1eba6c (203340396) ---------------------------------------------- ----- Recovery Hash Table Statistics --------- Hash table buckets = 32768 Longest hash chain = 0 Average hash chain = 0/0 = 0.0 Max compares per lookup = 0 Avg compares per lookup = 0/0 = 0.0 ---------------------------------------------- WARNING! Crash recovery of thread 1 seq 3464 is ending at redo block 12432 but should not have ended before redo block 12534 Incident 129790 created, dump file: /oracle/ora11gR2/diag/rdbms/database/database/incident/incdir_129790/database_ora_4848_i129790.trc ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], [] ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], [] ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], [], [], [], [], [], [], []
5 结合ALERT里的错误ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [3464], [12432], [12534], 和TRACE里的错误提示WARNING! Crash recovery of thread 1 seq 3464 is ending at redo block 12432 but should not have ended before redo block 12534 以及查询MetaLink文档ID 1299564.1获取的指导性信息,应该是由于服务器异常断电,导致LGWR写联机日志文件时失败,下次重新启动数据库时,需要做实例级恢复,而又无法从联机日志文件里获取到这些redo信息,因为上次断电时,写日志失败了。
那么ORA-00600的错误里,那几个参数 [1], [3464], [12432], [12534]又表示什么呢? 从EYGLE的网站上查询到类似的案例信息,结合本故障场景分析,原来是实例需要恢复日志序列号为3464的联机日志文件,需要恢复到编号为12534的日志块,而实际上只能恢复到第12432个日志块儿,所以库就启不来了。不过,从当前日志文件信息,可以看到,当前日志组的确是3464:
SQL> alter session set nls_date_format='yyyy/mm/dd hh24:mi:ss'; Session altered. SQL> select * from v$log; GROUP# THREAD# SEQUENCE# BYTES BLOCKSIZE MEMBERS ARC STATUS FIRST_CHANGE# FIRST_TIME NEXT_CHANGE# NEXT_TIME ---------- ---------- ---------- ---------- ---------- ---------- --- ---------------- ------------- ------------------- ------------ ------------------- 1 1 3463 52428800 512 1 YES INACTIVE 203285629 2012/11/21 03:00:32 203335249 2012/11/21 08:59:46 3 1 3462 52428800 512 1 YES INACTIVE 203238520 2012/11/20 21:00:52 203285629 2012/11/21 03:00:32 2 1 3464 52428800 512 1 NO CURRENT 203335249 2012/11/21 08:59:46 2.8147E+14 SQL>
6 参照MetaLink文档ID 1299564.1的方案来恢复数据库:
SQL> recover database until cancel using backup controlfile; ORA-00279: change 203360397 generated at 11/21/2012 09:21:51 needed for thread 1 ORA-00289: suggestion : /home/oracle/archive_no_delete/ARC_743097220_0000003464_1.arc ORA-00280: change 203360397 for thread 1 is in sequence #3464 Specify log: {=suggested | filename | AUTO | CANCEL} /oracle/ora11gR2/oradata/database/redo02.log Log applied. Media recovery complete. SQL> alter database open ; alter database open * ERROR at line 1: ORA-01589: must use RESETLOGS or NORESETLOGS option for database open SQL> alter database open resetlogs; Database altered. SQL>
至此,这个库是成功恢复,并且启动了。恢复之后,再次对数据库做了备份。看到Alert日志信息:
Wed Nov 21 12:43:44 2012 ALTER DATABASE RECOVER LOGFILE '/oracle/ora11gR2/oradata/database/redo02.log' Media Recovery Log /oracle/ora11gR2/oradata/database/redo02.log Wed Nov 21 12:43:44 2012 Incomplete recovery applied all redo ever generated. Recovery completed through change 203360398 time 11/21/2012 09:21:51 Media Recovery Complete (database) Completed: ALTER DATABASE RECOVER LOGFILE '/oracle/ora11gR2/oradata/database/redo02.log' Wed Nov 21 12:43:55 2012 alter database open Errors in file /oracle/ora11gR2/diag/rdbms/database/database/trace/database_ora_10698.trc: ORA-01589: must use RESETLOGS or NORESETLOGS option for database open ORA-1589 signalled during: alter database open ... Wed Nov 21 12:44:37 2012 alter database open resetlogs Archived Log entry 3382 added for thread 1 sequence 3463 ID 0xd9842c95 dest 1: Archived Log entry 3383 added for thread 1 sequence 3464 ID 0xd9842c95 dest 1: Archived Log entry 3384 added for thread 1 sequence 3462 ID 0xd9842c95 dest 1: RESETLOGS after complete recovery through change 203360398 Resetting resetlogs activation ID 3649318037 (0xd9842c95) Wed Nov 21 12:44:43 2012 Setting recovery target incarnation to 4 Wed Nov 21 12:44:43 2012 Assigning activation ID 3706166088 (0xdce79b48) LGWR: STARTING ARCH PROCESSES Wed Nov 21 12:44:43 2012 ARC0 started with pid=24, OS id=11236 ARC0: Archival started LGWR: STARTING ARCH PROCESSES COMPLETE ARC0: STARTING ARCH PROCESSES Wed Nov 21 12:44:44 2012 ARC1 started with pid=25, OS id=11238 Wed Nov 21 12:44:44 2012 ARC2 started with pid=26, OS id=11240 Thread 1 opened at log sequence 1 Current log# 1 seq# 1 mem# 0: /oracle/ora11gR2/oradata/database/redo01.log Successful open of redo thread 1 MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set Wed Nov 21 12:44:44 2012 SMON: enabling cache recovery ARC1: Archival started ARC2: Archival started Wed Nov 21 12:44:44 2012 ARC3 started with pid=29, OS id=11242 ARC1: Becoming the 'no FAL' ARCH ARC1: Becoming the 'no SRL' ARCH ARC2: Becoming the heartbeat ARCH Successfully onlined Undo Tablespace 2. Dictionary check beginning Dictionary check complete Verifying file header compatibility for 11g tablespace encryption.. Verifying 11g file header compatibility for tablespace encryption completed
从Metalink上看到,目前这个ORA-600的错误应该只有在Oracle 11g上才有的。
评论 (2)
Lonion| 2012年11月23日
从上可以看到,当前数据库停留在MOUNT状态,没法OPEN,然后,就用RMAN重新做了个全备,并手工拷贝了联机日志文件,保留现场!
这个习惯挺good的。
BTW:最后,数据库开启后有数据丢失的现象吗?
admin| 2012年11月26日
经客户确认,没有数据丢失现象。
不过,从理论上来讲,应该会有少量丢失数据的:
第1:数据库是RESTLOGS方式打开的;
第2:该案例中,数据库在做实例恢复时,需要应用到当前日志组3464的第12534个日志块儿,事实上最终只应用到第12432个块上。
在做故障诊断时,保留现场很重要,所以我就做了个RMAN备份。