这篇文章不是标题党,是在实际工作中真切的案例。
场景:这是一套Windows Server 2008 R2 X64的系统,跑了一套10.2.0.5.0的oracle物理备库,运行一切正常。在客户的要求下,需要调整该服务器的机器名。
步骤:整理好调整的思路后,开始执行操作[包括停备库,ASM实例,修改hosts文件、tnsnames.ora文件等],在客户IT人员修改完机器名并重启服务器之后,发现悲剧的一幕,机器无法正常启动,不过客户端倒是可以ping通服务器,但是无法通过远程桌面连接。
怎么办呢?经过分析和定位,感觉极有可能出问题的地方就是OracleCSService这个服务,而且该服务的启动类型是自动启动。也就是说该服务项会加载到windows系统的启动项里,随着操作系统的启动而启动,而该服务又是hard-coded,应该是同机器名进行“捆绑”的,由于修改了机器名,导致OracleCSService服务项不能正常启动,进而导致操作系统无法正常启动。
找到解决问题的思路之后,可以尝试重启服务器,进入安全模式,禁用该服务,然后重启机器,结果该机器已经无法再次进入安全模式,之前进去过,原因未知,客户IT硬件人员操作。
于是,一边尝试可以进入安全模式的方法,一边估计下下策的重装Windows系统,重建Dataguard的方案。结果,更为不可思议的是,服务器特么自己能够正常启动了,大家什么都没操作。接下来,就登录上去,果断重建了OracleCSService服务:
删除该服务:
Microsoft Windows [Version 6.1.7601] Copyright (c) 2009 Microsoft Corporation. All rights reserved. C:\Users\Administrator>C:\oracle\product\10.2.0\db_1\BIN\localconfig.bat usage: crssetupconfig - configure and startup the cluster on nodes add - add specified nodes to the cluster del - delete the specified nodes from the cluster deconfig - wipe out all cluster configuration information ldel - local css delete from oracle home lres - local css home reset to new oracle home ladd - local css add to oracle home shutdown - shutdown the selected nodes upgrade - upgrade the specified nodes help - print out this information C:\Users\Administrator>C:\oracle\product\10.2.0\db_1\BIN\localconfig.bat deconfig GetConfiguredClusterNodes: failed to initialize subsystem, rc(21) failed to determine remaining nodes in the cluster failed during critical configuration information please supply <-force> option to continue C:\Users\Administrator>C:\oracle\product\10.2.0\db_1\BIN\localconfig.bat deconfig -force GetConfiguredClusterNodes: failed to initialize subsystem, rc(21) failed to determine remaining nodes in the cluster failed during critical configuration information <-force> option specified, continuing Step 1: shutting down node apps failed executing check for CRS resources [ 2 ] The system cannot find the file specified. failed executing check for CRS resources failure determining CRS resources state, continuing due to FORCE option DEBRESTDDB Removing node apps PRKC-1056 : Failed to get the hostname for node DEBRESTDDB PRKH-1010 : Unable to communicate with CRS services. [Communications Error(Native: prsr_initCLSS:[3])] DEBRESTDDB Removing ONS configuration failed to remove ONS configuration [ 2 ] The system cannot find the file specified. DEBRESTDDB failed to execute removal of ONS configuration failuring during delete of node apps, continuing Step 2: shutting down local CRS stack DEBRESTDDB failed to located service OracleEVMService, err(1060) failed to stop CRS stack on all nodes to be removed, continuing Step 3: removing CRS stack from requested nodes Step 4: stopping extra CRS services Step 5: cleanup up registry keys Step 6: perform cleanup of the OCR repository C:\oracle\product\10.2.0\db_1\cdata\localhost\local.ocr successful deconfiguration of the cluster C:\Users\Administrator>
重建该服务:
C:\Users\Administrator>C:\oracle\product\10.2.0\db_1\BIN\localconfig.bat add Step 1: creating new OCR repository Successfully accumulated necessary OCR keys. Creating OCR keys for user 'administrator', privgrp ''.. Operation successful. Step 2: creating new CSS service successfully created local CSS service successfully added CSS to home C:\Users\Administrator>
最后,启动ASM实例,启动物理备库,打开同主库的同步,完成同步。
值得记住的地方:
① 不要轻易修改机器名,除非必要。修改之前,一定一定要理清楚checklist,不可像本例中遗漏了OracleCSService服务项的重建;
② 对于生产环境的各种操作,真的要三思而后行;
③ 写这篇记录小文的时候,发现Metalink上有该案例的详细操作说明哇:How to change the Hostname when Oracle 10G and ASM are used [ID 422729.1]