It has been a while since I wrote on Renaps’ Blog .. most probably because I didn’t run through any interesting case to talk about for a while !
Yesterday, at one of our main clients, the production (soon to be – on Monday) database hung. Was it because it was a Feb 29th of a bisextile year ? or simply because the week-end arriving in the next 4 hours was the go-live of the main data warehouse ? — Its hard to say, I believe that it was just a little wink from our old friend Murphy !
The environment is a 16 Itanium CPUs Windows Server running 10.2.0.3.0; ASM is used even though the database runs on a single instance mode.
When the database hung, the alert.log showed the following error messages (of course, no one could connect or run any transaction on the database):
ARC0: All Archive destinations made inactive due to error 354
Committing creation of archivelog ‘+FLASH/inf1/archivelog/arc17972_0627748574.001′ (error 354)
ARCH: Archival stopped, error occurred. Will continue retrying
Fri Feb 29 11:48:05 2008
Errors in file c:oraadmininf1bdumpinf1_arc0_10236.trc:
ORA-16038: Message 16038 not found; No message file for product=RDBMS, facility=ORA; arguments:  
ORA-00354: Message 354 not found; No message file for product=RDBMS, facility=ORA
ORA-00312: Message 312 not found; No message file for product=RDBMS, facility=ORA; arguments:   [+FLASH/inf1/onlinelog/group_1.257.627748575]
The first line of this message stack made me believe that the database just hung because there was no more space for the archiver to do its job. But after further analysis, it appeared that the problem was much more serious.
The actual cause of the issue was related to Redo-Log Corruption. The cause of the corruption was a “write or rewrite” SAN related error.
When Redo-Log corruption occurs, Oracle first tries to use the second (or third) member of the same group showing corruption to avoid un-availability of the database. In our case, this was impossible since there was only one member per group of Redo-Log — I was surprised to see that this client did not multiplex the Redo-Logs …
My first attempt to get the database back on its legs was to dump the content of the redo-log showing corruption issues (in this case, group 1).
SQL> ALTER SYSTEM DUMP LOGFILE ‘+FLASH/inf1/onlinelog/group_1.257.627748575′;
ALTER SYSTEM DUMP LOGFILE ‘+FLASH/inf1/onlinelog/group_1.257.627748575′
ERREUR à la ligne 1 :
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 53504 change 70389305 time 02/29/2008 11:48:02
ORA-00334: archived log: ‘+FLASH/inf1/onlinelog/group_1.257.627748575′
As you can see, this attempt failed, I then attempted to clear the unarchived redo-log:
SQL> alter database clear unarchived logfile ‘+FLASH/inf1/onlinelog/group_1.257.627748575′;
at this point, my command hung indefinitely, and I realized that pmon was holding a lock on the Redo-log Header block…
My margin of maneuver has become much smaller, here is the sequence of commands that brought the database back on track:
- shutdown abort;
- startup mount;
- issue the same command that was previously hanging (this time, it worked right away) :
alter database clear unarchived logfile ‘+FLASH/inf1/onlinelog/group_1.257.627748575′;
- alter database open;
- shutdown immediate;
- Immediately take a FULL backup at this point;
- Don’t forget to multiplex your Redo-Logs to avoid to run into this situation again !
Here you go, Redo-Log corruption on the header block of a group when only one member is used does look bad at first sight but I hope this article can help you if you run into a similar scenario.