Interesting Corrupted Redo Log Case

It has been a while since I wrote on Renaps’ Blog .. most probably because I didn’t run through any interesting case to talk about for a while !

Yesterday, at one of our main clients, the production (soon to be – on Monday) database hung. Was it because it was a Feb 29th of a bisextile year ? or simply because the week-end arriving in the next 4 hours was the go-live of the main data warehouse ? — Its hard to say, I believe that it was just a little wink from our old friend Murphy !

The environment is a 16 Itanium CPUs Windows Server running 10.2.0.3.0; ASM is used even though the database runs on a single instance mode.

When the database hung, the alert.log showed the following error messages (of course, no one could connect or run any transaction on the database):

ARC0: All Archive destinations made inactive due to error 354
Committing creation of archivelog ‘+FLASH/inf1/archivelog/arc17972_0627748574.001′ (error 354)
ARCH: Archival stopped, error occurred. Will continue retrying
Fri Feb 29 11:48:05 2008
Errors in file c:oraadmininf1bdumpinf1_arc0_10236.trc:
ORA-16038: Message 16038 not found; No message file for product=RDBMS, facility=ORA; arguments: [1] [17972]
ORA-00354: Message 354 not found; No message file for product=RDBMS, facility=ORA
ORA-00312: Message 312 not found; No message file for product=RDBMS, facility=ORA; arguments: [1] [1] [+FLASH/inf1/onlinelog/group_1.257.627748575]

The first line of this message stack made me believe that the database just hung because there was no more space for the archiver to do its job. But after further analysis, it appeared that the problem was much more serious.

The actual cause of the issue was related to Redo-Log Corruption. The cause of the corruption was a “write or rewrite” SAN related error.

When Redo-Log corruption occurs, Oracle first tries to use the second (or third) member of the same group showing corruption to avoid un-availability of the database. In our case, this was impossible since there was only one member per group of Redo-Log — I was surprised to see that this client did not multiplex the Redo-Logs …

My first attempt to get the database back on its legs was to dump the content of the redo-log showing corruption issues (in this case, group 1).

SQL> ALTER SYSTEM DUMP LOGFILE ‘+FLASH/inf1/onlinelog/group_1.257.627748575′;
ALTER SYSTEM DUMP LOGFILE ‘+FLASH/inf1/onlinelog/group_1.257.627748575′
*
ERREUR à la ligne 1 :
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 53504 change 70389305 time 02/29/2008 11:48:02
ORA-00334: archived log: ‘+FLASH/inf1/onlinelog/group_1.257.627748575′

As you can see, this attempt failed, I then attempted to clear the unarchived redo-log:
SQL> alter database clear unarchived logfile ‘+FLASH/inf1/onlinelog/group_1.257.627748575′;

at this point, my command hung indefinitely, and I realized that pmon was holding a lock on the Redo-log Header block…

My margin of maneuver has become much smaller, here is the sequence of commands that brought the database back on track:

  1. shutdown abort;
  2. startup mount;
  3. issue the same command that was previously hanging (this time, it worked right away) :
    alter database clear unarchived logfile ‘+FLASH/inf1/onlinelog/group_1.257.627748575′;
  4. alter database open;
  5. shutdown immediate;
  6. Immediately take a FULL backup at this point;
  7. startup;
  8. Don’t forget to multiplex your Redo-Logs to avoid to run into this situation again !

Here you go, Redo-Log corruption on the header block of a group when only one member is used does look bad at first sight but I hope this article can help you if you run into a similar scenario.

References:

Metalink Notes:
145769.1
332672.1
438176.1
114734.1

Patrick Hamou
E-Mail: phamou@renaps.com
Web: www.renaps.com

About these ads
  1. #1 by Eric Grancher on March 3, 2008 - 5:45 am

    Patrick,

    thank you for your note,

    do you have an idea on why you got such a corruption, did you have corruptions on some other ASM files in the same diskgroup, do you have an error at the OS/HW level?

    (btw; is it flash drive as the ASM disk group name implies?)

    regards,
    eric

  2. #2 by renaps on March 3, 2008 - 5:08 pm

    Hi Eric,

    We don’t know at this point what was the actual cause of this corruption (we are still investigating in that direction) – there was not a single error on the SAN or on any other file of the same ASM disk group.

    Flash is simply the ASM disk group name that is used to store the Flashback Recovery Area.

    Thanks,

    Patrick

  3. #3 by Erica on June 3, 2008 - 2:11 pm

    Patrick,

    We have same Log file curruption issue. In our case the currupted online log file is in ‘ACTIVE’ state. ACTIVE log file can not be cleared. We had to resort to previous good backup.
    Fact: Oracle 9i, EE, non-archivelog mode, on Microsoft platform. Redo logs were not multiplexed.

    Thanks,

  4. #4 by shao hua.feng on April 13, 2009 - 1:09 am

    thanks
    This article which illumined me is usefull:)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: