Never Assume
Posted by David Aldridge on 2006-10-12
Just a direct link to a Shark Tank story at Computerworld. I can’t improve on their own title so I just copied it.
I was involved with a company that suffered a similar loss when it turned out that operators were re-using the same backup tape every day, but as I recall that was three months of backups lost and was discovered when an errant script performed an rm -r from the ORACLE_HOME directory while the database was open. Happened on a Friday, of course.
Ah, happy days.
Noons said
Sure, happy days. But not long gone.
Just found three consecutive
“alter system switch logfile”
followed by a
“alter database checkpoint”
at the end of a backup script that is being run on most of our 9i databases.
You know, the old “to-be-sure-to-be-sure” approach.
And a very funny message in the alert logs. Something about “not being able to archive redo log because another process is using it”. And a 1002 bytes long archived redo log, the second one of the sequence above. With a timestamp later than the archived redo log of the third switch above.
Hmmm: 1002 bytes… finished writing it AFTER the next archive…
This stinks to high heaven!, me theenks.
So: I take the backup to another system, start the db, re-create the control file, start recovery, try to rollforward past the second archived redo log.
Bang! Unreadable redo log. Recovery can’t proceed.
Lovely: ALL our daily backups and archived redo logs protect our databases for the long period of a few seconds!
The redundant switches and checkpoint have now been removed from all the scripts. I can now recover databases to any point in time since the last backup.
Turns out these scripts were written by many generations of prior dbas, never tested, never audited.
What would have happened if I hadn’t bothered to check?
Sometimes not even operator error can save the day…