I’ve blogged before about the importance of checking database backups. Over 90% of new clients that we assess have backups that are either incomplete or totally unusable (true statistic!).

The following story of a backup failure that happened recently—and the successful double-check by our DBA Scott)—provides a noteworthy example of how bad backups can happen even when you think your process is working.

Recently we had a client that was looking to reduce storage costs for their Oracle RDBMS system while still meeting a legally mandated seven-year retention requirement. They were also looking to leverage AWS S3 and Glacier.

The majority of their Oracle data resided in a single history table partitioned on date; and this data was rarely, if ever, accessed once it was more than one year old. Thus S3 and Glacier were a perfect fit for this use case.

It was decided that data would be retained in the RDBMS until it was three years old.  After that, the data would be exported via Oracle Data Pump and zipped via the Linux zip utility. (A side note: In case you’re wondering why we didn’t just go with Data Pump’s native compression functionality, testing the export/zip process via Data Pump yielded a 20% lower compression ratio versus the Linux zip utility.)

Scott set about finding a way to automate the process, using Oracle Data Pump to export partitions starting with the oldest partitions first. To get started, he did what any good DBA does—he scoured the internet using Google and came up with this great example code posted by Michael Dinh to use as a starting point.

The process we decided to use was to export the identified partition, check the return code from dbms_datapump.wait_for_job.job_state to ensure the export completed successfully, and then drop the partition from the database.

After many modifications to the example code, it was time to test. Scott tested what would happen if everything went well. He also tested what would happen:

    • If the utility tried to export to a missing OS directory
    • If the directory was full
    • If the directory was read-only
    • If the export dump file was removed during the export process
    • If the export process was killed while the export job was running

The testing went smoothly, and in each case dbms_datapump.wait_for_job always returned a code other than COMPLETED.  The only time the process would drop the partition was when the return code was equal to COMPLETED, so it appeared we were ready to put this process into production use.

What we did not account for was the possibility that an Oracle bug would somehow cause dbms_datapump to fail to export the table partition rows but still return a COMPLETED code to the calling process—which is exactly what was about to happen.

The routine ran perfectly for a few weeks. Fortunately, Scott continued to closely monitor the job runs. All of a sudden he noticed that all of the export files started to report the exact same size, which was very small.

After checking the process, we found the issue and opened a ticket with Oracle support. They informed us that they believed an Oracle bug was to blame and recommended we upgrade the database to a later release.

No reason was ever found for why this suddenly started to happen after weeks of continuous use.  We did, however, learn an important lesson: when it comes to dropping table partitions after a data pump export, never rely solely on the dbms_datapump.wait_for_job return code. Always take the extra step of interrogating the export log file for the number of rows exported and completed successfully output.

In addition, the fact that we had a good current RMAN backup of the database that contained the data in the dropped partitions made this a good learning experience instead of a tragedy.  Nevertheless, this experience illustrates the importance of frequently checking your exports.

By checking to make sure that the exports were good, Scott was able to ensure that we saved the most recent backup prior to the pruning so that we could recover the data in those partitions if we ever needed to.

This is an extreme case because Oracle actually provided an incorrect success code. In most cases, the backups were either:

    • Not configured to run at all, or
    • Configured to exclude part of the database, or
    • Backup files were relocated to new volumes that were not included in the operating system backup process, or
    • The operating system backup was used without informing the DBMS that this was taking place, rendering the backup unusable.

Whatever the reasons, they are very hard to predict. The only safeguard is to check weekly (or daily, in some mission-critical cases) to make sure that what you think is happening really is.

If you don’t double-check your backups, the false sense of security you have today will end in sudden panic when you need to restore the database and can’t.