Donuts and Data Backups

The year was 1982. I was a Computer Science student by day and a baker by night (well, early morning). My boss Al at the Tiffany’s Bakery in the Staten Island Mall had asked me to help him figure out how much each of his products actually cost him, so he could understand which were the most profitable.

With my TRS-80 Model II computer and its 8-inch floppy hard drive, I got to work developing a program to provide the information Al needed.

It was a pretty cool program—the user would enter all the suppliers and the prices for the ingredients they supply, and the program would calculate an average price for a pound of flour, etc., across all suppliers.

The user would also enter the recipe and the yield for all products; i.e., how much sugar, flour, etc. went into the cake recipe and how many cakes the recipe yielded.

Out would pop the cost of each cake, Danish, cupcake and donut that the bakery sold.

It was a great little program built (I think) in Pascal. This was before database management systems like Oracle or SQL Server, even before DBASE and RBASE, so I built my own database into the application.

I was so proud of my creation. Then the day came for me to demonstrate the product to the boss. I still remember vividly how the night before I was working feverishly in my mother’s basement on a few last-minute touchups, getting everything ready for the big reveal.

But then…

I accidentally pressed Delete instead of Save. Sheer panic! That moment is seared into my memory.

I had no backup. I don’t even know if backup was “a thing” at the time.  I didn’t even have an old copy of the software named something different on the floppy drive anywhere. My program was gone, the whole thing! Weeks, maybe months of hard work disappeared in an instant.

Worse yet, I had already missed the first, and maybe even the second deadline. Al had been very patient but I had promised him it was really done this time, and now I had nothing to give him!

A wise friend once told me that nothing is ever as good or as bad as it seems. That was true of this disaster.  Al was very understanding and though it took many more hours of my time than I would have liked, I was able to rebuild the application, probably better than it was before. And it turned out to be very valuable to the bakery.

But I would not wish that feeling of dread on anyone. Ever since then, it has been my passion to make sure that everyone is protected against losing the applications or the data they spent their precious time creating.

So backup your work, double-check your backups, and test them on a regular basis.

Then go have a cup of coffee and a donut and think of this story with a smile knowing you are safe.

 

 

 

Why Your Database Management Team Should Regularly Double-Check Your Backups

I’ve blogged before about the importance of checking database backups. Over 90% of new clients that we assess have backups that are either incomplete or totally unusable (true statistic!).

The following story of a backup failure that happened recently—and the successful double-check by our DBA Scott)—provides a noteworthy example of how bad backups can happen even when you think your process is working.

Recently we had a client that was looking to reduce storage costs for their Oracle RDBMS system while still meeting a legally mandated seven-year retention requirement. They were also looking to leverage AWS S3 and Glacier.

The majority of their Oracle data resided in a single history table partitioned on date; and this data was rarely, if ever, accessed once it was more than one year old. Thus S3 and Glacier were a perfect fit for this use case.

It was decided that data would be retained in the RDBMS until it was three years old.  After that, the data would be exported via Oracle Data Pump and zipped via the Linux zip utility. (A side note: In case you’re wondering why we didn’t just go with Data Pump’s native compression functionality, testing the export/zip process via Data Pump yielded a 20% lower compression ratio versus the Linux zip utility.)

Scott set about finding a way to automate the process, using Oracle Data Pump to export partitions starting with the oldest partitions first. To get started, he did what any good DBA does—he scoured the internet using Google and came up with this great example code posted by Michael Dinh to use as a starting point.

The process we decided to use was to export the identified partition, check the return code from dbms_datapump.wait_for_job.job_state to ensure the export completed successfully, and then drop the partition from the database.

After many modifications to the example code, it was time to test. Scott tested what would happen if everything went well. He also tested what would happen:

    • If the utility tried to export to a missing OS directory
    • If the directory was full
    • If the directory was read-only
    • If the export dump file was removed during the export process
    • If the export process was killed while the export job was running

The testing went smoothly, and in each case dbms_datapump.wait_for_job always returned a code other than COMPLETED.  The only time the process would drop the partition was when the return code was equal to COMPLETED, so it appeared we were ready to put this process into production use.

What we did not account for was the possibility that an Oracle bug would somehow cause dbms_datapump to fail to export the table partition rows but still return a COMPLETED code to the calling process—which is exactly what was about to happen.

The routine ran perfectly for a few weeks. Fortunately, Scott continued to closely monitor the job runs. All of a sudden he noticed that all of the export files started to report the exact same size, which was very small.

After checking the process, we found the issue and opened a ticket with Oracle support. They informed us that they believed an Oracle bug was to blame and recommended we upgrade the database to a later release.

No reason was ever found for why this suddenly started to happen after weeks of continuous use.  We did, however, learn an important lesson: when it comes to dropping table partitions after a data pump export, never rely solely on the dbms_datapump.wait_for_job return code. Always take the extra step of interrogating the export log file for the number of rows exported and completed successfully output.

In addition, the fact that we had a good current RMAN backup of the database that contained the data in the dropped partitions made this a good learning experience instead of a tragedy.  Nevertheless, this experience illustrates the importance of frequently checking your exports.

By checking to make sure that the exports were good, Scott was able to ensure that we saved the most recent backup prior to the pruning so that we could recover the data in those partitions if we ever needed to.

This is an extreme case because Oracle actually provided an incorrect success code. In most cases, the backups were either:

    • Not configured to run at all, or
    • Configured to exclude part of the database, or
    • Backup files were relocated to new volumes that were not included in the operating system backup process, or
    • The operating system backup was used without informing the DBMS that this was taking place, rendering the backup unusable.

Whatever the reasons, they are very hard to predict. The only safeguard is to check weekly (or daily, in some mission-critical cases) to make sure that what you think is happening really is.

If you don’t double-check your backups, the false sense of security you have today will end in sudden panic when you need to restore the database and can’t.