Watch Your FRA!

Watch Your FRA!

When it comes to Oracle database administration, one of the most revered parts of your database structure is the fast recovery area (FRA). This is an Oracle managed area where DBAs usually store some of the following files:

  • Redo logs
  • Archive logs
  • Flashback logs
  • Control files
  • RMAN backups

The purpose of the Oracle FRA is to simplify database recovery. The Oracle database process automatically manages items stored in the FRA and will delete items that are no longer needed. 

Oracle FRA Settings

Now the Oracle FRA may sound like a magical area that should never run into storage-related issues—but that could not be farther from the truth. Even though the Oracle database process will manage items and attempt to delete files that aren’t needed, the DBA also has to be aware of instance settings that may block Oracle from being able to remove the files. Some settings that can cause storage issues with your FRA include:

  • RMAN backup retention setting – If you set your backup retention to store two weeks’ worth of RMAN backups, but your FRA fills up to 100% before any backups can be purged, this will cause your database to halt.
  • RMAN archive log deletion policy – If you set the deletion policy to delete archive logs after they are applied to all standby databases, but haven’t noticed that your primary and standby databases have been out of sync for a long period of time, your FRA can fill to 100% and cause your database to halt.
  • RMAN archive log backup copies setting – By default, backup copies are set to 1. But what if you want to make sure your backups contain more copies of your archive logs in the event that one of your incremental backups became corrupted? When you set this setting higher than 1, you will not be able to delete any archive logs unless they have been backed up however many times this setting is set to. So if you set this option to 3, you will need to have taken at least three backups of each archive log before said log can be deleted from your system. If you opted to store archive logs in your FRA, then this can fill the FRA to 100% and cause your database to halt.
  • Db_flashback_retention_target setting – If you have enabled the flashback database option this is stored in the FRA by default. As with the archive logs, depending on the time value of the setting, it will store all flashback logs needed to guarantee that you can flashback your database as per the setting. If you set this to a high setting, this can fill the FRA to 100% and cause your database to halt.

Those are just a handful of the many ways you can accidentally fill your Oracle FRA, which is why you need to make sure that your FRA is adequately sized to store all files as per all retention settings. You should also create a script that queries the v$recovery_area_usage and have this result sent to the email of all DBAs, as this will tell you how much of your FRA is used and what in particular is taking up the space:

For remote and onsite DBA support to help keep your databases running smoothly, including 24×7 live support, contact Buda Consulting.

Oracle Active Data Guard Replication Error with Cisco Firewall: Root Cause and Workaround

Oracle Active Data Guard Replication Error with Cisco Firewall: Root Cause and Workaround

One of our customers had issues over the past several months with Oracle Active Data Guard replication to a standby server. The database on the standby server would intermittently fall out of sync with the database on the primary server, and log files were not consistently shipping to the standby server. 

Active Data Guard Replication Error

The Oracle database version was 12.2.0.1 running on Oracle Linux 6. The firewall was a Cisco ASA-5585-SSP-40, and the ASA version is 9.6(4)8. 

TNS tracing showed: CORRUPTION DETECTED: In redo blocks starting at block #…

By the time I got involved, the firewall administrators had already implemented all the recommended firewall changes to disable the following:

    •     SQLNet fixup protocol 
    •     Deep Packet Inspection (DPI
    •     SQLNet packet inspection 
    •     SQL Fixup

The following errors were noted in the primary database alert log:

    •     ORA-16055: FAL request rejected 
    •     ARC6: Standby redo logfile selected for thread 2 sequence 33351 for destination LOG_ARCHIVE_DEST_2 
    •     ARC6: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (12152) 
    •     ARC6: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned 

The following errors were noted in the standby database alert log:

    •     CORRUPTION DETECTED: In redo blocks starting at block #…
    •     RFS[20]: Possible network disconnect with primary database
    •     Error 1017 received logging on to the standby 
    •     FAL[client, USER]: Error 16191 connecting to…
    •     ORA-16191: Primary log shipping client not logged on standby 

The root cause of the problem turned out to be a bug in the Cisco firewall.  For reasons unknown, when the primary and standby database listeners were using port 1521, the firewall would ignore the settings the admins had implemented for the Oracle Data Guard connections and revert to the default settings. As a workaround, we changed to a different port.

If you are experiencing intermittent or hard-to-diagnose database issues in your environment, contact Buda Consulting.

Test Your Disaster Recovery Strategies before Disaster Strikes

I’m sure you have heard—if not experienced—the following scenario. A student is working on a research paper and suddenly her PC crashes. Because she did not follow the golden rule of saving your document every sixty seconds, she lost hours of work. 

You would think by the time you are well into your career things like this would no longer happen, but unfortunately this kind of thing still happens all the time. When it comes to data, the one thing most people think about is backups. As long as the backups complete without error you feel safe, as you believe you have all of the files you need to restore your database in the event of a disaster. 

But what if I told you that is not a complete disaster recovery strategy?  

We saw this issue play out recently when we were contacted by a company that needed our assistance. The client was trying to restore a database after temporarily losing power and encountered a software bug that was requesting an old archive log that was applied to the database, which happened to be a standby database. Because the archive log requested was so old (a search of emailed backup job logs found the archive log was backed up 9 months prior) and their retention policy only saved backups for 14 days, there was no way for them to get the archive log back. This meant they were not able to restore their data. Long story short: the company lost all of the data in the database.

When one side of our disaster recovery strategy is working we often overlook the second side of the strategy, which is making sure we are able to restore our database using the files created. While the backup job may complete without errors, file corruption or erroneously deleting one of the backup files can render your recovery plan and data useless. This is why we here at Buda Consulting always recommend that our clients perform biannual disaster recovery restore tests at a minimum, with quarterly disaster recovery restore tests at a maximum. 

As the old saying goes, “It’s better to be safe than sorry,” and testing your disaster recovery data is essential to keeping you and your data safe!

Concerned (as you should be) about your disaster recovery and business continuity capability? Contact Buda Consulting to schedule a free consultation.

Donuts and Data Backups

The year was 1982. I was a Computer Science student by day and a baker by night (well, early morning). My boss Al at the Tiffany’s Bakery in the Staten Island Mall had asked me to help him figure out how much each of his products actually cost him, so he could understand which were the most profitable.

With my TRS-80 Model II computer and its 8-inch floppy hard drive, I got to work developing a program to provide the information Al needed.

It was a pretty cool program—the user would enter all the suppliers and the prices for the ingredients they supply, and the program would calculate an average price for a pound of flour, etc., across all suppliers.

The user would also enter the recipe and the yield for all products; i.e., how much sugar, flour, etc. went into the cake recipe and how many cakes the recipe yielded.

Out would pop the cost of each cake, Danish, cupcake and donut that the bakery sold.

It was a great little program built (I think) in Pascal. This was before database management systems like Oracle or SQL Server, even before DBASE and RBASE, so I built my own database into the application.

I was so proud of my creation. Then the day came for me to demonstrate the product to the boss. I still remember vividly how the night before I was working feverishly in my mother’s basement on a few last-minute touchups, getting everything ready for the big reveal.

But then…

I accidentally pressed Delete instead of Save. Sheer panic! That moment is seared into my memory.

I had no backup. I don’t even know if backup was “a thing” at the time.  I didn’t even have an old copy of the software named something different on the floppy drive anywhere. My program was gone, the whole thing! Weeks, maybe months of hard work disappeared in an instant.

Worse yet, I had already missed the first, and maybe even the second deadline. Al had been very patient but I had promised him it was really done this time, and now I had nothing to give him!

A wise friend once told me that nothing is ever as good or as bad as it seems. That was true of this disaster.  Al was very understanding and though it took many more hours of my time than I would have liked, I was able to rebuild the application, probably better than it was before. And it turned out to be very valuable to the bakery.

But I would not wish that feeling of dread on anyone. Ever since then, it has been my passion to make sure that everyone is protected against losing the applications or the data they spent their precious time creating.

So backup your work, double-check your backups, and test them on a regular basis.

Then go have a cup of coffee and a donut and think of this story with a smile knowing you are safe.

 

 

 

Why Your Database Management Team Should Regularly Double-Check Your Backups

I’ve blogged before about the importance of checking database backups. Over 90% of new clients that we assess have backups that are either incomplete or totally unusable (true statistic!).

The following story of a backup failure that happened recently—and the successful double-check by our DBA Scott)—provides a noteworthy example of how bad backups can happen even when you think your process is working.

Recently we had a client that was looking to reduce storage costs for their Oracle RDBMS system while still meeting a legally mandated seven-year retention requirement. They were also looking to leverage AWS S3 and Glacier.

The majority of their Oracle data resided in a single history table partitioned on date; and this data was rarely, if ever, accessed once it was more than one year old. Thus S3 and Glacier were a perfect fit for this use case.

It was decided that data would be retained in the RDBMS until it was three years old.  After that, the data would be exported via Oracle Data Pump and zipped via the Linux zip utility. (A side note: In case you’re wondering why we didn’t just go with Data Pump’s native compression functionality, testing the export/zip process via Data Pump yielded a 20% lower compression ratio versus the Linux zip utility.)

Scott set about finding a way to automate the process, using Oracle Data Pump to export partitions starting with the oldest partitions first. To get started, he did what any good DBA does—he scoured the internet using Google and came up with this great example code posted by Michael Dinh to use as a starting point.

The process we decided to use was to export the identified partition, check the return code from dbms_datapump.wait_for_job.job_state to ensure the export completed successfully, and then drop the partition from the database.

After many modifications to the example code, it was time to test. Scott tested what would happen if everything went well. He also tested what would happen:

    • If the utility tried to export to a missing OS directory
    • If the directory was full
    • If the directory was read-only
    • If the export dump file was removed during the export process
    • If the export process was killed while the export job was running

The testing went smoothly, and in each case dbms_datapump.wait_for_job always returned a code other than COMPLETED.  The only time the process would drop the partition was when the return code was equal to COMPLETED, so it appeared we were ready to put this process into production use.

What we did not account for was the possibility that an Oracle bug would somehow cause dbms_datapump to fail to export the table partition rows but still return a COMPLETED code to the calling process—which is exactly what was about to happen.

The routine ran perfectly for a few weeks. Fortunately, Scott continued to closely monitor the job runs. All of a sudden he noticed that all of the export files started to report the exact same size, which was very small.

After checking the process, we found the issue and opened a ticket with Oracle support. They informed us that they believed an Oracle bug was to blame and recommended we upgrade the database to a later release.

No reason was ever found for why this suddenly started to happen after weeks of continuous use.  We did, however, learn an important lesson: when it comes to dropping table partitions after a data pump export, never rely solely on the dbms_datapump.wait_for_job return code. Always take the extra step of interrogating the export log file for the number of rows exported and completed successfully output.

In addition, the fact that we had a good current RMAN backup of the database that contained the data in the dropped partitions made this a good learning experience instead of a tragedy.  Nevertheless, this experience illustrates the importance of frequently checking your exports.

By checking to make sure that the exports were good, Scott was able to ensure that we saved the most recent backup prior to the pruning so that we could recover the data in those partitions if we ever needed to.

This is an extreme case because Oracle actually provided an incorrect success code. In most cases, the backups were either:

    • Not configured to run at all, or
    • Configured to exclude part of the database, or
    • Backup files were relocated to new volumes that were not included in the operating system backup process, or
    • The operating system backup was used without informing the DBMS that this was taking place, rendering the backup unusable.

Whatever the reasons, they are very hard to predict. The only safeguard is to check weekly (or daily, in some mission-critical cases) to make sure that what you think is happening really is.

If you don’t double-check your backups, the false sense of security you have today will end in sudden panic when you need to restore the database and can’t.