How Poor Communication Brought an Oracle System Down

It was very cold and early on a Monday morning when I received a call from one of my fellow system administrators. He reported that one of our production databases would not come back online after the server hosting the database was restarted. 

Most DBAs would start investigating this issue by looking at database alert logs. But my experience led me to ask my fellow system admin the following question: “What changes did you make on the server prior to the reboot?”

It was his answer to that question that allowed me to quickly understand the issue and fix it in just a few minutes. 

Apparently the system admin (not the DBA) was conducting vulnerability testing and, as a result, made a change to the main listener.ora file that disabled all databases from being able to dynamically register to Oracle database listeners. 

By default, an Oracle database will try to dynamically register to an Oracle database listener on port 1521. This registration process allows connections to the database from outside of the server. The database was online and operational, but because the dynamic registration option was disabled it could no longer register to the listener. So no users could connect to the database.

The fix for this was adding a static listener to the listener.ora for the database hosted on the server, thus allowing it to receive connections. Once the static listener was added, all users were able to connect to the production database without error.

The Technical Problem\

Let’s break this incident down in more detail:

This is the original Listener file 

LISTENER=

  (DESCRIPTION=

    (ADDRESS_LIST=

      (ADDRESS=(PROTOCOL=tcp)(HOST=MyServer)(PORT=1521))

      (ADDRESS=(PROTOCOL=ipc)(KEY=extproc))))

The administrator added one line (see below in red):

LISTENER=

  (DESCRIPTION=

    (ADDRESS_LIST=

      (ADDRESS=(PROTOCOL=tcp)(HOST=MyServer)(PORT=1521))

      (ADDRESS=(PROTOCOL=ipc)(KEY=extproc))))

DYNAMIC_REGISTRATION_LISTENER=OFF

This prevented any databases that do not have a static listener specified in the listener.ora file from accepting connections..

The Technical Solution

To correct the problem, I added a static listener to the listener.ora file (see below in red):

LISTENER=

  (DESCRIPTION=

    (ADDRESS_LIST=

      (ADDRESS=(PROTOCOL=tcp)(HOST=MyServer)(PORT=1521))

      (ADDRESS=(PROTOCOL=ipc)(KEY=extproc))))

DYNAMIC_REGISTRATION_LISTENER=OFF

SID_LIST_LISTENER=

(SID_LIST=

    (SID_DESC=

      (GLOBAL_DBNAME=MyDBName)

      (ORACLE_HOME=/u01/app/oracle/product/12.1.0/dbhome_1)

      (SID_NAME=MySID))

)

You can find detailed information about the listener file for Oracle version 19c here.

The Communication Problem

We have mentioned in this blog before that almost all problems with technology projects are the result of poor communication. This principle holds here as well. Because the system administrator did not keep any of the DBAs on our team “in the loop” about their vulnerability testing, or the resulting changes, those changes caused production downtime.  

The Communication Solution

Any change to a server, database, or application must be communicated to all responsible parties beforehand. In fact, a better approach in this case would have been to ask the DBA to make the change to the listener file rather than the administrator making the change himself. This would have ensured that an experienced DBA had reviewed the change and understood the potential impact.

The moral of the story is: Keep your DBAs in the loop when you’re making system changes. It’s our job to proactively prevent database issues others might miss.

A Word on Database Security

While an action taken by the system administrator caused a problem in this situation, it should be applauded from a database security standpoint that vulnerability testing was conducted because it exposed a potential vulnerability (the dynamic registration). It is a best practice to disable dynamic registration unless it is necessary for the organization, and unless the associated risk is mitigated by other practices, such as changing the default listener port.  

Database vulnerability testing is a crucial part of a comprehensive IT security plan and is often overlooked. For the reasons described above, the process should always include a member of the DBA team. See a few of our Database Security related blogs here

 

Database Patch News — February 2021 (Issue 6)

Database Patch News — February 2021 (Issue 6)

Welcome to Database Patch News, Buda Consulting’s newsletter of current patch information for Oracle and Microsoft SQL Server. Here you’ll find information recently made available on patches—including security patches—and desupported versions.

Why should you care about patching vulnerabilities and bugs? Two big reasons:

  • Unpatched systems are a top cyber attack target. Patch releases literally advertise vulnerabilities to the hacker community. The longer you wait to patch, the greater your security risk.
  • Along with running a supported database version, applying the latest patches ensures that you can get support from the vendor in case of an issue. Patching also helps eliminate downtime and lost productivity associated with bugs.

Here are the latest patch updates for Oracle and SQL Server:

Oracle Patches:

January 19, 2021 Quarterly Patch Updates:
21c – Released January 13, 2021, Version 21.1; no Quarterly patch yet

19c – Release Update 19.10 is available (32218494 and 321266828)

18c – Release Update 18.13 is available (32204699 and 32126855)

12cR2 – Release Update 210119 is available (32228578 and 32126871)
Regular support ends in Mar 2023 and extended support ends in Mar 2026.

12cR1 – Release Update 210119 is available (32132231 and 32126908)
Regular support ended in July 2019 and extended support ends in July 2021.

11gR4 – Patch Set Update 201020 is available (31720776)
Regular support ended in October 2018 and extended support ended December 31, 2020.

SQL Server Patches:

SQL Server 2019
Cumulative update 8 (Latest build) Released Oct 1, 2020
Mainstream support ends Jan 7, 2025
Extended support ends Jan 8, 2030

SQL Server 2017
Cumulative update 22 (Latest build) Released Sept 10, 2020
Mainstream support ends Oct 11, 2022
Extended support ends Oct 12, 2027

SQL Server 2016 Service Pack 2
Cumulative update 15 Release date: Sept 28, 2020
Mainstream support ends Jul 13, 2021
Extended support ends Jul 14, 2026

SQL Server 2014 Service Pack 3
Cumulative update 4 Release date: Feb 11, 2019
Mainstream support ended Jul 9, 2019
Extended support ends Jul 9, 2024

SQL Server 2012 Service Pack 4
Release date: Oct 5, 2017
Mainstream support ended Jul 11, 2017
Extended support ends Jul 12, 2022

Note: All other SQL Server versions not mentioned are no longer supported.

4 Keys to Avoiding the Number 1 Cause of Database Project Failure

We all know that database projects and other technical/IT projects often fail. They are never completed, the results fall far short of expectations, nobody uses the new application, and so on.

Why? At the end of the day, if we look beneath the surface-level issues, the main reason for database project failure — by far — is poor communication. 

Case in point: If a project fails because of technical errors or deficiency, It’s either because the technical resources did not have the right skill set, or the requirements that they were working from were incorrect or incomplete.

If it’s the former, then there was a breakdown in communication between the resources and the project manager regarding the set of abilities that the resources have, or there was a breakdown between the project manager and the business analyst regarding what skill sets were needed for the project. If it’s the latter then there was a breakdown in communication between the business analyst and the project manager regarding what the overall system requirements were.

Another typical project failure involves missing deadlines. Typical causes of missing deadlines include resources not being available when needed, or the infrastructure not being ready when it was needed, or the business users not being ready when needed for testing or migration activities. 

Again, in all of these cases the root cause is communication. If one of the parties is not ready when they need to be, it is either because they didn’t know when they would be needed, or they incorrectly stated their availability. If the infrastructure is not available when it is needed, then either the requirements or the deadline for the infrastructure were not properly communicated to the infrastructure team, or the infrastructure team miscommunicated their ability to get the work done in time.

If you look deeper and break down the presenting problems, in almost all cases the root cause of project failures is communication. Often the communication failures occur in the very beginning of the project, during the scoping and estimate or quotation process.

Here are 4 key approaches that I use to mitigate the significant risks to project success caused by poor communication:

  1. When asking someone for a decision on an important point, I always ask twice. If the two answers differ, I ask a third time. And I continue that process until the answers become consistent. If I receive the two different answers from two different critical stakeholders, I will find a reason to send a joint email or have a conversation with both present, and I will re-ask the question in hopes of gaining consensus. (Political sensitivity and tact is critical here… Perhaps that’s the subject of another blog post…)
  2. When nailing down an important decision, I follow up in writing to validate and underscore everyone’s understanding, especially for something for which I have received two different answers over time.
  3. I treat decisions differently than statements of fact. If I ask a client, “Do your customers connect directly to your database?”, this is a statement of fact. There is a right and wrong answer to this question, and it can be validated independently. However, if I ask the customer, “How many customers do you want the database to support in five years?”, this is a decision or a target. There is no right or wrong answer. This cannot be validated except by the same individual (assuming they are the decision-maker).

    I treat statements of fact very differently from decisions/targets:

    • I validate a statement of fact in a variety of ways. I might look at the user accounts on the existing system, or I might ask someone else in the organization, or I might look at the application for clues. 
    • For decisions or targets, validation can be more difficult. As mentioned above, I ask at least twice for any decision that can impact the scope of the project. If the answers differ, or if I feel like the answer is not solid and may change (based on my client’s tone of voice, hesitation, inconsistencies with other statements or requests, or other factors), I will ask again until I am satisfied that the answer is solid.
  4. For all important points that can impact the project time or cost estimate, or the database design or implementation, I always validate them in one fashion or another before we act on them. And if I can’t validate them for some reason, I call them out separately as an assumption in the estimate or quote in order to bring it to the client’s attention and to the team’s attention, and then I mention it directly when reviewing the document with them.

To sum up: as you might expect, the antidote to poor communication is good communication. Especially going into a project, keep the above in mind. Get clarity and validate what you’re hearing. This will make you look good, your customers and technical team members will appreciate it, and your projects are much more likely to succeed.

To get optimum value and results from your database project investments, contact Buda Consulting.

Oracle Active Data Guard Replication Error with Cisco Firewall: Root Cause and Workaround

Oracle Active Data Guard Replication Error with Cisco Firewall: Root Cause and Workaround

One of our customers had issues over the past several months with Oracle Active Data Guard replication to a standby server. The database on the standby server would intermittently fall out of sync with the database on the primary server, and log files were not consistently shipping to the standby server. 

Active Data Guard Replication Error

The Oracle database version was 12.2.0.1 running on Oracle Linux 6. The firewall was a Cisco ASA-5585-SSP-40, and the ASA version is 9.6(4)8. 

TNS tracing showed: CORRUPTION DETECTED: In redo blocks starting at block #…

By the time I got involved, the firewall administrators had already implemented all the recommended firewall changes to disable the following:

    •     SQLNet fixup protocol 
    •     Deep Packet Inspection (DPI
    •     SQLNet packet inspection 
    •     SQL Fixup

The following errors were noted in the primary database alert log:

    •     ORA-16055: FAL request rejected 
    •     ARC6: Standby redo logfile selected for thread 2 sequence 33351 for destination LOG_ARCHIVE_DEST_2 
    •     ARC6: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (12152) 
    •     ARC6: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned 

The following errors were noted in the standby database alert log:

    •     CORRUPTION DETECTED: In redo blocks starting at block #…
    •     RFS[20]: Possible network disconnect with primary database
    •     Error 1017 received logging on to the standby 
    •     FAL[client, USER]: Error 16191 connecting to…
    •     ORA-16191: Primary log shipping client not logged on standby 

The root cause of the problem turned out to be a bug in the Cisco firewall.  For reasons unknown, when the primary and standby database listeners were using port 1521, the firewall would ignore the settings the admins had implemented for the Oracle Data Guard connections and revert to the default settings. As a workaround, we changed to a different port.

If you are experiencing intermittent or hard-to-diagnose database issues in your environment, contact Buda Consulting.

Test Your Disaster Recovery Strategies before Disaster Strikes

I’m sure you have heard—if not experienced—the following scenario. A student is working on a research paper and suddenly her PC crashes. Because she did not follow the golden rule of saving your document every sixty seconds, she lost hours of work. 

You would think by the time you are well into your career things like this would no longer happen, but unfortunately this kind of thing still happens all the time. When it comes to data, the one thing most people think about is backups. As long as the backups complete without error you feel safe, as you believe you have all of the files you need to restore your database in the event of a disaster. 

But what if I told you that is not a complete disaster recovery strategy?  

We saw this issue play out recently when we were contacted by a company that needed our assistance. The client was trying to restore a database after temporarily losing power and encountered a software bug that was requesting an old archive log that was applied to the database, which happened to be a standby database. Because the archive log requested was so old (a search of emailed backup job logs found the archive log was backed up 9 months prior) and their retention policy only saved backups for 14 days, there was no way for them to get the archive log back. This meant they were not able to restore their data. Long story short: the company lost all of the data in the database.

When one side of our disaster recovery strategy is working we often overlook the second side of the strategy, which is making sure we are able to restore our database using the files created. While the backup job may complete without errors, file corruption or erroneously deleting one of the backup files can render your recovery plan and data useless. This is why we here at Buda Consulting always recommend that our clients perform biannual disaster recovery restore tests at a minimum, with quarterly disaster recovery restore tests at a maximum. 

As the old saying goes, “It’s better to be safe than sorry,” and testing your disaster recovery data is essential to keeping you and your data safe!

Concerned (as you should be) about your disaster recovery and business continuity capability? Contact Buda Consulting to schedule a free consultation.