Tales of Business Continuity and Disaster Recovery Planning
Planning for Disaster
As we think about the enormous cleanup effort taking place in the wake of
Harvey, Irma, and Maria, we are reminded of the importance of planning to keep your business up and running when disaster strikes. In this blog post I share the thoughts of an expert in the field of Business Continuity Planning and Disaster Recovery.
I recently had an interesting and fun conversation with Bob Cohen, Business Continuity and Disaster Recovery Practice Director at Pivot Point Security, about some key aspects of Business Continuity Planning and some cautionary tales about the risks of not doing so.
|Bob Buda:||Thanks for joining me today. I would like to ask you some questions about Business Continuity Planning and at times I would like to touch on the intersection of BCP and database management. You lead the Business Continuity Planning practice for Pivot Point Security. So what does business continuity mean to you? How do you define business continuity?|
|Bob Cohen:||It's good business. It's making sure that you can provide your goods and services to your clients in light of things that go wrong. So it's really just good business, you know we now live in a global market if you're not there somebody else will be.|
|Bob Buda:||All right. I looked over the presentation that you use when you present to clients and one of the key elements is business impact analysis.|
|Bob Buda:||So can you talk a little bit about what business impact analysis is and about what the impact of or dependency upon data is in this part of business continuity preparations.|
|Bob Cohen:||Sure, Business impact analysis is the foundation for recovery planning.
So my favorite analogy is : you contract with me to keep your car run regardless of what happens. And I say OK fork over your premiums and I'll take care of you. So you called me up one day and say My car got struck by lightning it fried my car's computer.
Then I come back and I give you a computer for a Tiguan. And you say “But I have a Malibu. This is not going to work for me”. OK. So I go off and I get you a computer for a Malibu. We install it and you're up and running.
Then you come back to me and say I rolled my car. OK so I get you a Malibu to replace it. “No no I didn't roll the Malibu. I rolled my other car”. OK so I come back and bring you a Honda Accord. You say “no no no that was my family vehicle. I've got to be able to ferry around eight people”. OK. So I come back with an SUV. And you say “this isn't going to work because I live on a mountain. I need a four wheel drive vehicle that can accommodate six people”. Oh OK. So I come back and you say “damn it I needed it four hours ago. Because my wife is sick and I've got to get my kids to go pick up my kids from school”.
That's what happens if you write a recovery plan without doing a Business Impact Analysis.
You're fishing for the requirements. So you write a plan without knowing what the requirements are and chances are if the feces hits the oscillating ventilation device you implement your plan, without clear requirements you're going to wind up missing your recovery time objectives or your recovery windows.
|Bob Buda:||So what are the inputs to the business impact analysis.|
|Bob Cohen:||Well the inputs really come through interviews that I conduct with folks on the functional side.
It's not IT, I'm not looking to figure out from the data center folks what systems need to be recovered because at the end of the day they probably don't know any more than who screams the loudest when systems fail. I go to the operations/functional side, and get them to tell me what the functions are that they perform, how quickly they need to be recovered based on how long the company or the customers can live without those functions being performed. Once I understand that, then I have them identify the recovery requirements of the functions and then I understand what the IT is that supports and enables the functions. Then I can give the shopping list to IT. So when somebody says “yeah I need to do a business continuity plan so go talk to my CIO because he runs it”. I'm talking to somebody who doesn't understand business continuity planning.
They understand disaster recovery which is implementing the procedures to recover the IT, but the recovery windows for IT must be driven by the ops side.
|Bob Buda:||So the business impact analysis is really the business side of the house?|
|Bob Cohen:||Its the functional recovery requirements document which provides the prioritization to IT so that IT can do their DR plan and implement recovery with the prioritization that supports functional recovery priorities.|
|Bob Buda||And as part of Pivot Point’s services, Do you provide that DR element?|
|Bob Cohen:||Yes, I can provide the framework for the DR planning. What I can’t do is provide the button pushing, knob twisting, switch flipping procedures for system recovery. Because I'm not the I.T. guy that lives breathes and sleeps the customer’s environment. So I can do a functional BCP, I can give the framework for the DR plan. But the appendices that provide the technical system recovery procedures, that's further in the weeds That I typically don’t go.|
|Bob Buda:||So let's talk about that framework for the DR. We can drill little bit into that for a moment. Sure. From a high level, what does that framework look like and what are the elements of that framework.|
|Bob Cohen:||OK what are the strategies that you're going to implement if you can’t compute? Which includes several elements: hardware, software, and data.
So for each of those three components you need a strategy for recovery for each of the systems within your discrete recovery tiers whether it's 0 to 12 hours 12 to 24 hours three days etc. you need three strategies for each of those tiers or for each of those systems within those tiers. So I can develop the framework. I can talk with you about what strategies you need to implement. I can give you the high level procedures for implementing those strategies.
Because let's face it if you're using a cloud provider for a SaaS implementation, the recovery strategy is going to be cracking the whip over your SaaS provider.
If you've contracted for telecommunications whether it's voice over IP or your internet provider your strategy is going to be the same, ensuring they live up to your Service Level Agreement.
So I can do all of those things for you. When it comes to the configurations for your firewalls, the configurations for your active directory to ensure you're maintaining access controls for you users when you shift from your primary system to your alternate systems, that's where I have to rely on you as my customer to provide those.
|Bob Buda:||I know that two of the key elements of your plan are RTO and RPO. Can you describe what these are in the context of your business continuity framework and talk about how they relate to database management systems in particular?|
|Bob Cohen:||Okay, Recovery time objective: the point in time at which the loss of a function or the degradation of a function below acceptable levels becomes unacceptable. So when I'm talking to my clients I'll say “OK on a scale of 1 to 5, level 1 being operations are normal, level 5, you're down hard belly up, you’re dead, 2, 3 and 4 or gradations between. If you can live with level 1 can you live with level 2 for a period of time?” If you can live with a level 2 impact for a period of time, can you live with a period a level 3 impact for a period of time? If the answer to that is yes then can you live with a level 4 impact for a period of time? Where's the line in the sand where you say I can't ever get there. That’s the recovery time objective. And I include degradation below acceptable levels along with cessation of functionality. Because a lot of people don't take that into account.
RPO: Recovery Point Objective is how much data can you afford to lose. So from a database management perspective if you've got a database that's processing let's say you got a database processing 24,000 transactions in a day. OK. And you do backups once per day and your system fails 10 minutes before backups. Can you afford to lose 23,400 transactions? If the answer is no then you're not backing up frequently enough.
Now let's take a much more granular look, if you're processing a thousand transactions an hour, how many transactions can you afford to lose. Because you've got to recreate those transactions while you're still processing new stuff. So you get a backlog building behind you while you're continuing to process new transactions. So where is the point where you say I can't afford to lose X number of transactions because” I know if that happens I can never recover them and still maintain the work”. That's your RPO.
|Bob Buda:||In your framework, when it comes time to put the plan into action, the recovery organization is critical to successful recovery. Can you talk about who should be part of that team, from an IT and data management perspective?|
|Bob Cohen:||You’re talking about data management and IT, you're talking about your disaster recovery team. When I develop a plan depending upon the complexity of the organization I may recommend to the client a recovery organization with multiple teams. And I always have a command team. Because if you've got more than two groups doing recovery there needs to be coordination, more on that later. So focusing on the I.T. you need your disaster recovery coordinator who is subordinate to your business continuity coordinator. It's like if you're COO is your number three man in the organization underneath the CEO/ president, then you have a COO, and off to the side in that first echelon of direct report you’ve got your CIO. OK. Now let's look at it from a disaster recovery perspective. The CIO is your disaster recovery coordinator who would report up to your business continuity coordinator, who’s managing the recovery of the entire operations of the organization.
So the DR guy is another piece of the overall recovery. You see so it's actually a compartment or piece of the overall.
So in a business continuity you've got your command team. You may have functional or departmental recovery teams again depending upon the size of the departments. If you've got a one person sales force I'm not going to create a recovery team for sales. But if I've got a 45 person call center then yes, that group will probably get their own plan. So it's I keep it very squishy to make it client-specific as I develop these recovery organizations for individual clients.
|Bob Buda:||And when you make these organizations do you specify particular individuals within the organization or do you specify positions?|
|Bob Cohen:||I go by specific people because when the balloon goes up things get very confusing very quickly. So if I hand you a plan and say “Hey Bob we're going to deputize you to help Pivot Point recover. Call the Sharepoint system administrator”. Who are you going to call?|
|Bob Buda:||You don't know that is, right.|
|Bob Cohen:||So if you've got a plan that says call Mike Gargiullo at 609 555-1212 now you know exactly who you're looking for. It's got to be by name.|
|Bob Buda:||And now I've been asking a lot of database specific questions. This one is not really database specific but it's a really important part of the process of business continuity planning. So can you tell us about the importance of exercising the plan before the problem occurs.|
|Bob Cohen:||I'm trying to come up with a good analogy here.|
|You're going to go play baseball and you haven't played in a couple of years, you break out your favorite bat and you go to home plate and you step up. Bases are loaded two out, bottom of the ninth, you get a fastball, you swing the bat and it cracks in half. Do you think you should have looked at the bat before you stepped up to the plate?|
|Bob Buda||Good Analogy, Thank you. Were almost done. Are there any stories about either successes or failures in business continuity that come to mind that you'd like to share?|
|Bob Cohen:||Sure. Are you familiar with the debacle of Ericsson. You know they used to be a huge cell phone manufacturer, used to be! One of their contractors had a fire in a warehouse and couldn't provide a critical chip. It wasn't Ericsson’s that had the fault, but Ericsson is no longer a player because they lost 80 percent of the market share. And they never recovered.|
|Bob Buda:||Now do you have any success stories that come to mind.|
|Bob Cohen:||Let me give you another another fun failure.|
|Bob Buda:||OK sure.|
|Bob Cohen:||In 2012, a Federal organization in Silver Spring Maryland had a minor problem at 4 a.m.
What had happened was they had a fault in a junction box, insulation caught on fire. OK.
That small problem took them down for two days because they there were so many serial issues. It was like the perfect storm. They wound up losing power. They've got a five building campus, the repeater for the security guard radios was in the building that lost power. So the security guards couldn't talk to each other. The security guard force is owned and operated by the parent department downtown.
So the fire department was called, they came in put the fire out checked everything out and they left without giving the all clear. So the security guards last order was — nobody goes in. So the executives show up, it is now 7:00 a.m. and the executives are saying “I want to go in” and they said no, my last order was you can't go in. “but there's no fire”, “Yea but we never got the all clear”. “Where's the fire department?” They left. You see a problem with this? We're not letting you in. And because the security guards were managed by another entity there was no local override authority. So by the time they got everything fixed it was two days later. They lost productivity for two full days. And then the epilogue is the same thing happened last year. Except it wasn’t the exact same thing. Last year they were down for three days.
|Bob Buda||So let's wrap up by talking just for a minute about how having a well-thought out business continuity plan and then practicing that plan could have avoided that situation.|
|Bob Cohen:||OK. if they had had a viable recovery plan they would have had a very clearly defined fail-over management organization because for two days they were having these coordination calls that were attended by 65 senior executives. So these guys got no sleep for two days because they didn't have a recovery management capability.
They had facilities running emergency management, the CIO running the disaster recovery, and a third office doing business continuity and all of these folks needed to play because they weren’t set up for a single consolidated overall coordinator.
So they're trying to do decisions by committee 65 to 70 strong.
The 65 or 70 people parsed out over three organizations were sending taskers to the subordinate services. Since there was no inter-communications, the subordinate services were getting the same taskers three or four times. So rather than doing things efficiently, effectively, and being down for three or four hours maybe half a day. They had 75 people or 70 people managing everything for two days. Trying to brute force it.
So in recap: the requirement for having a recovery plan is really simple. Given unlimited resources and unlimited time any organization can recover from anything. But the last thing you want to do as a CEO is go into the board of directors and try to explain why it took you three weeks and three million dollars to recover instead of taking four days and $250,000.
|Bob Buda:||Thank you very much for your time. This was great. I really enjoyed it our conversation.|
- In our conversation I referenced Bob’s BCP/DR presentation. Here is a link: http://pvt.pt/ISO22301PPSOverview
- Here is a link to the story about the fire at the Ericsson contractor site that Bob mentioned: https://www.wsj.com/articles/SB980720939804883010
I hope you enjoyed the conversation that I had with Bob. PivotPoint and Buda work together to ensure that our client’s data assets are secure and protected against all types of threats.
If you have any questions about how to ensure that you are prepared to recover your databases in the event of a disaster, please give us a call at (888) 809-4803 x 700 and if you have further thoughts on the topic, please add comments.