Posts Tagged ‘recovery’

Resilience Engineering #14: Company Scorecard

by Gary Monti on September 20, 2011

How big of a hit can your organization take? Can you prevent it? What resilience score would you give your organization? Ron Westrum gives some good criteria in Resilience Engineering: Concepts and Precepts.

Threats and Timeframe

An important issue revolves around the time horizon surrounding the threat and when the organization responds to it. There are 3 categories to consider:

  • Foresight is the ability to prevent something bad from happening to the organization.
  • Coping is the ability to prevent something bad that has happened from getting worse.
  • Recovery is the ability to recover from something bad once it has happened.

Foresight

Foresight has two components. The first is profiting from lessons learned and dealing with threatening situations in a constructive way through avoidance (elimination of the threat) or mitigation (dampening the probability or impact of a risk) strategies. This is what could be considered standard risk management.

The second is more interesting. It has to do with weak signal analysis. This comprises sensitivity to emerging trends within the environment and taking steps early to fend off the threat or to be prepared to deal with it successfully should it turn into a problem.

The problem with weak signal analysis is the findings may not integrate with cultural norms and be dismissed out of hand as being incorrect, over-reactive, or signs of being a crackpot. The use of radar at Pearl Harbor in 1941 is a good example. Accurate information was generated regarding the incoming Japanese attack. Use of it would have allowed for better preparation for the attack. The problem was advanced technologies such as radar weren’t part of the military culture and were considered “out there” so the information was ignored and the opportunity to prepare for the attack was missed.

Do you do any weak signal analysis to see what trends might be developing? How familiar is your organization with the competitive environment? If you do get that information what is done with it? Is it converted into something actionable?

Coping

Coping can comprise two approaches. The first is familiar to most of us. It is toughness in terms of being able to absorb, say, a no-cost change order. This is what would be called “robust” in previous blogs. There is a second intriguing aspect to coping, which can promote long-term survivability. It is the ability to redesign/restructure the organization right in the middle of the trouble. There is an everyday word for this – flexibility.

The trend to switch from being a computer company that provides services to a service company that uses computers is a very good example of coping.

Recovery

How is the recovery from a seriously damaging event handled? Is the focus on the principles that best serve the market niche the organization is in or is there a search for the guilty and punishment of the innocent? Apple is probably the best example of recovery. It has gone from about 2% market share in personal computers to being the second biggest company listed on Wall Street beaten out only be ExxonMobil.

So the questions are, “What would your organization’s score be when it comes to foresight, coping and recovery? What would you do to improve them?”

Two key computers crashed irreversibly last week and an unobservant driver hit my car. Business deadlines can’t be moved. The next 3 weeks are on the road. What to do? Pause, breathe, think and act. It’s just another project, one that is rather personal but still a project just the same.

Pause and Ask The Right Questions

A series of questions helped steer through this project ask:

  • Even if it is unrelated, did these events occur while pursuing what is best (to do)?
  • Separate from personal feelings and desires can I accept myself, the situation, and the people involved?
  • Can an adequate list of the principles and constraints be listed by a stakeholder? This list started at the moment of the accident and computer crashes and includes the policeman, other driver, insurance agents, computer repairman, clients, etc.
  • Can personal limits along with available resources be listed?
  • Is there a risk management plan in place for dealing with loss of time, money, and resources?
  • Can an adequate plan be built to get back on track and stay on track? Can that plan adapt to new information?

Breathe and Think

Before getting on to using the questions it is worth pointing out the saving grace to all this is the “what ifs” thought through over the years along with implementation of associated strategies. It is in line with an earlier blog regarding the  “Titanic,” i.e., instead of trying to design a ship that wouldn’t sink it would have been better to design in response to the question, “What do we do if the ship does sink?” Applied here it’s translated into saying well in advance, “It could happen, lean into it, generate a plan,” instead of just reacting to problems by saying, “This shouldn’t be happening to me because…!”

Take Action

Actions comprise weaving the results of pursuing the questions with the risk response strategies. Centeredness has taken shape in the midst of the anger, disappointment, frustration, etc., This centeredness surfaced the question,

Do I stay with what can be done or get lost in reacting?

One example of staying with what CAN be done involves some key databases and revolves around asking, “What if the hourly backups that should never corrupt actually do?” The worst-case costs led to additional backups on separate equipment for especially important files beyond the imaged external hard drives. THAT strategy paid off handsomely. Somehow the hourly files were corrupted and there has been no time to explore. The additional belts-and-suspenders backups saved the day. They are running well with the new compute. The jury is still out on the second computer, which is being fixed under warranty.

The gods of blogging must have been watching all this. When going into the computer shop a conversation was under way. It went something like this, “We couldn’t recover any data. You can send them to a recovery specialist. Prices start at $700/hard drive and go up from there. Since you have several hard drives that need recovered…well you can see where the math is going.”

Pause, breathe, think, and act. The more it is done when everything is okay the better it will be when things go south.

Did I mention my car was hit? With that there is repair, a rental, insurance adjusters, claim adjusters …whoa!…got to get packing! Plane to catch. It looks like more pausing, breathing, and thinking while on the road. Sleep will be sometime in May.

4 steps to effective Disaster Recovery planning

by Marc Watley on August 23, 2010

Question: A wildfire 10 miles away from your company headquarters is raging out of control. The fire captain just ordered everyone in your building to evacuate. All staff have safely evacuated premises, and now you are likewise heading out, taking one final look at your datacenter – still humming away, unsuspectingly. You have offsite data storage but no offsite server infrastructure, applications, etc.

What do you do?

I’m paraphrasing from a not-so-great movie here – Speed (Keanu may have been good in The Matrix but the predictable tête-à-tête between his and Dennis Hopper’s character in Speed still makes me chuckle) – but IT executives today are, in fact, increasingly faced with the threat of disasters – whether natural (such as a wildfire) or man-made (e.g. some ding-dong crashing a vehicle into your datacenter). I may be taking a bit of creative license here, but this could not be a more serious issue. (Recall those horrible wildfires in San Diego, California area a few years back? The example above was culled from situations experienced during that period.)

As organizations – and their customers – increasingly rely on database, server, and IP-connected applications and data sources, the importance and responsibility of maintaining continuity of the business infrastructure and limiting costly downtime in the event of a disaster, is paramount.

Though many an organization had active disaster recovery (DR) projects on the books a few years ago, the global financial crunch of the last 20 or so months has wreaked havoc on IT budgets everywhere; only now are many of these DR projects once again taking priority.

If you’re thinking that you can ‘wait it out’ and disaster won’t strike on your watch, think again. Apparently, some 93 percent of organizations have had to execute on their disaster recovery plans. Yep. This according to an annual DR survey from Symantec last year.  A few more points from this survey:

  • In general it takes companies [with active DR plans] on average three hours to achieve skeleton operations after an outage, and four hours to be up and running
  • The average annual budget for DR initiatives is $50MM (including backup, recovery, clustering, archiving, spare servers, replication, tape, services, DR plan development and offsite costs)
  • Virtualization has caused 64 percent of organizations worldwide to reevaluate their DR plans

Whether your organization is a small recently funded startup or well-entrenched in the Fortune 100, designing, implementing, and testing a DR plan is an endeavor that takes dedication, careful planning and time (the entire process can take weeks or even months). There are many excellent resources available which can provide knowledge and detail as to the individual steps of a DR planning initiative.  (Cisco’s DR Best Practices site or Disaster Recovery are great places to begin, by the way.)  What follows is a high-level, best-practices overview of the planning process:

Executive Sponsorship

This first step of a successful DR plan involves two key components: One is to secure plan sponsorship and engagement from senior company leadership – CEO, COO, CIO, etc. The other is to establish a planning team that is representative of all functional units of the organization – sales, operations, finance, IT, etc.  This step is the catalyst to a smooth planning initiative, and requires focus and patience.  (The ability to herd cats wouldn’t hurt, either.) It may also be helpful to reduce the impact on internal resources by leveraging outside help from a consulting firm well-versed in DR planning.

Information Gathering

This portion of the planning process – information gathering, due diligence and assessment – is the most involved and most time-consuming, and a true test of teamwork across the organization.

The first step in this part of a DR planning initiative is performing a Business Impact Analysis (BIA), which helps to assess the overall risk to normal business operations (and revenue flow) should disaster strike right this second. The BIA is typically comprised of identifying and ranking all critical business systems, analysis impact of interruption on critical systems, and most importantly, establishing the maximum length of time critical systems can remain unavailable without causing irreparable harm to the business. This length of time is also known as Maximum Tolerable Downtime (MTD).  Working backwards from the MTD will allow acceptable Recovery Point Objective (RPO) and the Recovery Time Objective (RTO) to be reached.

With BIA in hand, the next steps are conducting a risk assessment and developing the recovery strategy.  The risk assessment will help to determine the probability of a critical system becoming severely disrupted, identifying vulnerabilities, and documenting the acceptability of these risks to the organization.  Engagement from the entire planning team is necessary in order to accurately review and record details for critical records, systems, processing requirements, support teams, vendors, etc. – all needed in order to develop the recovery strategy.

Also important in the recovery strategy is identifying the recovery infrastructure and outsourcing options – ideally alternate datacenter facilities from which critical systems and data can be recovered in the event of a serious interruption.  This, as they say, is the point at which the bacon hits the frying pan: Many organizations are leveraging the power and abundance of Cloud-based IT resources to lower infrastructure costs, and Cloud is particularly applicable for DR.  In fact, there are more than a few services who provide continuous data protection: typically accomplished via unobtrusive software agents residing on each server in a datacenter. These agents are then connected to a black box also residing in the datacenter, incrementally taking images of each server, de-duplicating the data, then replicating that data via secure WAN to a remote data store, ultimately providing on-demand (via secure web console) recovery from the remote location at any time. Companies such as nScaled, iland, and Simply Continuous offer such services and can even help build a business case to illustrate the ROI for this service.  Point is, do thy homework and explore if Cloud services such as these might make a sound fit into your organization’s DR plan.

Planning and Testing

Armed with a full impact analysis, risk assessment, recovery goals, and outsourced options, now the actual DR plan can be developed. The DR plan is a living document that identifies the criteria for invoking the plan, procedures for operating the business in contingency mode, steps to recovering lost data, and criteria and procedures for returning to normal business operations. Key activity in this step is to identify in the DR plan – a recovery team (which should consist of both primary and alternate personnel from each business unit) and to identify recovery processes and procedures at each business unit level.  Also important is to ensure the DR plan itself is available offsite – both via the web and in permanent media form (print, CD-ROM, etc.)

Equally important to having a DR plan is regular testing. This step includes designing disaster/disruption scenarios and the development and documentation of action plans for each scenario. Conducting regular testing with full operational participation is key to successful testing.

Ongoing Plan Evaluation

An effective DR plan is only a good plan if continually kept in lock-step with all changes within the organization.  Such changes include infrastructure, technology, and procedures – all of which must be kept under constant review, and the DR plan updated accordingly.  Also, DR plan testing should be evaluated on a regular basis, and any adjustments made (systems, applications, vendors, established procedures, etc.).

So there you have it – four key building blocks to tailoring a DR plan for your organization.  Of course, if the ‘disaster’ arrives in the form of a city-sized asteroid hurtling towards Earth, needless to say any plan will likely not make much difference. Anything short of such a global catastrophe, however, and a well-developed and maintained DR plan will keep employees and customers connected and business moving forward, with minimum downtime.

Again, this is by no means a complete recipe for designing and implementing a DR plan but instead is meant to serve as a high-level overview…offered as food for thought.  I encourage you to learn more, explore options, ask for help if needed – whatever it takes to thoroughly prepare your organization for the worst, should the worst ever occur. To loosely paraphrase our man Keanu once again from another of his, er, more questionable films from back in the day – Johnny Mnemonic – this is one topic where you absolutely, positively don’t want to “get caught in the 404″.