Question: A wildfire 10 miles away from your company headquarters is raging out of control. The fire captain just ordered everyone in your building to evacuate. All staff have safely evacuated premises, and now you are likewise heading out, taking one final look at your datacenter – still humming away, unsuspectingly. You have offsite data storage but no offsite server infrastructure, applications, etc.
What do you do?
I’m paraphrasing from a not-so-great movie here – Speed (Keanu may have been good in The Matrix but the predictable tête-à-tête between his and Dennis Hopper’s character in Speed still makes me chuckle) – but IT executives today are, in fact, increasingly faced with the threat of disasters – whether natural (such as a wildfire) or man-made (e.g. some ding-dong crashing a vehicle into your datacenter). I may be taking a bit of creative license here, but this could not be a more serious issue. (Recall those horrible wildfires in San Diego, California area a few years back? The example above was culled from situations experienced during that period.)
As organizations – and their customers – increasingly rely on database, server, and IP-connected applications and data sources, the importance and responsibility of maintaining continuity of the business infrastructure and limiting costly downtime in the event of a disaster, is paramount.
Though many an organization had active disaster recovery (DR) projects on the books a few years ago, the global financial crunch of the last 20 or so months has wreaked havoc on IT budgets everywhere; only now are many of these DR projects once again taking priority.
If you’re thinking that you can ‘wait it out’ and disaster won’t strike on your watch, think again. Apparently, some 93 percent of organizations have had to execute on their disaster recovery plans. Yep. This according to an annual DR survey from Symantec last year. A few more points from this survey:
- In general it takes companies [with active DR plans] on average three hours to achieve skeleton operations after an outage, and four hours to be up and running
- The average annual budget for DR initiatives is $50MM (including backup, recovery, clustering, archiving, spare servers, replication, tape, services, DR plan development and offsite costs)
- Virtualization has caused 64 percent of organizations worldwide to reevaluate their DR plans
Whether your organization is a small recently funded startup or well-entrenched in the Fortune 100, designing, implementing, and testing a DR plan is an endeavor that takes dedication, careful planning and time (the entire process can take weeks or even months). There are many excellent resources available which can provide knowledge and detail as to the individual steps of a DR planning initiative. (Cisco’s DR Best Practices site or Disaster Recovery are great places to begin, by the way.) What follows is a high-level, best-practices overview of the planning process:
This first step of a successful DR plan involves two key components: One is to secure plan sponsorship and engagement from senior company leadership – CEO, COO, CIO, etc. The other is to establish a planning team that is representative of all functional units of the organization – sales, operations, finance, IT, etc. This step is the catalyst to a smooth planning initiative, and requires focus and patience. (The ability to herd cats wouldn’t hurt, either.) It may also be helpful to reduce the impact on internal resources by leveraging outside help from a consulting firm well-versed in DR planning.
This portion of the planning process – information gathering, due diligence and assessment – is the most involved and most time-consuming, and a true test of teamwork across the organization.
The first step in this part of a DR planning initiative is performing a Business Impact Analysis (BIA), which helps to assess the overall risk to normal business operations (and revenue flow) should disaster strike right this second. The BIA is typically comprised of identifying and ranking all critical business systems, analysis impact of interruption on critical systems, and most importantly, establishing the maximum length of time critical systems can remain unavailable without causing irreparable harm to the business. This length of time is also known as Maximum Tolerable Downtime (MTD). Working backwards from the MTD will allow acceptable Recovery Point Objective (RPO) and the Recovery Time Objective (RTO) to be reached.
With BIA in hand, the next steps are conducting a risk assessment and developing the recovery strategy. The risk assessment will help to determine the probability of a critical system becoming severely disrupted, identifying vulnerabilities, and documenting the acceptability of these risks to the organization. Engagement from the entire planning team is necessary in order to accurately review and record details for critical records, systems, processing requirements, support teams, vendors, etc. – all needed in order to develop the recovery strategy.
Also important in the recovery strategy is identifying the recovery infrastructure and outsourcing options – ideally alternate datacenter facilities from which critical systems and data can be recovered in the event of a serious interruption. This, as they say, is the point at which the bacon hits the frying pan: Many organizations are leveraging the power and abundance of Cloud-based IT resources to lower infrastructure costs, and Cloud is particularly applicable for DR. In fact, there are more than a few services who provide continuous data protection: typically accomplished via unobtrusive software agents residing on each server in a datacenter. These agents are then connected to a black box also residing in the datacenter, incrementally taking images of each server, de-duplicating the data, then replicating that data via secure WAN to a remote data store, ultimately providing on-demand (via secure web console) recovery from the remote location at any time. Companies such as nScaled, iland, and Simply Continuous offer such services and can even help build a business case to illustrate the ROI for this service. Point is, do thy homework and explore if Cloud services such as these might make a sound fit into your organization’s DR plan.
Planning and Testing
Armed with a full impact analysis, risk assessment, recovery goals, and outsourced options, now the actual DR plan can be developed. The DR plan is a living document that identifies the criteria for invoking the plan, procedures for operating the business in contingency mode, steps to recovering lost data, and criteria and procedures for returning to normal business operations. Key activity in this step is to identify in the DR plan – a recovery team (which should consist of both primary and alternate personnel from each business unit) and to identify recovery processes and procedures at each business unit level. Also important is to ensure the DR plan itself is available offsite – both via the web and in permanent media form (print, CD-ROM, etc.)
Equally important to having a DR plan is regular testing. This step includes designing disaster/disruption scenarios and the development and documentation of action plans for each scenario. Conducting regular testing with full operational participation is key to successful testing.
Ongoing Plan Evaluation
An effective DR plan is only a good plan if continually kept in lock-step with all changes within the organization. Such changes include infrastructure, technology, and procedures – all of which must be kept under constant review, and the DR plan updated accordingly. Also, DR plan testing should be evaluated on a regular basis, and any adjustments made (systems, applications, vendors, established procedures, etc.).
So there you have it – four key building blocks to tailoring a DR plan for your organization. Of course, if the ‘disaster’ arrives in the form of a city-sized asteroid hurtling towards Earth, needless to say any plan will likely not make much difference. Anything short of such a global catastrophe, however, and a well-developed and maintained DR plan will keep employees and customers connected and business moving forward, with minimum downtime.
Again, this is by no means a complete recipe for designing and implementing a DR plan but instead is meant to serve as a high-level overview…offered as food for thought. I encourage you to learn more, explore options, ask for help if needed – whatever it takes to thoroughly prepare your organization for the worst, should the worst ever occur. To loosely paraphrase our man Keanu once again from another of his, er, more questionable films from back in the day – Johnny Mnemonic – this is one topic where you absolutely, positively don’t want to “get caught in the 404″.
—Written by Marc Watley, Co-Founder & CEO of Datacenter Trust and CMO at Reppify. Datacenter Trust is an IT consulting and services delivery firm, helping growing businesses make smart decisions from sound financial analysis and business intelligence. Reppify is a leading-edge technology company pioneering the use of social media data to drive better business decisions. Follow on Twitter: Datacenter Trust @datacentertrust and Reppify @reppify