Insights into creating a successful Disaster Recovery Test – Part 3: Metrics

Organizations with an active Disaster Recovery program conduct DR Tests to validate the Disaster Preparedness component of their IT Service Continuity strategies. Those exercises should validate – among other milestones – the overall recovery time (actual vs. planned RTO), completeness of DR plan documentation, level of preparedness among recovery team and the overall effectiveness of the DR response.

As a BCP/DR software solution provider, we are often called on to assist customers in the preparation, management and enhancement of their DR Tests. After one of our Utility customers completed their DR Recovery Test preparation steps (see Part 2 of this blog)  the time came to test their ability to execute. Based on management objectives, in-scope recovery elements included 344 systems (Mainframe, Unix, Linux, Wintel, Storage devices, Networks, Databases, TSM, NBU, SAP), 57 Tier-1 and-2 Applications, and involved more than 275 IT staff. As part of this DR test, 137 distinct DR plans were activated with a planned Recovery Time Objective (RTO) of 72 hours.

Management thinker Peter Drucker is often quoted as saying “You can’t manage what you can’t measure.”

Infrastructure Sub-systems:  As part of the DR test, Milestone Dashboards (MDB) were setup to measure real-time status updates of the infrastructure Subsystem recoveries. A Subsystem could be in one of three states:

  • System Down
  • Restoration In-Progress
  • System Restored

The visibility of restored infrastructure components allowed Application Owners to initiate Application Functionality Testing.

Application (IT Service): Certification of an application “Available” required (a) restoration of associated infrastructure Subsystems, (b) Completion of Application functional testing and (c) Completion of end-user Acceptance Testing. Metrics were adopted to measure Service Restoration status:

Application availability (100%) =

  1. Infrastructure sub-systems restored (60%) +
  2. Application Functional Testing complete (30%) +
  3. End-user Acceptance Testing (10%)

 

Single image


DR Plans
: 137 IT DR plans were in scope for the exercise, comprised of 1,921 distinct Tasks. Each task had a planned duration and a sequence in which it was to be executed. Based on the intra-plan Task sequence, and taking into consideration other inter-Plan linkages, dashboards displayed the Task status in different colors indicating: (a) Task ready for execution (b) Task in-progress, or (c) Tasks completed successfully

Each of the DR Plans could be viewed as a GANTT Chart with its “critical path” and any inherent “slack” time.

Stakeholders: Our client’s Annual Disaster Recovery test involved more than 275 IT staff over 3-days.  Broadly the stakeholders were grouped by their role or function during the test: •Incident Commanders •Recovery Teams •Application Functionality Testers •End-user Acceptance Testers •Executives / Observers.

Orchestration: The Disaster Recovery test involved – Plan activation, Task allocation, monitoring the critical path, real-time system/services dashboards, staff scheduling and issue management. All required a high level of coordination among the various stakeholders.  eBRP’s CommandCentre (a DR automation platform) was deployed to support their Annual DR Test. Use of CommandCentre to manage the exercise, with it’s real-time activity logging, integrated notification and array of dashboard displays, resulting in improvement of the overall efficiency of the test and its reporting capacity.

 

More about this series: Part 1 of this blog series detailed the setting of DR Test Scope and Objectives.  Part 2 focused on Test Preparation, Test Execution in Part 3 and concludes with Test Review in Part 4.

SHARE:
eBRP Insights

eBRP Insights

The eBRP Insights blog voice represents more than 50 years of BCM experience with corporate BCM program management and implementations. We've worked hand-in-hand with many government and private enterprises to assist them in developing viable BCM programs. eBRP is an active participant on LinkedIn and Twitter. The opinions expressed in our this blog are ours and are intended to engage resiliency planners in conversations about the BCM industry including it's standards and future.

Related Posts

Aligning Cyber Incident Response Planning with Your BC/DR Program

Aligning Cyber Incident Response Pl...

Cyber disruptions – and their impact on both reputations and…

Leave a comment