Organizations with an active Disaster Recovery program conduct DR Tests to validate the Disaster Preparedness component of their IT Service Continuity strategies. Those exercises should validate – among other milestones – the overall recovery time (actual vs. planned RTO), completeness of DR plan documentation, level of preparedness among recovery team and the overall effectiveness of the DR response.
As a BCP/DR software solution provider, we are often called on to assist customers in the preparation, management and enhancement of their DR Tests. After one of our Utility customers completed their DR Recovery Test preparation steps (see Part 2 of this blog) the time came to test their ability to execute. Based on management objectives, in-scope recovery elements included 344 systems (Mainframe, Unix, Linux, Wintel, Storage devices, Networks, Databases, TSM, NBU, SAP), 57 Tier-1 and-2 Applications, and involved more than 275 IT staff. As part of this DR test, 137 distinct DR plans were activated with a planned Recovery Time Objective (RTO) of 72 hours.
Management thinker Peter Drucker is often quoted as saying “You can’t manage what you can’t measure.”
Infrastructure Sub-systems: As part of the DR test, Milestone Dashboards (MDB) were setup to measure real-time status updates of the infrastructure Subsystem recoveries. A Subsystem could be in one of three states:
- System Down
- Restoration In-Progress
- System Restored
The visibility of restored infrastructure components allowed Application Owners to initiate Application Functionality Testing.
Application (IT Service): Certification of an application “Available” required (a) restoration of associated infrastructure Subsystems, (b) Completion of Application functional testing and (c) Completion of end-user Acceptance Testing. Metrics were adopted to measure Service Restoration status:
Application availability (100%) =
- Infrastructure sub-systems restored (60%) +
- Application Functional Testing complete (30%) +
- End-user Acceptance Testing (10%)
DR Plans: 137 IT DR plans were in scope for the exercise, comprised of 1,921 distinct Tasks. Each task had a planned duration and a sequence in which it was to be executed. Based on the intra-plan Task sequence, and taking into consideration other inter-Plan linkages, dashboards displayed the Task status in different colors indicating: (a) Task ready for execution (b) Task in-progress, or (c) Tasks completed successfully
Each of the DR Plans could be viewed as a GANTT Chart with its “critical path” and any inherent “slack” time.
Stakeholders: Our client’s Annual Disaster Recovery test involved more than 275 IT staff over 3-days. Broadly the stakeholders were grouped by their role or function during the test: •Incident Commanders •Recovery Teams •Application Functionality Testers •End-user Acceptance Testers •Executives / Observers.
Orchestration: The Disaster Recovery test involved – Plan activation, Task allocation, monitoring the critical path, real-time system/services dashboards, staff scheduling and issue management. All required a high level of coordination among the various stakeholders. eBRP’s CommandCentre (a DR automation platform) was deployed to support their Annual DR Test. Use of CommandCentre to manage the exercise, with it’s real-time activity logging, integrated notification and array of dashboard displays, resulting in improvement of the overall efficiency of the test and its reporting capacity.
More about this series: Part 1 of this blog series detailed the setting of DR Test Scope and Objectives. Part 2 focused on Test Preparation, Test Execution in Part 3 and concludes with Test Review in Part 4.