Recently, eBRP was invited to participate as observers of an Annual DR test of a southwest energy Company – conducting its first Disaster Recovery test of its new Backup Data Centre. Our role included assisting and advising client teams on leveraging features of our eBRP Suite (which holds all their planning and Plan information) to monitor, measure and manage the test.
DR History
Previously, the Company contracted with a ‘warm site’ provider to practice “bare metal” recovery. Their plans were maintained in a legacy software system, and annual DR tests were limited to only a select few applications.
After a 3-year backup data center construction project, the Company conducted its first failover test of all the 135 applications classified as “Tier 0” – those with RTO’s of up to 120 hours. Associated Plans were migrated to eBRP Suite to enable use of its CommandCentre EOC tool to monitor and manage recovery tasks.
Stakeholders
Stakeholders involved in this Annual DR test included:
· Infrastructure Restoration Teams
· Application Validation Teams
· Client (Application) Testers
· Incident Commanders & Senior Managers
More than 200 staff members were engaged in the exercise over the course of the test period – which was originally estimated to require 120 hours. This was the first time this DR strategy had ever been tested.
Test Results
Over 240 servers including mainframe, AIX, Linux and Windows servers along with DB2, Oracles, Hana and SQL databases were in the test scope. After an 8am kickoff on Tuesday October 18, Infrastructure recovery proceeded smoothly, and scripted Server restoration progressed according to planning assumptions.
Tier 0 apps with 24 hr. RTO’s were up and running (ready for end-user testing) within the first 6 hours. In all, more than 135 applications were restored, verified and presented for testing within 60 hours.
eBRP’s CommandCentre was used for its intended purposes by Teams and groups – Incident Managers, Recovery Teams, Client Testers – as well as Senior Managers monitoring the progress of the test.
Lessons Learned
A test of this scope – recovery of more than 100 critical applications, with 200+ recovery team members working round the clock in 8 hour shifts – provided insights that were very different for the assumptions made during the planning stages:
· Incident Managers and Senior Managers prefer a ’35,000ft.’ view of progress. They seldom wish to drill down to greater details
· Once the recovery is underway, RTO, RPO and other BIA parameters are largely irrelevant
· A mass notification tool, integrated with eBRP, was effective for Polling, Periodic Updates and providing relevant instructions to teams during recovery
· Understanding dependencies (on infrastructure, other apps, etc.) is absolutely essential for ensuring efficient workflow between dependent teams.
· Plans are important, but to monitor & manage the incidents, Incident Commanders & Senior Managers depended on high-level Dashboards that were refreshed in real-time.
The entire exercise was managed and monitored using eBRP Suite. The test concluded with all Application restored and Client Tests validated in less than half the original timeframe. Unlike earlier tests, there was no need for Conference Bridges or yellow Post-It notes lining the EOC walls.