May 2021

[originally published on ebrp.net; written by Colin Garrison]

A case study describing eBRP’s effort to help a Texas utility company eliminate a communications bottleneck and pass a mandated IT/DR test.

TROUBLE IN TEXAS

A Texas utility company had a stubborn problem: year after year, it failed to meet the recovery time objectives of its annual IT/DR tests, largely because of inefficiences caused by the restricted visibility the participants had into each other’s activities.

Could eBRP Solutions help the company improve information-sharing and meet its IT/DR test objective?

A MANDATE TO DEMONSTRATE RECOVERABILITY

Our client in this case was a Texas utility company that provided power to more than three million consumer households.

The utility was under a mandate from its Board of Directors to regularly test and report on the IT service continuity and disaster preparedness of its business data center.

It was also obliged to prove to regulators that, in the event of an outage, it could recover services to its customers within a defined timeline.

POOR COMMUNCIATION LEADS TO PERSISTENT FAILURE

Unfortunately, the company persistently took too long to recover. When the alloted recovery time of 72 hours arrived, it was typically only 80 to 90 percent done with the recovery. Full recovery routinely took between 76 and 80 hours.

The problem was not a matter of technical sophistication. The utility had the skills it needed to recover its operations.

The problem was simpler than that—and also more complicated. It involved a crippling bottleneck in communications.

A COMPLICATED RECOVERY EXERCiSE

The utility’s annual DR exercise involved recovering over 50 applications, implementing over 140 recovery plans, executing 3,000 tasks, and managing hundreds of assets.

The work was managed by 200 participants divided among multiple stakeholder groups, including an IT infrastructure team, a team of business analysts who handled application validation, and a group responsible for performing end-user acceptance tests. A fourth group represented senior management.

The recovery was supposed to be completed within a Recovery Time Objective (RTO) of 72 hours.

This is a typical situation for an IT/DR exercise at a medium- or large-scale enterprise.

In the case of this company, it was also a recipe for paralysis because the utility’s business continuity platform was not robust enough to meet the needs of the exercise.

AN INADEQUATE LEGACY BCM PLATFORM

For 10 years, the company had used a BCM software product supplied by a vendor. This product was useful for helping the BC staff create, maintain, and update plans.

However, it was completely inadequate to the demands of managing an IT/DR test for an organization of the size and complexity of the utility company.

The legacy platform had no ability to track the 3,000 tasks and hundreds of assets that were involved in the exercise or notify downstream participants when required predecessor actions were complete, greenlighting them to execute their own tasks.

This resulted in crippling delays and caused paralyzing communications bottlenecks as the participants clogged the available communications channels to ascertain if the required upstream tasks had been performed.

A COLORFUL WORKAROUND

To carry out the mandated IT/DR recovery exercise in the absence of a viable BCM platform, the utility hit on a colorful workaround. Unfortunately, it was also highly inefficient.

Recovery team participants placed red, yellow, or green sticky notes on the wall of their conference room to indicate the completion status of various tasks.

Inevitably, notes fell onto the floor or got mixed up, and the lack of any computerized search capability caused chronic delays.

The result of these limitations was to make the recovery team conference room resemble the floor of the New York Stock Exchange.

Members of the recovery team shouted back and forth while waving their arms to attract each other’s attention. The team wasted a great deal of time trying to reach their colleagues on the phone, waiting around for information, and trying to ascertain the status of various tasks from the wall covered with sticky notes.

This process chewed up vital hours, with the result that the utility never met its recovery time objective.

THE SEARCH FOR A SOLUTION

The company was well aware of the cause of its problems. Eventually it commenced a search among third-party providers for a BCM platform capable of solving its communication bottleneck.

None of the tools it looked at possessed the capability to see into the progress of multiple recovery plans simultaneously.

Then the utility learned about eBRP Solutions’ BCM platform eBRP Suite, with its focus on helping large, complex organizations achieve and validate timely recovery and its two modules, Toolkit (for planning) and CommandCentre (for testing, validation, and incident management).

The utility identified the following capabilities of CommandCentre as being particularly relevant to its need to eliminate its communications bottleneck:

  • Automatic Playbook Notification. Automatically notifies workers when upstream tasks on which they are dependent have been completed, greenlighting them to execute on their own tasks.
  • Realtime Status Information. Shows tasks that have been completed in green, those that are in progress in yellow, and those that have not yet started in red.
  • Task Allocation. Automatically allocates tasks based on roles and skills.
  • Decision Support. Calculates the impacts of action alternatives, providing leaders with metrical intelligence for making decisions.
  • Expectation Management. Constantly calculates and recalculates when tasks are likely to be completed, giving downstream workers visibility into when they will be able to begin their own tasks.
  • Asset Status Information. Can track the status of hundreds of assets simultaneously, making the information available to every user.
  • Issue Management. Tracks and logs problems encountered in the completion of tasks.
  • Playback Functionality. Allows team members to go back after an exercise and rerun every task in order to identify speed bumps and improve future performance.

TIME FOR THE BIG TEST

Following a training period, and newly equipped with CommandCentre, the utility’s recovery team embarked on its annual three-day IT/DR test. The exercise involved hundreds of people and over 140 recovery plans, and had the goal of recovering the 53 most critical applications within the required RTO of 72 hours.

As the exercise unfolded, the key CommandCentre functionalities gave the team members’ unprecedented visibility into the progress of the recovery.

This eliminated the clamor for information that had marred the previous recovery exercises, easing the strain on the recovery staff and leading to reduced crowding and confusion at the recovery center as well as smoother execution of the recovery playbook through all phases of the test.

THE FINAL RESULT

The final result of the exercise was highly gratifying to the members of the recovery team.

On the previous occasions when the utility had run this test, it had recovered on average 80 percent of the 53 or so applications deemed most critical at the expiration of the RTO of 72 hours.

During its first full-scale  IT/DR recovery exercise using eBRP Suite’s Command Centre, the utility’s team recovered 100 percent of the 53 applications included in the test within 56 hours.

LOOKING AHEAD

Following the success of its first IT/DR test using eBRP Suite, the utility made use of CommandCentre’s Playback Functionality to further improve its performance.

On a subsequent exercise, it raised the number of applications in the test from 53 to 68 and still managed to recover all of them within 56 hours.

The utility continues to utilize eBRP CommandCentre and to improve its skills at recovering its operations.

eBRP is scheduled to support the utility in another full-scale IT/DR test this fall.

COULD eBRP HELP YOUR ORGANIZATION BECOME MORE RESILIENT?   Are you involved in business continuity and IT/Disaster Recovery at a large enterprise that is under obligations to demonstrate recoverability? Would you like to learn how eBRP Suite can help your organization become more resilient? Write to us at info@ebpr.net or telephone us toll-free at (888) 480-3277 (U.S.) or +1 (905) 677-0404 (International). Our team would be glad to hear about the BCM challenges facing your enterprise and explore how we can help you meet them.   eBRP Solutions . . . Because serious enterprises need a serious BCM platform.