The Resilience Within Disaster Recovery

Written by Tony Broughton | Apr 26, 2023 2:33:33 PM

Natural disasters are unplanned, unpredictable and can result in devastating loss. In 2022 alone, the United States endured 18 climate-related disasters. Storms, wildfires, and flooding are just a few of the things that can wreak havoc on our lives, our possessions, and our processes. Statistics show that the number of weather-related catastrophes continues to rise from year to year.

Much like natural disasters, cybercriminals are also unpredictable and can have a detrimental impact on business information and systems. Happening roughly every 40 seconds, there were over 200 million cyberattacks reported in 2022.

Given the increasing occurrence of disasters, whether natural or human-generated, recoverability is quickly becoming a necessity in daily life. Diligence must be utilized in order to stay protected. Businesses need to enlist best practices in order to protect systems and data from danger and thievery.

Maintain vs. Restore

It is essential that companies have appropriate measures in place to plan, protect, and recover. A Business Continuity Plan (BCP) outlines the processes and procedures that maintain continuum during interruptions. It is a way of protecting against a complete outage by assuring that essential functions can continue to be performed.

In comparison, a Disaster Recovery (DR) plan focuses on the business restoration as a whole by providing a short-term method designed to get systems back online as quickly as possible. DR serves as an organization's method of regaining access and functionality to its IT infrastructure after a chaotic event has occurred. Although BCPs focus more on operational resiliency, they do go hand-in-hand with a DR initiative. A BCP includes procedures for reporting a disruption, while pinpointing the resources and strategies needed to restore operations. A DR plan is the initiative that follows when the actual recovery takes place.

Although they work in tandem, disaster recovery does differ from Business Continuity (BC). Where the focus of BC is on operational resiliency, the primary objective of a DR plan is to sustain System Resilience (SR) within critical business operations on the production network. SR refers to the ability for a system to return back to its previous operational state after enduring an adverse event in order to maintain an acceptable level of service.

DR Best Practice

It isn't so much a matter of if a company will experience a mechanical fault, it is more a matter of when. Systems incur wear and tear on the operating components and inevitably require repairs and tune-ups from time to time. Implementing a solid line of defense and recovery provides organizations with data security and reliability.

Enlisting High Availability (HA) empowers a system with the ability to provide uninterrupted service and minimize downtime. The HA mechanisms are proactive approaches that organizations typically rely on and include redundancy, load balancing, clustering, and virtualization.

Redundancy plays a major role in continuity by making components available in the event of a system failure. Much like a backup generator, if the main system fails, operations will avoid disruption and will continue to function as normal via the replicated system.
Load Balancing involves the distribution of workloads across multiple resources to assure functionality and performance during times of disruption.
Clustering is a process of imploring multiple redundant systems that go into action by taking over the abilities of a failed system during a disaster. Clusters involve connecting multiple computer systems together to work as one, with each having the ability to take over the other. This allows for continued functionality even if one or more servers fails.
Virtualization involves creating a virtual version of a server, operating system, application, or storage device to be at-the-ready for deployment when the physical resource has been compromised.

Used in disaster recovery, RAID (Redundant Array of Independent Disks) allows for multiple hard disks to be combined into a single virtual drive. This establishes continued online capabilities while providing a quick recovery from disk failure. RAID also allows for measuring the amount of time it takes to recover from a disk failure, thus measuring the effectiveness of HA. The use of RAID is tantamount to an organization’s capability to access information by distributing the storage load of data evenly amongst multiple memory devices. To the user, it appears as one database.

Another proactive approach for businesses is to enlist Fault Tolerance. This gives a system the capability to bounce back from an unexpected failure and continue operations as normal. Fault Tolerance is achieved through redundancy which involves replicating components to help compensate in the event of a system hardware fault. Redundancy can be implemented in several ways, for example using multiple hard drives, power supplies, or processors.

Different from redundancy, the use of Failover transfers data (automatically or manually) to a secondary system which then becomes the primary system. The transition is seamless and behind the scenes, likely unnoticed by users. Proxy Load Balancers are a good example of failover as they redistribute processes that were previously performed by the failed balancer, stepping in when the primary balancer is down.

A formal reciprocal agreement between partner organizations is another approach in enlisting DR. It outlines shared resources and ways in which organizations will collaborate in the event of a disaster. Processing time, hardware capability, and support amenities within the critical business functions and applications are defined. This provides a framework for ways that organizations can work together in partnership of functionality and recuperation during times of distress.

DR in the Foreground

Because a disaster can affect an entire organization, it is important to understand that DR isn’t just the responsibility of the company’s IT team. There are measures that can be taken across the board, to assure that staff can step in at times of concern.

A cross trained workforce, for example, provides a strong and protective measure within the scope of DR. Much like cross-training in fitness produces an over-arching approach to targeting muscle groups and adding variety and flexibility in workout regimes, cross training within the realm of DR adds a layer of protection in coverage. It provides an alternate level of ability and attention by training and enabling additional employees to step in if something were to require attention in the absence of the ‘regular’ personnel.

A crisis communication plan is essential in accessing employee contact information and providing a priority list of the chain of command. In conjunction, it is also imperative for companies to run quarterly or biannual alert drills. This will help employees get familiarized with the procedures while providing information such as who is to be notified and what roles they play during a DR incident.

Plans must also be in place if criminal activity does occur. Upon arrival, the authorities must have an escort. During an evacuation, a headcount must be conducted to account for each employee. While the authorities conduct their search of the premises, security guards must be posted on the outsides of the building as it is possible for a criminal to purposely set the alarm as a distraction in order to access a back door entrance.

Safety is vital to DR as personnel are considered the highest valued organizational asset. Management needs to understand their roles and responsibilities, accounting for the team that they oversee, while making staff fully aware of their objective during a disaster recovery situation.

Secondary Storage

Secondary sites allow for data recovery and are often used as a protective measure in enforcing recovery after a disaster. There are three different types of sites available to businesses: hot, cold, and warm.

Hot sites are functional data centers, equipped with pre-built servers and other necessary components that provide a quick restart when disaster strikes. Responsible for keeping network communication links, servers, and workstations in a constant working order, the primary site’s servers are continually sending replicated data to the hot site so that workers can get back on track efficiently and effectively. This is the best way for an organization to get back online as the workstations and servers are already preconfigured and loaded with the appropriate operating system and software applications. The servers used for hot sites are high-powered and expensive to maintain over time, however, do allow for a fast and reliable restart. Hot sites are typically used for businesses that have a large customer base or offer services that require high levels of reliability and performance.

A cold site typically encompasses only the basic components and normally functions as a standby facility would. Cold sites offer an empty infrastructure that requires installation of needed software and hardware in order to be used. An inexpensive option, cold sites do require a generous amount of time to setup and become operational.

A warm site falls in the middle as they don’t require the same level of power as hot sites and don’t typically contain copies of a client’s data. However, they are preconfigured and ready for operation with appropriate tools installed. They do require transportation of back-up media and restoration of critical data on the main site’s database servers. In essence, warm sites are partially operational data centers used as backup for data storage.

Types of Site Testing

Performing regular site testing allows organizations to identify areas for improvement in their business disaster recovery plan. Determining performance, uncovering weakness, and testing effectiveness promotes an up-to-date and properly functioning remedy. There are four types of site testing: Structured Walkthrough, Parallel Testing, Full Interruption, and Simulated Testing.

Structured Walkthroughs (also known as tabletop exercises) involve planning of a disaster scenario and role-playing through the situation. Members of the DR team refer to the DR plan to review and analyze it for appropriateness and ability to respond to the type of disaster that was simulated. Structured walkthroughs involve a review of hardware, software, connectivity, and physical inspection. This could range from a routine rehearsal such as a fire evacuation, to something more in-depth such as total site devastation.
Parallel testing involves site activation procedures and the relocation of personnel to an alternate recovery site. A parallel test is utilized whenever an organization wants to test disaster response capabilities without interrupting the primary site’s production processes. This test keeps the original site separated by running adjacent with the recovery systems. This can be assessed by acquiring redundant resources and simulating realistic workloads to test whether each system is working as expected.
Full Interruption, on the other hand, is a simulation that mimics a complete data center failure. Requiring a full outage environment, it disrupts regular operations at the primary site and requires employees to fall back to the alternate site as a simulated DR exercise. During full interruption, the operations get shut down and transferred to the backup recovery site, along with the reverse process of bringing operations back online. Full interruption is a long and tedious measure which can result in lost revenue during the shutdown as well as great challenges in arranging and organizing the outage with the rest of the organization.
Simulated Testing is often a less intrusive way to assess an organization’s DR plan. It involves creation of a particular scenario along with development of the appropriate response as if it were an actual disaster interruption. However, the key difference is that the systems on the production network remain untouched during the simulation testing. The DR responsibilities and site activation procedures are still implemented on the segmented area of the network yet refrain from interacting with systems on the production side. Simulated Testing allows for employees to continue to conduct day-to-day business operations while the DR team assesses the infrastructure.

DR Due Care

Disaster recovery is a critical part of an organization’s disaster preparedness plan. More than just having an immediate directive, it involves identifying backup options and other measures to minimize the impact of a catastrophic event. It’s also about educating and preparing employees regarding the importance of DR and day-to-day vigilance.

A single cyberattack has the potential to cause big issues. Revenue loss is one of the main impacts of cybercrime, with businesses reporting a loss of 20% or more per event. Empowering your company, your data, and your staff with the knowledge and tools needed to implement a layer of protection in thwarting the mind of a criminal is good practice. Having a plan in place now will save time, money, and resources when the inevitable does occur.

Events can happen at a moment’s notice and the proper engagement of a DR plan provides strategy, guidance, and restoration capabilities. In addition, regular review of the plan is vital in assuring that data, systems, and personnel are well protected, and that the response to disaster is seamlessly orchestrated and effective.

View full post