Disaster Recovery Strategy – The Backup Bible Review

The following post reviews The Backup Bible, a 204 page ebook written by Eric Siron and published by Altaro Software. I’ve decided to break this down into a short discussion around each section, 3 in total, to help with the flow of the text. The Backup Bible can be downloaded for free here.

Initial thoughts on downloading the ebook and flicking through; clearly a lot of research and work has gone into this based on content applicable to both the business outcomes and objectives as well as low level technical detail. A substantial number of example process documents are included that m§ean practical outputs can be applied by the reader straight away. The graphics, colour scheme, font, and text size all look like they’ll make the ebook easy to follow, so let’s jump in.

Introduction and part 1: creating a backup & disaster recovery strategy

Chapter 1 is short and to the point; providing a check list for the scope of disaster recovery. The check list serves as a nice starting point for organisations to expand into sub-categories based on their own goals, requirements, or industry specific regulations. An interesting observation made during the introduction is that human nature leans towards planning for best outcomes, rather than worst. As a result, organisations design systems with defences for known common failures (think disk, network adapter) but less often design with catastrophic failure in mind. Rather than designing systems with the expectation of failure principals, an assumption is made that they will operate as expected. We’ll discuss more points made in the ebook around shifting that mentality later.

A further key-takeaway from the introduction is that disaster recovery planning is not a one-time event, but an ongoing process. Plans and requirements will need to adapt as new services are added and older services change. This is why the ebook focuses heavily on the implementation of a disaster recovery strategy, staying agile, as oppose to a singular process or policy. The disaster recovery strategy will be driven by the overall business strategy.

Some great talking points are introduced in chapter 2 around popular responses as to why disaster recovery fails, or is not properly implemented. In most cases expect a combination of many, if not all, of these reasons. Building disaster recovery into the initial design of a greenfield environment can be relatively easy, in comparison with retrospectively fitting it into brownfield environments. Older systems and infrastructure were generally deployed in silos yet interlinked with dependencies. New services have been deployed over time, bolted onto existing infrastructure, with changing topologies. Disaster recovery plans need to be agile enough to cater for different systems, and keep up with both technical and business changes.

The chapter moves on to talk about common risks, and determining key stakeholders. Again, useful examples that can be adapted and increased based on your industry and organisational structure, such as protecting physical assets as well as digital. Gaining insights into business priorities and current risks from different people within the organisation boosts your chances of building successful plans and processes.

Chapter 3 starts out with some easy to understand explanations and graphics on Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO). It’s good to set out why these are important, since different systems will have different RTO and RPO values. Defining and setting these values helps with the trade-off between budget and availability.

The Backup Bible – Recovery Time Objective

Another tip here that really resonates with me is to translate business impact into monitory value. The suggested “How long can we operate without this system?” suffices in most cases, but to create urgency the ebook recommends asking “How much does this system cost us per hour when offline?”. Hearing that the business is losing £x an hour hits harder than saying part of the website is down. Establishing RPO’s for different events is also something often overlooked, for example recovery from a hardware failure will most likely differ from a site outage or ransomware attack.

The first 3 chapters laid the groundwork for capturing business requirements. Chapter 4 helps break business items down and map them to common technical solutions, applied at both the software and hardware level. Solid explanations of fault tolerance and high availability are followed up with in-depth descriptions of applying these terms to storage, memory, and other systems through clustering, replication, and load-balancing. Although fault tolerance and high availability contribute to IT operations, they absolutely do not replace backups. This point is positioned alongside an extensive list of how to assess potential backup solutions against your individual requirements when evaluating software.

The Backup Bible – Hard Drive Fault Tolerance

Part 2: backup best practices in action

The second part of The Backup Bible begins by setting out some example architectures for backup software placement. Security principals are also now being worked into the disaster recovery strategy, alongside backup software and storage, whilst those not currently in a backup or system administrator role will learn about the differences between crash-consistent, and application-consistent backups. Chapter 5 again allows for practical application, in the form of a 2-phase backup software selection and testing plan. The example tracking table provided can be used for feature comparisons, and kept for audit or reporting purposes.

Options for backup storage targets follow in chapter 6; another section of comprehensive explanations detailing the advantages and disadvantages of different storage types. The information allows consumers to apply each scenario to their own requirements and make informed decisions when implementing their disaster recovery strategy. Most examples are naturally data centre storage types, but perhaps where cloud storage targets differ is that they should be factored into an organisationals overall cloud strategy too. Factors like egress charges, connectivity, and data offshoring governance or regulations for cloud targets will likely need building into the wider cloud computing picture.

Chapter 7 moves on to a really important topic, and I’m glad this section looks pretty weighty. In many industries, securing and protecting backup data can be just as important as the availability of the data itself. The opening pages walk through securing backups using examples like encryption, account management, secret/key vaults, firewalls, network segmentation and air-gapping. At the same time keeping to the functionality within your existing backup, storage, network and security tools, so as not to over-engineer a solution that will be difficult to support in the future. The chapter moves on to talk about layering security practices, and how these can be weighed up using risk assessments, and then built into policy if desired. Very well made points that are realistic and applicable to real world scenarios.

An example backup software deployment checklist is covered in chapter 8, before chapter 9 expands more on documentation. This is another step often missed in the rush to get solutions out of the door. In disaster recovery, documentation is especially important so that on-call staff and those not familiar with the setup can carry out any required actions quickly and correctly.

The next stage in the process is to implement your organisations data protection needs in the form of backup schedule(s) – usually multiple. Various examples of backup schedules are used in chapter 10; placing the most value on full backups without dependencies on other data sources, and filtering through incremental, differential, or delta backups for convenience. A recommendation is made to time backups with business events, such as month end processes, when high value changes take place, or maintenance such as monthly operating system updates. Once again a real focus is placed firstly on building a technical strategy around business requirements, and secondly on communication and documentation to ensure that strategy is as effective as possible.

The final 2 chapters of this section describe maintaining the backup solution beyond its initial implementation. This includes monitoring and testing backups for data integrity, and keeping systems up to date and in good health. Security patches and upgrades will be in the form of scheduled or automatic regular updates, and less frequent one-off events like drivers, firmware, or hot patches for exposed vulnerabilities or fixes. Often newly implemented systems look great on day 1, but come back on day x in 6 or 12 months and things can look different. It’s refreshing to see the ebook call this out and plan for the ongoing lifecycle of a solution as well as the initial design and implementation.

The Backup Bible – Securing Backup Data

Part 3: disaster recovery & business continuity blueprint

In part 3 the ebook dives into business continuity, focusing on areas beyond technology; like making sure that office or physical space, people, and business dependencies are all taken care of to start recovery operations as quickly as possible. A theme that has remained consistent is asking questions and woking with multiple subject-matter experts, in turn exposing more questions and considerations. Example configurations of hot, warm, and cold secondary sites are used, alongside the required actions to prepare each site should business continuity ever be invoked. These topics really get the reader thinking of the bigger picture, and how some of the technical planning earlier feeds into business priorities.

Chapter 15 moves on to replication with a thorough description on replication types, and when they are or are not applicable. Replication enables rapid failover to another site, and does not replace but rather complements backups where budgets allow. Clear direction is given here to ensure replication practices are understood, along with which tasks can be automated but which need to remain manual. The ebook points out that if configured incorrectly replication can also add overhead, and put data at risk, so it’s good to see both sides of the coin. A helpful tip in this chapter is to use external monitoring where possible, rather than rely on built-in email alerting, which could fail along with the system it is supposed to be monitoring!

The use of cloud services is probably beyond the scope of this ebook, that said, because cloud storage is becoming an option for backup targets it is mentioned and implied currently as more of a comfort blanket. Chapter 16 adds some further considerations, like not making an assumption that backups or disaster recovery is included in a service just because it is hosted in the cloud. Cloud services need to be architected for availability in the same way they do on-premises. Some Software-as-a-Service (SaaS) offerings may include this functionality but it will be explicitly stated. The scope of using cloud for disaster recovery obviously varies between organisations already heavily invested in the cloud, or multiple cloud providers, and those not using it at all. Despite there being more talking points I think it’s right to keep the conversation on topic to avoid going down a rabbit hole.

We’re now at a stage where disaster recovery has been invoked and chapter 17 runs through what some of the business processes may look like. There’s a lot of compassion shown in this chapter and whilst flexibility and considerations for people is a given, going a step further and writing it into the business continuity documentation is something I must admit I hadn’t previously thought about. Having plans laid out ready as to how employees will be contacted in the event of a disaster will also save time and reduce confusion. There’s some good information here on implementing home working too, and this is particularly relevant as we still navigate through COVID-19 restrictions. Certain areas of industries like manufacturing and healthcare may need that secondary site mentioned earlier on, but a lot of jobs in the Internet-era can be carried out from home or distributed locations.

As the ebook begins to draw to a close, the latter 2 chapters produce guidance on how the disaster recovery strategy can be tested and maintained. This is crucial for many of the reasons we’ve read through, and we learn that automating testing in a ‘set and forget’ way is usually not sufficient. Some processes need to be manually checked and validated on a schedule to protect against unexpected behaviour. Chapter 19 calls once again on working together with teams outside of IT, and this is more good advice since rather than IT setting baseline values, for example on backup frequency and retention, it encourages line of business to take responsibility for their own data and applications.

The Backup Bible – Systems Fail

Summary

Overall The Backup Bible has been an enlightening read; containing some really useful guidance and personal narratives about situations the author has experienced. A substantial number of templates for documentation and process checklists are included at the end of the ebook, ranging from backup and restore documentation to data protection questionnaires to testing plans. The ebook does enough of the explanation leg work, and not assuming knowledge, to make sure that a reader from any area of the business can take something away from it.

There are of course references to Altaro software, but this in no way is a glorified advertisement. The points discussed are presented in a neutral manner with comprehensive detail for organisations to make informed decisions about technologies and processes most suited to their own business. Rather than publishing a list of ‘must-haves’ for disaster recovery, the ebook acknowledges that business have different requirements and provides the tools for the reader to go about implementing their own version based on what they have learnt through the course of the ebook.

From a small business with very little protection against a disaster, to an enterprise organisation with processes already in place, anybody interested in disaster recovery will be able to gain something from reading this ebook. The Backup Bible can be downloaded for free here.

Setting Manual DFW Override for NSX Restore

The recommended restore process for NSX Manager is to deploy a new OVA of the same version, and restore the configuration. After a recent failed upgrade we needed to restore NSX Manager, so deployed a new OVA with the same network settings. After the new NSX Manager was powered on we were unable to ping the IP address, this was because there were no default rules allowing access to the VM, and since the existing NSX Manager was down we were unable to connect to the UI or API to add the required firewall rules. NSX Manager is normally excluded from Distributed Firewall (DFW) by default, however at this point the hosts saw it as any other VM, since we had not yet restored the configuration. Therefore we needed to add a manual override to clear the filters applied to the new NSX Manager, allowing us to connect and restore the configuration. The following commands were run on the host where the new NSX Manager OVA was deployed, using SSH. For further guidance on the backup and restore process of NSX see the NSX Backup and Restore post.

Disclaimer: the steps below are advanced commands using vsipfwcli which is an extremely powerful tool. You should engage VMware GSS if doing this on anything other than a lab environment, you should also understand the impact of stopping the vsfwd service on the host and the impact this may have on any other VMs with a DFW policy of fail closed.

net-stats -l lists the NIC details of the VMs running on the host, verify the new NSX Manager is present.

/etc/init.d/vShield-Stateful-Firewall stop stops the vsfwd user world agent, you can also use status to display the status.

vsfwd

summarize-dvfilter lists port and filter details, we need the port name for the VM, e.g. nic-38549-eth0-vmware-sfw.2.

DFW_1

vsipioctl getrules -f nic-38549-eth0-vmware-sfw.2 lists the existing filters applied to the port, replace the port name with your own, from the output check to confirm the ruleset name, e.g. ruleset domain-c17.

DFW_2

vsipioctl vsipfwcli -f nic-38549-eth0-vmware-sfw.2 -c "create ruleset domain-c17;" creates a new empty ruleset with the same name, overriding the previous ruleset applied to the port. Replace the port name with your own and the ruleset name if it is different.

vsipioctl getrules -f nic-38549-eth0-vmware-sfw.2 again lists the existing filters applied to the port, the ruleset should now be empty as no filters are applied.

DFW_3

The NSX Manager is now pinging and the normal restore process can resume; connect to the web interface by browsing to the IP address or FQDN of the NSX Manager.

Restore_NSX_1

Select Backup & Restore.

Restore_NSX_2

Select the appropriate restore point and click Restore. Click Yes to confirm.

Restore_NSX_3

The restore generally takes 5-10 minutes, once complete you will see a restore completed successfully message in a blue banner on the Summary page. You may need to log out and log back in after the config is restored.

Restore_NSX_4

Once the NSX Manager services have started you can manage the DFW from the vSphere web client as normal. Remember to start the vsfwd service again on the host, after the vsfwd service is started the empty ruleset we created earlier is replaced with the original ruleset when the host syncs with NSX Manager.

/etc/init.d/vShield-Stateful-Firewall start starts the vsfwd user world agent, you can also use status to display the status.

EMC Networker Restore

This post explains the process of restoring a physical machine that has been backed up prior by Networker. The restore process will be a bare metal restore and is specific to environments where two NICs are used for backup, for example a private backup VLAN and a public VLAN. The Networker Bare Metal Recovery Wizard can only configure one physical NIC, therefore we need to use the CLI to configure the second NIC.

Procedure

To restore a physical server using a bare metal restore we first need to boot from the EMC Networker BMR ISO. Mount the ISO through the iLO or load the physical media into the server.

Once you have booted from the ISO you will be presented with the Networker Bare Metal Recovery Wizard. This may take 5 – 10 minutes to load after the command prompt window.

Set the current date and time, click Next.

restore1

Select the network interface that you want to configure, for this you should select the NIC that is connected to the public VLAN (i.e. the server OS netork settings) and click Next.

restore2

In the host and network screen enter the server name and domain if applicable. Enter the servers network settings and click Next. If the network is not configured correctly the restore will fail.

Confirm the physical disk on the server that you want to restore to and click Next. Note that the disk will be formatted and any existing data on the disk will be overwritten.

restore3

Before selecting the restore server we need to take some additional steps. In the background you will see a command prompt, you will need to do the following:

Configure the backup VLAN IP address using:

  • netsh interface ip set address name=”Ethernet Adapter” static <IP> <Subnet Mask> replacing Ethernet Adapter with the name of the interface connected to the backup VLAN, and the IP and subnetmask accordingly.
  • Run ipconfig /all to confirm the network settings are correct for both interfaces.
  • Test network connectivity by pinging other IP addresses on the same subnet.
  • If the server does not have access to DNS then you can type notepad.exe in command prompt and then open Boot (X:) > Windows\System32\drivers\etc\hosts. Add the IP and host names of any servers you need to resolve and save the file.

Go back to the Networker Bare Metal Recovery Wizard. You should be on the Select Server page. Enter the name of the Networker server you are restoring data from and click Next.

If you entered incorrect network settings in the previous step then this lookup will fail. If the client is able to connect then you will be asked to select the backup you want to restore from.

restore4

Confirm the partitions to be restored on the next page and click Next.

Double check the information on the summary page and start the restore. Again note the warning that data on the disk(s) will be overwritten.

restore5

When the restore has completed the results page will appear. If the result was successful then click Reboot, the server will be restarted and boot to the recovered operating system.

restore6