KB50012 - NETAPP LIVE Failover process

LIVE Failover/Failback

 

EnsureDR Live failover is a feature that allows you to start business processes from the recovery site in case the primary site is down for any reason. The failback is the ability of a system to restore all data back to the primary site after a disaster is resolved.

 

Some companies have a policy to test the failover/failover process to ensure that in the event of a real disaster, the system will be ready to operate from the disaster recovery location. In these cases, the EnsureDR platform will enable you to complete the scheduled event and move all data to the disaster recovery site. In order to be able to run the scheduled activity, prerequisites must be met before running the live failover migration process.

Support

 

In case you have any technical issues or questions, please access the EnsureDR support portal to open a case or send mail to support@ensuredr.com.

 

You can download this manual in PDF format by clicking on the following link


 A picture containing text, clock, gauge

Description automatically generated


 

 

Prerequisites

 

To be able to perform Live Failover/Failback process there are several prerequisites that your environment must meet.

 

EnsureDR prerequisites

 

To perform Live failover, you must create two servers for the EnsureDR platform in the disaster recovery site. This document does not cover the configuration process for EDRM and EDRC servers, assuming it was done before you entered the NetApp EnsureDR failover/Failback process.

 

In case you need more information about prerequisites and setup for EDRM/EDRC please take a look into an on-line full user guide available on our site.

 

To start the failback process on the primary site after the failure on the primary site is resolved, some prerequisites must be met. Create and configure EDRC/EDRM servers on the primary site. On the newly created EDRM server, you must create a new EnsureDR job that will be used for the failback process. To be able to create an EnsureDR job for the NetApp failback process, the customer needs to set up a reverse Snap Mirror process (from the disaster recovery to the primary site). After the replication is established and synchronized you can create an EnsureDR job for the failback process.

 

NetApp prerequisites

 

To run the Live Failover process on the disaster recovery site, you must create NetApp ONTAP export policies named "EDR" for the selected SVMs on the disaster site before running the Live Failover process. All VMware ESXi hosts specified in the EnsureDR job must be placed in NetApp export policies named "EDR".

 

To run the Live Failback process to the primary site after the disaster is resolved, you must create NetApp ONTAP export policies named "EDR" for the selected SVMs on the primary site before running the Live Failback process. All VMware ESXi hosts specified in the EnsureDR job must be placed in NetApp export policies named "EDR".

 

You must start reverse replication NetApp Snap Mirror process from the disaster recovery to primary site to be able to create EnsureDR job for the failback process.

 

Now that you are familiar with all the prerequisites for the failover/failover process, we can start by fine-tuning the process. If you have any further questions, please contact our support team at support@ensuredr.com before proceeding to the next step.

 

Failover process

 

The EnsureDR platform supports NetApp Snap Mirror technology for the Live Failover/Failback process in case the primary site is down, or the customer wants to perform a failover/failback process to ensure the environment is ready for a disaster recovery scenario.

 

Primary site is down

 

There's been a real disaster

 

In the event of a disaster primary site will be down and unreachable to the end users so you will use the EnsureDR platform to restore all selected servers (selected in the EDRM job) to the disaster recovery location. Because the primary site is down and inaccessible, there are no prerequisites that need to be performed on the primary site before executing EnsureDR Live Failover on a disaster recovery site.

 

Prerequisites

    • On the disaster recovery site, you already have EDRM and EDRC servers configured
    • You already have an EDRM job on the disaster recovery site
      • This is mandatory step because if a crash hits your primary site, you won't be able to create an EDRM job
      • Best practice would be to test the EnsureDR job at least once in test mode in an isolated/bubble network, but it’s not a mandatory step
    • Export policies named "EDR" is defined in NetApp at the disaster recovery site to allow mapping of the volume(s) to all ESXi hosts specified in the EDRM job
    • All VMs from the same NetApp volume must be selected in the EDRM job to avoid the possibility of a VM not being selected in the job, which could result in overwriting during the failback process and loose of data.
    • Only the servers defined in the EnsureDR job will be automatically restored. Servers that are not selected in the EnsureDR job will be available/replicated on the NetApp volume on the disaster recovery site, but the VMware administrator must manually register these virtual machines. To avoid this manual step, you must add all servers from the same NetApp volume to the EnsureDR job. Failure to include all virtual machines in an EDRM job may result in data loss during the failback process.

When an EDRM Live Failover job is started on the disaster recovery site, the following workflow is executed:

    • Break a NetApp SnapMirror replication on disaster recovery site for all selected volume(s) in EDRM job
    • Create a new Junction Path(s) in NetApp for all volume(s) on disaster recovery site defined in EDRM job
    • Map/register all newly created Junction Paths (volumes) in VMware environment in a disaster recovery site
    • Register virtual machines that are selected in a job
      • Registered virtual machines will be connected with a flat network that is defined in the second step (networking setup) of a EDRM job

 

Simulation of a real disaster

 

You can use the EnsureDR platform with NetApp Snap Mirror technology to simulate a primary site down and recover all servers in a disaster recovery site. The following prerequisites must be met before you can continue with the recovery process at the disaster recovery site:

Prerequisites

    • All virtual machines that will be recovered on disaster recovery site must be powered off in primary site
    • When all virtual machines are in a power down state, the NetApp administrator needs to verify that replication is in sync and that there is no lag between the selected volumes on the primary and disaster site
    • The network administrator needs to block access to the primary site from the disaster recovery site in order to simulate a real disaster
    • On the disaster recovery site, you already have EDRM and EDRC servers configured
    • You already have an EDRM job on the disaster recovery site
      • This is mandatory step because if a crash hits your primary site, you won't be able to create an EDRM job
      • Best practice would be to test the EnsureDR job at least once in test mode in an isolated/bubble network, but it’s not a mandatory step
    • Export policies named "EDR" must be defined in NetApp at the disaster recovery site to allow mapping of the volume(s) to all ESXi hosts specified in the EDRM job
    • All VMs from the same NetApp volume must be selected in the EDRM job to avoid the possibility of a VM not being selected in the job, which could result in overwriting during the failback process and loose of data.
    • Only the servers defined in the EnsureDR job will be automatically restored. Servers that are not selected in the EnsureDR job will be available/replicated on the NetApp volume on the disaster recovery site, but the VMware administrator must manually register these virtual machines. To avoid this manual step, you must add all servers from the same NetApp volume to the EnsureDR job. Failure to include all virtual machines in an EDRM job may result in data loss during the failback process.

When an EDRM Live Failover job is started on the disaster recovery site, the following workflow is executed:

    • Break a NetApp SnapMirror replication on disaster recovery site for all selected volume(s) in EDRM job
    • Create a new Junction Path(s) in NetApp for all volume(s) on disaster recovery site defined in EDRM job
    • Map/register all newly created Junction Paths (volumes) in VMware environment in a disaster recovery site
    • Register virtual machines that are selected in a job
      • Registered virtual machines will be connected with a flat network that is defined in the second step (networking setup) of a EDRM job

 

Primary site is recovered

 

Whether the primary site is restored after a real disaster or after a temporary disconnection during a disaster simulation, we must take the following steps to avoid problems in the VMware environment at the disaster recovery site before establishing a network connection to the restored primary site.

      • Validate that all servers are recovered in disaster recovery site successfully before continuing to the next steps
      • If all servers in the disaster recovery site are verified and running correctly, we can proceed to the primary site and remove (do not delete VM) from the VMware inventory all virtual machines that are put in offline state at the beginning of the failover process to avoid duplicate servers in the infrastructure.
      • To avoid loose of data please verify that NetApp volume in primary site has no sensitive data that was not replicated to the disaster recovery site at the beginning of Live Failover process (like server that was not select in EDRM job or similar). After all virtual machines from previous step are removed from the VMware inventory on the primary site unmount NetApp volume(s) from the VMware environment in primary site.
      • for all unmounted volume(s) from the previous step remove NetApp Junction Path(s) in the primary site
      • Now network administrators can establish communication between disaster recovery sites and vice versa
      • When network connectivity is established the NetApp administrator can establish reverse replication from the disaster recovery site to the primary site. Be aware that during reverse replication NetApp volume will be overwritten with data from disaster recovery site.

Now that the primary site is back online and the network connection is established, we can return to the primary site from the disaster recovery site. During this process, all servers selected in the EDRM job will be restored to the primary site.

 

The Failback process

 

The failback process will be performed to migrate all selected virtual machines in the EnsureDR job from the disaster recovery site back to the primary site after the primary site is restored.

Prerequisites

    • All virtual machines that will be recovered on primary site must be powered off in disaster recovery site
    • The network administrator needs to establish a link between the disaster recovery site and the primary site, and vice versa.
    • When all virtual machines are offline in disaster recovery site, the NetApp administrator needs to establish a reverse synchronization from the disaster recovery site to the primary site. Verify that replication is in sync and that there is no lag between the selected volumes on the disaster recovery site and primary site. Once the sync is complete, you can move on to the next step.
    • On the primary site, create and configure new EDRM and EDRC servers
    • Create a job inside EDRM on primary site
      • Best practice would be to test the EnsureDR job at least once in test mode in an isolated/bubble network, but it’s not a mandatory step
    • Create an Export policies named "EDR" on primary NetApp site that allow mapping of the volume(s) to all ESXi hosts specified in the EDRM job
    • All VMs from the same NetApp volume must be selected in the EDRM job to avoid the situation that the virtual machine is not automatically recovered by EDRM
      • Only the servers defined in the EnsureDR job will be automatically restored. Servers that are not selected in the EnsureDR job will be available/replicated on the NetApp volume at the primary site but the VMware administrator must them manually registered in the VMware environment. To avoid this manual step, you must select all servers from the same NetApp volume to the EnsureDR job before failback process.
    • Execute the job from EDRM server in primary site

When an EDRM failback process is started, the following workflow is executed:

    • Break a NetApp SnapMirror replication on primary site for all selected volume(s) in EDRM job
    • Create a new Junction Path(s) in NetApp for all volume(s) on primary site defined in EDRM job
    • Map/register all newly created Junction Paths (volumes) in VMware environment in a primary site
    • Register virtual machines that are selected in a job
      • Registered virtual machines will be connected with a flat network that is defined in the second step (networking setup) of a EDRM job

 

Appendix - Example of Live Failover/Failback process

 

In this example, we will show you a step-by-step process so you can have a better understanding of the Failover/Failback process. For the demonstration, we will use VMware vSphere on-premises with NetApp ONTAP.

 

VMware environment pre-checks

 

In our test lab on primary site, we have several virtual machines

which reside in two NFS datastores:

  • prod_nfs1
  • prod_nfs2

On the disaster recovery site we have setup EDRM and EDRC servers that will be used in this demo.

 

The NetApp ONTAP environment pre-checks

 

Two volumes from primary site that are visible in VMware environment are defined in NetApp ONTAP.

A screenshot of a computer

Description automatically generated

For each of these two volumes, we set up replication using NetApp SnapMirror and replicate these two volumes to the disaster recovery site.

The minimum conditions on the primary site are met. On the disaster recovery site, we need to create an export policy named "EDR" and list all the VMware ESXi hosts that are defined in the EDRM job.

In the "Protection" menu select "Volume Relationships" where you can validate that ongoing replication from primary site to disaster recovery site is in a good state.

Now all the prerequisites in the environment setup are verified and allow us to move on to the LIVE Failover step.

 

Starting LIVE Failover process

 

Our goal is to realistically simulate a disaster event at the primary site that results in the primary site being unavailable and unreachable from the disaster recovery site.

 

Because of the simulation, we need to take measures on the primary site to avoid data loss during the Failback process. Before starting the EDRM job with the LIVE Failover option we need to do the following:

  • shutdown all virtual machines that reside on NetApp volume defined in the EDRM job. All virtual machines must be selected in the EDRM job.  Partial selection of virtual machines could lead to data loss in the Failback process
  • in NetApp manager, check that volume is synchronized after the last virtual machine is shutdown and that there is no lag in synchronization
  • if the previous two steps are successful, the network administrator can block access to the primary site from the disaster recovery site to simulate a real disaster

 

Shutting down virtual machines

 

Validate that all virtual machines that reside on NetApp volumes are in state “Powered Off”.

Now that all the VMs are offline, we can connect to NetApp ONTAP Manager to manually initiate replication of the selected volumes to ensure that synchronization takes place after all the VMs are offline.

 

NetApp ONTAP synchronization process

 

Select NetApp ONTAP on the disaster recovery site, click the "Protection" menu, and select the "Volume Relationship" option. Now all replicated volumes are listed in NetApp ONTAP Manager. Right click on the desired volume and select the "Update" option that will initiate synchronization between the primary and disaster recovery site.

The synchronization process starts.

Graphical user interface, application

Description automatically generated

And when the synchronization is finished, we are sure that all virtual machines that are powered off are replicated to the disaster recovery site are replicated with the latest changes.

Now we can disconnect the primary site from the network level, so the primary site is inaccessible from the disaster site. For the purposes of this tutorial, we will simulate a loss of connectivity by disconnecting NetApp from the local network. As you can see on the main site, VMware vcenter reports that NetApp Snapshot Volumes are unavailable.

Same as for virtual machines that were hosted on NetApp volumes that are now unavailable.

As we show, there is no connection between VMware vCenter and NetApp volumes on the primary site either connectivity between NetApp on the primary and disaster recovery site.

All conditions are met so we can continue to run LIVE Failover process on disaster recovery site.

 

LIVE Failover

 

Log in to the EDRM server, select the desired job, right-click and select the Run LIVE Failover option.

 

Warning message popup, you must click on Yes to continue with LIVE Failover. Before clicking on the Yes button, you can click on the link to get more information about the LIVE Failover process available online.

After you confirm the start of the LIVE Failover process, the task will begin to run. Please wait until the job is finished. The result of LIVE Failover process will be:

  • created a new Junction Path in NetApp on the disaster recovery site

  • newly created Junction Path will be automatically connected with VMware vCenter on the disaster recovery site
  • all virtual machines selected in the EDRM job will be registered and powered on in the disaster recovery site

After all virtual machines are up and running, the LIVE failover process is completed. You must verify that all running servers are properly started and operating according to your internal procedures.

 

Post Failover process steps

 

Now that all virtual machines have been restored in the disaster recovery site, we will take steps to fix the primary site and set up reverse volume replication.

Connect to VMware vCenter on the primary site, in vCenter select the NetApp volume configured for the LIVE Failover process and click the tab with virtual machines. Select each virtual machine and remove them from Inventory.

When all virtual machines are removed, select volume configured in EDRM job and unmount it.

Log in to NetApp on the primary site and select Junction Path under Storage. Then select the Junction Path that was configured in VMware vCenter and which we unmount from VMware. Click on the Junction Path and select the Unmount option, a new pop-up window shows where you need to confirm that you want to unmount the Junction Path in the NetApp Manager.

When you remove all Junction Path definition from the NetApp manager in primary site you are ready to continue with Failback process.

 

Failback process

 

Only if all the previous steps have been completed successfully, you should start the Failback process. Otherwise, you may overwrite the data on the primary site if you don't follow the instructions. Please double-check before proceeding that:

  • all virtual machines are recovered on disaster recovery site
  • that the VMware inventory cleanup done on the primary site

If all measures from previous steps are taken, we can proceed with the reverse synchronization process in NetApp. To do that log into the NetApp manager in the disaster recovery site. Select Volume Relationship to list all volumes that are in replication. As mentioned earlier in this document, during the LIVE Failover process the replication between primary and secondary NetApp will be broken.

Select the NetApp volume with the status “Broken Off” and right-click on the selected volume to establish reverse synchronization, from disaster recovery to the primary site.

Graphical user interface, application

Description automatically generated

A pop-up window will appear, confirm reverse sync to begin replication from the disaster recovery site back to the main site.

Now connect with the NetApp on primary site to confirm that reverse synchronization is configured and there are no errors in replication.

Depending on the volume and size of the data, it may take time for the disaster recovery site to synchronize with the primary site. Once synchronization is finished successfully you are ready to start the Failback process.

Due to the fact that the Failback process basically means running the EDRM job in the opposite direction and taking the same actions as described in the LIVE Failover process, we will not repeat all these steps again in this tutorial.

 

In case you have any questions please contact us at support@ensuredr.com