Complete Data Protection for the Private Cloud: Recovering from Disaster with SnapMirror and Element OS

Working with SnapMirror

The previous blog post in this series described in detail on how to configure the NetApp® SolidFire® Element® OS storage platform to use NetApp SnapMirror® to NetApp ONTAP.Now that the NetApp HCI version that supports this replication has been released, I thought it would be a good time to focus on the most important part of the technology: RECOVERY!

I assume that you have read the first blog in the series, but if you haven’t, you can catch up here.

Disaster recovery exercise

To use the knowledge in the previous blog about how to configure Element OS for SnapMirror, consider the following hypothetical situation.

Suppose thatI just realized that someone has deleted an important VMware datastore that housed our payroll information. I sent an urgent request to the IT help desk requesting them to restore the datastore. They responded that no backup of this volume existed. I remembered that I protected this volume myself with SnapMirror during a down week just after Christmas! I now have the following scenario to work with.

First, I checked VMware to find out what virtual machines needed to be recovered. This evidence suggests that a few machines were missing from the application:

Looking back at the paired volumes in the SolidFire GUI (see the previous blog post), I see that this volume is missing on the SolidFire cluster.

Now if I still had the volume, and it were simply corrupt data that needed to be restored, it would be an easy fix.I would simply select the option to “break” the relationship and then resync the target to the source.This is not the case here, because I am describing a worst-case scenario for a volume.Well, I like a challenge. First, I will break the relationship.

Heck yeah I’m sure. I go ahead and break it. Notice that just seconds later the relationship state reports “broken-off.” I am now officially failed over to the ONTAP system. I can now mount the volume and get kudos for low RTO statistics.Or I can immediately bring the volume back from the dead, as I am doing right now.

It’s also an option to run the volume in production from the FAS until the next maintenance cycle. Also, in case of a real disaster in which the Element OS system was actually destroyed, some additional steps would be necessary. To build an entirely new cluster in this instance, support would need to facilitate a full–system recovery from a new cluster. That instance assumes destruction from a volume level.

Okay, before performing any sort of resync, I need to pay close attention to the existing policy and schedule. In this case, the policy type is mirror and vault, and the policy schedule is 5 minutes.

Now I’m ready to perform a resync operation, which will allow me to establish the FAS cluster as the source and the Element OS cluster as the target. This resync effectively reverses the previous relationship of the volumes. It also allows me to run production workloads on the FAS until I have a maintenance window where I can fail back to the original configuration. (More about that later.)Using the volume options in the SnapMirror relationships portion of the GUI, I now reverse sync. Again, I’m very sure that I want this, and I click the blue button!

Notice that I didn’t just delete the old relationship, Icreated a completely new one. Immediately on execution, this relationship began transferring data, and now there’s another volume with an inverted relationship that is happily transferring data. If I had decided to mount the target volume on the FAS and continue, then that volume would now be replicating to Element OS!

The transfer is complete after a few minutes, and the relationship is healthy. Notice the information link that gives a hint about the next step to return to full production.

The system also automatically created a new volume, with its own QoS policy and account ID. The main purpose of this new volume is allow a restore in a controlled workspace that doesn’t have the potential to impact other workloads.

Next, I perform the following steps on the SnapMirror volume relationship to promote this new volume to be the primary production volume.

On the volume relationship that was created during the reverse sync operation, click Break.

On the volume that was just broken, use the Action button and click Reverse Resync.

Edit the new volume relationship to match the same QoS policy and schedule settings of the original.

It is now possible to safely delete the original volume relationship that was reported as not found.

If I followed the steps correctly, I should have something that looks like this:

Remember, I deleted the original and I’m now left with only the new relationship and the original one that was sourced from the FAS.I can now safely delete the broken-off relationship that I performed in the last step as well.

There is one more important thing to do before mounting the volumes to the VMware host.

Remember, the volume is in a recovery workspace, and I need to move that volume back to the appropriate account and QoS volume policy. I already changed the SnapMirror policy, but Ineed to do the same at the volume level.To do this, I use the Management menu in the Volumes submenu. (Everyone who uses Element OS knows how to do this, but for some users moving volumes around to different accounts may be a new option.)

One More Important Note When using Volume Access Groups (VAGs) in Element OS, it’s also necessary to add the newly restored volume back into the correct VAG. Select the Management menu and the Access Groups submenu, then click the Modify Settings icon for the VAG in question.

Now all that’s left to do is to scan for the new storage on VMware, then add it with the existing signature.

I now have a completely recovered application!

Data Fabric Readiness

Backup is easy, but recovery is seldom straightforward. Recovery failures seem to happen at the worst possible time, in the worst possible manner. SnapMirror has the benefit of industrywide adoption and validation. It’s industry proven, and it has been integrated into most of the portfolio. The integration of this technology allows NetApp to realize the goal of complete data mobility in the hybrid cloud. This trend will continue as we push for a seamless Data Fabric integrated reality. Expect many more great things to come.

Shayne Williams

Shayne Williams is a Global Architect for the Cloud Infrastructure division at NetApp. Currently, his primary focus is to help provide validated private cloud solutions for NetApp’s Global 100 customers. He has worked for various industry storage vendors, focused on providing infrastructure solutions for large applications. As a previous customer, he has direct experience with workflow automation and Data Center management that offers customers a valuable firsthand perspective on solution delivery.