I’ve decided to explore what happens when a ZVM (Zerto Virtual Manager) in either the protected site or the recovery site is down for a period of time, and what happens when it is back in service, and most importantly, how an outage of either ZVM affects replication, journal history, and the ability to recover a workload.

Before getting in to it, I have to admit that I was happy to see how resilient the platform is through this test, and how the ability to self-heal is a built in “feature” that rarely gets talked about.

Questions:

Does ZVR still replicate when a ZVM goes down?

How does a ZVM being down affect checkpoint creation?

What can be recovered while the ZVM is down?

What happens when the ZVM is returned to service?

What happens if the ZVM is down longer than the configured Journal History setting?

Acronym Decoder & Explanations

ZVM

Zerto Virtual Manager

ZVR

Zerto Virtual Replication

VRA

Virtual Replication Appliance

VPG

Virtual Protection Group

RPO

Recovery Point Objective

RTO

Recovery Time Objective

BCDR

Business Continuity/Disaster Recovery

CSP

Cloud Service Provider

FOT

Failover Test

FOL

Failover Live

Does ZVR still replicate when a ZVM goes down?

The quick answer is yes. Once a VPG is created, the VRAs handle all replication. The ZVM takes care of inserting and tracking checkpoints in the journal, as well as automation and orchestration of Virtual Protection Groups (VPGs), whether it be for DR, workload mobility, or cloud adoption.

In the protected site, I took the ZVM down for over an hour via power-off to simulate a failure. Prior to that, I made note of the last checkpoint created. As the ZVM went down, within a few seconds, the protected site dashboard reported RPO as 0 (zero), VPG health went red, and I received an alert stating “The Zerto Virtual Manager is not connected to site Prod_Site…”

Great, so the protected site ZVM is down now and the recovery site ZVM noticed. The next step for me was to verify that despite the ZVM being down, the VRA continued to replicate my workload. To prove this, I opened the file server and copied the fonts folder (C:\Windows\Fonts) to C:\Temp (total size of data ~500MB).

As the copy completed, I then opened the performance tab of the sending VRA and went straight to see if the network transmit rate went up, indicating data being sent:

Following that, I opened the performance monitor on the receiving VRA and looked at two stats: Data receive rate, and Disk write rate, both indicating activity at the same timeframe as the sending VRA stats above:

As you can see, despite the ZVM being down, replication continues, with caveats though, that you need to be aware of:

No new checkpoints are being created in the journal

Existing checkpoints up to the last one created are all still recoverable, meaning you can still recover VMs (VPGs), Sites, or files.

Even if replication is still taking place, you will only be able to recover to the latest (last recorded checkpoint) before the ZVM went down. When the ZVM returns, checkpoints are once again created, however, you will not see checkpoints created for the entire time that ZVM was unavailable. In my testing, the same was true for if the recovery site ZVM went down while the protected site ZVM was still up.

How does the ZVM being down affect checkpoint creation?

If I take a look at the Journal history for the target workload (file server), I can see that since the ZVM went away, no new checkpoints have been created. So, while replication continues on, no new checkpoints are tracked due to the ZVM being down, since one of it’s jobs is to track checkpoints.

What can be recovered while the ZVM is down?

Despite no new checkpoints being created – FOT or FOL – VPG Clone, Move, and File Restore services are still available for the existing journal checkpoints. Given this was something I’ve never tested before, this was really impressive.

One thing to keep in mind though is that this will all depend on how long your Journal history is configured for, and how long that ZVM is down. I provide more information about this specific topic further down in this article.

What happens when the ZVM is returned to service?

So now that I’ve shown what is going on when the ZVM is down, let’s see what happens when it is back in service. To do this, I just need to power it back up, and allow the services to start, then see what is reported in the ZVM UI on either site.

As soon as all services were back up on the protected site ZVM, the recovery site ZVM alerted that a Synchronization with site Prod_Site was initiated:

The next step here is to see what our checkpoint history looks like. Taking a look at the image below, we can see when the ZVM went down, and that there is a noticeable gap in checkpoints, however, as soon as the ZVM was back in service, checkpoint creation resumed, with only the time during the outage being unavailable.

What happens if the ZVM is down longer than the configured Journal History setting?

In my lab, for the above testing, I set the VPG history to 1 hour. That said, if you take a look at the last screen shot, older checkpoints are still available (showing 405 checkpoints). When I first tried to run a failover test after this experiment, I was presented with checkpoints that go beyond an hour. When I selected the oldest checkpoint in the list, a failover test would not start, even if the “Next” button in the FOT wizard did not gray out. What this has lead me to believe is that it may take a minute or two for the journal to be cleaned up.

Because I was not able to move forward with a failover test (FOT), I went back in to select another checkpoint, and this time, the older checkpoints were gone (from over an hour ago). Selecting the oldest checkpoint at this time, allowed me to run a successful FOT because it was within range of the journal history setting. Lesson learned here – note to self: give Zerto a minute to figure things out, you just disconnected the brain from the spine!

Running a failover test to validate successful usage of checkpoints after ZVM outage:

And… a recovery report to prove it:

Summary and Next Steps

So in summary, Zerto is self-healing and can recover from a ZVM being down for a period of time. That said, there are some things to watch out for, which include known what your configured journal setting is, and how a ZVM being down longer than the configured history setting affects your ability to recover.

You can still recover, however, you will start losing older checkpoints as time goes on while the ZVM is down. This is because of the first-in-first-out (FIFO) nature of how the journal works. You will still have the replica disks and journal checkpoints committing to it as time goes on, so losing history doesn’t mean you’re lost, you will just end up breaching your SLA for history, which will re-build over time as soon as the ZVM is back up.

As a best practice, it is recommended you have a ZVM in each of your protected sites, and in each of your recovery sites for full resilience. Because after all, if you lose one of the ZVMs, you will need at least either the protected or recovery site ZVM available to perform a recovery. The case is different if you have a single ZVM. If you must have a single ZVM, put it into the recovery site, and not on the protected site, because chances are, your protected site is what you’re accounting for going down in any planned or unplanned event. It makes most sense to have the single ZVM in the recovery site.

In the next article, I’ll be exploring this very example of a single ZVM and how that going down affects your resiliency. I’ll also be testing some ways to potentially protect that single ZVM in the event it is lost.

Thanks for reading! Please comment and share, because I’d like to hear your thoughts, and am also interested in hearing how other solutions handle similar outages.

Zerto is simple to install and simple to use, but it gets better with automation! While performing tasks within the UI can quickly become second nature, you can quickly find yourself spending a lot of time repeating the same tasks over and over again. I get it, repetition builds memory, but it gets old. As your environment grows, so does the amount of time it takes to do things manually. Why do things manually when there are better ways to spend your time?

Zerto provides great documentation for automation via PowerShell and REST APIs, along with Zerto Cmdlets that you can download and install to add-on to PowerShell to be able to do more from the CLI. One of my favorite things is that the team has provided functional sample scripts that are pretty much ready to go; so you don’t have to develop them for common tasks, including:

Querying and Reporting

Automating Deployment

Automating VM Protection (including vRealize Orchestrator)

Bulk Edits to VPGs or even NIC settings, including Re-IP and PortGroup changes

Offsite Cloning

For automated failover testing, Zerto includes an Orchestrator for vSphere, which I will cover in a separate set of posts.

To get started with PowerShell and RESTful APIs, head over to the Technical Documentation section of My Zerto and download the Zerto PowerShell Cmdlets (requires MyZerto Login) and the following guides to get started, and stay tuned for future posts where I try these scripts out and offer a little insight to how to run them, and also learn how I’ve used them!

Rest APIs Online Help – Zerto Virtual Replication

The REST APIs provide a way to automate many DR related tasks without having to use the Zerto UI.

This document includes an overview of how to use ZVR REST APIs with PowerShell to automate your virtual infrastructure. This is the document that also includes several functional scripts that take the hard work out of everyday tasks.

If you’ve automated ZVR using PowerShell or REST APIs, I’d like to hear how you’re using it and how it’s changed your overall BCDR strategy.

I myself am still getting started with automating ZVR, but am really excited to share my experiences, and hopefully, help others along the way! In fact, I’ve already been working with bulk VRA deployment, so check back or follow me on twitter @EugeneJTorres for updates!

This week was my first undertaking of preparing a vCenter environment for agent-less AV using Trend Micro Deep Security 9.6 SP1 and NSX Manager 6.2.4. So far, it’s been a great learning experience (to remain positive about it), so I wanted to share.

Initially (about 4 months ago), vShield was deployed, and since day 1, we had problems. Since I’ve never worked with Trend Micro Deep Security, I relied on the team that managed it to tell me what their requirements were. Sometimes, this is the quickest way to get things done, but what I realized is that even if it works in one environment, it always helps to validate compatibility across the board. If I had done that, I would have probably saved myself a lot of headache…

So in anticipation of a “next time”, here are my notes from the installation and configuration. Hopefully, this information is found useful to others, as some of it isn’t even covered in the Deep Security Installation Guide.

Please not, this is not a how-to. It’s more of a reference for the things we’ve discovered through this process that may not all be documented in one place. With that said, here’s my attempt at documenting what I’ve learned in one place.

The process (after trial and error since the installation guide isn’t very detailed) that seems to work best is as follows:

Register the vCenter and NSX Manager from the Trend Micro Security Manager.

Deploy the Trend Micro Deep Security Service from vSphere Networking and Security in the Web Client.

When we first set out to deploy this in our datacenter, it was months ago. We initially started with vShield Manager, which is what was relayed to us from the team that manages Trend. We met with issues deploying properly and “things” missing from the vSphere Web Client when documentation said otherwise. We had support tickets open with both VMware and Trend Micro for at least a few months. At one point, due to the errors we were getting, Trend and VMware both escalated the issue to their engineering/development teams. At the end of the day, we (the customers) eventually figured out what was causing the problem… DNS lookups.

The Trend Micro Deep Security installation guide does not cover this as a hard requirement. Although the product will allow you to input and save IP addresses instead of FQDNs, it just doesn’t work, so use DNS whenever possible!

vShield

First of all, I wouldn’t look at vShield anymore unless working in an older environment after this. In fact, I may just respond with:

If it’s EOL, is no longer supported, AND incompatible; I won’t even try. “Road’s closed pizza boy! Find another way home!”

You don’t gain anything from deploying EOL software. Very importantly, you don’t get any future security updates when vulnerabilities are discovered and you won’t get any help if you call support about it.

In case you’re reading this and did the same thing I did, here are some things we noticed during this vShield experience:

Endpoint would not stay installed on some the ESXi hosts, while it did on others.

There is additional work for this configuration if you’re using AutoDeploy and stateless hosts. (see VMware KB: 2036701)

When deploying the TM DSVAs to hosts where Endpoint did stay on, installation fails as soon as the appliance is attempted to be deployed.

This is where we discovered using FQDN instead of IP address is preferred.

After successfully deploying the DSVAs, we still had problems with virtual appliances and endpoint staying registered, so it never actually worked.

In the back of my mind since I didn’t do this up-front, I started questioning compatibility. Sure enough, vShield is EOL, and is not compatible with our vCenter and host versions.

VMware NSX Manager 6.2.4

With a little more research, I found that NSX with Guest Introspection has replaced vShield for endpoint/GI services, and as long as that’s all you’re using NSX for, the license is free. With NSX 6.1 and later, there is also no need to “prepare stateless hosts” (see VMware KB: 2120649).

Before simply deploying and configuring, I checked all the compatibility matrixes involved and validated that our versions are supported and compatible. Be sure to check the resource links below, as there is some important information especially with compatibility:

vCenter: v 6.0 build 5318203

ESXi: v 6.0 build 5224934

NSX: v 6.2.4 build 4292526

Trend Micro Deep Security: v 9.6 SP1

Note: NSX 6.3.2 can be deployed, but you will need at least TMDS 9.6 SP1 Update3 – which is why I went with 6.2.4, and will upgrade once TMDS is upgraded to support NSX 6.3.2.)

What I’ve Learned

Here are some tips to ensure a smooth deployment for NSX Manager 6.2.4 and Trend Micro Deep Security 9.6 SP1.

Ensure your NTP servers are correct and reachable.

Use IP Pools if at all possible when deploying guest introspection services from NSX Manager. (makes deployment easier and quicker)

Set up a datastore that will house ONLY NSX related appliances. (makes deployment easier and quicker)

When you first set up NSX Manager, be sure to add your user account or domain group with admin access to it for management, otherwise, you won’t see it in the vSphere Web Client unless you’re logged in with the administrator@vsphere.local account.

Validate that there are DNS A and PTR records for the Trend Micro Security Manager, vCenter, and NSX Manager, otherwise anything you do in Deep Security to register your environment will fail.

Pay close attention to the known issues and workarounds in the “Compatibility Between NSX 6.2.3 and 6.2.4 with Deep Security” reference above, because you will see the error/failure they refer to.

If deploying in separate datacenters or across firewalls, be sure to allow all the necessary ports.

Unlike vShield Manager deploying Endpoint, NSX Manager deploying Guest Introspection is done at the cluster level. When using NSX, you can’t deploy GI to only one host, you can only select a cluster to deploy to.

If you’ve found this useful in your deployment, please comment and share! I’d like to hear from others who have experienced the same!

Yesterday, we had one host in our recovery site PSOD, and that caused all kinds of errors in Zerto, primarily related to VPGs. In our case, this particular host had both inbound and outbound VPGs attached to it’s VRA, and we were unable to edit (edit button in VPG view was grayed out, along with the “Edit VPG” link when clicking in to the VPG) any of them to recover from the host failure. Previously when this would happen, we would just delete the VPG(s) and recreate it/them, preserving the disk files as pre-seeded data.

When you have a few of these to re-do, it’s not a big deal, however, when you have 10 or more, it quickly becomes a problem.

One thing that I discovered that I didn’t know was in the product, is that if you click in to the VRA associated with the failed host, and go do the MORE link, there’s an option in there to “Change Recovery VRA.” This option will allow you to tell Zerto that anything related to this VRA should now be pointed at X. Once I did that, I was able to then edit the VPGs. I needed to edit the VPGs that were outbound, because they were actually reverse-protected workloads that were missing some configuration details (NIC settings and/or Journal datastore).

Here’s how:

Log on to the Zerto UI.

Once logged on, click on the Setup tab.

In the “VRA Name” column, locate the VRA associated with the failed host, and then click the link (name of VRA) to open the VRA in a new tab in the UI.

Click on the tab at the top that contains VRA: Z-VRA-[hostName].

Once you’re looking at the VRA page, click on the MORE link.

From the MORE menu, click Change VM Recovery VRA.

In the Change VM Recovery VRA dialog, check the box beside the VPG/VM, then select a replacement host. Once all VPGs have been udpated, click Save.

Once you’ve saved your settings, validate that the VPG can be edited, and/or is once again replicating.

Following an upgrade to ESXi 6.0 U2, this particular issue has popped up a few times, and while we still have a case open with VMware support in an attempt to understand root cause, we have found a successful workaround that doesn’t require any downtime for the running workloads or the host in question. This issue doesn’t discriminate between iSCSI or Fibre Channel storage, as we’ve seen it in both instances (SolidFire – iSCSI, IBM SVC – FC). One common theme with where we are seeing this problem is that it is happening in clusters with 10 or more hosts, and many datastores. It may also be helpful to know that we have two datastores that are shared between multiple clusters. These datastores are for syslogs and ISOs/Templates.

Note: In order to perform the steps in this how-to, you will need to already
have SSH running and available on the host, or access to the DCUI.

Observations

Following a host or cluster storage rescan, an ESXi host(s) stops responding in vCenter and still has running VMs on it (host isolation)

Attempts to reconnect the host via vCenter doesn’t work

Direct client connection (thick client) to host doesn’t work

Attempts to run services.sh from the CLI causes script to hang after “running sfcbd-watchdog stop“. The last thing on the screen is “Exclusive access granted.”

The /var/log/vmkernel.log displays the following at this point: “Alert: hostd detected to be non-responsive“

Troubleshooting

Verify that the ESXi host is able to respond back to vCenter at the correct IP address and vice versa.

Verify that network connectivity exists from vCenter to the ESXi host’s management IP or FQDN

Verify that port 903 TCP/UDP is open between the vCenter and the ESXi host

Try to restart the ESXi management agents via DCUI or SSH to see if it resolves the issue

Verify if the hostd process has stopped responding on the affected host.

verify if the vpxa agent has stopped responding on the affected host.

Verify if the host has experienced a PSOD (Purple Screen of Death).

Verify if there is an underlying storage connectivity (or other storage-related) issue.

Following these troubleshooting steps left me at step 7, where I was able to determine if hostd was responding on the host. The vmkernel.log further supports this observation.

Resolution/Workaround Steps

These are the steps I’ve taken to remedy the problem without having to take the VMs down or reboot the host:

Since the hostd service is not responding, the first thing to do is run /etc/init.d/hostd restartfrom a second SSH session window (leaving the first one with the hung services.sh restart script process).

While running the hostd restart command, the hung session will update, and produce the following:

When you see that message, press enter to be returned to the shell prompt.

Now run /etc/init.d/vpxa restart, which is the vCenter Agent on the host.

After that completes, re-run services.sh restart and this time it should run all the way through successfully.

Once services are all restarted, return to the vSphere Web Client and refresh the screen. You should now see the host is back to being managed, and is no longer disconnected.

At this point, you can either leave the host running as-is, or put it into maintenance mode (vMotion all VMs off). Export the log bundle if you’d like VMware support to help analyze root cause.

Something I recently ran into with Zerto (and this can happen for anything else) was the dilemma of being able to protect remote sites that (doesn’t happen often) happen to have IP addresses that are identical in both the protected and recovery sites. And no, this wasn’t planned for, it was just discovered during my Zerto deployment in what we’ll call the protected sites.

Luckily, our network team had provisioned two new networks that are isolated, and connected to these protected sites via MPLS. Those two new networks do not have the ability to talk back to our existing enterprise network without firewalls getting involved, and this is by design since we are basically consolidating data centers while absorbing assets and virtual workloads from a recently acquired company.

When I originally installed the ZVM in my site (which we’ll call the recovery site), I had used IP addresses for the ZVM and VRAs that were part of our production network, and not the isolated network set aside for this consolidation. Note: I installed the Zerto infrastructure in the recovery site ahead of time before discussions about the isolated networks was brought up. So, because I needed to get this onto the isolated network in order to be able to replicate data from the protected sites to the recovery site, I set out to re-IP the ZVM, and re-IP the VRAs. Before I could do that, I needed to provide justification for firewall exceptions in order for the ZVM in the recovery site to link to the vCenter, communicate with ESXi hosts for VRA deployment, and also to be able to authenticate the computer, users, service accounts in use on the ZVM. Oh, and I also needed DNS and time services.

The network and security teams asked if they could NAT the traffic, and my answer was “no” because Zerto doesn’t support replication using NAT. That was easy, and now the network team had to create firewall exceptions for the ports I needed.

Well, as expected, they delivered what I needed. To make a long story short, it all worked, and then about 12 hours before we were scheduled to perform our first VPG move, it all stopped working, and no one knew why. At this point, it was getting really close to us pulling the plug on the migration the following day, but I was determined to get this going and prevent another delay in the project.

When looking for answers, I contacted my Zerto SE, reached out on twitter, and also contacted Zerto Support. Well, at the time I was on the phone with support, we couldn’t do anything because communication to the resources I needed was not working. We couldn’t perform a Zerto re-configure to re-connect to the vCenter, and at this point, I had about 24VPGs that were reporting they were in sync (lucky!), but ZVM to ZVM communication wasn’t working, and recovery site ZVM was not able to communicate with vCenter, so I wouldn’t have been able to perform the cutover. So since support couldn’t help me out in that instance, I scoured the Zerto KB looking for an alternate way of configuring this where I could get the best of both worlds, and still be able to stay isolated as needed.

I eventually found this KB article that explained that not only is it supported, but it’s also considered a best practice in CSP or large environments to dual-NIC the ZVM to separate management from replication traffic. I figured, I’m all out of ideas, and the back-and-forth with firewall admins wasn’t getting us anywhere; I might as well give this a go. While the KB article offers the solution, it doesn’t tell you exactly how to do it, outside of adding a second vNIC to the ZVM. There were some steps missing, which I figured out within a few minutes of completing the configuration. Oh, and part of this required me to re-IP the original NIC back to the original IP I used, which was on our production network. Doing this re-opened the lines of communication to vCenter, ESXi hosts, AD, DNS, SMTP, etc, etc… Now I had to focus on the vNIC that was to be used for all ZVM to ZVM as well as replication traffic. In a few short minutes, I was able to get communication going the way I needed it, so the final thing I needed to do was re-configure Zerto to use the new vNIC for it’s replication-related activities. I did that, and while I was able to re-establish the production network communications I needed, now I wasn’t able to access the remote sites (ZVM to ZVM) or access the recovery site VRAs.

It turns out, what I needed here were some static, persistent routes to the remote networks, configured to use the specific interface I created for it.

Here’s how:

The steps I took are below the image. If the image is too small, consider downloading the PDF here.

On the ZVM:

Power it down, add 2nd vNIC and set it’s network to the isolated network. Set the primary vNIC to the production network.

Power it on. When it’s booted up, log in to Windows, and re-configure the IP address for the primary vNIC. Reboot to make sure everything comes up successfully now that it is on the correct production network.

After the reboot, edit the IP configuration of the second vNIC (the one on the isolated network). DO NOT configure a default gateway for it.

Open the Zerto Diagnostics Utility on the ZVM. You’ll find this by opening the start menu and looking for the Zerto Diagnostics Utility. If you’re on Windows Server 2008 or 2012, you can search for it by clicking the start menu and starting to type “Zerto.”

On the vCenter Server Connectivity screen, make any necessary changes you need to and click Next. (Note: We’re only after changing the IP address the ZVM uses for replication and ZVM-to-ZVM communication, so in most cases, you can just click Next on this screen.)

On the vCloud Director (vCD) Connectivity screen, make any necessary changes you need to and click Next. (Note: same note in step 6)

On the Zerto Virtual Manager Site Details screen, make any necessary changes you need to and click Next. (Note: same as note in step 6)

On the Zerto Virtual Manager Communication screen, the only thing to change here is the “IP/Host Name Used by the Zerto User Interface.” Change this to the IP Address of your vNIC on the isolated Network, then click Next.

Continue to accept any defaults on following screens, and after validation completes, click Finish, and your changes will be saved.

Once the above step has completed, you will now need to add a persistent, static route to the Windows routing table. This will tell the ZVM that for any traffic destined for the protected site(s), it will need to send that traffic over the vNIC that is configured for the isolated network.

Use the following route statement from the Windows CLI to create those static routes:

Note: To find out what the interface number is for your isolated network vNIC, run route print from the Windows CLI. It will be listed at the top of what is returned.

Once you’ve configured your route(s), you can test by sending pings to remote site IP addresses that you would normally not be able to see.

After performing all of these steps, my ZVMs are now communicating without issue and replications are all taking place. A huge difference from hours before when everything looked like it was broken. The next day, we were able to successfully move our VPGs from protected sites to recovery sites without issue, and reverse protect (which we’re doing for now as a failback option until we can guarantee everything is working as expected).

If this is helpful or you have any questions/suggestions, please comment, and please share! Thanks for reading!

Continuing on from the previous blog about configuring array-based replication with SRM, in this blog post we’ll be going through configuring protection of a VM using vSphere Replication. The reason I’m doing this instead of jumping right into creating the protection groups and recovery plans is because vSphere Replication can function on its own without SRM. That said, we’ll go through the steps to protect a virtual workload using vSphere Replication, and follow this up with creating protection groups and recovery plans, which come into play in either situation (ABR vs vR) when we get to the orchestration functionality that SRM brings to the table.

vSphere Replication is included with VMware Essentials plus and above, so chances are you have this feature available to you to should you decide to use it to protect VMs using hypervisor-based replication. In my experience, vSphere Replication works great and can be used to either migrate or protect virtual workloads, however, as stated above, can be limited. See this previous post for the details of what vSphere Replication can and can’t do without Site Recovery Manager.

Procedure

In this walkthrough for protecting a VM using vSphere Replication, I will be performing the steps using a decently sized Windows VM as the asset that needs protection. This VM is a plain installation of Windows, however, I use the fsutil to generate files of different sizes to simulate data change.

In your vSphere Web Client, locate a VM that you wish to protect via hypervisor-based replication.

Right-click on the VM and go to All vSphere Replication Actions > Configure Replication.

When the wizard loads, the first screen asks for the replication type. Select Replicate to a vCenter Server, and click Next.

Select the Target Site and click Next.

Select the remote vSphere Replication server (or if you only have 1, then select auto-assign), wait for validation, then click Next.

On the target location screen, there are several options to configure, so we’ll go through each one by one:- Expand the settings by clicking the arrow next to the VM, or click the info link.– Click edit in the area labeled Target VM Location, select the target datastore and location for the recovery VM, then click OK to be returned to the previous screen.– Typically, the previous step would be enough, however, if you want to place VMDKs in specific datastores, edit their format (thick vs. thin provisioned), or assign a policy, use the edit links beside each hard disk. Once all your settings are how you want them, click Next.

Specify your replication options, then click Next.

Notes:
- Enable quiescing if your guest OS supports it, however, keep in mind
that quiescing may affect your RPO times.
- Enable network compression to reduce required bandwidth and free up
buffer memory on the vSphere Replication server, however, higher CPU
usage may result, so it is best to test with both options to see what
works best in your environment.

Configure RPO to meet customer requirements, enable point in time instances (snapshots in time as recovery points – maximum of 24) if needed, then click Next.

Review your configuration summary, make changes if necessary, but when you’re done, click Finish. As soon as you finish, a full sync will be initiated.

There you go, configuring vSphere replication for a VM. The next post will cover creating protection groups and recovery plans, which we will then tie into what we’ve just performed here and with the array-based replication post.

Introduction

This how-to will walk through the installation and configuration of array-based replication features for VMware Site Recovery Manager 6.1.

Before configuring array-based replication for use with VMware SRM, there are some pre-requisites. First of all, you’re going to need to visit the VMware Compatibility Guide, which will help you determine if your specific array vendor is supported for use with SRM. Second, there are steps to take to configure array based replication on the storage side, and that portion is out-of-scope for this blog, as I did not have access to do so.

There are several ways to search the compatibility guide, but to be specific, you can select entries from the areas highlighted above. The bottom section that is highlighted will be your results once you click “Update and View Results.” The reason why I wanted to point this step out is because if you assume your array vendor is supported, and don’t verify first, you could end up wasting your time planning and designing.

For this example, we are using SRM 6.1 with the Fibre Channel protocol on IBM SVC-fronted DS8K’s in both sites. I wanted to point that out because when I first set out to find the SRAs for use with our solution, I attempted to use the “IBM DS8000 Storage Replication Adapter”, later to find out it wasn’t the correct one. The correct SRA for use with my environment is the “IBM Storwize Family Storage Replication Adapter”, so there may be a little bit of trial and error with this; however, if you do it up front during testing, you’ll save yourself some time later when deploying to production.

That all said, once you’ve verified your storage is supported, and what version of the SRA to download, you can get it by visiting the VMware downloads (you will need to login). Be sure to also verify that the version of the SRA you are downloading is compatible with the version of array manager code you’re running.

Installing the SRA

Before you Begin – Prior to installing the SRA on the SRM server in each site (protected and recovery), you should have already paired the sites successfully. Also, if you haven’t installed SRM yet, you will need to, otherwise the SRA installer will fail once it discovers that SRM is not installed.

Installing the SRA should be straightforward and painless, as there are not many options to configure during installation. Once the installation is completed on both the protected and recovery SRM servers, proceed.

Verify That SRM Has Registered the SRAs

Once you’ve installed the SRA on each site’s SRM server, log into the vSphere Web Client, and go to Site Recovery > Sites and select a site.
From this view, you can see what SRA has been installed, its status, and compatibility information.

Click the rescan button to ensure the connection is valid and there are no errors.

Configure Array Managers

After pairing the protected and recovery sites, you will need to configure the respective array managers so SRM can discover replicated devices, compute datastore groups, and initiate storage operations. You typically only need to do this once, however, if array access credentials change, or you want to use a different set of arrays, you can edit the connections to update accordingly.

Pre-Requisites

Sites have been paired and are connected

SRAs have been installed at both sites and verified

Procedure

In the vSphere Web Client, go to Site Recovery > Array Based Replication.

On the Objects tab in the right window pane, click the icon to add an array manager.

Select from one of two options for adding array managers (pair or single), then click Next.

Select a pair of sites for the array manager(s), and click Next.

Enter a name for the array in the Display Name field, and click Next.

Provide the required information for the type of SRA you selected, and click Next.

If you chose to add a pair of array managers, enter the paired array manager information, then click Next.

Click-to-enable the checkbox beside the array pair you just configured, and click Next.

Review your configuration, then click Finish when ready.

Rescan Arrays to Detect Configuration Changes

SRM performs an automatic rescan every 24 hours by default to detect any changes made to the array configurations. It is recommended to perform a manual rescan following any changes to either site by way of reconfiguration or adding/removing devices to recompute the datastore groups. If you need to change the default interval at which SRM performs a rescan, you can do this in the advanced settings for each site, editing the storage.minDsGroupComputationInterval advanced setting:

To perform a manual rescan after making any configuration changes:

Go to Site Recovery > Array Based Replication

Select an array for either site

On the Manage tab of the selected array, click the Array Pairs sub tab

Click the rescan button to perform a manual rescan.

Once you’ve got all of the above configured, you can begin setting up your protection groups and recovery plans.

Introduction

Obviously, based on my previous blog posts, it’s apparent that I’ve been spending some time in the past few months testing VMware Site Recovery Manager and Zerto Virtual Replication to see which product best meets our business continuity and disaster recovery requirements. My task was to compare the two products, feature for feature based on our use cases, which are primarily protection, recovery, re-protection, and workload migration.

Get comfortable, this could take a while…

Blue vs. Red

As of today, SRM and Zerto have been tested in a sandbox environment, consisting of 2 sites (Seattle and Denver), 2 vCenters, 2 physical hosts in a cluster in each site, and 1 test workload which consisted of a Windows Server VM with auto-generated files of different sizes. The two sites, being geographically separated are joined by a dual 20 Gb/s connection, and there are no bandwidth throttling mechanisms in place outside of what’s available in the software, and it’s only used to throttle down during business hours. The physical networking at the host level in both sites is 10GbE.

VMware’s Site Recovery Manager is the only one of the two products that has the array-based replication feature, so to make this more of an “apples-to-apples” comparison, that feature isn’t heavily reported on here, but has been tested, and it works well, so I’m happy.

Both hypervisor-based product tests that were performed have been completed in each direction, in terms of recovery testing, failover, re-protection, and migration. The results of both solutions are similar, however, based on results, we are leaning more toward one product in terms of simplicity, flexibility, scalability, monitoring capabilities, and user experience.

Below are images of what the topology for both test environments looks like, with SRM on the left, and Zerto on the right.

If you are interested in seeing these diagrams up close, you can download the PDFs for each here:

^^ Not pictured in the Zerto Diagram: External PSCs for vCenter, vCenter SQL Servers, and all port communication native to vCenter components.

Product Comparison

While VMware Site Recovery Manager creates a complete solution with vSphere Replication (which can also be used without SRM), Zerto also protects using hypervisor replication. But to compare the two, we must first compare the capabilities of each solution by comparing vSphere Replication (without SRM) to Zerto Virtual Replication. Note that without SRM, vSphere Replication can be rather limited when it comes to several features. The tables will lay out the use cases for either product, and their features.

Provides planning, testing, and execution of disaster recovery for vSphere:

Yes

Yes

Designed for:

SRM was designed for disaster recovery orchestration only

Designed for hypervisor-based replication AND disaster
recovery orchestration

Licensed:

Per-VM

Per-VM

Replication granularity:

Per-VM or multi-select but virtual protection grouping
is not available

Per-VM and/or Per-Virtual Protection Group

Configure consistency groups (virtual protection groups)

No

Yes

Replication recovery points:

Yes, up to 24 snapshots

Yes, up to 14 days with standard recovery, up to 1 year with
extended recovery using the Offsite Backup feature.

Compatibility:

vSphere Replication works with ESX 5.x and above.
SRM requires the same version of vCenter and SRM
be installed at both sites.

Zerto works with ESXi 4.0 U1 and above. Zerto can
replicate between different versions of vCenter. Zerto
can also protect and recover from vSphere to Hyper-V,
Hyper-V to vSphere, and either virtualization platform
to the cloud (AWS, Azure(Zerto v5.0)).

Managed with:

vSphere Client Plugin

vSphere Client Plugin and standalone browser UI

Replication is performed with:

vSphere Replication

Zerto HyperVisor-based replication through VRAs deployed
to each host with protected VMs

Steps from Installation to Protection

The following table compares the high-level installation tasks/steps for VMware Site Recovery Manager and Zerto Virtual Replication. These steps assume necessary pre-requisites such as vCenter installation and firewall rules have been created.

Please note, that SRM appears to have many more steps, because SRM supports both array-based replication, in addition to vSphere Replication. If you don’t use one or the other, these steps are dramatically decreased. In my test environment, both features have been tested, and because of that, SRM has more steps.

VMware Site Recovery Manager

Zerto Virtual Replication

Build Windows VMs to host SRM in each site

Build SQL Server/leverage existing, or use embedded vPostgress db.

Install SRM in Protected and Recovery Sites and license

Connect SRM instances in Protected and Recovery SitesNote: This requires a functional error-free vCenter/PSC
infrastructure. PSCs should be in-sync with no errors.

Pair SRM instances

Install & configure Storage Replication Adapters (SRA)

Pair Array Managers

Configure inventory mappings

Create Protection Groups and Recovery Plans

Test, validate, protect, test recovery, monitor, and alert.

If using vSphere Replication - Install, configure, & pair vSphere Replication Appliances in each site

Build Windows VMs to host Zerto in each site

Install Zerto on each ZVM and apply license on login

Optional: Build/leverage existing SQL Server, or use the embedded database

See Database requirements in the above table for explanation on sizing
the DB and when to use an external SQL server.

Pair the Zerto instances

Edit site settings, schedule throttling if using a shared WAN connection, and configure alerts, thresholds, etc...

Deploy ZRAs (Zerto Replication Appliance - one per host that will be protecting VMs)

Build Virtual Protection Groups (the VPG configuration also includes recovery options such as re-IP or pre/post scripts).

Test, validate, protect, test recovery, monitor, and alert.

Protection Workflow

The following workflows have been created to illustrate the process involved in protecting virtual workloads using VMware Site Recovery Manager with vSphere Replication, and Zerto Virtual Replication.
Individual files for each protection workflow in full-size view are here:

In the above images, SRM on the left, and Zerto on the right; visually, you can see that SRM clearly has many more steps performed in multiple places, compared to Zerto. Majority of the additional steps in the SRM protection workflow deal with the multiple layers where protection is configured via the vSphere Web Client for a single VM using vSphere Replication. On the right side (Zerto), you see that most of the steps (if not all) for protecting virtual workloads takes place at the top layer, which is the Zerto Virtual Manager UI.

In SRM, protecting a single VM using vSphere Replication involves selecting the VM enabling vSphere Replication, going into Site Recovery, building a protection group and configuring it, followed by creating a recovery plan and configuring. The recovery plan portion of that is where customization such as boot priority and IP address changes are completed.

In Zerto, protecting a single VM is as easy as logging into the ZVM UI, creating a VPG, and providing protection and recovery settings all within one wizard.

Recovery Workflow

The following workflows have been created to illustrate the process involved in recovering from a site failure using VMware Site Recovery Manager with vSphere Replication, and Zerto Virtual Replication.

Individual files for each protection workflow in full-size view are here:

In the above images, SRM on the left, and Zerto on the right; visually you can see that the steps to recovery are fairly similar, with the exception that recovery in SRM is performed via the vSphere Web Client, while recovery from Zerto is performed from the ZVM UI (recovery performed at the recovery site in both scenarios). The most complex part about recovering in any scenario is the organization of admins/engineers/business stakeholders to recover, re-configure, and validate the recovery process. Of course, if routine recovery testing had been taking place, a failure should basically mimic a recovery test, although, more of a commitment at this point, instead of an exercise.

In SRM, there really is one place to take care of a recovery, and that is in Site Recovery > Recovery Plans. Locate the recovery plan for the application(s) you want to recover, and click the red button – its a no-brainer!

In the Zerto UI home screen, toggle the failover type from test to “live”, and click the recover button. When you click the button, you will be presented with a 3 step wizard, where you will select the VPG(s) to recover; select the checkpoint to recover from, set the commit policy, re-protect; and click the “start failover” button. Recovery and re-protection all in 1 place. The re-protection process in either product is straightforward, however, if there already isn’t a site built to re-protect to, there will be some work to do (in either case).

Implementation Time and Complexity

Planning, designing, and implementing either of these two products shouldn’t be difficult for anyone, except there are several pre-requisites that take time, change management processes and schedules to follow, or firewall rules to create and verify. With SRM, I’ve found that since this product ties to closely in to vSphere and version matching is a requirement, this could delay anyone who doesn’t have a version-aligned environment; or doesn’t have experience with vSphere or SRM. The biggest requirement for SRM? vSphere – you will have to have a vSphere deployment fully functional, and at an exact minimum version in both sites, in order to deploy SRM successfully. Zerto doesn’t care if the vCenter/ESXi versions on both sites match, as long as the minimum supported version is in use.

Granular requirements can make for administrative overhead and total team collaboration in the case of upgrades, maintenance, recovery, etc… because SRM relies heavily on version compatibility (as do other VMware products). In cases like this, there are specific orders of operations required for upgrades or power-on operations. These requirements are out of scope, but it pays to understand that they exist; so be sure to do some research, and if you can, test it before performing in production.

When installing Zerto, what took the most amount of time was building the Windows VMs (a few hours x 2) to house ZVM in each site… that and firewall rules (about 2 weeks, in my case following approval, change management, and implementation). Once the VMs were built and the firewall rules were in place, the actual time taken to install Zerto was about 10-15 minutes per ZVM, and approximately 10 minutes to deploy each VRA, which can also be bulk scripted. Zerto works as long as the hypervisor and vCenter are at a minimum version supported by Zerto, but it can protect across versions, or even hypervisors (VMware vSphere & Microsoft Hyper-V)! VPG creation can vary, depending on how many VMs per VPG you want to protect, and customization of all options, with one of the longer taking items being recovery and test IP settings. That’s it. Once you have a VPG created, initial synchronization starts, and as soon as the sites are in sync, you’ll ready to test, recover, or migrate and re-protect.

Monitoring and Reporting

Monitoring and Reporting with VMware Site Recovery Manager

VMware Site Recovery Manager provides monitoring and reporting, however, is limited depending on where you are in the object hierarchy (but the data is there!):

number of replicated VMs per host

amount of data transferred

number of RPO violations

replication count

number of sites successfully connected

These reports can also be expanded to show more detail, and data range can be modified. In my experience during testing, monitoring replication status and information isn’t as intuitive and centrally located as you would expect. There are several different places to monitor protection status and get additional information.

Some of this is at the VM level, where you will see replication status, last sync point, target site, quiescing (enabled/disabled), network compression (enabled/disabled), RPO, Points in time recovery (enabled/disabled), disk status.

Monitoring at the VM Object

At the VM (protected VM) level, you can monitor replication performance, however, it is limited to 2 counters, which are:

Replication Data Receive Rate (Average in KBps)

Replication Data Transmit Rate (Average in KBps)

Monitoring at the Site Recovery > Sites Level

At the site level, you can monitor things like issues, recovery plan history, and also get basic protection group and recovery plan information for Array Based Replication, Protection Groups, and Recovery Plans:

Monitoring at the Protection Group Level

At the protection group level, the summary tab will give you information such as status, number of VMs that are in the protection group, configuration status of those VMs, and any replication warnings (not clickable for more detail):

Selecting a protection group gives you a list of recovery plans, and VMs, and general protection information, but no logging or reporting.

Monitoring at the Recovery Plan Level

At the recovery plan level, when you select a recovery plan you the plan status, VM status, and recent history if the recovery plan has been run for testing or failover:

Digging deeper into a recovery plan, you have the ability to see recovery plan steps, history, protection group general protection information, and virtual machine general protection information:

Monitoring vSphere Replication at the vCenter Level

One more place that I was able to find monitoring and reporting is at the vSphere Replication level. Going to vSphere Replication in the vSphere Web Client, then clicking on a vCenter. From there, going to the Monitor tab, and clicking on vSphere Replication will take you the the screen in the image below where you can monitor Outgoing Replications, Incoming Replications, View Reports and Cloud Recovery Settings. The reports section looks to contain the most information, however, there isn’t a way in the UI to export reports if a customer requests a report to show history of their replication jobs.

Monitoring Outgoing Replications (per vCenter)

This section displays any Point in Time snapshots that can be recovered to if it has been configured, and replication information (although very general) such as:

Status

VM

Target Site

vR Server used

Configured Disks

Last Instance Sync Point

Last Sync Duration

Last Sync Size

RPO

Quiescing (enabled/disabled)

Network Compression (enabled/disabled)

Monitoring Incoming Replications (per vCenter)

This section displays Point in Time Snapshots, Recovery history, and Replication information (again all general) such as:

Status

VM (when a VM is selected above)

Target Site

vR Server

Configured Disks

What manages the incoming replications (in this case, it’s SRM)

Last instance sync point

Last sync duration

Last sync size

RPO

Quiescing (enabled/disabled)

Network Compression (enabled/disabled)

Reporting for vSphere Replication (per vCenter)

This section contains statistical information that can be filtered by date range. This section is a little more detailed (my favorite view), and actually contains numbers on graphs. It contains information such as:

Count of replicated vs non-replicated VMs

Replicated VMs per by host(s)

Transferred bytes

RPO violations

Replications Count

Site connectivity status

vR Server Connectivity (not pictured)

While this is great information, there is no way from the interface to export the reports if needed.

Cloud Recovery Testing

This section contains general information on any replications to the cloud. Since we are not replicating to the public cloud, this section is empty, but I have shown it to display what detail it contains.

Based on the findings for monitoring vSphere Replication and SRM, as shown above, there are multiple places to look for information, statistics, and reports. The problem here is that monitoring any ongoing replication jobs and/or recoveries and performance is a multi-tiered approach, and there is no centralization of information that is exportable for review. There are too many places to look for information, and it would be too tedious to effectively monitor protection jobs, recoveries, and performance out-of-the-box.

Monitoring and Reporting in Zerto Virtual Replication

Monitoring protection status in Zerto has been intuitive, detailed, and centralized. Zerto has decided to separate the two functions into “tabs” within the UI. One tab for monitoring (includes tasks and alerts), and one tab for reporting. The ability to set Zerto up to alert via e-mail and send reports at a regular interval (and scheduled!) are natively built into the product. The product doesn’t stop with 1 e-mail address destination, as it also allows for multiple recipients via comma or semicolon separator in the site settings. In the resource reports, you can set up the sampling rate, and the sampling time interval. In terms of BC/DR solutions, it would be much more preferred to receive more information than necessary, rather than waiting for a problem to surface. Nothing is more embarrassing or resume-generating than finding out at the point of a failure that your replication product hasn’t been replicating much or hasn’t been able to meet your RPO/RTO.

In the Zerto UI, monitoring alerts, events, and tasks is as simple as clicking on the “monitoring” tab. You can search for specific events or alerts (or both), and also modify the timeframe that you are targeting. In the reporting tab, you can get reports for the following items, and you can select any of them per VPG, or for all VPGs (and customize the reporting dates).

Protection Over Time by Site (Journal Usage in GB, VMs protected by count)

Recovery Reports by VPG, type, and/or status

Resource Report – shows resources used by protected VMs, which is required by Zerto to ensure recovery capability. (Exports to Excel)

Usage – exports to CSV, PDF, or ZIP

Conclusion

In conclusion, both products work as advertised, and deciding which product to go with may come down to trust, flexibility, simplicity, scalability, monitoring & reporting, re-protection capabilities, and of course, cost. When considering the cost of either solution, be sure to also include the cost of human hours required to successfully deploy and support either one. Both products have their benefits and quirks, but the bottom line is that THEY BOTH WORK GREAT!

Since I also went through the entire process from design to implementation, to protection, testing, and recovery – it took a considerable amount of time for VMware Site Recovery Manager to become usable due to some external problems we were having, so that sort of left a bad taste in my mouth (it was frustrating – but that was specific to my environment). Because Zerto wasn’t affected by those existing problems in terms of being prevented from working, it felt much simpler, but don’t get me wrong, you still have to plan for your deployment. The time that it took to deploy and have both products functioning varied considerably, with Zerto coming in as the winner in terms of time to protection versus Site Recovery Manager in my experience (again related to the underlying problems in my environment).

Array-based replication is an optional feature of SRM, and once we figured out what was needed on the SAN side for this to work properly, it actually runs nicely. This method has historically an expensive route to go due to the requirement of needing to have the same storage (vendor at least) in each site (protected and recovery). This also introduces another layer of complexity in configuration, administration, maintenance, and support alignment, which will involve SAN administrators. vSphere Replication, on the other hand, is easy to set up and you can be replicating VMs using this method in a short period of time.

Scalability of the products is another area I researched and determined that both products can protect up to 5000 VMs per vCenter instance (refer to comparison tables).

vSphere replication (without Site Recovery Manager) has a limitation of 1 vSphere Replication Appliance per vCenter instance. When leveraging the additional (limit) of 9 more vSphere Replication Servers per vSphere Replication appliance, you can protect up to 2000 VMs – see here for details. When pairing vSphere Replication with Site Recovery Manager and array-based replication, you can achieve protection of up to 5000 VMs per vCenter instance. (SRM Operation Limits)

Zerto can scale out to take advantage of cluster resources by deploying a VRA (virtual replication appliance) to each host in a cluster where you are protecting VMs. The VRAs come at no additional cost (both products are licensed per VM being protected) and can be sized as needed for best performance. When deploying Zerto VRAs, you will need IP addresses, so that’s one downside to having one per host, especially in large environments. On the plus side, you can deploy all those VRAs from one screen and their deployments can be automated, so that saves time.

Compatibility of each product and their requirements vary as well, with SRM having more requirements in both sites (protected and recovery). Since Zerto is basically deployed on top of a virtualization infrastructure, it is not tightly integrated into the base vSphere product nor does it rely on the same version requirements as SRM. Zerto is very flexible in versioning for both protected and recovery sites, and it also can protect and recovery to/from vSphere and Microsoft Hyper-V, or cloud providers.

Lastly, while I’m not seasoned programmer or script guru – at a high-level, both products can be programmatically managed, and both support PowerShell (with SRM requiring the PowerCLI add-on from VMware). Both products can also leverage vRealize Orchestrator, allowing workflow automation for protection tasks. Both products include support for multiple scripting/programming languages and have their APIs documented, however, in the case of SRM, the creation of recovery plans and forced-failovers cannot be automated (per the API documentation). Zerto can be managed through a feature-rich RESTful API that allows management of pretty much every aspect of the product and its capabilities, and their documentation is clear and full of example scripts in each of their supported languages for everyday tasks.

I hope this information has been helpful for those who are trying to decide which product to go with, and as always, comments or questions are welcome! And if you find this to be useful information, please share it!

The other day while I looking to find what programming language would be fun to learn, I stumbled across this post that outlined what languages are good to learn for 2016 and why. I’ve seen other articles, and even watched some YouTube videos to see what was out there and what is and isn’t as popular in terms of:

Easy to learn as a first language

Popularity

History

Growth

Support and Application

I started seeing a pattern where Python would constantly be on the lists I was looking at. It’s a very common language they teach at universities such as MIT; apparently is easy to learn, fast to program, has a clear syntax, solid documentation, and is cross-platform. I’m sure there are many other reasons to learn it, but for me personally, I really like the fact that it’s cross-platform and is reported to have a 99% success rate. I guess only time and effort will tell, right?

No matter what language it is, there will always be a use, big or small and each with their own respective communities to draw knowledge from and share with. Mainly for me, I’m tired of pointing and clicking in many things I do, and have had this need in the back of my mind to be able to create things out of nothing or supplement existing tools, or improve on other things. There’s no better feeling than that of accomplishment especially when you can combine logic, and creativity, thus seeing results.

If you’re a veteran or even novice Python programmer, I’d love to hear about your experience and tips!