Datrium have had a scalable protection tier and focus on performance since their inception.

[image courtesy of Datrium]

The “mobility tier”, in the form of Cloud DVX, has been around for a little while now. It’s simple to consume (via SaaS), yields decent deduplication results, and the Datrium team tells me it also delivers fast RTO. There’s also solid support for moving data between DCs with the DVX platform. This all sounds like the foundation for something happening in the hybrid space, right?

And Into The Future

Datrium pointed out that disaster recovery has traditionally been a good way of finding out where a lot of the problems exist in you data centre. There’s nothing like failing a failover to understand where the integration points in your on-premises infrastructure are lacking. Disaster recovery needs to be a seamless, integrated process, but data centres are still built on various silos of technology. People are still using clouds for a variety of reasons, and some clouds do some things better than others. It’s easy to pick and choose what you need to get things done. This has been one of the big advantages of public cloud and a large reason for its success. As a result of this, however, the silos are moving to the cloud, even as they’re fixed in the DC.

As a result of this, Datrium are looking to develop a solution that delivers on the following theme: “Run. Protect. Any Cloud”. The idea is simple, offering up an orchestrated DR offering that makes failover and failback a painless undertaking. Datrium tell me they’ve been a big supporter of VMware’s SRM product, but have observed that there can be problems with VMware offering an orchestration-only layer, with adapters having issues from time to time, and managing the solution can be complicated. With CloudShift, Datrium are taking a vertical stack approach, positioning CloudShift as an orchestrator for DR as a SaaS offering. Note that it only works with Datrium.

[image courtesy of Datrium]

The idea behind CloudShift is pretty neat. With Cloud DVX you can already backup VMs to AWS using S3 and EC2. The idea is that you can leverage data already in AWS to fire up VMs on AWS (using on-demand instances of VMware Cloud on AWS) to provide temporary disaster recovery capability. The good thing about this is that converting your VMware VMs to someone else’s cloud is no longer a problem you need to resolve. You’ll need to have a relationship with AWS in the first place – it won’t be as simple as entering your credit card details and firing up an instance. But it certainly seems a lot simpler than having an existing infrastructure in place, and dealing with the conversion problems inherent in going from vSphere to KVM and other virtualisation platforms.

[image courtesy of Datrium]

Failover and failback is a fairly straightforward process as well, with the following steps required for failover and failback of workloads:

Backup to Cloud DVX / S3 – This is ongoing and happens in the background;

It’s being pitched as a very simple way to run DR, something that has been notorious for being a stressful activity in the past.

Thoughts and Further Reading

CloudShift is targeted for release in the first half of 2019. The economic power of DRaaS in the cloud is very strong. People love the idea that they can access the facility on-demand, rather than having passive infrastructure doing nothing on the off chance that it will be required. There’s obviously some additional cost when you need to use on demand versus reserved resources, but this is still potentially cheaper than standing up and maintaining your own secondary DC presence.

Datrium are focused on keeping inherently complex activities like DR simple. I’ll be curious to see whether they’re successful with this approach. The great thing about something like a generic orchestration framework like VMware SRM is that you can use a number of different vendors in the data centre and not have a huge problem with interoperability. The downside to this approach is that this broader ecosystem can leave you exposed to problems with individual components in the solution. Datrium is taking a punt that their customers are going to see the advantages of having an integrated approach to leveraging on demand services. I’m constantly astonished that people don’t get more excited about DRaaS offerings. It’s really cool that you can get this level of protection without having to invest a tonne in running your own passive infrastructure. If you’d like to read more about CloudShift, there’s a blog post that sheds some more light on the solution on Datrium’s site, and you can grab a white paper here too.

I’ve been doing a bunch of research into Pure Storage’s ActiveCluster product recently. I was all set to do an article that explains how to set it up and what a vSphere Metro Cluster looks like with it in place, but Cody Hosterman has beaten me to the punch. Given that it’s more his job than mine to write this stuff, and that he works for Pure Storage, I’m okay with that. In any case, I thought it would be worthwhile to jot down some thoughts and notes and share some links to Cody’s work, if for no other reason than it gives me an aggregation point for my thoughts.

Introduction

I was lucky enough to be at Pure//Accelerate in 2017 when ActiveCluster was announced and covered it at a high level here. If you’re unfamiliar with ActiveCluster, it’s “a fully symmetric active/active bidirectional replication solution that provides synchronous replication for RPO zero and automatic transparent failover for RTO zero. ActiveCluster spans multiple sites enabling clustered arrays and clustered ESXi hosts to be used to deploy flexible active/active datacenter configurations.” (https://kb.vmware.com/s/article/51656).

[image courtesy of Pure Storage]

Components

There are a few bits that are needed to make ActiveCluster work (besides Purity 5.0 on your FlashArray):

Replication Network;

Pods; and

Pure1 Cloud Mediator.

Replication Network

The replication network is used for the initial asynchronous transfer of data to stretch a pod, to synchronously transfer data and configuration information between arrays, and to resynchronise a pod. For this network to work, you should note the following criteria apply:

The maximum tolerable RTT is 5ms between clustered FlashArrays;

4x 10GbE replication ports per array (two per controller). Two replication ports per controller are required to ensure redundant access from the primary controller to the other array;

4x dedicated replication IP addresses per array;

A redundant, switched replication network. Direct connection of FlashArrays for replication is not supported; and

Adequate bandwidth between arrays to support bi-directional synchronous writes and bandwidth for resynchronizing. This depends on the write rate of the hosts at both sites.

So, you need to know (and understand) your workload, and you need some reasonable bandwidth between the arrays. This shouldn’t be unexpected, but it’s clearly well suited to a metro deployment.

Pods

A Pod is a replication namespace. Once a pod is created, the pod (and the volumes inside it) can be controlled from either FlashArray. If you create a snapshot, that snapshot is created on both sides. If snapshots exist on the volume before it’s added to the pod, those snapshots will be copied over when you add it in. The pod itself acts as a consistency group.

Pure1 Cloud Mediator

The Pure1 Cloud Mediator is used to arbitrate split-brain scenarios. It sits in the cloud and keeps an eye on stuff. Think of it as the Vanilla Ice of the ActiveCluster (before he went off and did moto-x and renovation shows). For “dark” sites, an on-premises mediator (VM) can also be deployed.

A Few Other Notes

A few other things to note about the behaviour of ActiveCluster:

Data reduction is performed independently between arrays. This is cool because you might have a mix of workloads at each data centre;

If the arrays lose connection to the mediator they will continue to serve data and synchronously replicate as long as array to array communication is active; and

If both arrays lose communication with each other and with the mediator, this is a dual failure and both the mirrored volumes become unavailable until communication with the other array or the mediator can be re-established. Non-mirrored volumes would not be affected in this instance and would still be accessible.

Disaster Avoidance Or Recovery?

Before deploying ActiveCluster, you should think about what kind of goal you’re trying to achieve. Disaster Avoidance assumes that some element of the primary site (Site A) is unavailable due to a disaster. DA uses synchronous replication only and requires a stretched cluster technology (such as VMware vSphere Metro Cluster) to provide active / active workload availability access both sites. Disaster Recovery, on the other hand, assumes that workloads are deployed in an active / passive configuration across sites. There are advantages to each approach, depending on what your recovery point objective (RPO) is, and what your recovery time objective (RTO) is. If you have a very low RPO and RTO requirement, the added expense of deploying a synchronous replication solution (not the Pure bit, but the supporting infrastructure) is worth it. If you have a greater tolerance for a higher RPO and / or RTO, an asynchronous solution (and the less stringent replication network requirements) may be a better fit for you.

You should also think about whether the topology you’re deploying is Uniform or Non-Uniform. A Uniform configuration provides hosts with access across Sites. This requires a bit more investment in terms of stretched FC fabrics (assuming you’re using FC and not iSCSI). This is generally the topology deployed for metro clusters.

You might decide, however, to deploy a Non-Uniform configuration for simpler disaster recovery. In that case, there’s no requirement to have cross-site FC links in place, but your time to recover will be impacted. You’ll also want to look at something like VMware Site Recovery Manager to orchestrate the recovery of workloads at the secondary site.

Conclusion

Whilst I think ActiveCluster is a very neat piece of technology, you should be doing a whole lot of thinking about other (possibly very boring) stuff before you take the plunge and decide to deploy vMSC sitting on an ActiveCluster environment. Disaster Avoidance (and Recovery) require a lot of planning and understanding of what’s important to your business before you deploy a solution. In the next little while I hope to be able to report back with some results from testing, and talk a bit about other protection scenarios, including metro clusters with asynchronous protection off to the side.

So what exactly is Cloud Unity? If you’ve been keeping an eye on the IT market in the last few years, you’ll notice that everything has cloud of some type in its product name. In this case, Cloud Unity is a mechanism by which you can run Scale Computing’s HC3 hypervisor nested in Google Cloud Platform (GCP). The point of the solution, ostensibly, is to provide a business with disaster recovery capability on a public cloud platform. You’re basically running an HC3 cluster on GCP, with the added benefit that you can create an encrypted VXLAN connection between your on-premises HC3 cluster and the GCP cluster. The neat thing here is that everything runs as a small instance to handle replication from on-premises and only scales up when you’re actually needing to run the VMs in anger. The service is bought through Scale Computing, and starts from as little as $1000US per month (for 5TB). There are other options available as well and the solution is expected to be Generally Available in Q4 this year.

Conclusion and Further Reading

This isn’t the first time nested virtualisation has been released as a product, with AWS, Azure and Ravello all doing similar things. The cool thing here is that it’s aimed at Scale Computing’s traditional customers, namely small to medium businesses. These are the people who’ve bought into the simplicity of the Scale Computing model and don’t necessarily have time to re-write their line of business applications to work as cloud native applications (as much as it would be nice that this were the case). Whilst application lift and shift isn’t the ideal outcome, the other benefit of this approach is that companies who may not have previously invested in DR capability can now leverage this product to solve the technical part of the puzzle fairly simply.

DR should be a simple thing to have in place. Everyone has horror stories of data centres going off line because of natural disasters or, more commonly, human error. The price of good DR, however, has traditionally been quite high. And it’s been pretty hard to achieve. The beauty of this solution is that it provides businesses with solid technical capabilities for a moderate price, and allows them to focus on people and processes, which are arguably the key parts of DR that are commonly overlooked. Disasters are bad, which is why they’re called disasters. If you run a small to medium business and want to protect yourself from bad things happening, this is the kind of solution that should be of interest to you.

A few years ago, Scale Computing sent me a refurbished HC1000 cluster to play with, and I’ve had first-hand exposure to the excellent support staff and experience that Scale Computing tell people about. The stories are true – these people are very good at what they do and this goes a long way in providing consumers with confidence in the solution. This confidence is fairly critical to the success of technical DR solutions – you want to leverage something that’s robust in times of duress. You don’t want to be worrying about whether it will work or not when your on-premises DC is slowly becoming submerged in water because building maintenance made a boo boo. You want to be able to focus on your processes to ensure that applications and data are available when and where they’re required to keep doing what you do.

I’m not terribly good at predicting the future, particularly when it comes to technology trends. I generally prefer to leave that kind of punditry to journalists who don’t mind putting it out there and are happy to be proven wrong on the internet time and again. So why do a post referencing a great Hot Water Music album? Well, one of the PR companies I deal with regularly sent me a few quotes through from companies that I’m generally interested in talking about. And let’s face it, I haven’t had a lot to say in the last little while due to day job commitments and the general malaise I seem to suffer from during the onset of summer in Brisbane (no, I really don’t understand the concept of Christmas sweaters in the same way my friends in the Northern Hemisphere do).

Long intro for a short post? Yes. So I’ll get to the point. Here’s one of the quotes I was sent. “As concerns of downtime grow more acute in companies around the globe – and the funds for secondary data centers shrink – companies will be turning to DRaaS. While it’s been readily available for years, the true apex of adoption will hit in 2017-2018, as prices continue to drop and organizations become more risk-averse. There are exceptional technologies out there that can solve the business continuity problem for very little money in a very short time.” This was from Justin Giardina, CTO of iland. I was fortunate enough to meet Justin at the Nimble Storage Predictive Flash launch event in February this year. Justin is a switched on guy and while I don’t want to give his company too much air time (they compete in places with my employer), I think he’s bang on the money with his assessment of the state of play with DR and market appetite for DR as a Service.

I think there are a few things at play here, and it’s not all about technology (because it rarely is). The CxO’s fascination with cloud has been (rightly or wrongly) fiscally focused, with a lot of my customers thinking that public cloud could really help reduce their operating costs. I don’t want to go too much into the accuracy of that idea, but I know that cost has been front and centre for a number of customers for some time now. Five years ago I was working in a conservative environment where we had two production DCs and a third site dedicated to data protection infrastructure. They’ve since reduced that to one production site and are leveraging outsourced providers for both DR and data protection capabilities. The workload hasn’t changed significantly, nor has the requirement to have the data protected and recoverable.

Rightly or wrongly the argument for appropriate disaster recovery infrastructure seems to be a difficult one to make in organisations, even those that have been exposed to disaster and have (through sheer dumb luck) survived the ordeal. I don’t know why it is so difficult for people to understand that good DR and data protection is worth it. I suppose it is the same as me taking a calculated risk on my insurance every year and paying a lower annual rate and gambling on the fact that I won’t have to make a claim and be exposed to higher premiums.

It’s not just about cost though. I’ve spoken to plenty of people who just don’t know what they’re doing when it comes to DR and data protection. And some of these people have been put in the tough position of having lost some data, or had a heck of a time recovering after a significant equipment failure. In the same way that I have a someone come and look at my pool pump when water is coming out of the wrong bit, these companies are keen to get people in who know what they’re doing. If you think about it, it’s a smart move. While it can be hard to admit, sometimes knowing your limitations is actually a good thing.

It’s not that we don’t have the technology, or the facilities (even in BrisVegas) to do DR and data protection pretty well nowadays. In most cases it’s easier and more reliable than it ever was. But, like on-premises email services, it seems to be a service that people are happy to make someone else’s problem. I don’t have an issue with that as a concept, as long as you understand that you’re only outsourcing some technology and processes, you’re not magically doing away with the risk and result when something goes pear-shaped. If you’re a small business without a dedicated team of people to look after your stuff, it makes a lot of sense. Even the bigger players can benefit from making it someone else’s thing to worry about it. Just make sure you know what you’re getting into.

Getting back to the original premise of this post, I agree with Justin that we’re at a tipping point regarding DRaaS adoption, and I think 2017 is going to be really interesting in terms of how companies make use of this technology to protect their assets and keep costs under control.

working for minimum rage

taking the social out of social networking

buy me a pony

photos of food

disclaimer

The opinions expressed here are my personal opinions. Content published here is not read or approved in advance by my employer and does not necessarily reflect the views and opinions of my employers, previous or current. This is my blog.

Search

Search

Subscribe to PenguinPunk.net by email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.