Service Resilience and why the private cloud can't be beaten

Most public cloud services are not Highly Available by nature. They are designed with “cattle” in mind - that is to say, disposable units that are quickly and easily replaceable. Public cloud providers are clear on this. If you need high availability, design it into your applications.

The problem with Drupal websites, and especially in the context of (seemingly) cheaper cloud-based PaaS Drupal hosting products, is the containers running Drupal are not “cattle” - they are “pets”. That is to say they are all unique and different. Perhaps not in server configuration, many of our VMs are the same in that regard, but they are absolutely unique and different when it comes to Drupal.

Long explanation follows, but for the pressed-for-time, the TL;DR here is this: our cheap “Medium” VM is highly available at the hardware level, as standard, and no PaaS provider can say the same.

Each Drupal website has a requirement for persistent data (files, databases, configuration files) that cannot be lost, not even for a moment, or the site will not function. To understand the implications of this, you need to understand what happens when a host fails in the public cloud:

A host is a big server with lots of VMs (and in turn, lots of containers, with some providers) residing on it. If that host breaks, let’s say it has a power supply failure, the VMs need to be migrated. The problem is all of these VMs have many gigabytes of data associated with them, and that needs migrating too. Also, there are no set rules when it comes to prioritising migrations, it’s a simple case of “join the line”.

Whether migrations occur by hostname, some arbitrary randomness, or some other system, it doesn’t really matter. As the owner of the VM, all you can do is sit and wait until it gets migrated to another host, which can take up to an hour, depending on how unlucky you are (e.g. how far down the migration list your VM is). We used to be platformed on the public cloud, we’ve been on the sharp end of this!

Now, most PaaS providers offer a highly available option, but it isn’t on the cheaper plans. You need to be talking to them about their “Enterprise tier” before you even get a sniff of that goodness in any real sense.

VMware have a thing called Distributed Resource Scheduler - DRS for short - which is basically magic. DRS continuously monitors your hypervisors and their load, and constantly moves your guests around between hypervisors. That's right, unless you turn DRS off (and why would you, because it works beautifully!) you never know where your guests are going to be from one moment to the next. What you do know is which guests they will not be sharing a host with, because happily DRS lets us set anti-affinity rules to ensure, for example, a customer's two load balancers never end up on the same hypervisor, no matter what the load patterns are doing. But for everything else, DRS is in charge.

If three or four guest VMs on the same hypervisor start getting really busy, while another hypervisor has guests that are all snoozing, DRS will push a couple of guests from the busy host to the quiet host. Seamlessly. No one will even know it happened. And if a host fails entirely, vMotion just moves the guests to the least busy hypervisors in your cluster. You can literally shut down a host machine in a VMware cluster, with no warning, and all its guests will hop over to one that's still up. And DRS will choreograph all this, so no single hypervisor takes a hammering. Like I said, magic.

In case you got lost there a moment, “guests” are our VMs. As you can now see, this is also how we mitigate the risk of vCPU being unavailable when customers need it, as mentioned above. DRS makes sure VMs are moved to where resource is available, something public cloud platforms do not do.

Is this true high availability? No, because software failure remains an issue (you only have one web server, for example, so if it crashes you need to intervene). However, it’s a lot better than other platforms, because we really do have highly available hardware, which - in our experience - is more than half the battle when it comes to good resilience of service. We still sell 24/7/365 monitoring services, where we fix problems proactively and reactively, every hour of every day of the year, and we also sell highly available server layouts to ensure there are no single points of software failure - our own “Enterprise tier”, if you will. But the bottom line is our hardware is highly available and extremely resilient - more so than any PaaS provider can offer.

Besides the resilience of our servers, an equally important point is our backup procedures are transparent, clear and tested. We can evidence the backup of every server under our control, we store the backups (encrypted) in AWS S3 buckets “off site” so we can restore servers on another site should anything go badly wrong with our principle data centre, and we fully test all backups quarterly. This is all documented, customers can inspect our records for their servers if they so wish.

With the exception of Acquia, I am yet to find any PaaS provider actually document how they carry out and test their backups, where they store them, what their contingency plan is if their principle provider fails, etc. They just say they do.

As for Acquia, they only state they backup databases to S3 hourly, in fact, they make a point of saying data never leaves AWS, so all eggs are in one basket. (And we found out the hard way with #AberdeenCloud what happens when a company like that stops paying their AWS bills!)

What about scaling?

A quick word on scaling is probably appropriate at this moment. Because one of the points a PaaS provider would doubtless make at this moment is their Enterprise products allow rapid scaling of the application, which is only made possible by the public cloud. And this is true.

But what problem are you solving here? One that most organisations never have, frankly. People like to talk about rapidly scaling, they like to imagine a scenario where they will suddenly have a website as popular as the BBC, but in all my time as a web developer and, subsequently, as a hosting manager, I have never seen a website that needs to rapidly scale the application layer in an unplanned manner. I have seen planned scaling (to match TV show schedules or specific events) but never the so-called Slashdot effect.

Yes, sometimes some customers will get a huge spike in traffic because of an event, or being featured on another popular website, but this traffic is anonymous. With good caching policies in place and with a low (or even no)-cost, cloud-based CDN service like Sucuri’s WAF or CloudFlare’s (free!) plan (blog about these options here) you can stand a huge anonymous traffic spike without ever needing to worry about your application. It need not cost you a penny.

And yes, sometimes you really need to scale the application - a great example of this is e-commerce, the Christmas rush, so-called Black Friday sales, etc. - but this is almost never unplanned. Online shops know when their key busy periods are and planning to scale is easy for us. We might not be able to flex your application in minutes, but give us a bit of notice or a sales schedule and we certainly can.

So it is our firm opinion that the “but scaling!” argument is a straw man proposal, highlighting a "problem" that doesn’t really exist.

The next blog in the series is about security of service, why it matters, and how we make sure you have a strong security posture with Code Enigma.

We’re Code Enigma

We’re one of the most experienced Drupal teams in Europe, best known for our work on large, technically challenging projects for all kinds of clients.

Our team is passionate about Drupal and open source software. Our whole company spends at least four weeks per year working on Drupal modules or other open source projects. We’re also strongly committed to putting design first, taking a mobile-first, content-out approach to creating websites. This ensures that the sites we build combine the power of Drupal with best practice design and development.