Availability is calculated as the percentage of time an application and its services are available, given a specific time interval. High Availability (HA) is achieved when the service downtime is no more than 5.25 minutes per year, meaning at least 99.999%. The cloud with the best uptime in 2015 was Amazon Web Services, with a downtime of 150 minutes, far from the also named “five nines” availability. To achieve more than “five nines” availability, it could be necessary to deploy HA services using a combination of two availability zones (multi-zone HA), which are isolated locations within a cloud infrastructure, or even two different cloud providers (multi-cloud HA).

HA of a system is achieved by incorporating specific features to reduce service downtime, typically redundancy for failover and replication for load balancing. These techniques can be incorporated into services, such as clusters or multi-tier applications. In this post, we explore and compare different deployments of HA clusters using single- and multi-cloud setups. First we discuss a sample multi-tier service and give some details about the deployment configuration. Then, we analyze the advantages of the multi-cloud approach regarding the availability of the cluster.

Single-cloud HA

To illustrate these experiences we have chosen a paradigmatic multi-tier application that showcases the main characteristics and requirements of an HA deployment. In our case, we will use a classical web application consisting of the following components:

The load balancer tier that distributes the traffic over different application servers. The load balancers must be deployed in a HA configuration, requiring a floating IP associated to the full qualified domain name of the application and shared across the load balancer cluster. Usually, this layer is implemented with a combination of TCP/HTTP load balancing (e.g. with HAProxy) and VRRP failover (e.g. with Keepalived).

The web or application tier, consisting of various web servers that exposes the application HTTP interface. The web server spawns one or more worker process to handle the actual requests from the clients.

The cache tier, consisting of in-memory cache nodes providing read-only data to speed-up database access. This tier is usually included to scale out the application, sometimes in the application servers. A common setup is to use a distributed hash table (e.g. memcached) that requires the clients (workers in the web tier) to implement a consistent hashing algorithm and so to allow the addition or removal of cache nodes.

The data tier, consisting of one or more database servers that provide data access and persistence mechanisms. To provide HA in this tier, the database is replicated to one or more database servers. The database servers adopt a master-master replication mode, so write updates can be directed to the backup database servers in case of failure.

The following figure shows a single-cloud deployment of this web application. To improve availability, the application components are replicated in two different zones of the same cloud. These zones can be seen as two separated physical clusters within the same cloud infrastructure, so, if one cluster fails the service is not interrupted. Note that component replication provides two benefits, namely:

Scale out: The application and cache nodes can scale horizontally to increase the overall capacity of the web service. Also this nodes provides those layers with the required HA functionality.

HA: The load balancer and database tiers includes active-passive nodes to provide pure HA functionality. Note that the main workload is processed in the other two tiers.

These kind of HA services should be deployed taking into account that failures occur at different levels: VM instance, physical server, and availability zone. Therefore, the different service components must be deployed with specific placement constraints. The use of affinity rules is traditionally considered an effective mechanism to implement HA strategies. However, traditional affinity rules can not deal deal with complex multi-tier services and different availability zones. So, one of the main challenges in the deployment of multi-zone HA services is the orchestration of the service considering zone-based placement constraints and role-based placement constraints (i.e. for groups of related VMs or roles). To address this challenge, we are adding new affinity mechanisms and placement heuristics in OpenNebula for multi-zone scenarios.

Multi-cloud HA

The following figure shows a multi-cloud deployment of the same web application. In this case, the application components are replicated in two different clouds, so in case of cloud outage, the service continuity is guaranteed. As in the single-cloud deployment, we deploy the application in two different availability zones to increase the fault-tolerance capabilities of the service.

The different service components are distributed or replicated among both clouds, so that each cloud scheduler should receive the description of the service components that must be locally deployed, along with their location constraints (affinity rules), and each cloud makes its own and independent placement decisions according to these constraints. So, regarding the orchestration problem in a multi-cloud scenario, there are no new challenges other than those concerned with the multi-zone case.

However, in the multi-cloud scenario other additional challenges should be considered. First, the access to the service, which is performed through the Internet using a global load balancing and failover mechanism. For example, the Domain Name System (DNS) can include multiple address records for the service to distribute the client calls across load balancers. When a load balancer fails, web clients will retry using the next address returned by the DNS servers. Since the client usually picks the first address provided, the sequence of addresses is permuted in order to provide round robin, or it can be sorted following some distance metric. Also, health tests can be used to remove failing services. This is the simplest and probably most effective solution. More advanced techniques rely on anycast networks or global networks of reverse proxies.

Second, to interconnect the different elements within a tier, it is necessary to configure various private networks. In the single-cloud deployment, all the networks are internal, so they can be configured as private VLANs within the cloud. However, in the multi-cloud deployment, the data tier would require the configuration of a cross-site private network for multi-master database replication. This may require multicast UDP monitoring traffic to promote a slave in case of master failure, so it could be necessary to provide L2 connectivity at the virtual network level. For this, the BEACON framework for federated cloud networking is used.

Summing up, there are three main challenges for multi-cloud HA services, namely: multi-zone service orchestration with placement constraints, global load balancing and failover, and cross-cloud private networking.