Disaster Recovery

May 19, 2014

One thing that I like about working with an open source community like OpenStack is that you get direct access not just to the source code itself, but to its users and ecosystem as well. This allows you to be less exposed to marketing spins, and actually analyze our actions based on real numbers.

The OpenStack Foundation conducts a very insightful survey twice a year which helps to measure how users are using the technology. This is an important and useful feedback loop that goes to the PTL of the various projects, and can help product companies to analyze their go to market plans and such. You can find the source

I found the original version a bit too verbose, so I took the liberty to put together an executive summary of the main points from the survey, mostly targeted at decision makers. I hope that you’ll find it useful, any feedback would be highly appreciated.

Main Reasons for Choosing OpenStack

The main reason cited for choosing OpenStack is Open Technology and Avoiding Lock-in.

Interestingly enough Cost Saving comes only 4th!

Size of Deployments

The large majority of the current deployments are under 100 nodes.

Around 30% of the deployments run relatively large scale deployments between 500-1000.

QA/DEV

Production

Production seems to follow a similar trend but lower in numbers to QA, which is reasonable as I believe that some of the QA deployments represent a transition stage to production.

User Type

The majority of the users seem to be service providers and cloud operators.

There is also a good number of ISVs and ecosystem partners, which is a strong indication that OpenStack is indeed an ecosystem.

There's a very strong presence of OpenStack in the Telco industry. Finance seem to be lagging behind the adoption curve with only three financial organizations declaring their use of OpenStack.

The majority of the deployments run on private or hosted private clouds. About 20% run on public or hybrid clouds. OpenStack also seems to attract mostly small to mid-size companies.

Private/Public Clouds

Size of Organizations

Deployment Tools

The survey includes two categories for deployment tools, one for those that are used for deploying the OpenStack infrastructure, and the other that are used for application configuration (covered in the application survey).

According to the two surveys users tend to use different sets of tools for the infrastructure deployment and for application configuration management. Puppet seems to be the most popular choice for OpenStack users in both categories.

One of the interesting results is to see that Bare Metal and Docker are starting to appear in more deployments. This is fairly impressive given that the two are not yet officially supported by OpenStack. VMware is around 10%-13% of KVM of Virtualization similar to XEN.

OpenStack Deployment Tools

Applications on top of OpenStack

As noted above, Puppet is popular for application configuration, as well as infrastructure deployment. Heat and Docker take the 2nd and 3rd place respectively, putting Chef behind in the 4th place.

Given the popularity of Heat during the last summit I wouldn’t be surprised if Heat and Docker will surpass the use of Puppet in this category.

Popular Frameworks

Interestingly enough, JClouds seems to be gaining popularity amongst OpenStack users. This may be an indication that ~ ¼ of the users are interested in frameworks that are not tied only OpenStack. Not surprisingly, Python is far more popular than any other language, however Java seems to have a good and dominant presence more than other frameworks. One possible explanation is that the type of users that are attracted to OpenStack are more of the traditional enterprise organizations, and less of the new web-facing companies.

Operating Systems

Ubuntu and CentOS are the most popular choice. Windows adoption amongst OpenStack users is between 10-20%.

<

Application Survey

Production Survey

Compatibility / Portability

EC2 is a relatively popular choice which indicates that there is an increasing desire among the OpenStack community to maintain portability with EC2. This also correlates with the popularity of portability frameworks such as JClouds.

February 19, 2014

I first heard the term NFV a year ago as part of the writeup that I was doing with Alcatel Lucent on CloudBand titled Carrier Grade PaaS covering our collaboration. Ever since that time it looks like the term has risen in popularity with every network provider or infrastructure provider now launching a new NFV initiative.

As with any new hype, the term NFV quickly became overloaded and confusing, so I thought it would be beneficial to try and clarify it a bit.

What is NFV (technical definition)?

NFV stands for Network Function Virtualization - it basically means that your routers, load balancers and firewalls that are currently shipped in boxes will become a virtual entity rather than a physical box. As in the case of compute or storage virtualization, the process of virtualizing network functions is quite the same; we take each function (box) and turn it into a software function that can run on any given box. The decision of which box is going to serve each function can change at runtime based on the desired SLA and would not involve changing the software piece.

What does that really mean?

For most people, network functions do not mean a lot. In addition, the definition that I found over the internet for NFV for the most part comes from people who often tend to speak in a language that only a small community can really understand.

To understand what NFV really means we need to take a step back and look at the overall change that the IT industry as a whole is undergoing.

The IT industry is going through a big industrialization change that is very similar to the change that the car industry underwent when Ford came out with Model-T. The change with the Model-T car wasn’t in the making a faster engine, quite the contrary, the Model-T was inferior to other cars that were produced in those days. The industrial revolution with the Model-T was in the ability to produce cars in mass production. This revolution changed the entire automotive industry that propagated to other industries, as well, from consumer electronics through the food industry, which followed the same principles and adopted the same mass production manufacturing principles.

The IT industry is going through a fairly similar change - until now we used to build our IT manually, and we’re now shifting into mass production of IT, as well. In this analogy cloud is the manufacturing floor, and DevOps is the process for optimizing and automating the production pipeline.

How IT industrialization is related to NFV?

The carrier industry today relies on a highly customized infrastructure, no one carrier infrastructure runs in the same way as another carrier infrastructure. Once upon a time carriers saw this as an advantage and ended up with layers and layers of fairly proprietary and costly setups for running their business.

As the competition in this industry became tighter this operational model becomes not just less economical, but a huge threat, as it limits the speed in which carriers can adapt to new economic and market changes. This already has resulted in their loss of ground quite rapidly to competition from internet shops such as Google, Amazon, Microsoft, and others.

Carriers realized that in order to survive in this new world they needed to reduce their operational costs and increase the speed in which they can introduce new services, as well as scale their business.

This is where NFV comes into play. NFV is basically a better operational model for running the carrier backbone. Instead of a highly customized and costly backbone, we’ll use a commodity based infrastructure, and use opensource frameworks and with that we can innovate faster, scale and all this at a much lower cost.

What NFV is comprised of?

NFV isn’t a standard (yet) or a product. Even though there are various bodies such as ETSI that try to define standards for NFV, it is very unlikely that we’ll see any real standard emerging from this sort of initiative. This is simply due to the fact that standards often emerge when an industry reaches a certain level of maturity and we’re just not there yet.

At this point it is a set of “de-facto” standards bred from best practices mostly from the cloud providers that proved to be able to deliver a more efficient and agile operational model for running their infrastructure.

Quite often that includes the following core components:

Cloud-based infrastructure - OpenStack is currently the most popular choice for this purpose.

Software-defined network functions - This is a mix of the existing network functions provided as software packages, as well as new purely open source players that are making a new entrance into this world.

Orchestration engine - Responsible for provisioning of the network functions on a cloud-based infrastructure. TOSCA is a standard orchestration language defined by the OASIS organisation that provides a standard modeling language. Projects such as OpenStack Heat, as infrastructure orchestration in combination with Cloudify as an application orchestration tool are a good reference for this.

Analytics engine - Analytics engines are basically the feedback loop. They are an essential piece to measure which part to measure whether our services meet their desired SLAs and also as a means to analyze and optimize workloads. In the context of NFV, where many of the insights and decisions need to happen in real-time, the analytics engine will be heavily real-time based. A real time analytics engine that is more specific to the operational monitoring domain is Reimann.

What makes NFV different than any other cloud-based infrastructure?

For the most part network functions are not different than a database or any other software function that isn’t a defined as a network function. Having said that there are number of characteristics that are more specific to network functions such as:

Deterministic latency and performance - Network functions are more sensitive to latency and non-deterministic behavior that can be quite common in virtual data centers.

Support for third-party virtual appliances - Many of the network functions are packed as VMs. Those third-party VMs are mostly treated as a blackbox and can be accessed only through custom interfaces that are specific to the network function. Managing those functions can be fairly different than that of the other software services, as a result. The main difference, though, is that most management systems install an agent to control each managed service. In the context of network functions, these assumptions are no longer valid, and therefore managing virtual appliances can be done only remotely and not through a local agent.

Support for legacy network services - Many of the existing network functions were written in a pre-cloud world, as such they were built with specific assumptions that the high availability and scalability model, as well as the configuration model that are usually very much human-driven. Changing all that in one day isn’t going to be realistic, and therefore, there needs to be a more gradual path to transition those services into the new world or even replacing them with a more modern infrastructure.

High Degree of Security - Carrier networks assume a high degree of isolation on the network level. This often maps to a daily sophisticated network setup of VLANs and a separate network hierarchy that is not yet supported by most of the existing cloud infrastructures.

The Role of Orchestration in NFV

In a pre-NFV world setting, provisioning the carrier infrastructure was fairly human driven. In an NFV world those human operations need to be automated through software. In that context the orchestrator is a software defined operator and it’s the piece that manages the deployment of the network function in the right location and hardware. It is also responsible for orchestrating the network firewalls, routers and such, to fit a specific service deployment and security constraints. Like the human operator it is also responsible for continuously monitoring the deployed services and ensuring that they meet their desired SLA.

Final words

NFV is part of a broader shift toward industrialization of IT. In this particular case NFV is simply a way to make the operational model of carriers more efficient by adopting similar best practices and tools that have already proven themselves at scale by the major cloud providers such a Amazon, Google, Microsoft and others.

On the business side it has disrupted the entire networking industry that used to sell proprietary boxes. In the new virtualized world, boxes have become commoditized and the software on top becomes the main play. This opens up the door to new players that will have a pure software solution that will be specifically designed for a virtualized setup.

As with any disruptive force we will see big players that find it difficult to adapt to this change phased out to make room for new players who will take over, in a similar fashion to the way the internet wave brought Google, Amazon, PayPal and other players who were completely new commoners at the time and became the new market leaders.

On the technical side of things virtualizing the network function is only the first step. In order to experience the full potential of a virtualized carriers, we need to have the entire deployment and management completely automated. What we are missing in this picture is the software equivalent of the human operator, which is also known as the orchestrator. The orchestrator will become the major piece in putting all of this together.

In the next post I’ll describe in more depth the role of an orchestrator in the context of NFV.

March 07, 2013

As the technology for building and running cloud infrastructure matures, it is starting to spread out into more industries and revolutionizing how even the most conservative organizations are running their entire operations.

One of the areas undergoing a transformation is the carrier backbone services. For those who are not familiar, carrier backbone services includes services like cell and network services (DHCP, DNS, …), content serving (SMS, MMS, …), activation services, CRM, call centers, etc. Moving these critical carrier services from the existing environment tends to be labor intensive and proprietary. Selecting to move applications into a more open and virtualized environment such as cloud could yield a significant cost saving. An open cloud environment also enables Carrier organizations to reduce their time to market for delivering new services.

At time where the Telco/Carrier market is increasingly competitive, moving to a cloud-based carrier backbone can be more than a cost saving initiative. It can be a differentiator from the competition and is critical for survivability and success of the business.

That said, running carrier grade services requires special care to meet the required SLAs in terms of latency, deterministic behavior, performance, location awareness, etc. These challenges are unique enough not to fit into your mom-and-pop cloud.

The purpose of the Carrier Grade Cloud and Carrier Grade PaaS is to address these gaps and challenges.

In this post I'll try to provide a more detailed overview on what Carrier Grade Cloud and PaaS actually means. I will use examples based on GigaSpaces’ joint work with Alcatel. Alcatel-Lucent recently launched a new product in this space named CloudBand which is using Cloudify for its Carrier PaaS layer.

What does Carrier Grade mean? Learning from the Weather Chanel experience during the Sandy super-storm

The Weather Channel’s experience during Sandy is an excellent example of the need for carrier grade services. Below is a list of some of the key statistics during Hurricane Sandy:

1000% - The Weather Company’s traffic increase during Hurricane Sandy

110 GB- The amount of data, served every second during Sandy

170,000 - Peak number of simultaneous streams of video served during Sandy

1 - The amount of data centers that went bust during the storm

To address this demand during the storm, the Weather Chanel was running from 13 Data Centers managed by Verizon across North America, all with load balancing between them. During the storm Verizon increased their bandwidth capacity to meet the peak demand.

This sort of increased traffic behavior wasn't unique to the weather channel, as noted below:

So what can be learned from this process?What makes a service Carrier Grade?

Learning from the Weather Channel experience we can define a Carrier Grade service as a service with the following attributes:

Critical to the business function

Designed for massive scale

Designed to deal with major usage spikes

Location sensitive

Designed to provide deterministic response during extreme condition

This is obviously a fairly simplistic definition, but for the sake of this discussion I think it will suffice.

What Makes a Cloud/PaaS Carrier Grade?

There are various attributes that makes a Cloud/PaaS carrier grade, as I noted earlier. The two most important attributes IMO are the network and multi-site deployments. Let me explain why:

The Network

One of the main elements that is extremely important in in a Carrier Grade environment is the ability to assert control over the network.

That include the control over:

Isolation

Bandwidth

Latency

Cross-Cloud/Data Center Deployments:

Another critical element of a successful Carrier environment is the multi-site deployment. As seen with the Weather Channel’s use of 13 sites, multi-site deployment is important to address continues availability and scaling. Optimizing the latency by surveying the content closer to the location of the end user also helps to deal with challenges of data delivery.

So how are things done today?

The current Carrier backbone runs on physical appliances which basically maps to lots of irons. In this environment scaling capacity means buying more appliances. While this model works, it has two main drawbacks:

cost (infrastructure/operation)

lack of agility (i.e. it takes months and sometimes years to launch new service in this environment).

Alcatel CloudBand -- Carrier Grade IaaS/PaaS

Alcatel CloudBand is a new platform that let Telco apps easily leverage the carrier cloud services.

It is comprised of a few main elements.

Multi node/site IaaS -- a multi-site/Cloud infrastructure. The CloudBand infrastructure is essentially a policy based management on large numbers of cloud nodes. Each cloud node can run either an OpenStack or CloudStack-based infrastructure. These nodes can live in many disparate data centers. Alcatel CloudBand glues all of the disparate nodes together into a single big cloud that is accessible through an OpenStack API.

CPaaS -- Stands for Carrier Grade PaaS, which is essentially the framework enabling the on-boarding of the carrier services into the CloudBand infrastructure via a simple click and run user interface. Cloudify is integrated into this as an integral part of the CloudBand offering.

CloudBand's Unique Approach: Putting Network and Application Together

One of the unique aspects of the CloudBand architecture is its holistic approach to Network and Application. Standard cloud infrastructures tend to look at the two pieces as separate black boxes that run one on top of the other.

What does thisnew approach to Network and Applications really means?

Two example scenarios that I often use to describe the value of putting network and application together is in the areas of Disaster Recovery and Cloud Bursting. In today’s cloud, DR involves lots of wiring in which i need to explicitly point a segment of the application into a particular cloud zone and the other to another zone. Beyond the complexity of setting these zones up, it also means that there is a good degree of manual intervention required to handle a recovery or a scaling process in this environment.

Taking an automated SLA-driven approach to IaaS

Instead of identifying explicitly the zones in our cloud, with automated SLA we can simply ask the cloud to figure out the right zone for the job based on our application SLA. For example, a user could simply say something like "deploy RingTone service" where continuous availability=true redundancy=3 and distance between sites=100km". Most of that information is known to the CloudBand management at the time of deployment and it can therefore allocate machine instances not solely on image ID and zone ID, but also based on those SLA requirements.

Integrating the PaaS with the network

Many of the current PaaS solutions were designed to work with a simple cloud infrastructure.

If we design our PaaS solutions to work on top of a more intelligent infrastructure, like CloudBand, that can accept SLA-driven calls to coordinate infrastructure management, a revolution will happen. We can start looking at offloading some of the responsibility for allocating the right machine instance to a particular application tier to the infrastructure. The infrastructure could be made aware that we're deploying a data service and would therefore ensure that the nodes of that database don't reside on the same physical machine or even the same data center. Another area where the responsibility could be delegated to the infrastructure is the network isolation. Instead of dealing with security groups, the system can attach a particular network for a given application or a tier within that application and the infrastructure will make sure that any machine that is allocated for this service would be attached to this network.

Final Words

For years there has been discussion on the missing piece in the cloud puzzle - the network. Today we're at a point where this gap is starting to be filled up by projects like Quantum in OpenStack. In addition to OpenStack the Telco industry is also launching a new initiative titled NFV which stand for Network Function Virtualization. NFV was born in October of 2012 when AT&T, BT, China Mobile, Deutsche Telekom and many other Telcos introduced the NFV Call to Action document. It basically aims to combine new network API with virtualization and thus provide a standard model for a virtulized Carrier Cloud.

While it seems that the industry is moving in the right direction toward the Virtualization of the backbone systems, most of the effort seem to be focused on standardisation at the lower level of stack. Very little has been done to draw the real end game i.e. how would an end to end Carrier backbone would look like given that new virtualized infrastructure in place. More importantly we haven't yet began to think of what would be the implication of that infrastructure change on the application and services ontop of it.

This is what excites me in the CloudBand project. CloudBand doesn't just end up with yet another fancy infrastructure piece that we don't necessarily know how and what do with it. It actually takes the holistic approach and maps those fancy features into a real end to end solution which at the most basic level maps to the fact that setting up data and network clusters, disaster recovery or cloud bursting scenarios can now be fully automated in a much simpler fashion than in most of the current cloud infrastructure environment.

At a more strategic level that means that Carrier can now rely on the the cloud as an infrastructure that could manage their backbone services and thus leverage the cloud economics to meet their cost and business challenges.

February 14, 2013

Learning from the experience of others has always been a great source for many of the posts in this blog.

I happened to have had a meeting with Ron Zavner, Applications Architect at GigaSpaces when he reviewed his planned presentation for the AWS Meetup this week in London.

I thought that the information that Ron gathered for this purpose could be extremely beneficial for everyone who is either running, or plans to run their application in the cloud.

In this post I tried to capture in words, the content of Ron’s presentation on the lessons from recent cloud outages.

Recap - History of Cloud Outages

21 April 2011 - Some parts of Amazon Web Services suffered a major outage. A portion of volumes utilizing the Elastic Block Store (EBS) service became "stuck" and were unable to fulfill read/write requests. It took at least two days for service to be fully restored. Reddit, one of the better-known sites to go down due to the error, said it has 700 EBS volumes with Amazon. Sites like Quora and Reddit were able to come back online in "read-only" mode, but users couldn't post new content for many hours.

29 June 2012 - Several websites that rely on Amazon Web Services were taken offline due to a severe storm of historic proportions in the Northern Virginia area where Amazon's largest data center is located.

22 October 2012 - A major outage occurred, affecting many sites, again such as Reddit, Foursquare, Pinterest, and others. The cause was a latent bug in an operational data collection agent.

Christmas Eve 2012 - Amazon AWS again suffered an outage, causing websites such as Netflix instant video to be unavailable for some customers, particularly in the Northeastern US. Amazon later issued a statement detailing the issues with the Elastic Load Balancing service that led up to the outage.

Cloud outages are not the sole property of Amazon – they’re everywhere

While most of the more notable failure events happened to be related to Amazon AWS, failures tend be in direct proportion to the usage of the infrastructure, and right now Amazon is probably running the biggest workloads on their cloud, and is growing pretty fast. With these statistics, it’s very likely that their failure would be more notable than others simply because they simply have a wider impact.

As we have experienced, failure has dropped a visit to other cloud infrastructure providers as well.

Microsoft Azure outage

28 December 2012 - some owners of Microsoft's XBox 360 gaming console were unable to access some of their cloud-based storage files.

26 July 2012 - Service for Microsoft’s Windows Azure Europe region went down for more than two hours

29 February 2012 - The ultimate result was service impacts of 8-10 hours for users of Azure data centers in Dublin, Ireland, Chicago, and San Antonio.

Main take way

Looking at all of these failures, it becomes apparent that they don't quite follow a common pattern.

Failure tends to happens when and where you least expect it.

Rather than relying on the infrastructure to prevent such failure from happening, we need to learn how to cope with them as a way of living.

What does 99% availability mean anyway?

Quite often when we talk about availability, we’re referring to % of uptime.

In this context, for most people 99% uptime sounds good enough. Is it?

Let's examine what that means in days:

99% - 3.65 days downtime

99.9% - 8.76 hours downtime

99.99% - 53 minutes downtime

99.999% - 5.26 minutes downtime

99% downtime means that we need to be ready to tolerate almost 4 days of downtime, and no one can ensure how those 4 days will be spread across the year.

The impact of cloud outages

Although AWS went offline for a few hours only, the downtime experience did have an impact on customers’ businesses. There is no known data for the number of people affected by a cloud computing service outage. It is estimated that the travel service provider Amadeus loses $89,000 per hour during any cloud computing outage, while Paypal loses around $225,000 per hour.

How to survive cloud outages - (Lessons from RightScale & Netflix)

A good source for learning how to survive cloud outages is Netflix and RightScale who have had a good track record for surviving many of the previously mentioned cloud outages.

Below is a summary of the main takeaways.

Make sure to have a dedicated expert to manage your disaster recovery (DR) architecture, processes and testing.

Define what your target recovery time and recovery point is.

Be pessimistic and design for failures – (assume everything will fail and design a solution that is capable of handling it).

Use monitoring and alerts for failover processes and for every change in state.

Document your DR operational processes and automations.

Try to “break” different parts in your application. From unplugging the network, to turning off machines…then try it again.

Netflix didn't just provided its share of advice but has started to open source many of the tools it uses internally. The first one of them is ”Chaos Monkey.” This tool was designed to purposely cause failure in order to increase the resiliency of an application in Amazon Web Services (AWS.)

Netflix provides an excellent toolset for surviving outages on Amazon on the operational level.

In this section, I wanted to zoom-in more on the design implications on our application.

The core principle for surviving failure is actually fairly simple, and in fact, applies to any system - not just cloud - whether they happen to be airplanes, missiles, cars. At the end of the day, it’s all about redundancy. The degree of tolerance is often determined by how many alternate systems or parts of the system we have in our design, and how separate they are from one another. The degree of tolerance is also determined by how fast we can detect the broken part in our system and make the switch.

In software terms, the common parts that comprise our system are built out of two main groups - the business logic and the data.

Making a redundant software application that can survive failure is often based on setting up clones for two of those parts of our system.

Cloning your data

There are two models for creating redundant data systems.

1. Use database replication

2. Use generic replication as in the case of CDN

For database replication - each database tends to have a different replication scheme. Amazon RDS is based on a read replica that can take over when the master node fails. More modern databases, such as Cassandra and MongoDB, tend to have more flexibility and control for setting up the database replication. So the first choice that you need to make for setting up your data redundancy, is choosing the right database.

Quite often, using database replication tends to be good enough within a certain geography but can be too fragile for geo-redundancy. A model that has proven itself for replicating data across WAN is CDN.

What’s more, CDN is not tied to a specific data source, and therefore can use a generic service for replicating data from multiple sources that doesn't necessarily reside within a certain database.

Having said that, CDN is fit mostly to read data and doesn't ensure the consistency of the data. For that purpose, we need to use a generic replication service that could plug into any data source and replicate it to other locations in a way that will allow us to control the replication route, the latency, and consistency.

Cloning your application

To clone our application business logic, we need to be able to ensure that all parts of our system run the exact same version of all of our software components. That includes not just the binaries, but also the configuration, the scripts that run our application, and more importantly that all our post-deployment procedures such as failover, scaling and monitoring are also kept consistent.

Quite often the things that makes the cloning of our business logic complex, are due to the fact that the information on how to run our application is often scattered across many different sources such as scripts, as well as the mind of the people that runs these apps.

To make the job of cloning our application much simpler, and thus more consistent we need to be able to capture all parts of the information for running our apps in the same place.

Configuration management tools such as Chef, Puppet and in the case of Amazon CloudFormation can help in this regard.

Making it simpler through Cloudify

To make the work of setting all this up simpler, we tried to bake all those patterns into ready-made tools scripted into out of the box recipes.

Cloudify recipes include:

Database cluster recipes with support for MySQL, MongoDB, Cassandra, Postgress...

Integration with Chef and Puppet

Automation of failover, scaling and continuous maintenance of our application.

Application recipes that allow you to capture all the aspect of running your application, including the post-deployment aspects such as failover, scaling and monitoring.