Although cloud / utility computing models continuing to increase in popularity, it is still very much a work in progress. Even the definition of what constitutes the “cloud” is still being debated in some corners. While it may be hard to define, I think the NIST definition of cloud provides a valuable baseline for discussing some of the broader requirements and obstacles associated with it. Towards that end, I wanted to turn our attention to what NIST calls on-demand self-service and why it’s so critical to the success of cloud computing.

NIST defines ‘On-demand self-service’ as when “A consumer (or any user for our purposes here) can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service provider.” Without on-demand self-service the issue of user adoption of utility computing models can be severely hampered. An application developer or an IT administrator should be able to, at the click of a mouse, order the exact environment they need for a specific application.

On-Demand Self-Service Today

To some degree, virtualization has made on-demand self-service a reality in the cloud. Most cloud-based solutions today allow users to configure their compute (number of CPUs, memory capacity per VM) and their storage resources (capacity). Well-designed orchestration management tools like OpenStack, CloudStack, vCloud Director are making these capabilities even more promising and accessible to all cloud providers, both public and private. And while self-service provisioning of CPUs, memory, and storage is very useful for small, contained applications, it is still not sufficient for more complex (Enterprise-class) applications that typically require some type of custom networking configuration.

What We Really Need

If users could define their own detailed networking requirements for each application that is using the cloud infrastructure, the real challenge would turn to the operations folks within the cloud infrastructure who would need the ability to place workloads based on their specific network requirements. To do this, first we need to treat the network as a pooled resource, just like compute and storage. Then we need to make all user-relevant network resources configurable by the user on-demand - like capacity, proximity, path diversity, and data segmentation. Each user should be able to tell the network what resources they need for any given application they are running in the cloud. For example:

Once the user has defined these parameters, the cloud infrastructure must then have the ability to take the sum of all the individual application network requirements and dynamically make network workload placement decisions along with CPU and storage placement decisions that are already being made by orchestration management tools.

Like rapid elasticity, on-demand self-service is an essential requirement for cloud computing. It is already working with compute and storage in many cases, but to realize the full value of the cloud – i.e. to extend the cloud to all types of workloads - we also need to incorporate the same type of self-service access and configuration capabilities for key network resources and make sure they are configurable by users as well. This will be an important step in helping cloud computing reach its true potential.

For a few years (maybe 10?), we seemed to have reached some sort of happy equilibrium in networking. Ethernet and IP won the protocol wars and came to dominate almost all voice and data traffic. Things moved along very nicely and stayed that way for some time. We were able to build functional networks that got data from point A to point B and just added some capacity every few years and it kept on ticking.

As with anything that settles into some sort of equilibrium, eventually a new stimulus comes along to upset that equilibrium. This time it’s not the protocol wars. Rather the stimulus seems to be the increased usage of server virtualization, especially in large data center networks. This stimulus is causing quite a disbalance in networking. It creates very basic issues like connectivity and preservation of policy for servers that are moving around a data center. It also presents more complicated issues such as how to go from a relatively static network engineering mindset to one that must contemplate dynamic workload changes and movements.

In response we’ve seen two evolutionary responses from the network – each solving some of the problems, while creating new ones. As with any evolutionary approach, multiple reactions may appear in response to a new stimulus, but it will likely take some time before the winner appears. In the case of networking technology, the winning adaptation may not be one of the two discussed below, but for now we’ll take these as leading indicators of what to expect. Ultimately, the outcome may be shaped more by market forces than technology.

1. First Evolutionary Response: “The Fabric”

Here we see the network responding to the stimuli of virtual machine movement and so-called “east-west” traffic patterns. Existing routed and hierarchical network approaches tend to be sub-optimal for highly virtualized data centers. Many have written about this, so I won’t waste space re-hashing the issues, but suffice it to say that virtual machines want to be able to move anywhere in the data center, which is generally hampered when the network is organized as many routed subnets. And creating highly interconnected mesh networks of servers with hierarchical networks can create much inefficiency when one server just wants to talk to another server.

The so-called network fabric attempts to ease primarily those two issues (and in some cases also tries to present the entire network as a single entity for simpler management.) The aim of most fabric designs is to create a single large layer 2 domain so that virtual machines can be moved without concern to IP address changes. Other approaches are more layer 3 centric, but use some tracking logic to ensure that the network can keep up with moves at the virtual server layer. In either case, the network fabric should allow for VM mobility without re-addressing the VM. Fabrics also aim to better utilize all available network links. By doing this, the fabric eases some of the constraints in typical hierarchical networks and makes the network appear “flatter” by opening more available capacity for server-to-server needs.

Problems It Solves:

A fully-connected, any-to-any meshed network makes the network a non-issue when it comes to VM mobility.

Newer technologies allow full use of all the capacity without intentionally blocking links to prevent loops in the topology, making the network appear flatter for servers.

Problems It Creates:

To be fully useful for virtualized data centers, the mesh needs to reduce or eliminate oversubscription, making for a very costly network. In essence, the entire network needs to be engineered for the most demanding workloads as workload location cannot be pre-supposed.

All of that over-engineered capacity (i.e. for workloads below the most demanding) becomes unusable.

The same new technologies that unlock all the potential capacity also create new challenges and complications; e.g., technologies like TRILL and SPB introduce complex link state protocols at layer 2, taking a once relatively simple connectivity layer and making it subject to the vagaries of non-deterministic protocols.

These network control protocols also add more intelligence to the network versus the end points (virtual switches in servers) which can create conflicts with SDN architectures (more on that below.)

In some architecture, “flat” means a single large broadcast domain, which can be very difficult to manage and troubleshoot.

Inserting network services becomes problematic, especially if the fabric is proprietary or if the need for L3-L7 services negates any gains in “flatness.”

2.Second Evolutionary Response – Software Defined Networks

The goal of Software Defined Networks (SDN) is to move the control point of the network into software. The premise is that in order to truly respond to the new stimulus, both the currently known and the future unknown, we need to separate the intelligence that controls the network from the data paths. By moving the “control plane” into software, it can be more easily evolved and adapt at the rate of software evolutions, versus hardware evolutions.

In many cases, SDN is being conflated with OpenFlow, Network Virtualization or vice-versa. In reality, OpenFlow is a component of SDN. It aims to standardize the interface by which software-based network control planes speak to the various network elements. Network virtualization generally refers to the ability to define software paths through a network. These software-defined paths are generally instantiated at the very edge of the network (i.e. from the virtual switch on the server) and controlled also by a software control plane. In this view network virtualization could be viewed as an implementation of SDN. Of course, all of this is relatively new, so many may disagree with my dissection and resulting taxonomies.

Problems It Solves:

A commoditized and dumb HW layer should (in theory) bring down costs.

The ability to control and manipulate the network from software should (in theory) enable faster innovation cycles for new networking capabilities.

Users can manage intelligence from a Host; e.g., virtual switch, allowing for the creation of more coherent policies that are more closely related to the business and application needs.

Compartmentalization, etc. restore some order to the network.

There is a direct interface to orchestration and management infrastructure for the VM layer.

Problems It Creates:

This approach requires cooperation and synchronization of intelligence/policy at both the virtual network and the physical network, or requires completely overlaying the physical network with software-defined tunnels. In some cases, there is a new layer of complexity to manage; e.g. tunnels, etc.

It works best with fully-meshed fabrics that offer uniform performance characteristics. These fabrics are becoming more intelligent/complex (not less so, see above) – meaning users could end up paying for intelligence twice.

It also brings up the obvious question – when using an SDN approach on top of an intelligent fabric, who is in control and is the result deterministic and predictable?

The Winning Adaptation?

I’m fairly certain we haven’t seen the end of this evolutionary phase. The new stimulus being applied to the network is only just the beginning, and these first two adaptations may themselves morph a few times before we reach the next equilibrium. If we project from our current course, we’re heading to a place where the network will be fully aligned to the dynamic and evolving needs of business applications, and will also present a more software-minded construct in its own evolutionary process such that networks will be designed and deployed as rapidly as the applications that ride on top of them. In order to get there, the industry needs to continue to build on the work done to date to at least address the following issues (and probably a few more):

Simplify the point of control (intelligence) in the software layer and reach the point where we have fully deterministic and predictable network behavior.

Allow for the creation of specific software-based network topologies for each application, and allow all of the resulting network topologies to coexist in the physical network.

Build service interposition into the workflow and respect/enforce the requirements for L3-L7 services directly at the physical layer.

SDN and network fabrics present an exciting vision of what’s possible and there is tremendous energy and passion behind both. In my view, these are both steps in the right direction. As the network adapts, it solves some problems, and creates some new ones. We’ll have reached an equilibrium when the new network solves more problems than it creates, and presents a clear evolutionary advantage relative to the current stimuli. To get there, we’ll certainly need more than SDN and/or Fabrics. It’s hard to predict what the end result will look like, and the answer will be shaped not only by technology, but also by market and economic forces. In the meantime, we’ll all continue solving as many problems as we can, introducing as few new problems as possible, and seeing where this evolutionary process takes us next.

Last week I was re-reading this blog post - dev2ops.org/blog/2010/2/22/what-is-devop... - about an emerging trend called “DevOps.” The article was written about a year ago and the concept has really matured and gained interest since then. At the risk of over-simplification, DevOps aims to smooth the disconnect that frequently occurs in companies between application developers (Dev) – the people tasked with technology innovation and change; and operations (Ops) – the people responsible for putting that innovation into production in the enterprise data center and the business.

The post got me thinking about the state of application development versus networking. Application development has moved into a new era of agility, led by process evolution (Waterfall to Agile), a rapid transformation of foundational tools and libraries, and new ways to leverage open source code and communities. As application developers have become more capable and agile, they have put pressure on the deployment infrastructure (servers in data centers) to be equally agile and responsive. And now, in turn, the deployment infrastructure is pushing on the network to deliver the new capabilities quickly and efficiently to the business.

In the broader context of information technology, it is clear that many businesses are becoming more and more reliant on the speed of their IT (or R&D) capabilities. This newfound ability to quickly develop new applications either to directly support the business’ revenue (e.g. Google, Facebook, etc.) or to improve the internal business operations, is no longer a luxury. Agile software development methodologies have allowed these businesses to rapidly develop new IT or product capabilities and refine them quickly to get the final product. As the article highlights, this speed create a tension – or a “Wall of Confusion” – between agility (development) and stability (operations), and the DevOps movement has really been focused on giving the operation folks the tools and processes to smooth those tensions. Technologies like server virtualization, orchestration tools and even “Infrastructure as Code” concepts are facilitating this transition, making the computing infrastructure much more responsive to the needs of the applications.

Editors note – we added the third frame of the graphic for purposes of this blog.

This “Wall of Confusion” could now be drawn between Operations and Networking. Today, the network is squarely Waterfall. It takes a long time to design and engineer a network to support a specific set of assumptions based on the gross application needs. In a world where the applications needs are changing rapidly, we no longer have that luxury. As Ops teams have become more fluid with server virtualization and tools for orchestration, they are now looking at the network to get “agilified.”

Technologies like OpenFlow and the general trend toward Software-Defined Networking (SDN) are emerging as enablers to help “virtualize” the network resources and bring programmatic aspects to the network. These are necessary steps in the evolution of the enterprise network – where it ultimately becomes just another cog in the rapid application delivery machinery that enables us to transition smoothly from business concept to deployment to delivery in synchronous, agile steps.

The definition and requirements of cloud computing are still evolving, but NIST (National Institute of Standards and Technology) is always a good objective reference. NIST defines five essential characteristics of cloud computing: On-demand self-service, broad network access, resource pooling, rapid elasticity and measured service. In this post, I focus on rapid elasticity and how it relates to the networking aspect of the cloud. In future posts I will comment on some of the other characteristics of cloud computing.

According to NIST, ‘Rapid elasticity’is defined as: “Capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the user, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.”

With the growth of new cloud computing models, the enterprise data center has become home to a lot of dynamic workloads that require this type of rapid elasticity. These workloads appear suddenly, migrate rapidly and change intensity unexpectedly. They offer significant benefits in the way of performance, flexibility and operational cost effectiveness, but they also create some interesting data center challenges. In order to achieve the real benefits of these dynamic workloads, the underlying infrastructure also has to be elastic. Most data center managers are focused on achieving elasticity for compute and storage aspects of the infrastructure, but there is also a third aspect that is often overlooked – the network.

In the data center, workloads are placed primarily based on their performance requirements in terms of the underlying compute horsepower, storage capacity needs or idle time energy savings requirements. Most will be moved more than once when servers need to be rotated out, power consumption needs to be re-balanced, workloads fail over or servers get consolidated. The problem is that when workloads are moved, there is little attention paid to their physical location on the network and the resulting effects it may have on performance (in terms of capacity/utilization of the links, latency due to hop counts, etc.), security (when and how data is co-mingled on the network) and resiliency (how many redundant paths exist for a given inter- or intra-workload communication path).

Trying to optimize workload placement across all three dimensions – compute, storage and network – quickly becomes a never-ending game of whack-a-mole. Instead, we need to make all three of these different sets of resources more fluid such that workloads can be placed based on the most important needs of that workload, and the other aspects of the infrastructure follow along. For example, if a workload requires very high compute performance, a data center operator should be able to move (or expand) the workload to the most capable virtual machines anywhere in the infrastructure, regardless of the network or storage implications. Those resources should have enough elasticity to meet the workload needs by adjusting themselves appropriately. Conversely, if a component of a workload needs high-capacity storage, the compute and network dimensions should accommodate without any adverse effects. This type of elasticity enables a number of advantages including shorter provisioning cycles, easier maintenance of existing applications, more rapid adoption of new web-based business models, better customer experience, reduced complexity and higher levels of security and data integrity.

With the help of virtualization technologies, compute and storage resources have already come a long way, achieving greater degrees of elasticity than ever before. But in order to realize the full benefits and capabilities of cloud computing, all three dimensions must continue to evolve; most notably the network. Networking is still in the early stages of this transformation, one that will allow for rapid elasticity of workloads without compromise. Then, the resources provided by the network, such as connectivity, capacity, latency performance, path diversity and data segregation, will be fully accessible to workloads regardless of physical location. Only then will we see what’s truly possible in the cloud.

There have been a number of very high-profile service outages recently – Amazon AWS in April1, United Airlines in June2, and just this week, RIM. I doubt it’s the last. While it’s interesting to wonder how these companies could have let such critical services go down at all, there is a more fundamental problem here. It’s the network.

As RIM noted on its website, "The messaging and browsing issues many of you are still experiencing were caused by a core switch failure within RIMs Infrastructure."

In my view, RIM’s outage is just one more example of how broken and mis-applied “modern day” networking technology is for our mission critical services. Traditional networking gear held together by overly complex protocols and schemes simply isn’t working.

I won't join the bandwagon speculating on whether this is the nail in RIM’s coffin. As a former Crackberry addict, I'm still rooting for them! I do find it interesting though that a company that has operated a rock solid enterprise network for so many years (the best in the enterprise business by most measures) is now the latest victim of a “networktasrophe.” From what I've read in the press (most of it unsubstantiated) it sounds like a large switch had a module failure and the supposed redundancy scheme (likely something like VRRP) did not operate as planned. Although, a more likely explanation is someone fat fingered a command in a 1980’s era command-line interface, and all hell broke loose. We can certainly blame RIM for not testing their redundancy architecture or procedures well enough, but let’s not overlook the fundamental problem – the network and its apparent fragility. Everything from the complexity (and corresponding fragility) of the protocols to the downright user-unfriendly management needs to be rethought for today’s critical data and service needs. We built these systems to solve a different problem (and under a different set of constraints) than we’re trying to address with them today. And the band-aid layers we've applied over the many years have made the network an overly complex and inflexible house of cards at risk of coming down at the slightest mis-key.