I recently came across Benjamin Black’s blog on complexity in the context of AWS. He says:

i now see complexity moving up the stack as merely an effect of complexity budgets. like anything worth knowing, complexity budgets are simple: complexity has a cost, like any other resource, and we can’t expect an infinite budget.

spending our complexity budget wisely means investing it in the areas where it brings the most benefit (the most leverage, if you must), sometimes immediately, sometimes only once a system grows, and not spending it on things unessential to our goals.

What drives design complexity, in the Cloud computing infrastructure space?

Finding the right mix of functional “differentiation” versus “integration” at all levels or tiers of the design (whether it is hardware or software), along with technology & business constraints drive “complexity budget”.

“Differentiation” at the functional level is pretty well understood, but evolving: Routers, Switches, Compute, Storage nodes although the very basis of how these functions are realized is changing (e.g. Cisco is supposed to be gearing to sell servers).

“Integration” of various infrastructure functions in the data center, is always a non-trivial system (integration) expense.

Some Technology constraints examples:

Right from the chip level to system level, energy efficiency improvements are much slower than, hardware density improvements. Consequently, not being able to consume power in proportion to utilization levels result in sub-optimal cost/pricing structures.

Context/environment for virtualization for Cloud providers is really around defining what it means to have a “virtualized data center”. Cisco’s Unified computing being one example.

The need to produce manageable systems, the imperatives of interdependencies from the chip level to board, system level, software and ultimately the data center level requires us to resort to holistic, iterative thinking & design: We have to consider the function, interaction at all levels, in the context of the containing environment, i.e. what it means to have a “fully virtualized” data center.

optimal cloud computing operations, even for non-virtualized environments (lets face it, there are tons of scenarios that don’t necessarily need virtualization; by the same token, there are virtualization solutions that enable better utilization but don’t require hypervisors).

complete automation of managed IT environments

This is all about moving the complexity away from IT customers (adopting the cloud computing model), and in to the data centers/cloud providers.

Have you done this fire-drill: you have a high traffic/volume web application, it is sluggish or unstable. Your team is called to figure out if the problem is in the application, middleware stack, operating system or somewhere else in the deployment configuration??

This is where Observability comes in: Being able to dynamically probe resource usage (across all levels of the infrastructure/application stack) at a granular level, control probing overheads and associating actions or triggers with those probes at all levels.At a system level, DTrace is a good example, on Solaris & Mac OS X. There’s even a Java API for DTrace. Of course, there are other profiling libraries that offer low over-head, extremely fine-grained probes, aggregation capabilities (e.g. JETM).

Manageability is another important consideration: Being able to control and manage applications via standard management systems. This involves (hopefully) consistent instrumentation mechanisms, and standard isolation mechanisms between IT resources being managed, and external management systems. JMX is an example of one such standard. Other options such as JMX to SNMP bridge, MIBs compiled to MXBeans are some methods of integrating managed resources in to higher level (management) frameworks.

Flexible binding of computing infrastructure to workloads, is a key value proposition of the Cloud Computing model. Cloud computing providers like Amazon serve the needs of typical web workloads, by providing access to their dynamic infrastructure.

So, why is Observability and Manageability more important in the Cloud?

Because consumption of a fixed resource like CPU hrs is not optimal when you are on a “elastic”/dynamic infrastructure. You’d want to pay for exact utilization levels.

Because you want monitoring arrangements that work seamlessly, as they do on a single system, but on top of “elastic”/dynamic infrastructure.

There are many other reasons, of course:

Being able to manage the lifecycle of applications in an automated manner, in a distributed computing environment is perhaps the biggest use case for Manageability in the Cloud.

Metering, Billing, Performance monitoring, Sizing and Capacity planning are some examples of activities in the Cloud computing model that leverage the same underlying Observability principles (Instrumentation, dynamic probes, support for aggregating metrics etc).

Lets look at a couple of examples of what Observability and Manageability capabilities can enable in the Cloud computing context:

Customers can pay at a more granular level of Throughput levels (Requests or Transactions processed per unit of time) for a given latency (and other SLA items), so your charge back model makes sense, in the context of your business activities.

Enable better business opportunities, in a cost-effective manner. e.g. Mashery focuses metering/instrumentation at the API level, but the proposition in this context: Open up your API’s, meter it to provide you with an automated “business development filtering” mechanism….attractive if you’re in the right business, even on the “long tail” i.e customer traffic/volume is not high, but at least you have an ecosystem (100’s of developers or partners) to support the long tail, without burning your bank account. See my earlier post on RESTful business for more context here…these considerations are more valid in the Cloud computing paradigm.

Observability and Manageability are key “infrastructure” capabilities in the Cloud computing model that enable features/value proposition such as the ones discussed above. These are not new ideas, but adaption of time tested ideas to a new computing paradigm (i.e predominantly distributed computing over adaptive/dynamic infrastructure, but other variations will crop up).