Category Archives: Automation

When I lamented, in a previous post, that I couldn’t tell you about recent submissions to the DMTF Cloud incubator, one of those I had in mind was a submission from HP. I can now write this, because the author of the specification, Nigel Cook, has recently blogged about it. Unfortunately he is isn’t publishing the specification itself, just an announcement that it was submitted. Hopefully he is currently going through the long approval process to make the submitted document public (been there, done that, I know it takes time).

In the blog, Nigel makes a good argument for the need to go beyond a hypervisor-centric view of Cloud computing. Even at the IaaS layer there are cases of automated-but-not-virtualized deployment that have all the characteristics of Cloud computing and need to be supported by Cloud management APIs. Not to mention OS-level isolation like Solaris Containers.

Nigel also offers a spirited defense of SOAP-based protocols. I don’t necessarily agree with all his points (“one could easily map the web service definition I described to REST if that was important” suggests a “it’s just SOAP without the wrapper” view of REST), but I am glad he is launching this debate. We need to discuss this rather than assume that REST is the obvious answer. Remember, a few years ago SOAP was just as obvious an answer to any protocol question. It may well be that indeed REST comes out ahead of this discussion, but the process will force us to be explicit about what benefits of REST we are trying to achieve and will allow us to be practical in the way we approach it.

Events/alerts/notifications have been a central concept in IT management at least since the first SNMP trap was emitted, and probably even long before that. And yet they are curiously absent from all the Cloud management APIs/protocols. If you think that’s because “THE CLOUD CHANGES EVERYTHING” then you may have to think again. Over the last few days, two of the most experienced practitioners of Cloud computing pointed out that this omission is a real pain in the neck. RightScale’s Thorsten von Eicken was first to request “an event based interface instead of a request-reply based interface”, pointing out that “we run a good number of machines that do nothing but chew up 100% cpu polling EC2 to detect changes”. George Reese seconded and started to sketch a solution. And while these blog posts gave the issue increased visibility recently, it has been a recurring topic on the AWS Forum and other similar discussion boards for quite some time. For example, in this thread going back to 2006, an Amazon employee wrote that “this is a feature we’ve discussed recently and we’re looking at options” (incidentally, I see a post by Thorsten in that old thread). We’re still waiting.

Let’s look at what it would take to define such a feature.

I have some experience with events for IT management, having been involved in the WS-Notification family of specifications and having co-chaired the OASIS technical committee that standardized them. This post is not about foisting WS-Notification on Cloud APIs, but just about surfacing some of the questions that come up when you try to standardize such a mechanism. While the main use cases for WS-Notification came from IT (and Grid) management, it was supposed to be a generic mechanism. A Cloud-centric eventing protocol can be made simpler by focusing on fewer use cases (Cloud scenarios only). In addition, WS-Notification was marred by the complexity-is-a-sign-of-greatness spirit of the time . On this too, a Cloud eventing protocol could improve things by keeping IBM at bay simplicity in mind.

Types of event

When you pull the state of a resource to see if anything changed, you don’t have to tell the provider what kind of change you are interested in. If, on the other hand, you want the provider to notify you, then they need to know what you care about. You may not want to be notified on every single change in the resource state. How do you describe the changes you care about? Is there an agreed-upon set of states for the resource and you are only notified on state transitions? Can you indicate the minimum severity level for an event to be emitted? Who determines the severity of an event? Or do you get to specify what fields in the resource state you want to watch? What about numeric values for which you may not want to be notified of every change but only when a threshold is crossed? Do you get to specify a query and get notified whenever the query result changes? In WS-Notification some of this is handled by WS-Topics which I still like conceptually (I co-edited it) but is too complex for the task at hand.

Event formats

What format are the events serialized in? How is the even metadata captured (e.g. time stamp of observation, which may not be the same as the time at which the notification message was sent)? If the event payload is a representation of the new state of the resource, does it indicate what field changes (and what the old value was)? How do you keep event payloads consistent with the resource representation in the request/response interactions? If many events occur near the same time, can you group them in one notification message for better scalability?

Subscription creation

Presumably you need a subscription mechanism. Is the subscription set in stone when the resource is created? Or can you come later and subscribe? If subscription is an operation on the resource itself, how do you subscribe for events on something that doesn’t exist yet (e.g. “create a VM and notify me once it’s started”)? Do you get to set subscriptions on a per-resource-basis? Or is this a global setting for all the resources that you own? Can you have two different subscriptions on the same resource (e.g. a “critical events only” subscription that exist throughout the life of the resource, plus a “lots of events please” subscription that you keep for a few hours while troubleshooting)?

Subscription management

Do you get to come back and update/pause/delete a subscription? Do you get to change what filter the subscription carries? Or is it set in stone until the subscription expires? Can you change the delivery endpoint? What if events fail to be delivered? Does the provider cancel your subscription? After how many failures? Does it just pause it for a few hours? Keep trying?

Subscription expiration

Who sets the expiration period? The subscriber? Can the provider set a max duration? Do you get a warning message before the subscription expires? Can you renew a subscription or do you have to create a new one? Do you get a message telling you that it has expired? Where are these subscription-lifecycle messages sent? To the same endpoint as the regular messages? What if your subscription is being killed because your deliver endpoint is down, clearly it makes no sense to send the warning message to that same endpoint. Do you provide a separate “subscription management” endpoint (different from the event delivery endpoint) when you subscribe? Alternatively, does an email message get sent to the registered user who set the subscription?

Delivery reliability

How reliable do you want the notifications to be? Should the emitter retry until they’ve received a confirmation? How long do they keep messages that can’t be delivered? Some may have a very short shelf life while others are still useful weeks later. If you don’t have a reliable mechanism but you really “need to know about a lost server within a minute of it disappearing” (the example Georges gives) then in reality you may still have to poll just to make sure that an event wasn’t lost. If you haven’t received an event in a while, how can you test if the subscription is still working? Should subscriptions send a heartbeat message once a while?

Delivery mechanism

How do you deliver notifications? Do you keep HTTP connections open through tricks similar to how self-updating web pages work (e.g. COMET, long polling and soon WebSockets)? Or do you just provide a listener endpoint to which the notifier tries to connect (which, in the case of public cloud deployments, means you need to have a publicly-addressable listener, but hopefully not on the same Cloud infrastructure). Do you use XMPP? AMQP? Email? Can I have you hold my events and let me come pull them?

Security

Do you need to verify the origin of the events you receive? Or do you assume they may be forged and always initiate a connection to the provider to double-check? And on the other side, what are the security requirements for event delivery? If a user looses some of their privileges, do you have to go and cancel the still-active subscriptions that they created?

Throttling

Is there a maximum event rate? Do you get charged for the events the Cloud provider sends you? How do you make sure that someone doesn’t create a subscription pointing to the wrong endpoint (either erroneously or maliciously, e.g. DoS). Do you send a test message at registration asking the delivery endpoint to acknowledge that they indeed want to receive these notifications?

Conclusion

My goal is not to argue that we cannot have a simple yet good enough notification system or to scare anyone from attempting to define it. It’s just to show that it’s not as simple as it may seem at first blush. But there probably is a sweetspot and people like Thorsten and George are very well qualified to find it.

[UPDATED 2010/4/7: Amazon releases AWS Simple notification Service. Not just as an eventing feature for the Cloud API, as a generic notification service. Which can, of course, also carry Cloud management events. Though at this point you’re on your own to publish them from your instances, it doesn’t look like the AWS infrastructure can do it for you. Which means, for example, that you’re not going to be able to publish an event for a sudden crash.]

There have been someinterestingdiscussionsrecently about the relationship between Cloud management and SOA management/governance (run-time and design-time). My only regret is that they are a bit too focused on determining winners and loosers rather than defining what victory looks like (a bit like arguing whether the smartphone is the triumph of the phone over the computer or of the computer over the phone instead of discussing what makes a good smartphone).

To define victory, we need to answer this seemingly simple question: in what ways is the relationship between a VM and its hypervisor different from the relationship between two communicating applications?

More generally, there are three broad categories of relationships between the “active” elements of an IT system (by “active” I am excluding configuration, organization, management and security artifacts, like patch, department, ticket and user, respectively, to concentrate instead on the elements that are on the invocation path at runtime). We need to understand if/how/why these categories differ in how we manage them:

Deployment relationships: a machine (or VM) in a physical host (or hypervisor), a JEE application in an application server, a business process in a process engine, etc…

Infrastructure dependency relationships (other than containment): from an application to the DB that persists its data, from an application tier to web server that fronts it, from a batch job to the scheduler that launches it, etc…

Application dependency relationships: from an application to a web service it invokes, from a mash-up to an Atom feed it pulls, from a portal to a remote portlet, etc…

In the old days, the lines between these categories seemed pretty clear and we rarely even thought of them in the same terms. They were created and managed in different ways, by different people, at different times. Some were established as part of a process, others in a more ad-hoc way. Some took place by walking around with a CD, others via a console, others via a centralized repository. Some of these relationships were inventoried in spreadsheets, others on white boards, some in CMDBs, others just in code and in someone’s head. Some involved senior IT staff, others were up to developers and others were left to whoever was manning the controls when stuff broke.

It was a bit like the relationships you have with the taxi that takes you to the airport, the TSA agent who scans you and the pilot who flies you to your destination. You know they are all involved in your travel, but they are very distinct in how you experience and approach them.

It all changes with the Cloud (used as a short hand for virtualization, management automation, on-demand provisioning, 3rd-party hosting, metered usage, etc…). The advent of the hypervisor is the most obvious source of change: relationships that were mostly static become dynamic; also, where you used to manage just the parts (the host and the OS, often even mixed as one), you now manage not just the parts but the relationship between them (the deployment of a VM in a hypervisor). But it’s not just hypervisors. It’s frameworks, APIs, models, protocols, tools. Put them all together and you realize that:

the IT resources involved in all three categories of relationships can all be thought of as services being consumed (an “X86+ethernet emulation” service exposed by the hypervisor, a “JEE-compatible platform” service exposed by the application server, an “RDB service” expose by the database, a Web services exposed via SOAP or XML/JSON over HTTP, etc…),

they can also be set up as services, by simply sending a request to the API of the service provider,

not only can they be set up as services, they are also invoked as such, via well-documented (and often standard) interfaces,

they can also all be managed in a similar service-centric way, via performance metrics, SLAs, policies, etc,

your orchestration code may have to deal with all three categories, (e.g. an application slowdown might be addressed either by modifying its application dependencies, reconfiguring its infrastructure or initiating a new deployment),

the relationships in all these categories now have the potential to cross organization boundaries and involve external providers, possibly with usage-based billing,

as a result of all this, your IT automation system really needs a simple, consistent, standard way to handle all these relationships. Automation works best when you’ve simplified and standardize the environment to which it is applied.

If you’re a SOA person, your mental model for this is SOA++ and you pull out your SOA management and governance (config and runtime) tools. If you are in the WS-* obedience of SOA, you go back to WS-Management, try to see what it would take to slap a WSDL on a hypervisor and start dreaming of OVF over MTOM/XOP. If you’re into middleware modeling you might start to have visions of SCA models that extend all the way down to the hardware, or at least of getting SCA and OSGi to ally and conquer the world. If you’re a CMDB person, you may tell yourself that now is the time for the CMDB to do what you’ve been pretending it was doing all along and actually extend all the way into the application. Then you may have that “single source of truth” on which the automation code can reliably work. Or if you see the world through the “Cloud API” goggles, then this “consistent and standard” way to manage relationships at all three layers looks like what your Cloud API of choice will eventually do, as it grows from IaaS to PaaS and SaaS.

Your background may shape your reference model for this unified service-centric approach to IT management, but the bottom line is that we’d all like a nice, clear conceptual model to bridge and unify Cloud (provisioning and containment), application configuration and SOA relationships. A model in which we have services/containers with well-defined operational contracts (and on-demand provisioning interfaces). Consumers/components with well-defined requirements. APIs to connect the two, with predictable results (both in functional and non-functional terms). Policies and SLAs to fine-tune the quality of service. A management framework that monitors these policies and SLAs. A common security infrastructure that gets out of the way. A metering/billing framework that spans all these interactions. All this while keeping out of sight all the resource-specific work needed behind the scene, so that the automation code can look as Zen as a Japanese garden.

It doesn’t mean that there won’t be separations, roles, processes. We may still want to partition the IT management tasks, but we should first have a chance to rejigger what’s in each category. It might, for example, make sense to handle provider relationships in a consistent way whether they are “deployment relationships” (e.g. EC2 or your private IaaS Cloud) or “application dependency relationships” (e.g. SOA, internal or external). On the other hand, some of the relationships currently lumped in the “infrastructure dependency relationships” category because they are “config files stuff” may find different homes depending on whether they remain low-level and resource-specific or they are absorbed in a higher-level platform contract. Any fracture in the management of this overall IT infrastructure should be voluntary, based on legal, financial or human requirements. And not based on protocol, model, security and tool disconnect, on legacy approaches, on myopic metering, that we later rationalize as “the way we’d want things to be anyway because that’s what we are used to”.

In the application configuration management universe, there is a planetary collision scheduled between the hypervisor-centric view of the world (where virtual disk formats wrap themselves in OVF, then something like OVA to address, at least at launch time, application and infrastructure dependency relationships) and the application-model view of the world (SOA, SCA, Microsoft Oslo at least as it was initially defined, various application frameworks…). Microsoft Azure will have an answer, VMWare/Springsouce will have one, Oracle will too (though I can’t talk about it), Amazon might (especially as it keeps adding to its PaaS portfolio) or it might let its ecosystem sort it out, IBM probably has Rational, WebSphere and Tivoli distinguished engineers locked into a room, discussing and over-engineering it at this very minute, etc.

There is a lot at stake, and it would be nice if this was driven (industry-wide or at least within each of the contenders) by a clear understanding of what we are aiming for rather than a race to cobble together partial solutions based on existing control points and products (e.g. the hypervisor-centric party).

[UPDATED 2010/1/25: For an illustration of my statement that “if you’re a SOA person, your mental model for this is SOA++”, see Joe McKendrick’s “SOA’s Seven Greatest Mysteries Unveiled” (bullet #6: “When you get right down to it, cloud is the acquisition or provisioning of reusable services that cross enterprise walls. (…) They are service oriented architecture, and they rely on SOA-based principles to function.”)]

One of the heavily discussed Cloud topics in early 2009 was a Cloud Computing taxonomy. Now that this theme has died down (with limited results), and to start 2010 in a similar form, here is a proposal for a taxonomy of the benefits of Cloud Computing.

Just like the original Cloud Computing taxonomy only had three layers (IaaS/PaaS/SaaS), so does this taxonomy of Cloud benefits. The point of this post is to promote the third layer. I describe layers 1 and 2 mainly to better call out what’s specific about layer 3.

Layer 1 (infrastructure: “let someone else do it”)

This is the bare-bottom, inherent benefit of Cloud Computing: you don’t have to deal with the hardware. In practice, it means:

no need to worry about power/cooling,

on-demand provisioning/deprovisioning (machines appear/disappear in a way physical machines do not),

not responsible for physical security (though responsible for ensuring that the provider has an acceptable security level),

More specifically, more automated IT management. This does not require Cloud Computing (you can have a highly automated IT management environment on premise), but the move to Cloud Computing is the trigger that is making it really happen. While this capability is not an inherent benefit of Cloud Computing, the Cloud makes it:

Needed: You don’t get to put color tags on machines, you don’t get to bring a DVD to install a new application, you don’t get to open a machine to insert more memory, you don’t get to go retrieve a backup tape, label it and put it in a safe, etc. Of course loosing these “privileges” doesn’t sound bad considering that they are mostly chores, but it means that you have to design alternative (and mostly programmatic) ways to perform the functions that these tasks addressed.

Easier: Cloud environments are highly API-driven. Many IT tools from the previous generation were console-centric (people would go out and buy “a network/event/system management console“) with APIs/protocols as a secondary thought. In Cloud environments, tools are a lot more API-centric with the console as an adjunct (anyone has stats about the ratio of EC2 instances provisioned via the AWS console versus the APIs?). This is also why even though a lot of people wanted standard management protocols (of the WSDM/WS-Management generation), there wasn’t as much of a realization of their importance in the old environment (and not as much pressure to create them and eagerness to adopt them). The stakes and visibility are a lot higher in the Cloud environments and that’s why this second wave of protocols will have to succeed where the previous one came short.

More beneficial: Once you have automated IT management in a traditional data center, what you get is fewer employees needed and somewhat better utilization. But you are still gated by the time/process to purchase/install new machines and the cost of unused machines (at least with automation you don’t have to pay their power/cooling). You don’t get the “just what I need” level of infrastructure usage that the same automation work allows in a Cloud setting.

Layer 3 (applications: “do it right”)

In short, use the move to the Cloud as an opportunity to fix some of the key issues of today’s applications. Think of the Cloud switch as a second Y2K, 10 years later: like in 2000, not only are there things that the transition requires you to fix, there are also many things that aren’t exactly required to fix but still make sense to fix as part of the larger modernization effort. Of course the Cloud move is missing that ever-so-valuable project management motivator of a firm deadline, but hopefully competitive pressure can play a similar role.

What are these issues? Here is a partial list:

Security: at least authentication and authorization. We have SSO/Federation systems, both enterprise-type and Web-centric and they often suck in practice. Whether it’s because of the protocols, the implementations, the tools or the mindset. Plus, there are too many of them. As applications gained mouths and ears and started to communicate with one another, the problem became obvious. If, in the Cloud, you also want them to grow legs and be able to move around (wholly or in parts) then it really really has to get fixed. Not to mention the “all or nothing” delegation model that I am surprised hasn’t yet created a major disaster (let’s see what 2010 has in store). I suggested a band-aid fix earlier, but this needs a real solution (the Cloud Security Alliance provides some guidance in this document, see “domain 12” for IAM).

Get remote application interfaces right. It’s been discussed, manifesto’ed, buried and lampooned many times before (this was my humble take on it). Whether it’s because of WS-* or, more likely, java2wsdl we have been delayed in this but it simply has to happen. Call it SOAP, zenSOAP, REST, practical REST or whatever you want. Just make sure that all important functions and data are accessible via clear, documented, consistent, easy-to-use, on-the-wire interfaces. Once we have these interfaces, and only then, we can worry about reliably composing/orchestrating applications that cross organizational boundaries.

Related to the previous point, clean up the incestuous relationship between an application and its data. Actually, it’s not “its” data. It’s the data it works on.

Deliver application-centric IT management. Quit loosing and (badly) re-creating information: for example, an application deployment followed by a black-box discovery (“what did I just do”?). Or after-the-fact re-establishing correlations between events on different servers (“what was this about”?). Application management too often looks like a day in the life of a senile person.

Fault-tolerance and disaster recovery. It is too often lacking (or untested, which is the same) for applications that are just below the perceived threshold of requiring it to be done right. That threshold needs to be lowered and the move to the Cloud can be used to make this possible.

As I mentioned above, these are mostly not Cloud specific (though it is possible to create a Cloud connection for each). They are things that we have known about and tried to fix for a while. But the pace has been pretty slow and there is an opportunity for the Cloud transition to do more than just hand out the keys of the datacenter.

Lots of communities think of Cloud Computing as the realization of a vision that they have been pusuing for a while (“sure we didn’t call it Cloud back then but…”). Just ask the Grid folks, the dynamic data center folks (DCML, IBM’s “Autonomic Computing”, HP’s “Adaptive Enterprise”, Microsoft’s DSI), the ASP community, and those of us who toiled on what was going to be the SOAP-based management stack for all IT (e.g. my HP colleagues and I can selectively quote mentions of “adaptation mechanisms like resource reservation, allocation/de-allocation” and “management as a service” in this WSMF white paper from 2003 to portray WSMF as a precursor to all the Cloud APIs of today).

I thought of another such community today, as I ran into older OMG specifications: the Model-Driven Architecture (MDA) community. I have no idea what people in this community actually think of Cloud Computing, but it seems to me that PaaS is a chance to come close to part of their vision. For two reasons: PaaS makes it easier and more rewarding, all at the same time, to practice model-driven design. More bang for less buck.

Easier

My understanding of the MDA value proposition is that it would allow you to create a high-level design (at the level of something like an augmented version of UML) and have it automatically turn into executable code (e.g. that can run in a JEE or .NET container). I am probably making it sound more naive than it really is, but not by much. That’s a might wide gap to bridge, for QVT and friends, from UMLish to byte-code and it’s no surprise that the practical benefits of MDA are still to be seen (to put it kindly).

In a PaaS/SaaS world, on the other hand, you are mapping to something that is higher level than byte code. Depending on what types of PaaS containers you envision, some of the abstractions provided by these containers (e.g. business process execution, event processing) are a lot closer to the concepts manipulated in your PIM (Platform-independent model, the UMLish mentioned above). Thus a smaller gap to bridge and a better chance of it being automagical. Especially if you add a few SaaS building blocks to the mix.

More rewarding

Not only should it be easier to map a PIM to a PaaS deployment environments, the benefits you get once you are done are incommensurably greater. Rather than getting a dump of opaque auto-generated byte-code running in a regular JVM/CLR, you get an environments in which the design concepts (actors/services, process, rules, events) and the business model elements are first class citizens of the platform management infrastructure. So that you can monitor and set policies on the same things that you manipulate in you PIM. As opposed to falling down to the lowest common denominator of CPU/memory metrics. Or, god forbid, trying to diagnose/optimize machine-generated code.

We shall see

I wasn’t thinking of Microsoft SQL Server Modeling (previously known as Oslo) when I wrote this, but Doug Purdy’s tweet made the connection for me. And indeed, one can see in SQLSM+Azure the leading candidate today to realizing the MDA vision… minus the OMG MDA specifications.

[Preface: a few months ago I shared some thoughts about how REST was (or could) be applied to IT and Cloud management. Part 1 was a comparison of the RESTful aspects of four well-known IaaS Cloud APIs and part 2 was an analysis of how REST applies to configuration management. Both of these entries received well-informed reader comments BTW, so if you read the posts but didn’t come back for the comments you really owe it to yourself to do so now. At the time, I jotted down thoughts for subsequent entries in this series, but I never got around to posting them. Since the topic seems to be getting a lot of attention these days (especially in DMTF) I decided to go back to these notes and see if I could extract a few practical recommendations in the form of a wrap-up.]

The findings listed below should be relevant whether your protocol is trying to be truly RESTful, just HTTP-centric or even zen-SOAPy. Many of the issues that arise when creating a protocol that maps well to IT management use cases should transcend these variations and that’s what I try to cover.

The clear conclusion of both part 1 and part 2 was that the most relevant part of REST for IT and Cloud management is the use of hypermedia. IT management enjoys a head start on this compared to other domains, because its models are already rich in explicit relationships (e.g. CIM associations), as opposed to other business domains in which relationships are more implicit (to the end user at least). But REST teaches us that just having relationships in your model is not enough. They need to be exposed in a way that maps directly to the protocol, so that following a relationship is an infrastructure-level task, not an application-level task: passing an ID as a parameter for some domain-specific function is not it.

This doesn’t violate the rule to not mix the protocol and the model because the alignment should take place in the metamodel. XML is famously weak in that respect, but that’s where Atom steps in, handling relationships in a generic way. Similarly, support for references is, in addition to its accolade to Schematron, one of the main benefits of SML (extra kudos for apparently dropping the “EPR” reference scheme between submission and standardization, in favor of just the “URI” scheme). Not to mention RDFa and friends. Or HTTP Link headers (explained) for link-challenged types.

Finding #2: Put IDs on steroids

There is little to argue about the value of clearly identifying things of interest and we didn’t wait for the Web to realize this. But it is also one of the most vexing and complex problems in many areas of computing (including IT management). Some of the long-standing questions include:

Use an opaque ID (some random-looking string a characters) or an ID grounded in “unique” properties of the resource (if you can find any)?

At what point does a thing stop being the same (typical example: if I replace each hardware component of a server one after the other, at which point is it not the same server anymore? Does it make sense for the IT guys to slap an “asset id” sticker on the plastic box around it?)

How do you deal with reconciling two resources (with their own IDs) when you realize they represent the same thing?

REST guidelines don’t help with these questions. There often is an assumption, which is true for many web apps, that the application “owns” the resource. My “inbox” only exists as a resource within the mail server application (e.g. Gmail or an Exchange server). Whatever URI GMail assigns for it is the URI for my inbox, period. Things are not as simple when the resources exist outside of any specific application: take a server, for example: the board management controller (or the hypervisor in the case of a VM), the OS management layer and the management agent installed on the machine all have claims to report on the machine (and therefore a need to identify it).

To some extent, Cloud computing simplifies many of these issues by providing controllers that “own” infrastructure resources and can authoritatively identify them. But it really is only pushing the problem to the next level of the stack.

Making the ID a URI doesn’t magically answer these questions. Though it helps in that it lets you leverage reconciliation mechanisms developed around URIs (such as <atom:link rel=”alternate”> or owl:sameAs). What REST does is add another constraint to this ID mechanism: Make the IDs dereferenceable URLs rather than just URIs.

I buy into this. A simple GET on a resource URI doesn’t solve everything but it has so many advantages that it should be attempted in all cases. And make this HTTP GET please (see finding #6).

In this adoption of GET, we just have to deal with small details such as:

What URL do I use for resources that have more than one agent/controller?

How close to the resource do I point this URL? If it’s too close to it then it may change as the resource evolves (e.g. network changes) or be affected by the resource performance (e.g. a crashed machine or application that does not respond to its management API). If it’s removed from the resource, then I introduce a scope (e.g. one controller) within which the resource has to remain, which may cause scalability concerns (how many VMs can/should one controller handle, what if I want to migrate a VM across the ocean…).

These are somewhat corner cases (and the more automation and virtualization you get, the fewer possible controllers you have per resource). While they need to be addressed, they don’t come close to negating the value of dereferenceable IDs. In addition, there are plenty of mechanisms to help with the issues above, from links in the representations (obviously) to RDDL-style lightweight directory to a last resort “give Saint Peter a call” mechanism (the original WSRF proposal had a sub-specification called WS-RenewableReferences that would let you ask for a new version of an expired EPR but it was never published — WS-Naming in then-GGF also touched on that with its reference resolvers — showing once again that the base challenges don’t change as fast as technology flavors).

Implicit in this is the fact that URIs are vastly superior to EPRs. The latter were only just a band-aid on a broken system (which may have started back when WSDL 1.1 decided to define “ports” as message aggregators that can have only one URL) and it’s been more debilitating to SOAP than any other interoperability issue. Web services containers internalized this assumption to the point of providing a stunted dispatch mechanism that made it very hard to assign distinct URLs to resources.

Finding #3: If REST told you to jump off a bridge, would you do it?

Adherence to REST is not required to get the benefits I describe in this series. There is a lot to be inspired by in REST, but it shouldn’t be a religion. Sure, if you squint hard enough (and poke it here and there) you can call your interface RESTful, but why bother with the contortions if some parts are not so. As long as they don’t detract from the value of REST in the other parts. As in all conversions, the most fervent adepts of RPC will likely be tempted to become its most violent denunciators once they’re born again. This is a tired scenario that we don’t need to repeat. Don’t think of it as a conversion but as a new perspective.

Look at the “RESTful with many parameters?” comment thread on Stefan Tilkov’s excellent InfoQ introduction to REST. It starts with some shared distaste for parameter-laden URIs and a search for a more RESTful approach. This gets suggested:

You could do a post on some URI like ./query/product_dep which would create a query resource. Now you “add” products to the query either by sending a product uri list with the initial post or by calling post on ./query/product_dep/{id}. With every post to the query resource the get on the query resource would change.

Yeah, you could. But how about an RPC-like query operation rather than having yet another resource lifecycle to manage just for the sake of being REST-compliant? And BTW, how do you think any sane consumer of your API is going to handle this? You guessed it, by packaging the POST/POST/GET/DELETE in one convenient client-side library function called “query”. As much as I criticize RPC-centric toolkits (see finding #5 below), it would be justified in this case.

Either you understand why/how REST principles benefit you or you don’t. If you do, then use this understanding to interpret the REST principles to best fit your needs. If you don’t, then no amount of CONTENT-TYPE-pixie-dust-spreading, GET-PUT-POST-DELETE-golden-rule-following and HATEOAS-magical-incantation-reciting will help you. That’s the whole point, for me at least, of this tree-part investigation. Stefan says essential the same, but in a converse way, in his article: “there are often reasons why one would violate a REST constraint, simply because every constraint induces some trade-off that might not be acceptable in a particular situation. But often, REST constraints are violated due to a simple lack of understanding of their benefits.” He says “understand why you violate” and I say “understand why you obey”. It is essentially the same (if you’re into stereotypes you can attribute the difference to his Germanic heritage and my Gallic blood).

Even worse than bending your interface to appear RESTful, don’t cherry-pick your use cases to only keep those that you feel you can properly address via REST, leaving the others aside. Conversely, don’t add requirements just because REST makes them easy to support (interesting how quickly “why do you force me to manage the lifecycle of yet another resource just to run a query” turns into “isn’t this great, you can share queries among users and you can handle long-running queries, I am sure we need this”).

This is not to say that you should not create a fully RESTful system. Just that you don’t necessarily have to and you can still get many benefits as long as you open your eyes to the cost/benefits trade-off involved.

Finding #4: Learn humility from REST

Beyond the technology, there is a vibe behind REST design. You can copy the technology and still miss it. I described it in 2005 as Humble Architecture, and applied to SOA at the time. But it describes REST just as well:

More practically, this means that the key things to keep in mind when creating a service, is that you are not at the center of the universe, that you don’t know who is going to consume your service, that you don’t know what they are going to do with it, that you are not necessarily the one who can make the best use of the information you have access to and that you should be willing to share it with others openly…

In IT management terms, it means that you can RESTify your CMDB and your event console and your asset management software and your automation engine all you want, if you see your code as the ultimate consumer and the one that knows best, as the UI that users have to go through, the “ultimate source of truth” and the “manager of managers” then it doesn’t matter how well you use HTTP.

Finding #5: Beware of tools bearing gifts

To a large extent, the great thing about REST is how few tools there are to take it away from you. So you’re pretty much forced to understand what is going on in your contract as opposed to being kept ignorant by a wsdl2java type of toolkit. Sure, Java (and .NET) have improved in that regard, but really the cultural damage is done and the expectations have been set. Contrast this to “the ‘router’ is just a big case statement over URI-matching regexps”, from Tim Bray’s post on the Sun Cloud API, one of my main inspirations for this investigation.

REST is not inherently immune to the tool-controlling-the-hand syndrome. It’s just a matter of time until such tools try to make REST “accessible” to the “normal” developer (who can supposedly prevent thread deadlocks but not parse XML). Joe Gregorio warns about this in the context of WADL (to summarize: WADL brings XSD which leads to code generation). Keep this in mind next time someone states that REST is more “loosely coupled” than SOAP. It’s how you use it that matters.

Finding #6: Use screws, not glue, so we can peer inside and then close the lid again

The “view source” option is how I and many others learned HTML. It unfortunately created a generation of HTML monsters who never went past version 3.2 (the marbled background makes me feel young again). But it also fueled the explosion of the Web. On-the-wire inspection through soapUI is what allowed me to perform this investigation and report on it (WMI has allowed this for years, but WS-Management is what made it accessible and usable for anyone on any platform). This was, of course, in the context of SOAP which is also inspectable. Still, in that respect nothing beats plain HTTP which is why I recommend HTTP GET in finding #2 (make IDs dereferenceable) even though I don’t expect that the one-page-per-resource view is going to be the only way to access it in the finished product.

These (HTML source, on-the-wire XML and resource-description pages) rarely hit the human eye and yet their presence enables the development of the more commonly used views. Making it as easy as possible to see what is going on under the covers helps with learning, with debugging, with extending and with innovating. In the same way that 99% of web users don’t look at the HTML source (and 99.99% of them don’t see the HTTP requests) but the Web would not be what it is to them if this inspectability wasn’t been there to fuel its development.

Along the same line, make as few assumptions as possible about the consumers in your interfaces. Which, in practice, often means document what goes on the wire. WSDL/WADL can be used as a format, but they are at most one small component. Human-readable semantics are much more important.

Finding #7: Nothing is free

Part of what was so attractive about SOAP is everything you were going to get “for free” by using it. Message-level security (for all these use cases where your messages starts over HTTP, then hops onto a train, then get delivered by a carrier pigeon). Reliable messaging. Transactionality. Intermediaries (they were going to be a big deal in SOAP, as you can see in vestigial form today in the Nodes/Roles left in the spec – also, do you remember WS-Routing? I do.)

And it’s true that by now there is a body of specifications that support this as composable SOAP headers. But the lack of usage of these features contrasts with how often they were bandied in the early days of SOAP.

Well, I am detecting some of the same in the REST camp. How often have you heard about how REST enables caching? Or about how content types allows an ISP to compress images on the fly to speed up delivery over dial-up? Like in the SOAP case, these are real features and sometimes useful. It doesn’t mean that they are valuable to you. And if they are not, then don’t let them be used as justifications. Especially since they are not free. If caching doesn’t help me (because of low volume, because security considerations prevent a shared cache, etc) then its presence actually adds a cost to me, since I now have to worry whether something is cached or not and deal with ETags. Or I have to consistently remember to request the cache to be bypassed.

Finding #8: Starting by sweeping you front door.

Before you agonize about how RESTful your back-end management protocol is, how about you make sure that your management application (the user front-end) is a decent Web application? One with cool URIs , where the back button works, where bookmarks work, where the data is not hidden in some over-encompassing Flash/Silverlight thingy. Just saying.

***

Now for some questions still unanswered.

Question #1: Is this a flee market?

I am highly dubious of content negotiation and yet I can see many advantages to it. Mostly along the lines of finding #6: make it easy for people to look under the hood and get hold of the data. If you let them specify how they want to see the data, it’s obviously easier.

But there is no free lunch. Even if your infrastructure takes care of generating these different views for you (“no coding, just check the box”), you are expanding the surface of your contract. This means more documentation, more testing, more interoperability problems and more friction when time comes to modify the interface.

I don’t have enough experience with format negotiation to define the sweetspot of this practice. Is it one XML representation and one HTML, period (everything else get produced by the client by transforming the XML)? But is the XML Atom-wrapped or not? What about RDF? What about JSON? Not to forget that SOAP wrapper, how hard can it be to add. But soon enough we are in legacy hell.

Question #2: Mime-types?

The second part of Joe Gregorio’s WADL entry is all about Mime types and I have a harder time following him there. For one thing, I am a bit puzzled by the different directions in which Mime types go at the same time. For example, we have image formats (e.g. “image/png”), packaging/compression formats (e.g. “application/zip”) and application formats (e.g. “application/vnd.oasis.opendocument.text” or “application/msword”). But what if I have a zip full of PNG images? And aren’t modern word processing formats basically a zip of XML files? If I don’t have the appropriate viewer, maybe I’d like them to be at least recognized as ZIP files. I don’t see support for such composition and taxonomy in these types.

And even within one type, things seem a bit messy in practice. Looking at the registered applications in the “options” menu of my Firefox browser, I see plenty of duplication:

application/zip vs. application/x-zip-compressed

application/ms-powerpoint vs. application/vnd.ms-powerpoint

application/sdp vs. application/x-sdp

audio/mpeg vs. audio/x-mpeg

video/x-ms-asf vs. video/x-ms-asf-plugin

I also wonder at what level of depth I want to take my Mime types. Sure I can use Atom as a package but if the items I am passing around happen to be CIM classes (serialized to XML), doesn’t it make sense to advertise this? And within these classes, can I let you know which domain (e.g. which namespace) my resources are in (virtual machines versus support tickets)?

These questions may simply be a reflection of my lack of maturity in the fine art of using Mime types as part of protocol design. My experience with them is more of the “find the type that works through trial and error and then leave it alone” kind.

[Side note: the first time I had to pay attention to Mime types was back in 1995/1996, playing with non-parsed headers and the multipart/x-mixed-replace type to bring some dynamism to web pages (that was before JavaScript or even animated GIFs). The site is still up, but the admins have messed up the Apache config so that the CGIs aren’t executed anymore but return the Python code. So, here are some early Python experiments from yours truly: this script was a “pushed” countdown and this one was a “pushed” image animation. Cool stuff at the time, though not in a “get a date” kind of way.]

On the other hand, I very much agree with Joe’s point that “less is more”, i.e. that by not dictating how the semantics of a Mime type are defined the system forces you to think about the proper way to define them (e.g. an English-language RFC). As opposed to WSDL/XSD which gives the impression that once your XML validator turns green you’re done describing your interface. These syntactic validations are a complement at best, and usually not a very useful one (see “fat-bottomed specs”).

In comments on previous posts, Stu Charlton also emphasizes the value that Mime types bring. “Hypermedia advocates exposing a variety of links for such state-transitions, along with potentially unique media types to describe interfaces to those transitions.” I get the hypermedia concept, the HATEOAS approach and its very practical benefits. But I am still dubious about the role of Mime types in achieving them and I am not the only one with such qualms. I have too much respect for Joe and Stu to dismiss it entirely, but until I get an example that makes it “click” in practice for me I won’t sweat about Mime types too much.

Question #3: Riding the Zeitgeist?

That’s a practical question rather than a technical one, but as a protocol creator/promoter you are going to have to decide whether you market it as “RESTful”. If I have learned one thing in my past involvement with standards it is that marketing/positioning/impressions matter for standards as much as for products. To a large extent, for Clouds, Linked Data is a more appropriate label. But that provides little marketing/credibility humph with CIOs compared to REST (and less buzzword-compliance for the tech press). So maybe you want to write your spec based on Linked Data and then market it with a REST ribbon (the two are very compatible anyway). Just keep in mind that REST is the obvious choice for protocols in 2009 in the same way that SOAP was a few years ago.

Of course this is not an issue if you specification is truly RESTful. But none of the current Cloud “RESTful” APIs is, and I don’t expect this to change. At least if you go by Roy Fielding’s definition (or Paul’s handy summary):

A REST API must not define fixed resource names or hierarchies (an obvious coupling of client and server). Servers must have the freedom to control their own namespace. Instead, allow servers to instruct clients on how to construct appropriate URIs, such as is done in HTML forms and URI templates, by defining those instructions within media types and link relations. [Failure here implies that clients are assuming a resource structure due to out-of band information, such as a domain-specific standard, which is the data-oriented equivalent to RPC’s functional coupling].

I’ve reviewed lots of “REST APIs”, many of them privately for clients, and a common theme I’ve noticed is that most folks coming from a CORBA/DCE/DCOM/WS-* background, despite all the REST knowledge I’ve implanted into their heads, still cannot get away from the need to “specify the interface”. Sometimes this manifests itself through predefined relationships between resources, specifying URI structure, or listing the possible response codes received from different resources in response to the standard 4 methods (usually a combination of all those). I expect it’s just habit. But a second round of harping on the uniform interface – that every service has the same interface and so any service-specific interface specification only serves to increase coupling – sets them straight.

So the question of whether you want to market yourself as RESTful (rather than just as “inspired by the proper use of HTTP illustrated by REST”) is relevant, if only because you may find the father of REST throwing (POSTing?) tomatoes at you. There is always a risk in wearing clothes that look good but don’t quite fit you. The worst time for your pants to fall off is when you suddenly have to start running.

For more on this, refer to Ted Neward’s excellent Roy decoder ring where he not only explains what Roy means but more importantly clarifies that “if you’re not doing REST, it doesn’t mean that your API sucks” (to which I’d add that it is actually more likely to suck if you try to ape REST than if you allow yourself to be loosely inspired by it).

***

Wrapping up the wrap-up

There is one key topic that I had originally included in this wrap-up but decided to remove: extensibility. Mark Hapner brings it up in a comment on a previous post:

It is interesting to note that HTML does not provide namespaces but this hasn’t limited its capabilities. The reason is that links are a very effective mechanism for composing resources. Rather than composition via complicated ‘embedding’ mechanisms such as namespaces, the web composes resources via links. If HTML hadn’t provided open-ended, embeddable links there would be no web.

I am the kind of guy who would have namespace-qualified his children when naming them (had my wife not stepped in) so I don’t necessarily see “extension via links” as a negation of the need for namespaces (best example: RDF). The whole topic of embedding versus linking is a great one but this post doesn’t need another thousand words and the “REST in practice” umbrella is not necessarily the best one for this discussion. So I hereby conclude my “REST in practice for IT and Cloud management” series, with the intent to eventually start a “Linked Data in practice for IT and Cloud management” series in which extensibility will be properly handled. And we can also talk about querying (conspicuously absent from Cloud APIs, unless CMDBf is now a Cloud API) and versioning. As a teaser for the application of Linked Data to IT/Cloud, I will leave you with what Vint Cerf has to say.

[UPDATED 2010/1/27: I still haven’t written the promised “Linked Data in practice for IT and Cloud management” post, but this explanation of the usage of Linked Data for data.gov.uk pretty much says it all. I may still write a post describing how what Jeni says about government data applies to Cloud management APIs, but it’s almost too obvious to bother. Actually, there may be reasons why Cloud management benefits even more from Linked Data than UK government data, so it may still be worth a post. At some point. When I convince myself that it may influence things rather than be background noise.]

When I left HP for Oracle, in the summer of 2007, a friend made the argument that Cloud Computing (I think we were still saying “Utility Computing” at the time) would be the death of proprietary software.

There was a lot to support this view. EC2 was one year old and its usage was overwhelmingly based on open source software (OSS). For proprietary software, there was no clear understanding of how licensing terms applied to Cloud deployments. Not only would most sales reps not know the answer, back then they probably wouldn’t have comprehended the question.

Two and a half years later… well it doesn’t look all that different at first blush. EC2 seems to still be a largely OSS-dominated playground. Especially since the most obvious (though not necessarily the most accurate) measure is to peruse the description/content of public AMIs. They are (predictably since you can’t generally redistribute proprietary software) almost entirely OSS-based, with the exception of the public AMIs provided by software vendors themselves (Oracle, IBM…).

But these can be called reactive adaptations. At best (depending on the variability of your load and whether you use an Unlimited License Agreement), these moves have brought the “proprietary vs. OSS” debate in the Cloud back to the same parameters as for on-premise deployments. They have simply corrected the initial challenge of not even knowing how to license proprietary software for Cloud deployments.

But it’s not stopping here. What we are seeing is some of the parameters for the “proprietary vs. OSS” debate become more friendly towards proprietary software in the Cloud than on-premise. I am referring to the emergence of EC2 instances for which the software licenses are included in the per-hour rate for server instances. This pretty much removes license management as a concern altogether. The main example is Windows EC2 instances. As a user, you don’t have to deal with any Windows license consideration. You just request a machine and use it. Which of course doesn’t mean you are not paying for Windows. These Windows instances are more expensive than comparable Linux instances (e.g. $0.34 versus $0.48 per hour in the case of a “standard/large” instance) and Amazon takes care of paying Microsoft.

The removal of license management as a concern may not make a big difference to large corporations that have an Unlimited License Agreement, but for smaller companies it may take a chunk out of the reasons to use open source software. Not only do you not have to track license usage (and renewal), you never have to spend time with a sales rep. You don’t have to ask yourself at what point in your beta program you’ve moved from a legitimate use of the (often free) development license to a situation in which you need a production license. Including the software license directly in the cost of the base Cloud resource (e.g. the virtual machine instance) makes planning (and auto-scaling) easier: you use the same algorithm as in the “free license” situation, just with a different per hour cost. The trade-off becomes quantitative rather than qualitative. You can trade a given software stack against a faster CPU or more memory or more storage, depending on which combination serves your needs better. It doesn’t matter if the value you get from the instance comes from the software in the image or the hardware. This moves IaaS closer to PaaS (force.com doesn’t itemize your bill between hardware cost and software cost). Anything that makes IaaS more like PaaS is good for IaaS.

As with all things computer-related, the issue is going to get blurrier and then irrelevant . The great thing about software is that there is no solid line. In this case, we will eventually get more customized appliances (via appliance builders or model-driven appliance generation) blurring the line between installed software and appliance-based software.

I was referring to a blurring line in terms of how software is managed, but it’s also true in terms of how it is licensed.

There are of course many other reasons (than license management) why people use open source software rather than proprietary. The most obvious being the cost of the license (which, as we have seen, doesn’t go away but just gets incorporated in the base Cloud instance rate). Or they may simply prefer a given open source product independently of any licensing aspect. Some need access to the underlying code, to customize/improve it for their purpose. Or they may be leery of depending on one entity for the long-term viability of their platform. There may even be some who suspect any software that they don’t examine/compile themselves to contain backdoors (though these people are presumably not candidates for Cloud deployments in he first place). Etc. These reasons remain pretty much unchanged in the Cloud. But, anecdotally at least, removing license management concerns from both manual and automated tasks is a big improvement.

If Cloud providers get it right (which would require being smarter than wireless service providers, a low bar) and software vendors play ball, the “proprietary vs. OSS” debate may become more favorable to proprietary software in the Cloud than it is on-premise. For the benefit of customers, software vendors and Cloud providers. Hopefully Amazon will succeed where telcos mostly failed, in providing a convenient application metering/billing service for 3rd-party software offered on top of their infrastructural services (without otherwise getting in the way). Anybody remembers the Minitel? Today we’d call that a “Terminal as a Service” offering and if it did one thing well (beyond displaying green characters) it was billing. Which reminds me, I probably still have one in my parent’s basement.

[Note: every page of this blog mentions, at the bottom, that “the statements and opinions expressed here are my own and do not necessarily represent those of Oracle Corporation” but in this case I will repeat it here, in the body of the post.]

[UPDATED 2009/12/29: The Register seems to agree. In fact, they come close to paraphrasing this blog entry:

“It’s proprietary applications offered by enterprise mainstays such as Oracle, IBM, and other big vendors that may turn out to be the big winners. The big vendors simply manipulated and corrected their licensing strategies to offer their applications in an on-demand or subscription manner.

Amazonian middlemen

AWS, for example, now offers EC2 instances for which the software licenses are included in the per-hour rate for server instances. This means that users who want to run Windows applications don’t have to deal with dreaded Windows licensing – instead, they simply request a machine and use it while Amazon deals with paying Microsoft.”]

Chuck Shotton recently made a compelling case (“Twitter with a Brain“) for Twitter tools to allow the user to change the protocol endpoint. That is, instead of always going to twitter.com, you can tell your Twitter client to send all requests to myTweetInterceptor.me.com. Why would you do this? You should read his blog entry, but in short his point is that the intermediary can add all kinds of new features that neither the Twitter client nor Twitter itself support. As always in computer science, a new level of decoupling adds opportunities for extensions (and breakage too, of course).

I fully agree with what he writes and I would very much like to see his call to action answered. In fact, I want more than what he is asking for. So here is my call to action:

1) It’s not just Twitter

Why just Twitter? This should be true for any client using any protocol. Why not also the APIs for the various Google and Yahoo services? The APIs for the other social networks beyond Twitter? For shopping sites like Amazon and EBay? Etc. And of course to all the various Cloud providers out there. Just because I am using the Amazon EC2 API it doesn’t mean I necessarily want the requests to go straight to Amazon. Client tools should always make the endpoint configurable, period.

2) It’s not just the clients, it’s also (and especially) the third party sites

Chuck’s examples are about features that the Twitter clients could provide but don’t, so an intermediary would be an easy way to hack support for them (others presumably include modifying the client – if open source -, writing a plug-in for it – it there is such mechanism -, or running a network interceptor on the local client – unless the protocol is encrypted-).

That’s nice and I’d love to see this, but the big deal for me is less with clients and more with third party sites. You know, all these sites that ask for your Twitter login/password. Or those that ask for your GMail/Yahoo account info to retrieve a list of your contacts. I never grant these requests, but I would consider it if they allowed me to tell them what endpoint URL to use. For example, rather than using my Twitter login to go straight to twitter.com, they would use a login/password that I create and talk to twitterIntermediary.vambenepe.com. The requests would be in the exact same shape as what they send today to Twitter, just directed to another URL. There, I could have a proxy that only allows some requests (e.g. “update twitter background image” but not “send update”) and forwards them using my real Twitter credentials. Or, for email accounts, I could have a proxy that allows requests that read my address book but not those that read my mails. The goal here is not to add features, it is to delegate trust in a fine-grained (and audited) manner. This, to me, is the burning need, rather than a 3rd place to implement Twitter lists.

I would probably write these proxies using a PaaS platform like the Google App Engine. Or maybe even Yahoo Pipes. I have long struggled to think of use cases for which Yahoo Pipes hits the sweetspot, and this may well be it. Especially if people write modules to handle specific APIs (e.g. a “Twitter API” module that shows all operations and lets you enable/disable them one by one in a pipe). The one thing missing would be a way for a pipe to keep a log of its invocations, for auditing.

You want access to my email and social network accounts? Give me the ability to filter you requests and you’ll get access. If it’s blind trust you want, I am afraid I have a very limited supply.

[Note: I wanted to add this as a comment on Chuck’s blog, but he doesn’t seem to allow them: “go start your own blog and/or shut up and eat your vegetables” is his recommendation. Since I already have my own blog, I guess I don’t have to eat my vegetables if I don’t want to. I just hope my kids don’t learn about this rule or they’ll be blogging in no time.]

[UPDATED 2009/11/30: WRT to Chuck’s request, it looks like it’s being done already. But no luck with the third party sites so far, which is what I most want to see.]

The Fujitsu submission is called an “API design”. What this means is that it doesn’t tell you anything about what things look like on the wire. It could materialize as another “XML over HTTP” protocol (with or without SOAP wrapper), but it could just as well be implemented as a binary RPC protocol. It’s really more of an esquisse of a resource model than a remote API. The only invocation-related aspect of the document is that it defines explicit operations on various resources (though not their input and outputs). This suggest that the most obvious mapping would be to some XML/HTTP RPC protocol (SOAPy or not). In that sense, it stands out a bit from the more recent Cloud API proposals that take a “RESTful” rather than RPC approach. But in these days of enthusiastic REST-washing I am pretty sure a determined designer could produce a RESTful-looking (but contorted) set of resources that would channel the operations in the specification as HTTP-like verbs on these resources.

Since there are few protocol aspects to this “API design”, if we are to compare it to other “Cloud APIs”, it’s really the resource model that’s worth evaluating. The obvious comparison is to the EC2 model as it provides a pretty similar set of infrastructure resources (it’s entirely focused on the IaaS layer). It lacks EC2 capabilities around availability, security and monitoring. But it adds to the EC2 resource model the notions of VDC (“virtual data center”, a container of IaaS resources), VSYS (see below) and a lightly-defined EFM (Extended Function Module) concept which intends to encompass all kinds of network/security appliances (and presumably makes up for the lack of security groups).

The heart of the specification is the VSYS and its accompanying VSYS Descriptor. We are encouraged to think of the VSYS Descriptor as an extension of OVF that lets you specify this kind of environment:

Example content for a VSYS Descriptor

By forcing the initial VSYS instance to be based on a VSYS Descriptor, but then allowing the VSYS to drift away from the descriptor via direct management actions, the specification takes a middle-of-the-road approach to the “model-based versus procedural” debate. Disciples of the procedural approach will presumably start from a very generic and unconstrained VSYS Descriptor and, from there, script their way to happiness. Model geeks will look for a way to keep the system configuration in sync with a VSYS Descriptor.

How this will work is completely undefined. There is supposed to be a getVSYSConfiguration() operation which “returns the configuration information on the VSYS” but there is no format/content proposed for the response payload. Is this supposed to return every single config file, every setting (OS, MW, application) on all the servers in the VSYS? Surely not. But what then is it supposed to return? The specification defines five VSYS attributes (VSYSID, creator, createTime, description and baseDescriptor) so I know what getSYSAttributes() returns. But leaving getVSYSConfiguration() undefined is like handing someone an airplane maintenance manual that simply reads “put the right part in the right place”. A similar feature is also left as an exercise to the reader in section that sketches an “external configuration service”. We are provided with a URL convention to address the service, but zero information about the format and content of the configuration instructions provided to the VServer.

EC2 has a keypair access mechanism for Linux instances and a clumsy password-retrieval system for Windows instances. The Fujitsu proposal adopts the lowest common denominator (actually the greatest common divisor, but that’s a lost rhetorical cause): random password generation/retrieval for everyone.

I also noticed the statement that a VServer must be “implemented as a virtual machine” which is an unnecessary constraint/assumption. The opposite statement is later made for EFMs, which “can be implemented in various ways (e.g. run on virtual machines or not)”, so I don’t want to read too much into the “hypervisor-required” VServer statement which probably just needs an editorial clean-up.

From a political perspective this specification looks more like a case of “can I play with you? I brought some marbles” than a more aggressive “listen everybody, we’re playing soccer now and I am the captain”. In other words, this may not be as much an attempt to shape the outcome of the incubator as much as to contribute to its work and position Fujitsu as a respected member whose participation needs to be acknowledged.

While this is an alternative submission to the vCloud API, I don’t think VMWare will feel very challenged by it. The specification’s core (VSYS Descriptor) intends to build on OVF, which should be music to VMWare’s ears (it’s the model, not the protocol, which is strategic). And it is light enough on technical details that it will be pretty easy for vCloud to claim that it, indeed, aligns with the intent of this “design”.

All in all, it is good to see companies take the time to write down what they expect out of the DMTF work. And it’s refreshing to see genuine single-company contributions rather than pre-negotiated documents by a clique. Whether they look more like implementable specifications of position paper, they all provide good input to the DMTF Cloud incubator.

Andi Mann recently wrote an interesting post about virtual appliances . He uses the domain name pleasediscuss.com for his blog so I figured I’d do just that. More specifically, I have three comments on his article.

Opaque or transparent appliance

Andi’s concerns about the security and management problems posed by virtual appliances are real, but he seems to assume that the content of the appliance is necessarily opaque to the customer and under the responsibility of the appliance provider. Why can’t a virtual appliance be transparent in the sense that the customer is able to efficiently manage at least some aspects of the software installed on it? “You can’t put agents on most virtual appliances, they don’t come with WMI, and most have only a GUI for management” says Andi. Why can’t an appliance come with an agent (especially in these days of consolidation where many vendors provide many layers of the stack – hypervisor / OS / application container / application / management tools – including their agent)? Why can’t it implement a standard management API (most servers nowadays implement WBEM, WS-Management and/or IPMI pre-boot – on the motherboard – which is a lot more challenging to do than supporting a similar protocol in a virtual appliance). Andi is really criticizing the current offering more than the virtual appliance model per se and in this I can join him.

Let me put it differently, since this is probably just a question of definition: what would Andi call a virtual appliance that does expose management APIs for its infrastructure (e.g. WS-Management for the OS, JMX for the java stack) or that comes with an agent (HP, IBM, BMC, Oracle…) installed on it?

Such an appliance (let’s call it a “transparent virtual appliance” for now) doesn’t provide all the commonly claimed benefits of an appliance (zero config/admin) but as Andi points out these benefits come with major intrinsic drawbacks. A transparent virtual appliance still drastically simplifies installation (especially useful for test/dev/demo/POC). It doesn’t entirely free you of monitoring and configuration but at least it provides you with a very consistent and controlled starting point, manageable from the start (no need to subsequently install an agent). In addition, it can be made “just enough” (just enough OS, just enough app server…) to require a lot less maintenance than an application stack that you assemble yourself out of generic parts. We’ll always have trade offs between how optimized/customized it is versus how uniform your overall environment can be, but I don’t see the use of an appliance as a delivery mechanism as necessarily cornering you into a completely opaque situation, from a management perspective.

Those who attended Oracle Open World a few weeks ago were treated to an example of such an appliance, if they attended any of the sessions that covered Oracle’s Appliance Builder (the main one was, I believe, Virtualizing Oracle Fusion Middleware in the Modern Data Center, in case you have access to the Open World On Demand replay and slides). I believe it’s probably the same content that @jayfry3 was shown when he tweeted about “Oracle is demoing their private cloud self-service app”. These appliances are not at all opaque from a management perspective. To the contrary, they are highly manageable, coming with an Enterprise Manager agent installed that can manage everything in the appliance (and when that “everything” doesn’t include the OS, it’s because there isn’t one thanks to JRockit Virtual Edition, making things slimmer, faster, safer and more manageable). And of course the OVM-based environment in which you deploy these appliances is also managed by Enterprise Manager. OK, my point here wasn’t to go into marketing mode, but this is cool stuff and an example of what virtual appliances should be. BTW, this was also demonstrated during Hasan Rizvi’s keynote at OpenWorld, including the management of these systems through Enterprise Manager.

In the long run it’s irrelevant

As with all things computer-related, the issue is going to get blurrier and then irrelevant . The great thing about software is that there is no solid line. In this case, we will eventually get more customized appliances (via appliance builders or model-driven appliance generation) blurring the line between installed software and appliance-based software.

Waiting for PaaS

Towards the end of his post, Andi paints an optimistic vision of the future: “I also think that virtual appliances have a bright future – but in some ways I continue to see them as a beta version of what could (or should) come next. By adding in capabilities for responsible and accountable management, they could form the basis of more fully-functional virtual service management containers. These in turn could form the basis of elastic, mobile, network-deployed, responsible cloud appliances that deliver complete end-to-end service management without regard to physical location or domain of control.”

I mostly agree with this vision, though when I describe it it is in the guise of a PaaS platform. Where your appliance (which today goes from the OS all the way to the app) has shrunk to an application template that you deploy in the PaaS environment (rather than in a hypervisor). If/when the underlying PaaS environment has reached the right level of management automation you get all the benefits of an appliance while maintaining the consistency of your environment and its adherence to your management policies (because the environment is the PaaS platform and its management is driven from your policies).

[As is often the case, this started as a comment (on Andi’s blog) and quickly outgrew that environment, leading to this new post. Plus, Andi’s blog is brand new and seems to be well worth spreading the word about (Andi himself is under-marketingit).]

Remember 2006? Things were starting to fall into place for IT management integration and automation:

SDD was already on its way to cleanly describe/package/manage the lifecycle of simple and composite applications alike,

the first version of SML came out to capture all the relevant constraints of complex and composite systems and open the door to “desired-state management”,

the CMDBf effort was started to seamlessly integrate all sources of configuration and provide a bird-eye view of your entire IT infrastructure, and

the WSDM/WS-Management convergence/reconciliation was announced and promised to free management consoles from supporting many resource discovery, collection and control mechanisms and from having platform/library dependencies between the manager and its targets.

It looked like we were a year or two from standardization on all these and another year or two from shipping implementations. Things were looking good.

Good news: the schedule was respected. SDD, SML and CMDBf are now all standards (at OASIS, W3C and DMTF respectively). And today the Eclipse COSMOS project announced the release of COSMOS 1.1 which implements them all. The WSDM/WS-Management convergence is the only one that didn’t quite go according to the plan but it is about to come out as a standard too (in a pared-down form).

Bad news: nobody cares. We’ve moved on to “private clouds”.

Having been involved with these specifications in various degrees (a little bit on SDD, a fair amount on SML and a lot on CMDBf and WSDM/WS-Management) I am not as detached as my sarcastic tone may suggest. But as they say in action movies, “don’t let sentiments get in the way of the mission”.

There is still a chance to reuse parts of this stack (e.g. the CMDBf query language) and there are lessons to learn from our errors. The over-promising, the technical misjudgments, the political bickering, the lack of concrete customer validation, etc. To some extent this work was also victim of collateral damages from the excesses of WS-* (I am looking at you WS-Addressing). We also failed to notice the rise of the hypervisor in our peripheral vision.

I tried to capture some important lessons in this post-mortem. For the edification of the cloud generation. I also see a pendulum in action. Where we over-engineered I now see some under-engineering (overly granular interaction models, overemphasis on the virtual machine as the unit of everything, simplistic constraint models, underestimation of config/patching issues…). Things will come around and may eventually look familiar (suggested exercise: compare PubSubHubBub with WS-Notification).

Cloud development toolkits like Libcloud (for Python) and jcloud (for Java) have been around for some time, but over the last two months they have been joined by several other open source contenders. They all claim to abstract the on-the-wire Cloud management protocols sufficiently to let you access different Clouds via the same code; while at the same time providing objects in your programming language of choice and saving you the trouble of dealing with on-the-wire messages. By focusing on interoperability, they slot themselves below the larger role of a “Cloud broker” (which also deals with tasks like transfer and choice). Here is the list, starting with the more recent contenders:

DeltaCloud shares the same goal of translating between different Cloud management protocols but they present their own interface as yet another Cloud REST API/protocol rather than a language-specific toolkit. More along the lines of what UCI is trying to do (not sure what’s up with that project, I recorded my skepticism earlier and am still waiting to be pleasantly surprised).

Of course there are also programming toolkits that are specific to one Cloud provider. They are language-specific wrappers around one Cloud management protocol. AWS protocols (EC2, S3, etc…) represent the most common case, for example amazon-ec2 (a Ruby Gem), Power-EC2Dream (in C# which gives it the tantalizing advantage of being invokable via PowerShell) and typica (for Java). For Clouds beyond AWS, check out the various RightScale Ruby Gems.

The main point of this entry was to list the cross-Cloud development toolkits in the bullet list above. But if you’re in the mood for some pontification you can keep reading.

For some reason, what used to be called “protocols” is often called “APIs” in Cloud settings. Witness the Sun Cloud “API” or the vCloud “API” which only define XML formats for on-the-wire messages. I have never heard of CIM/XML over HTTP, WSDM or WS-Management being referred as APIs though they occupy a very similar place. They are usually considered “protocols”.

It’s a just question of definition whether an on-the-wire protocol (rather than a language-specific set of objects/methods) qualifies as an “Application Programming Interface”. It’s not an “interface” in the Java sense of the term. But I can “program” against it so it could go either way. On this blog I have gone along with the “API” term because that seemed widely used, though in verbal conversations I have tended to stick to “protocol”. One problem with “API” is that it pushes you towards mixing the “what” and the “how” and not respecting the protocol/model dichotomy.

Where is becomes relevant is when you start to see language-specific APIs for Cloud control pop-up as listed above. You now have two classes of things called “API” and it gets a bit confusing. Is it time to bring back the “protocol” term for on-the-wire definitions?

As a developer, whether you’re better off eating your Cloud noodles using chopsticks (on-the-wire protocol definitions) or a fork (language-specific APIs) is an important decision that will stay with you and may come back to bit you (e.g. when the interfaces are versioned). There is a place for both of course, but if we are to learn anything from WS-* it’s that we went way too far in the “give me a java stub” direction. Which doesn’t mean there is no room for them, but be careful how far from the wire semantics you get. It become even trickier when your stub tries not jsut to bridge between XML and Java but also to smooth out the differences between several on-the-wire protocols, as the toolkits above do. The hope, of course, is that there will eventually be enough standardization of on-the-wire protocols to make this a moot point.

What happened to the separation between the model and the protocol in management APIs? For all the arguments we had in the design of WSDM and WS-Management, this was one fundamental concept that took little discussion before everyone agreed: that the protocol (the interaction model and the on-the-wire shape of the messages used) should be defined in a way that is agnostic to the type of resource being managed (computers, elevators or toasters — the perennial silly example). To this end, WSDM took pains to release MUWS (Management Using Web Services) and MOWS (Management Of Web Services) as two different specifications.

Contrast that to the different Cloud APIs (there is a new one released every other day). If they have one thing in common it is that they happily ignore this principle and tackle protocol concerns alongside the resource model. Here are my guesses as to why that is:

1) It’s a land grad

The goal is not to produce the best long-term API, it’s to be out early, to stake your claim and to gain leverage, so that you can steer the final standard close to your implementation. Editorial niceties like properly factoring the specification are not major concerns, there will be plenty of time for this during the standardization process. In fact, leaving such improvements for the standardization phase is a nice way to make it look like the group is not just rubberstamping, while not changing much that actually impacts your implementation. The good old “give them something insignificant to argue about” trick. It works BTW.

As an example of how rushed some of these submissions can be, did you notice that what VMWare submitted to DMTF this week is the vCloud API Specification v0.8 (a 7-page document that is simply a list of operations), not the accompanying vCloud API programming guide v0.8 which is ten times longer and is the real specification, the place where the operation semantics, payload formats and protocol considerations are actually described and without which the previous document cannot possibly be implemented. Presumably the VMWare team was pressed to release on time for a VMWorld announcement and they came up with this to be able to submit without finishing all the needed editorial work. I assume this will follow soon and in the meantime the DMTF members will retrieve the programming guide from the VMWare site in order to make sense of what was submitted to them.

This kind of rush is not rare in the history of specification submission, even those that have been in the work for a long time . For example, the initial CBE submission by IBM had “IBM Confidential” all over the specification and a mention that one should retrieve the most up to date version from the “Autonomic Computing Problem Determination Offering Team Notes Database” (presumably non-IBMers were supposed to break into the server).

If lack of time is the main reason why all these APIs do not factor out the protocol aspects then I have no problem, there is plenty of time to address it. But I suspect that there may be other reasons, that some may see it as a feature rather than a bug. For example:

2) Anything but WS-*

SOAP-based interfaces (WS-* or WS-DeathStar) have a bad rap and doing anything in the opposite way is a crowd pleaser (well, in the blogosphere at least). Modularity and composition of specifications is a major driving force behind the WS-* work, therefore it is bad and we should make all specifications of the new REST order stand-alone.

3) Keep it simple

A more benevolent way to put it is the concern to keep things simple. If you factor specifications out you put on the developer the burden of assembling the complete documentation, plus you introduce versioning issues between the parts. One API document that fully describes the contract is simpler.

4) We don’t need no stinking’ protocol, we have HTTP

Isn’t this the protocol? Through the magic of REST, all that’s needed is a resource model, right? But if you look in the specifications you see sections about authentication, fault handling, long-lived operations, enumeration of long result sets, etc… Things that have nothing to do with the resource model.

So what?

Why is this confluence of model and protocol in one specification bad? If nothing else, the “keep it simple” argument (#3) above has plenty of merits, doesn’t it? Aren’t WSDM and WS-Management just over-engineered?

They may be, but not because they offer this separation. Consider the following practical benefits of separating the protocol from the model:

1) We can at least agree on one part

Thanks to the “REST is the new black” attitude in Cloud circles, there are lots of commonalities between these various Cloud APIs. Especially the more recent ones, those that I think of as “second generation” APIs: vCloud, Sun API, GoGrid and OCCI (Amazon EC2 is the main “1st generation” Cloud API, back when people weren’t too self-conscious about not just using HTTP but really “doing REST”). As an example of convergence between second generation specifications, see for example, how vCloud and the Sun API both use“202 Accepted” and a dedicated “status” resource to handle long-lived operations. More comparisons here.

Where they differ on such protocol matters, it wouldn’t be hard to modify one’s implementation to use an alternative approach. Things become a lot more sensitive when you touch the resource model, which reflects the actual capabilities of the Cloud management infrastructure. How much flexibility in the network setup? What kind of application provisioning? What affinity/anti-affinity control level? Can I get block-level storage? Etc. Having to implement the other guy’s interface in these matters is not just a matter of glue code, it’s a major product feature. As a result, the resource model is a much more strategic control point than the protocol. Would you rather dictate the terms of a contract or the color of the ink in which it is printed?

That being the case, I suspect that there could be relatively quick and painless agreement on that first layer of the Cloud API: a set of protocol considerations, based on HTTP and REST, that provide a resource control framework with support for security, events, long-running operations, faults, many-as-one semantics, enumeration, etc. Or rather, that if there is to be a “quick and painless” agreement on anything related to Cloud computing standards it can only be on something that is limited to protocol concerns. It doesn’t have to be long and complex. It doesn’t have to be factored in 8 different specifications like WS-* did. It can be just one specification. Keep it simple, ignore all use cases that aren’t related to Cloud Computing. In the end, please call it MUR (Management Using REST)… ;-)

2) Many Clouds, one protocol to rule them all

Whichever Cloud taxonomy strikes your fancy (I am so disappointed that SADIST-PIMP hasn’t caught on), it’s pretty clear that there will not be one kind of Cloud. There will be at least some IaaS, some PaaS and plenty of SaaS. There will not be one API that provides control of them all, but they can share a base protocol that will make life a lot easier for developers. These Clouds won’t be isolated, developers will use them as a continuum.

3) Not just one access model

As much as it makes sense to start from simple and mostly synchronous operations, there will be many different interaction models for Cloud Computing. In addition to the base operations, we may get more of a desired-state/blueprint interaction pattern, based on the same resource model. Or, somewhere in-between, some kind of stored execution flow where modules are passed around rather than individual operations. Also, as the level of automation increases you may want a base framework that is more event-friendly for rapid close-loop management. And there are other considerations involved (like resource monitoring, policies…) not currently covered by these specifications but that can surely reuse the protocol aspects. By factoring out the resource model, you make it possible for these other interaction patterns to emerge in a compatible way.

The current Cloud APIs are not far away from this clean factoring. It would be an easy task to extract protocol considerations as a separate document, in large part due to the fact that REST prevents you from burying the resource model inside convoluted operation semantics. To some extent it’s just a partitioning issue, but the same can be said of many intractable and bloody armed conflicts around the world… Good fences make good neighbors in the world of IT specs too.

[UPDATE: Soon after this entry went to “press” (meaning soon after I pressed the publish button), I noticed this report of a “REST-*” proposal by Mark Little of RedHat/JBOSS. I will reserve judgment until Mark has blogged about it or I have seen some other authoritative description. We may be talking about the same thing here. Or maybe not. The REST-* name surprises me a bit as I would expect opponents of such a proposal to name it just this way. We’ll see.]

[UPDATE 2009/9/6: Apparently I am something like the 26th person to think of the “one protocol/API to rule them all” sentence. We geeks have such a shallow set of shared cultural references it’s scary at times.]

[UPDATED 2009/11/12: Lori MacVittie has a very nice follow-up on this, with examples and interesting analogies. Check it out.]

VMWare published its vCloud API yesterday (it was previously only available to a few partners) and submitted it to the DMTF, as had been previously announced. So much for my speculations involving IBM.

It may be time to update the Cloud API comparison. After a very quick first pass, vCloud looks quite similar to the Sun Cloud API (that’s a compliment). For example, they both handle long-lived operations via a “202 Accepted” complemented by a resource that represents the progress (“status” for Sun, “task” for vCloud). A very visible (but not critical) difference is the use of JSON (Sun) versus XML (vCloud).

As expected, OVF/OVA is central to vCloud. More once I have read the whole specification.

In any case, things are going to get interesting in the DMTF Cloud incubator. I there a path to adoption?Assuming that Amazon keeps sitting it out, what will the other Cloud vendors with an API (Rackspace, GoGrid, Sun…) do? I doubt they ever had plans/aspirations to own or even drive the standard, but how much are they willing to let VMWare do it? How much does Citrix/Xen want to steer standards versus simply implement them in the context of the Xen Cloud project? What about OGF/OCCI with which the DMTF is supposedly collaborating?How much support is VMWare going to receive from its service provider partners? How much traction does VMWare have with Cisco, HP (server division) and IBM on this? What are the plans at Oracle and Microsoft? Speaking of Microsoft, maybe it will at some point want its standard strategy playbook back. At least when VMWare is done using it.

IBM, Fujitsu and CA have recently proposed a charter for a new OASIS technical committee, called the Symptoms Autonomic Framework (SAF) TC. Including a specification candidate and other submitted documents, listed here.

For context, you need to remember the Common Base Event (CBE) specification that IBM has shopped around for a long time, initially hand in hand with Cisco. As always, the Cover Pages offer the best references on this saga. CBE was submitted to WSDM and came out (in a much-emaciated form) as the WSDM Event Format (WEF) in WSDM 1.1 part 2.

Because so many parts of CBE were left on the floor of the WSDM editing room and because WSDM itself saw little adoption, I have always been expecting IBM to bring CBE back in some form. When I heard of SAF, my instinct was that this was it.

Not so. SAF is meant to sit on top of an event system like CBE. It turns selected events/situations and other data points into symptoms and tells you what to do next. Its focus is on roles, process and knowledge bases. Not on the event format. The operations and payloads defined are not for exchanging events, they are for exchanging “symptoms”, “syndromes”, “prescriptions” and “protocols”.

As the terms show, the specification espouses the medical dialect (even “protocol” is meant to be understand in the medical sense, not as in “HTTP” or “FTP”). While I have been guilty of a similar analogy myself, I also think that if there is one area from which we don’t want to learn in terms of automation, system integration and proper use of IT in general, it’s the medical field. So let’s be careful not to push the analogy too far (section 8.1 of the SAF specification is a fun read, but not necessarily very compelling).

BTW, since when do we use terms strongly associated with one company in the name of standards group (“autonomic”)?

More fundamentally, the main question is what the chances of success of this effort are. Its a huge endeavor (“enabling interoperable diagnosis and treatment of complex systems”) and it tries to structure activities that have been going on for a long time and in many different ways. No-one will adopt this structure for its own sake, so the question is what practical benefits can be derived from this level of standardization. For example, how reliably can incoming events be mapped in practice to symptoms, how efficiently can symptoms be matched to protocols (in typical IBM fashion there seems to be a big “XPath is my hammer” assumption lurking), etc…

The discussion on the charter is currently open in OASIS if you want to weigh in.

What benefits does REST provide for configuration management (in traditional data centers and in Clouds)?

Part 1 of the “REST in practice for IT and Cloud management” investigation looked at Cloud APIs from leading IaaS providers. It examined how RESTful they are and what concrete benefits derive from their RESTfulness. In part 2 we will now look at the configuration management domain. Even though it’s less trendy, it is just as useful, if not more, in understanding the practical value of REST for IT management. Plus, as long as Cloud deployments are mainly of the IaaS kind, you are still left with the problem of managing the configuration of everything that runs of top the virtual machines (OS, middleware, DB, applications…). Or, if you are a glass-half-full person, here is another way to look at it: the great thing about IaaS (and host virtualization in general) is that you can choose to keep your existing infrastructure, applications and management tools (including configuration management) largely unchanged.

At first blush, REST is ideally suited to configuration management.

The RESTful Cloud APIs have no problem retrieving resource descriptions, but they seem somewhat hesitant in the way they deal with resource-specific actions. Tim Bray described one of the challenges in his well-considered Slow REST post. And indeed, applying REST to these “do something that may take some time and not result exactly in what was requested” scenarios is a lot less straightforward than when you’re just doing document/data retrieval. In contrast you’d think that applying REST to the task of retrieving configuration data from a CMDB or other configuration store would be a no-brainer. Especially in the IT management world, where we already have explicit resource models and a rich set of relationships defined. Let’s give each resource a URI that responds to HTTP GET requests, let’s turn the associations into hyperlinks in the resource presentation, let’s mint a MIME type to represent this format and we are out of the office in time for a 4:00PM tennis game when all the courts are available (hopefully our tennis partners are as bright as us and can get out early too). This “work smarter not harder” approach would allow us to present this list of benefits in our weekly progress report:

-1- A URI-based scheme makes the protocol independent of the resource topology, unlike today’s data stores that usually struggle to represent relationships between stores.

-2- It is simpler to code against than CIM-over-HTTP or WS-Management. It is cross-platform, unlike WMI or JMX.

-3- It makes it trivial to browse the configuration data from a Web browser (the resources themselves could provide an HTML representation based on content-type negotiation, or a simple transformation could generate it for the Web browser).

-4- You get REST-induced caching and scalability.

In the shower after the tennis game, it becomes apparent that benefit #4 is largely irrelevant for IT management use cases. That the browser in #3 would not be all that useful beyond simple use cases. That #2 is good for karma but developers will demand a library that hides this benefit anyway. And that the boss is going to say that he doesn’t care about #1either because his product is “the single source of truth” so it needs to import from the other configuration store, not reference them.

Even if we ignore the boss (once again) it only leaves #1 as a practical benefit. Surprise, that’s also the aspect that came out on top of the analysis in part 1 (see “the API doesn’t constrain the design of the URI space” highlight, reinforced by Mark’s excellent comment on the role of hypertext). Clearly, there is something useful for IT management in this “hypermedia” thing. This will largely be the topic of part 3.

There are also quite a few things that this RESTification of the configuration management store doesn’t solve:

-1- The ability to query: “show me all the WebLogic instances that run on a Windows host and don’t have patch xyz applied”. You don’t have much of a CMDB if you can’t answer this. For an analogy, remember (or imagine) a pre-1995 Web with no search engine, where you can only navigate by starting from your browser home page and clicking through static links step by step, or through bookmarks.

-2- The ability to retrieve the configuration change history and to compare configurations across resources (or to a reference configuration).

This is not to say that these two features cannot be built on top of a RESTful IT resource model. Just that they are the real meat of configuration management (rather than a simple resource-by-resource configuration browser) and that your brilliant re-architecture hasn’t really helped in addressing them. Does a RESTful foundation make these features harder to build? Not necessarily, but there are some tricky aspects to take care of:

-1- In hypermedia systems, the links are usually part of the resource representation, not resources of their own. In IT management, relationships/associations can have their own lifecycle and configuration properties.

-2- Be careful that you can really maintain the address of a resource. It’s one thing to make sure that a UUID gets maintained as a resource configuration changes, it’s another to ensure that a dereferenceable URI remains unchanged. For example, the admin server of a cluster may move over time from one node to another.

More fundamentally, the ability to deal with multiple resources at the same time and/or to use the model at different levels of granularity is often a challenge. Either you make your protocol more complex to account for this or your pollute your resource model (with a bunch of arbitrary “groups”, implicit or explicit).

We saw this in the Cloud APIs too. It typically goes something like this: you can address an individual server (called “foo”) by sending requests to http://Cloudprovider.com/server/foo. Drop the “foo” part of the URL and now you can address all the servers, for example to retrieve their configuration or possibly to reboot them. This gives me a way of dealing with multiple resources at time, but only along the lines pre-defined by the API. What if I want to deal only with the servers that host nodes of a given cluster. Sorry, not possible. What if the servers have different hosts in their URIs (remember, “the API doesn’t constrain the design of the URI space”)? Oops.

WS-Management, in the SOAP world, takes this one step further with Selectors, through which you can embed some kind of query, the result of which is what you are addressing in your message. Or, if all you want to do is GET, you can model you entire datacenter as one giant virtual XML doc (a document which is never assembled in practice) and use WSRF/WSDM’s “QueryExpression” or WS-Management’s “FragmentTransfer” to the same effect. BTW, I have issues with the details of how these mechanisms work (and I have described an alternative under the motto “if you are going to suffer with WS-Addressing, at least get some value out of it”).

These are all non-RESTful atrocities to a RESTafarian, but in my mind the Cloud REST API reviewed in part 1 have open Pandora’s box by allowing less-qualified URIs to address all instances of a class. I expect you’ll soon see more precise query parameters in these URIs and they’ll look a lot like WS-Management Selectors (e.g. http://Cloudprovider.com/server?OS=Linux&CPUType=X86). Want to take bets about when a Cloud API URI format with an embedded regex first arrives?

When you need this, my gut feeling is that you are better off not worrying too much about trying to look RESTful. There is no shame to using an RPC pattern in the right circumstances. Don’t be the stupid skier who ends up crashing in a tree because he is just too cool for the using snowplow position.

One of the most common reasons to deal with multiple resources together is to run queries such as the “show me all the WebLogic instances that run on a Windows host and don’t have patch xyz applied” example above. Such a query mechanism recently became a DMTF standard, it’s called CMDBf. It is SOAP-based and doesn’t attempt to have anything to do with REST. Not that it didn’t cross the mind of a bunch of people, lead by Michael Coté when CMDBf first emerged (read the comments too). But as James Governor rightly predicted in the first comment, Coté heard “dick” from us on this (I represented HP in CMDBf and ended up being an editor of the specification, focusing on the “query” part). I don’t remember reading the entry back then but I must have since I have been a long time Coté fan. I must have dismissed the idea so quickly that it didn’t even register with my memory. Well, it’s 2009 now, CMDBf v1 is a DMTF standard and guess what? I, and many other SOAP-the-world-till-it-shines alumni, are looking a lot more seriously into what’s in this REST thing (thus this series of posts for me). BTW in this piece Coté also correctly predicted that CMDBf would be “more about CMDB interoperation than federation” but that didn’t take as much foresight (it was pretty obvious to me from the start).

Frankly I am still not sure that there is much benefit from REST in what CMDBf does, which is mostly a query interface. Yes the CMDBf query and its response go over SOAP. Yes in this case SOAP is mostly a useless wrapper since none of the implementations will likely support any WS-* SOAP header (other than paying the WS-Addressing tax). Sure we could remove it and send plain XML over HTTP. Or replace the SOAP wrapper with an Atom wrapper. Would it be anymore RESTful? Not one bit.

And I don’t see how to make it more RESTful. There are plenty of things in the periphery the query operation that can be made RESTful, along the lines of what I described above. REST could make the discovery/reconciliation tasks of the CMDB more efficient. The CMDBf query result format could be improved so that from the returned elements I can navigate my way among resources by following hyperlinks. But the query operation itself looks fundamentally RPCish to me, just like my interaction with the Google search page is really an RPC call that happens to return a Web page full of hyperlinks. In a way, this query (whether Google or CMDBf) can at best be the transition point from RPC to REST. It can return results that open a world of RESTful requests to you, but the query invocation itself is not RESTful. And that’s OK.

In part 3 (now available), I will try to synthesize the lessons from the Cloud APIs (part 1) and configuration management (this post) and extract specific guidance to get the best of what REST has to offer in future IT management protocols. Just so you can plan ahead, in part 4 I will reform the US health care system and in part 5 I will provide a practical roadmap for global nuclear disarmament. Suggestions for part 6 are accepted.

Like librarians, we IT wonks tend to like things cataloged. To date, the last instance of this has been SOA governance and its various registries and repositories. With UDDI limping along as some kind of organizing standard for the effort. One issue I have with UDDI is that its technical awkwardness is preventing us from learning from its failure to realize its ambitious goals (“e-business heaven”). It would be too easy to attribute the UDDI disappointment to UDDI. Rather, it should be laid at the feet of unreasonable initial expectations.

The SOA governance saga is still ongoing, now away from the spotlight and mostly from an implementation perspective rather than a standard perspective (by the way, what’s up with GIF?). Instead, the spotlight has turned to Cloud computing and that’s what we are supposedly going to control through cataloging next.

Earlier this year, I commented on the release of an ITSM catalog product for Cloud computing (though I was addressing the convergence of ITSM and Cloud computing more than catalogs per se).

More recently, Lori MacVittie related SOA governance to the need for Cloud catalogs. She makes some good points, but I also see some familiar-looking “irrational exuberance”. The idea of dynamically discovering and invoking a Cloud service reminds me too much of the initial “yellow pages” scenarios for UDDI (which quickly got dropped in favor of a more modest internal governance focus).

I am not convinced by the reason Lori gives for why things are different this time around (“one of the interesting things virtualization brings to the table that SOA did not is the ability to abstract management of services”). She argues that SOA governance only gave you access to the operational WSDL of a Web service, while Cloud catalogs will give you access to their management API. But if your service is an IT service, then your so-called management API (launch/configure/control VMs) is really its operational interface. The real management interface is the one Amazon uses under the cover and they are not going to expose it to you anymore than your bank is going to expose its application server administration console to you (if they do, move your money somewhere else before someone does it for you).

After all, isn’t SOA governance pretty close to a SaaS catalog which is itself a small part of the overall Cloud (IaaS+PaaS+SaaS) catalog question? If we still haven’t succeeded in the smaller scope, what are the odds of striking gold quickly in the larger effort?

Someanalysts take a more pragmatic view, involving active brokers rather than simply a new DNS record type. I am doubtful about these brokers (0.2 probability, as Gartner would put it) but at least this moves the question onto business terms (leverage, control) rather than technical terms. Which is where the battle will be fought.

When it comes to Cloud catalogs, I think they are needed (if only for the categorization of Cloud services that they require) but will only play a supporting role, if any, in any move towards dynamic Cloud provisioning. As with SOA governance it’s as an internal tool, supported by strong processes, that they will be most useful.

Throughout human history, catalogs have been substitutes for control more often than instruments of control. Think of astronomy, zoology and… nephology for example. What kind will IT Cloud catalogs be?

Did you know that you can now SSH to a Windows machine over WS-Management and its is a documented protocol that can be implemented from any platform and programming language? This is big news to me and I am surprised that, as management protocol geek, I hadn’t heard about it until I started to search MSDN for a related but much smaller feature (file transfer over WS-Management).

OK, so it’s not exactly SSH but it is a remote shell. In fact it comes in two flavors, which I think of as “dumb SSH” and “super SSH”.

Dumb SSH

Dumb SSH is the ability to remotely run a DOS-like command shell over WS-Management. Anyone who has had to use the Windows command shell as a scripting language ersatz understands why I call it “dumb”. I expect that even in Microsoft most would agree (otherwise why would they have created PowerShell?).

Still, you can do quite a few basic things using the Windows command shell and being able to do them remotely is not something to sneer at if you’re building a management product. If you’re interested, you need to read MS-WSMV, the WS-Management Protocol Extensions for Windows Vista specification (available here as a PDF). By the name of the specification, I expected a laundry list of tweaks that the WS-Management and WS-CIM implementation in Vista makes on top of the standards (e.g. proprietary extensions, default values, unsupported features, etc). And there is plenty of that, in sections 3.1, 3.2 and 3.3. The kind of “this is my way” decisions that you’d come to expect from Microsoft on implementing standards. A bit frustrating when you know that they pretty much wrote the standard but at least it’s well documented. Plus, being one of those that forced a few changes in WS-Management between the Microsoft submission and the DMTF standard (under laments from Microsoft that “it’s too late to change Longhorn”) I am not really in position to complain that “Longhorn” (now Vista) indeed deviates from the standard.

But then we get to section 3.4 and we enter a new realm. These are not tweaks to WS-Management anymore. It’s a stateful tunneling protocol going over WS-Management, complete with base-64-encoded streams (stdin, stdout, stderr) and signals. It gives you all you need to run a remote command shell over WS-Management. In addition to the base Windows command shell, it also supports “custom remote shells”, which lets you leverage the tunneling mechanism for another protocol than the one made of Windows shell commands. For example, you could build an HTTP emulation over this on top of which you could run WS-Management on top of which… you know where this is going, don’t you?

A more serious example of such a “custom remote shell” is PowerShell, which takes us to…

Super SSH

Imagine SSH with the guarantee that the shell that you log into on the other side was a Python interpreter, complete with full access to the server’s management API. I think that would qualify as “super SSH”, at least for IT management purposes (no so exciting if all you want to do is check your email with mutt). This is equivalent to what you get when the remote shell invoked over WS-Management (or rather WS-Management plus Vista extensions described above) is PowerShell instead of the the Windows command shell. I have always liked PowerShell but it hasn’t really be all that relevant to me (other than as a design study) because of its ties to the Windows platform. Now, thanks to MS-PSRP, the PowerShell Remoting Protocol specification (PDF here) we are only a good Java (or Python, or Ruby) library away from being able to invoke PowerShell commands from any language, anywhere.

I have criticized over-reliance on libraries to shield developers from XML for task that really would be much better handled by simply learning to use XML. But in this case we really need a library because there is quite a bit of work involved in this protocol, most of which has nothing to do with XML. We have to fragment/defragment packets, compress/decompress messages, not to mention the security aspects. At this point you may question what the value of doing all this on top of WS-Management is, for which I respectfully redirect you to your local Microsoft technology evangelist, MVP or, in last resort, sales representative.

Even if PowerShell is not your scripting language of choice, you can at least use it to create a bootstrap mechanism that will install whatever execution engine you want (e.g. Ruby) and download scripts from your management server. At which point you can sign out of PowerShell. For some reason, I get the feeling that we just got one step closer to Puppet managing Windows machines.

A few closing comments

First, while the MS-WSMV part that lets you run a basic command shell seems already available (Vista SP1, Win2K3R2, Win2K8, etc), the PowerShell part is a lot greener. The MS-PSRP specification is marked “preliminary” and the supported platform list only contains Windows 7 and Win2K8R2. Nevertheless, the word from Microsoft is that they have the intention to make this available on XP and above shortly after Windows 7 comes out. Let’s hope this is the case, otherwise this technology will remain largely irrelevant for years to come.

The other caveat comes from the standard angle. In this post, I only concern myself with the technical aspects. If you want to implement these specifications you have to also take into account that they are proprietary specifications with no IP grant (“Microsoft has patents that may cover your implementations of the technologies described in the Open Specifications. Neither this notice nor Microsoft’s delivery of the documentation grants any licenses under those or any other Microsoft patents”) and fully controlled by Microsoft (who could radically change or kill them tomorrow). As to whether Microsoft plans to eventually standardize them, I would again refer you to your friendly local Microsoft representative. I can just predict, based on the content of the specification, that it would make for some interesting debates in the DMTF (or wherever they may go).

This is a big step towards the citizenship of Windows machines in an automated datacenter (and, incidentally, an endorsement for the “these scripts have to grow up” approach to automation). As Windows comes to parity with Unix in remote scripting abilities, the only question remaining (well, in addition to the pesky license) will be “why another mechanism”. Which could be solved either via standardization of MS-PSRP, de-facto adoption (PowerShell on Suse Linux is only one Microsoft-to-Novell check away) or simply using PowerShell as just a bootstrapping mechanism for Puppet or others, as mentioned above.

[UPDATE: On a related topic, these twoposts describe ways to transfer files over WS-Management.]

That’s not going to improve the relationship between the Insight Control group (part of the server hardware group, of Compaq heritage) and the BTO group (part of HP Software, of HP heritage plus many acquisitions) in HP. The Microsoft relationship was already a point of tension when they were still called SIM and OpenView, respectively.

You can call it a “Cloud operating system”, an “adaptive infrastructure framework” or simply “IT management middleware” (my vote) as you prefer. It’s the software that underpins the automation engine of your Cloud. You can’t have a Cloud without an automation engine, unless you live in a country where IT admins run really fast, never push the wrong button, never plug a cable in the wrong port, can interpret blinking lights at a rate of 9,600 bauds and are very cheap. The automation engine is what technically makes a Cloud. That engine is an application whose business is to know what needs to be done to maintain the IT environment you use in a state that is acceptable to you at any point in time (where you definition of “acceptable” can evolve). Like any application, you want to keep its business logic neatly isolated from the mundane tasks that it relies on. These mundane tasks include things like:

collecting events and delivered them to the right place

collecting metrics of the different IT elements

discovering available resources and accessing them (with or without agents)

performing coordinated actions on IT elements

maintaining an audit of management actions

securing the management interactions

managing long-running tasks and processes

etc

That’s what management middleware does. It doesn’t automate anything by itself, but it provides an environment in which it is feasible to implement automation. This middleware is useful even if you don’t automate anything, but it often doesn’t get called out in that scenario. On the other hand, automation means capturing more business logic in software which makes it imperative to clearly layer concerns, at which point the IT management middleware can be more clearly identified within the overall IT management infrastructure.

This is happening in many different ways. I can count seven roads to IT management middleware, seven ways in which it is emerging as an identifiable actor in data centers. Each road represents a different history and comes with different assumptions and mindsets. And yet, they go after the same base problem of enabling IT management automation. Here is a quick overview of these seven roads.

Road #1: “these scripts have to grow up”

This road starts from all the scripts common in IT operations and matures them. It’s based on the realization that they are crucial business assets, just like the applications that they support. And that they implement reusable patterns. Alex Honor described it well here. Puppet and Powershell are in this category.

Road #2: “it’s just another integration job”

We’ve been doing computer integration almost since the second piece of software was written. There are plenty of mechanisms available to do so. IT management is just another integration problem, so let’s present it in a way that allows us to use our favorite integration tools on it. That’s the driver behind the use of Web Services for management integration (e.g. WSDM/WS-Management): create interfaces to manageable resources so that existing middleware (mostly J2EE application servers, along with their WS stack) can be used to solve the “enterprise IT management” integration problems in a robust and reliable way. The same logic is behind the current wave of REST-based IT management efforts (see this presentation). REST is a good integration approach, so let’s turn IT resources into RESTful resources so we can apply this generic integration mechanism to enterprise IT management. Different tool, but same logical approach. Which is why they can be easily compared.

Road #3: “top-down”

This is the “high road”, the one most intellectually satisfying and most promising in the long term. But also the one with this highest hurdle off the gate. In this approach, you create a top-down model of your system and you try to mediate management actions through this model. But for this to be practical, you need to hit the sweetspot in many dimensions. You need composable sub-models at a level of granularity that makes them maintainable. You need to force enough uniformity but not so much as to loose all optimizations. You need to decide which of the myriads of configuration variables you include in your model. Because you can’t take the traditional approach of “I’ll model it and display it to my user who can decide what to do with it”. Because the user now is a piece of software and it can’t make a judgment of whether it is ok if parameter foo differs from the desires state or not. This has been worked on for a long time (remember HP’s UDC?) with steady but slow progress. Elastra has some of the most interesting technology there, and a healthy dose of realism and opportunism to make it work.

Think of it as SCA component/composites but not just for software artifacts. Rather, it’s SCA for all IT elements, with wires and policies that are just rich enough to allow meaningful optimization but not too rich to be unmanageable. If you can pull off such model-driven IT management middleware, then the automation code almost writes itself on top of it.

That was the road followed by the Big Four. Buy enough of their products (CMDBs, network management console, operations console, service desk, etc…) and you’ll get APIs that allow you to leverage their discovery, collection, eventing and process management features. So you can write your automation on top. At least on paper. In reality, these APIs are too inconsistent and import/export-oriented to really support SOA-style (or REST-style) integration, even though they usually have a SOAP and/or plain HTTP option available. It’s a challenge just to get point to point integration between these products, even more a true set of management services that can be orchestrated. These vendors know it and rather than turning their product suites into a real SOI (Service Oriented Infrastructure) they have decided to build/buy automation engines on the side that can be hard-wired with the existing IT management products. That’s your IT management middleware but it comes bundled with the automation engine rather than as an independent layer.

Road #5: “management integration is another feature of our hypervisor”

If the virtual machine (in the x86 virtualization sense of the term, a.k.a. a fake machine) is the basic building block of your IT infrastructure then hypervisor interfaces to manipulate these VMs are pretty much all you need in terms of middleware to build data center automation on top, right? Are we done then? Not really, since there is a lot more to an application than the VMs on which it runs. Still, hypervisors bring the potential of automation to what used to be a hardware domain and as such play a big part in the composition of the IT management middleware of modern data centers.

Road #6: “make it all the same”

From what I understand about how Yahoo, Google (see section 2.1 “System Health Infrastructure”), Microsoft (see the “device manager” and “collection service” parts of Autopilot) and others run their Web applications, they have put a lot of work in that management middleware and have made simplicity a key design goal for it. To that end, they are willing to accept drastic limitations at both ends of the IT infrastructure chain: at the bottom, they actively limit the heterogeneity of resources in the data center. At the top, they limit the capabilities exposed to the business applications. In an extreme scenario, all servers are the same and all the business applications are written to a few execution/persistence/communication environments (think GAE SDK as an example). Even if you only approximate this ideal, it’s a dramatic simplification that makes your IT management middleware much simpler and thinner.

Road #7: “it’s the Grid”

The Grid computing and HPC (High Performance Computing) communities have long been active in this area. There is a lot of relevant expertise in all the Grid work, but we also need to understand the difference between IT management middleware and the Grid infrastructure as defined by OGSI. OGSI defines a virtualization layer on which to build applications. It doesn’t define how to deploy, manage and configure the physical datacenter infrastructure that allows OGSI interfaces to be exposed to consumers. With regards to HPC, we also need to keep in mind that the profile of the applications is very different from your typical enterprise application (especially the user-driven apps as opposed to batch jobs). In HPC environments, CPUs can run at full capacity for days and new requests just go in a queue. The Web applications of your typical enterprise don’t have this luxury and usually need spare capacity.

All these approaches can complement each other and I am not trying to pin each product/vendor to just one approach. In this post (motivated by this podcast), Stu Charlton discusses the overlap and differences between some of these approaches. Rather than a taxonomy of products, this list of seven roads to IT management middleware is simply a cultural history, a reading guide to understand the background, vocabulary and assumptions of the different solutions. This list cuts across the declarative versus procedural debate (#1 is clearly procedural, #3 is clearly declarative, the others could go either way).

[UPDATED 2009/6/23: Stu has a somewhat related (similary structured but much more entertainingly writen) list of Cloud Computing approaches. I feel good that I have one more item in my list than him.]