The prospects got even bleaker today, at least according to the usually-well-informed Mary Jo Foley, who writes: “Multiple contacts of mine are telling me that Microsoft has decided to shelve Quadrant and ‘refocus’ M.” Is “M” the end of the SDM/SML/M model-driven management approach at Microsoft? Or is the “refocus” a hint that M is returning “home” to address IT management use cases? Time (or Doug) will tell…

While we’re talking about Microsoft and IT automation, I have one piece of free advice for the Microsofties: people *really* want to SSH into Windows servers. Here’s how I know. This blog rarely talks about Microsoft but over the course of two successive weekends over a year ago I toyed with ways to remotely manage Windows machines using publicly documented protocols. In effect, showing what to send on the wire (from Linux or any platform) to leverage the SOAP-based management capabilities in recent versions of Windows. To my surprise, these posts (1, 2, 3) still draw a disproportionate amount of traffic. And whenever I look at my httpd logs, I can count on seeing search engine queries related to “windows native ssh” or similar keywords.

If heterogeneous Cloud is something Microsoft cares about they need to better leverage the potential of the PowerShell Remoting Protocol. They can release open-source Python, Java and Ruby client-side libraries. Alternatively, they can drastically simplify the protocol, rather than its current “binary over SOAP” (you read this right) incarnation. Because the poor Kridek who is looking for the “WSDL for WinRM / Remote Powershell” is in for a nasty surprise if he finds it and thinks he’ll get a ready-to-use stub out of it.

That being said, a brave developer willing to suck it up and create such a Python/Ruby/Java library would probably make some people very grateful.

Oracle recently published a Cloud management API on OTN and also submitted a subset of the API to the new DMTF Cloud Management working group. The OTN specification, titled “Oracle Cloud Resource Model API”, is available here. In typical DMTF fashion, the DMTF-submitted specification is not publicly available (if you have a DMTF account and are a member of the right group you can find it here). It is titled the “Oracle Cloud Elemental Resource Model” and is essentially the same as the OTN version, minus sections 9.2, 9.4, 9.6, 9.8, 9.9 and 9.10 (I’ll explain below why these sections have been removed from the DMTF submission). Here is also a slideset that was recently used to present the submitted specification at a DMTF meeting.

So why two documents? Because they serve different purposes. The Elemental Resource Model, submitted to DMTF, represents the technical foundation for the IaaS layer. It’s not all of IaaS, just its core. You can think of its scope as that of the base EC2 service (boot a VM from an image, attach a volume, connect to a network). It’s the part that appears in all the various IaaS APIs out there, and that looks very similar, in its model, across all of them. It’s the part that’s ripe for a simple standard, hopefully free of much of the drama of a more open-ended and speculative effort. A standard that can come out quickly and provide interoperability right out of the gate (for the simple use cases it supports), not after years of plugfests and profiles. This is the narrow scope I described in an earlier rant about Cloud standards:

I understand the pain of customers today who just want to have a bit more flexibility and portability within the limited scope of the VM/Volume/IP offering. If we really want to do a standard today, fine. Let’s do a very small and pragmatic standard that addresses this. Just a subset of the EC2 API. Don’t attempt to standardize the virtual disk format. Don’t worry about application-level features inside the VM. Don’t sweat the REST or SOA purity aspects of the interface too much either. Don’t stress about scalability of the management API and batching of actions. Just make it simple and provide a reference implementation. A few HTTP messages to provision, attach, update and delete VMs, volumes and IPs. That would be fine. Anything else (and more is indeed needed) would be vendor extensions for now.

Of course IaaS goes beyond the scope of the Elemental Resource Model. We’ll need load balancing. We’ll need tunneling to the private datacenter. We’ll need low-latency sub-networks. We’ll need the ability to map multi-tier applications to different security zones. Etc. Some Cloud platforms support some of these (e.g. Amazon has an answer to all but the last one), but there is a lot more divergence (both in the “what” and the “how”) between the various Cloud APIs on this. That part of IaaS is not ready for standardization.

Then there are the extensions that attempt to make the IaaS APIs more application-aware. These too exist in some Cloud APIs (e.g. vCloud vApp) but not others. They haven’t naturally converged between implementations. They haven’t seen nearly as much usage in the industry as the base IaaS features. It would be a mistake to overreach in the initial phase of IaaS standardization and try to tackle these questions. It would not just delay the availability of a standard for the base IaaS use cases, it would put its emergence and adoption in jeopardy.

This is why Oracle withheld these application-aware aspects from the DMTF submission, though we are sharing them in the specification published on OTN. We want to expose them and get feedback. We’re open to collaborating on them, maybe even in the scope of a standard group if that’s the best way to ensure an open IP framework for the work. But it shouldn’t make the upcoming DMTF IaaS specification more complex and speculative than it needs to be, so we are keeping them as separate extensions. Not to mention that DMTF as an organization has a lot more infrastructure expertise than middleware and application expertise.

Again, the “Elemental Resource Model” specification submitted to DMTF is the same as the “Oracle Cloud Resource Model API” on OTN except that it has a different license (a license grant to DMTF instead of the usual OTN license) and is missing some resources in the list of resource types (section 9).

Both specifications share the exact same protocol aspects. It’s pretty cleanly RESTful and uses a JSON serialization. The credit for the nice RESTful protocol goes to the folks who created the original Sun Cloud API as this is pretty much what the Oracle Cloud API adopted in its entirety. Tim Bray described the genesis and design philosophy of the Sun Cloud API last year. He also described his role and explained that “most of the heavy lifting was done by Craig McClanahan with guidance from Lew Tucker“. It’s a shame that the Oracle specification fails to credit the Sun team and I kick myself for not noticing this in my reviews. This heritage was noted from the get go in the slides and is, in my mind, a selling point for the specification. When I reviewed the main Cloud APIs available last summer (the first part in a “REST in practice for IT and Cloud management” series), I liked Sun’s protocol design the best.

The resource model, while still based on the Sun Cloud API, has seen many more changes. That’s where our tireless editor, Jack Yu, with help from Mark Carlson, has spent most of the countless hours he devoted to the specification. I won’t do a point to point comparison of the Sun model and the Oracle model, but in general most of the changes and additions are motivated by use cases that are more heavily tilted towards private clouds and compatibility with existing application infrastructure. For example, the semantics of a Zone have been relaxed to allow a private Cloud administrator to choose how to partition the Cloud (by location is an obvious option, but it could also by security zone or by organizational ownership, as heretic as this may sound to Cloud purists).

The most important differences between the DMTF and OTN versions relate to the support for assemblies, which are groups of VMs that jointly participate in the delivery of a composite application. This goes hand-in-hand with the recently-released Oracle Virtual Assembly Builder, a framework for creating, packing, deploying and configuring multi-tier applications. To support this approach, the Cloud Resource Model (but not the Elemental Model, as explained above) adds resource types such as AssemblyTemplate, AssemblyInstance and ScalabilityGroup.

So what now? The DMTF working group has received a large number of IaaS APIs as submissions (though not the one that matters most or the one that may well soon matter a lot too). If all goes well it will succeed in delivering a simple and useful standard for the base IaaS use cases, and we’ll be down to a somewhat manageable triplet (EC2, RackSpace/OpenStack and DMTF) of IaaS specifications. If not (either because the DMTF group tries to bite too much or because it succumbs to infighting) then DMTF will be out of the game entirely and it will be between EC2, OpenStack and a bunch of private specifications. It will be the reign of toolkits/library/brokers and hell on earth for all those who think that such a bridging approach is as good as a standard. And for this reason it will have to coalesce at some point.

As far as the more application-centric approach to hypervisor-based Cloud, well, the interesting things are really just starting. Let’s experiment. And let’s talk.

Bernd Harzog recently wrote a blog entry to examine whether “the CMDB [is] irrelevant in a Virtual and Cloud based world“. If I can paraphrase, his conclusion is that there will be something that looks like a CMDB but the current CMDB products are ill-equipped to fulfill that function. Here are the main reasons he gives for this prognostic:

A whole new class of data gets created by the virtualization platform – specifically how the virtualization platform itself is configured in support of the guests and the applications that run on the guest.

A whole new set of relationships between the elements in this data get created – specifically new relationships between hosts, hypervisors, guests, virtual networks and virtual storage get created that existing CMDB’s were not built to handle.

New information gets created at a very rapid rate. Hundreds of new guests can get provisioned in time periods much too short to allow for the traditional Extract, Transform and Load processes that feed CMDB’s to be able to keep up.

The environment can change at a rate that existing CMDB’s cannot keep up with. Something as simple as vMotion events can create thousands of configuration changes in a few minutes, something that the entire CMDB architecture is simply not designed to keep up with.

Having portions of IT assets running in a public cloud introduces significant data collection challenges. Leading edge APM vendors like New Relic and AppDynamics have produced APM products that allow these products to collect the data that they need in a cloud friendly way. However, we are still a long way away from having a generic ability to collect the configuration data underlying a cloud based IT infrastructure – notwithstanding the fact that many current cloud vendors would not make this data available to their customers in the first place.

The scope of the CMDB needs to expand beyond just asset and configuration data and incorporate Infrastructure Performance, Applications Performance and Service assurance information in order to be relevant in the virtualization and cloud based worlds.

I wanted to expand on some of these points.

New model elements for Cloud (bullets #1 and #2)

These first bullets are not the killers. Sure, the current CMDBs were designed before the rise of virtualized environment, but they are usually built on a solid modeling foundation that can easily be extend with new resources classes. I don’t think that extending the model to describe VM, VNets, Volumes, hypervisors and their relationships to the physical infrastructure is the real challenge.

New approach to “discovery” (bullets #3 and #4)

This, on the other hand is much more of a “dinosaurs meet meteorite” kind of historical event. A large part of the value provided by current CMDBs is their ability to automate resource discovery. This is often achieved via polling/scanning (at the hardware level) and heuristics/templates (directory names, port numbers, packet inspection, bird entrails…) for application discovery. It’s imprecise but often good enough in static environments (and when it fails, the CMDB complements the automatic discovery with a reconciliation process to let the admin clean things up). And it used to be all you could get anyway so there wasn’t much point complaining about the limitations. The crown jewel of many of today’s big CMDBs can often be traced back to smart start-ups specialized in application discovery/mapping, like Appilog (now HP, by way of Mercury) and nLayers (now EMC). And more recently the purchase of Tideway by BMC (ironically – but unsurprisingly – oftencast in Cloud terms).

But this is not going to cut it in “the Cloud” (by which I really mean in a highly automated IT environment). As Bernd Harzog explains, the rate of change can completely overwhelm such discovery heuristics (plus, some of the network scans they sometimes use will get you in trouble in public clouds). And more importantly, there now is a better way. Why discover when you can ask? If resources are created via API calls, there are also API calls to find out which resources exist and how they are configured. This goes beyond the resources accessible via IaaS APIs, like what VMWare, EC2 and OVM let you retrieve. This “don’t guess, ask” approach to discovery needs to also apply at the application level. Rather than guessing what software is installed via packet inspection or filesystem spelunking, we need application-aware discovery that retrieves the application and configuration and dependencies from the application itself (or its underlying framework). And builds a model in which the connections between application entities are expressed in terms of the configuration settings that drive them rather than the side effects by which they can be noticed.

“All solutions built in the pre-cloud era are modeled on jvms (or their equivalent), hosts and ports, rather than the logical application running in a more fluid environment. If the solution identifies a web application by host/port or some other infrastructural id, then you cannot effectively manage it in a cloud environment, since the app will move and grow, and your management system (that is, everything offered by the Big 4, as well as all infrastructure management companies that pay lip service to the application) will provide nearly-useless visibility and extraordinarily high TCO.”

I don’t agree with everything in Lew Cirne’s post, but this diagnostic is correct and well worded. He later adds:

“So application management becomes the strategic center or gravity for the client of a public or private cloud, and infrastructure-centric tools (even ones that claim to be cloudy) take on a lesser role.”

Which is also very true even if counter-intuitive for those who think that

Embracing such a VM-centric view naturally raises the profile of infrastructure management compared to application management, which is a fallacy in Cloud computing.

Drawing the line between Cloud infrastructure management and application management (bullet #5)

This is another key change that traditional CMDBs are going to have a hard time with. In a Big-4 CMDB, you’re after the mythical “single source of truth”. Even in a federated CMDB (which doesn’t really exist anyway), you’re trying to have a unified logical (if not physical) repository of information. There is an assumption that you want to manage everything from one place, so you can see all the inter-dependencies, across all layers of the stack (even if individual users may have a scope that is limited by permissions). Not so with public Clouds and even, I would argue, any private Cloud that is more than just a “cloud” sticker slapped on an old infrastructure. The fact that there is a clean line between the infrastructure model and the application model is not a limitation. It is empowering. Even if your Cloud provider was willing to expose a detailed view of the underlying infrastructure you should resist the temptation to accept. Despite the fact that it might be handy in the short term and provide an interesting perspective, it is self-defeating in the long term from the perspective of realizing the productivity improvements promised by the Cloud. These improvements require that the infrastructure administrator be freed from application-specific issues and focus on meeting the contract of the platform. And that the application administrator be freed from infrastructure-level concerns (while at the same time being empowered to diagnose application-level concerns). This doesn’t mean that the application and infrastructure models should be disconnected. There is a contract and both models (infrastructure and consumption) should represent it in the same way. It draws a line, albeit one with some width.

Blurring the line between configuration and monitoring (bullet #6)

This is another shortcoming of current CMDBs, but one that I think is more easily addressed. The “contract” between the Cloud infrastructure and the consuming application materializes itself in a mix of configuration settings, administrative capabilities and monitoring data. This contract is not just represented by the configuration-centric Cloud API that immediately comes to mind. It also includes the management capabilities and monitoring points of the resulting instances/runtimes.

Wither CMDB?

Whether all these considerations mean that traditional CMDBs are doomed in the Cloud as Bernd Harzog posits, I don’t know. In this post, BMC’s Kia Behnia acknowledges the importance of application management, though it’s not clear that he agrees with their primacy. I am also waiting to see whether the application management portfolio he has assembled can really maps to the new methods of application discovery and management.

But these are resourceful organization, with plenty of smart people (as I can testify: in the end of my HP tenure I worked with the very sharp CMDB team that came from the Mercury acquisition). And let’s keep in mind that customers also value the continuity of support of their environment. Most of them will be dealing with a mix of old-style and Cloud applications and they’ll be looking for a unified management approach. This helps CMDB incumbents. If you doubt the power to continuity, take a minute to realize that the entire value proposition of hypervisor-style virtualization is centered around it. It’s the value of backward-compatibility versus forward-compatibility. in addition, CMDBs are evolving into CMS and are a lot more than configuration repositories. They are an important supporting tool for IT management processes. Whether, and how, these processes apply to “the Cloud” is a topic for another post. In the meantime, read what the IT Skeptic and Rodrigo Flores have to say.

I wouldn’t be so quick to count the Big-4 out, even though I work every day towards that goal, building Oracle’s application and middleware management capabilities in conjunction with my colleagues focused on infrastructure management.

If the topic of application-centric management in the age of Cloud is of interest to you (and it must be if you’ve read this long entry all the way to the end), You might also find this previous entry relevant: “Generalizing the Cloud vs. SOA Governance debate“.

Most APIs are like hospital gowns. They seem to provide good coverage, until you turn around.

I am talking about the dreadful state of fault reporting in remote APIs, from Twitter to Cloud interfaces. They are badly described in the interface documentation and the implementations often don’t even conform to what little is documented.

If, when reading a specification, you get the impression that the “normal” part of the specification is the result of hours of whiteboard debate but that the section that describes the faults is a stream-of-consciousness late-night dump that no-one reviewed, well… you’re most likely right. And this is not only the case for standard-by-committee kind of specifications. Even when the specification is written to match the behavior of an existing implementation, error handling is often incorrectly and incompletely described. In part because developers may not even know what their application returns in all error conditions.

After learning the lessons of SOAP-RPC, programmers are now more willing to acknowledge and understand the on-the-wire messages received and produced. But when it comes to faults, there is still a tendency to throw their hands in the air, write to the application log and then let the stack do whatever it does when an unhandled exception occurs, on-the-wire compliance be damned. If that means sending an HTML error message in response to a request for a JSON payload, so be it. After all, it’s just a fault.

But even if fault messages may only represent 0.001% of the messages your application sends, they still represent 85% of those that the client-side developers will look at.

Client developers can’t even reverse-engineer the fault behavior by hitting a reference implementation (whether official or de-facto) the way they do with regular messages. That’s because while you can generate response messages for any successful request, you don’t know what error conditions to simulate. You can’t tell your Cloud provider “please bring down your user account database for five minutes so I can see what faults you really send me when that happens”. Also, when testing against a live application you may get a different fault behavior depending on the time of day. A late-night coder (or a daytime coder in another time zone) might never see the various faults emitted when the application (like Twitter) is over capacity. And yet these will be quite common at peak time (when the coder is busy with his day job… or sleeping).

All these reasons make it even more important to carefully (and accurately) document fault behavior.

The move to REST makes matters even worse, in part because it removes SOAP faults. There’s nothing magical about SOAP faults, but at least they force you to think about providing an information payload inside your fault message. Many REST APIs replace that with HTTP error codes, often accompanied by a one-line description with a sometimes unclear relationship with the semantics of the application. Either it’s a standard error code, which by definition is very generic or it’s an application-defined code at which point it most likely overlaps with one or more standard codes and you don’t know when you should expect one or the other. Either way, there is too much faith put in the HTTP code versus the payload of the error. Let’s be realistic. There are very few things most applications can do automatically in response to a fault. Mainly:

So make sure that your HTTP errors support this simple decision tree. Beyond that point, listing a panoply of application-specific error codes looks like an attempt to look “RESTful” by overdoing it. In most cases, application-specific error codes are too detailed for most automated processing and not detailed enough to help the developer understand and correct the issue. I am not against using them but what matters most is the payload data that comes along.

On that aspect, implementations generally fail in one of two extremes. Some of them tell you nothing. For example the payload is a string that just repeats what the documentation says about the error code. Others dump the kitchen sink on you and you get a full stack trace of where the error occurred in the server implementation. The former is justified as a security precaution. The latter as a way to help you debug. More likely, they both just reflect laziness.

In the ideal world, you’d get a detailed error payload telling you exactly which of the input parameters the application choked on and why. Not just vague words like “invalid”. Is parameter “foo” invalid for syntactical reasons? Is it invalid because inconsistent with another parameter value in the request? Is it invalid because it doesn’t match the state on the server side? Realistically, implementations often can’t spend too many CPU cycles analyzing errors and generating such detailed reports. That’s fine, but then they can include a link to a wiki a knowledge base where more details are available about the error, its common causes and the workarounds.

Your API should document all messages accurately and comprehensively. Faults are messages too.

The VMforce announcement is a great step for SalesForce, in large part because it lets them address a recurring concern about the force.com PaaS offering: the lack of portability of Apex applications. Now they can be written using Java and Spring instead. A great illustration of how painful this issue was for SalesForce is to see the contortions that Peter Coffee goes through just to acknowledge it: “On the downside, a project might be delayed by debates—some in good faith, others driven by vendor FUD—over the perception of platform lock-in. Political barriers, far more than technical barriers, have likely delayed many organizations’ return on the advantages of the cloud”. The issue is not lock-in it’s the potential delays that may come from the perception of lock-in. Poetic.

Similarly, portability between clouds is also a big theme in Steve Herrod’s blog covering VMforce as illustrated by the figure below. The message is that “write once run anywhere” is coming to the Cloud.

Because this is such a big part of the VMforce value proposition, both from the SalesForce and the VMWare/SpringSource side (as well as for PaaS in general), it’s worth looking at the portability aspect in more details. At least to the extent that we can do so based on this pre-announcement (VMforce is not open for developers yet). And while I am taking VMforce as an example, all the considerations below apply to any enterprise PaaS offering. VMforce just happens to be one of the brave pioneers, willing to take a first step into the jungle.

Beyond the use of Java as a programming language and Spring as a framework, the portability also comes from the supporting tools. This is something I did not cover in my initial analysis of VMforce but that Michael Cote covers well on his blog and Carl Brooks in his comment. Unlike the more general considerations in my previous post, these matters of tooling are hard to discuss until the tools are actually out. We can describe what they “could”, “should” and “would” do all day long, but in the end we need to look at the application in practice and see what, if anything, needs to change when I redirect my deployment target from one cloud to the other. As SalesForce’s Umit Yalcinalp commented, “the details are going to be forthcoming in the coming months and it is too early to speculate”.

So rather than speculating on what VMforce tooling will do, I’ll describe what portability questions any PaaS platform would have to address (or explicitly decline to address).

Code portability

That’s the easiest to address. Thanks to Java, the runtime portability problem for the core language is pretty much solved. Still, moving applications around require changes to way the application communicates with its infrastructure. Can your libraries and frameworks for data access and identity, for example, successfully encapsulate and hide the different kinds of data/identity stores behind them? Even when the stores are functionally equivalent (e.g. SQL, LDAP), they may have operational differences that matter for an enterprise application. Especially if the database is delivered (and paid for) as a service. I may well design my application differently depending on whether I am charged by the amount of data in the DB, by the number of requests to the DB, by the quantity of app-to-DB traffic or by the total processing time of my requests in the DB. Apparently force.com considers the number of “database objects” in its pricing plans and going over 200 pushes you from the “Enterprise” version to the more expensive “Unlimited” version. If I run against my local relational database I don’t think twice about having 201 “database objects”. But if I run in force.com and I otherwise can live within the limits of the “Enterprise” version I’d probably be tempted to slightly alter my data model to fit under 200 objects. The example is borderline silly, but the underlying truth is that not all differences in application infrastructure can be automatically encapsulated by libraries.

While code portability is a solvable problem for a reasonably large set of use cases, things get hairier for the more demanding applications. A large part of the PaaS value proposition is contingent on the willingness to give up some low-level optimizations. This, and harder portability in some cases, may just have to be part of the cost of running demanding applications in a PaaS environment. Or just keep these off PaaS for now. This is part of the backward-compatible versus forward compatible Cloud dilemma.

Data portability

I have covered data portability in the previous entry, in response to Steve Herrod’s comment that “you should be able to extract the code from the cloud it currently runs in and move it, along with its data, to another cloud choice”. Your data in the force.com database can already be moved somewhere else… as long as you’re willing to write code to get it and perform any needed transformation. In theory, any data that you can read is data that you can move (thus fulfilling Steve’s promise). The question is at what cost. Presumably Steve is referring to data migration tools that VMWare will build (or acquire) and make part of its cloud enablement platform. Another way in which VMWare is trying to assemble a more complete middleware portfolio (see Oracle ODI for an example of a complete data integration offering, which goes far beyond ETL).

There is a subtle difference between the intrinsic portability of Java (which will run in any JRE, modulo JDK version) and the extrinsic portability of data which can in theory be moved anywhere but each place you move it to may require a different process. A car and an oak armoire are both “portable”, but one is designed for moving while the other will only move if you bring a truck and two strong guys.

Application service portability

I covered this in my previous entry and Bob Warfield summarized it as “take advantage of all those juicy services and it will be hard to back out of that platform, Java or no Java”. He is referring to all the platform services (search, reporting, mobile, integration, BPM, IdM, administration) that make a large part of the force.com value proposition. They won’t be waiting for you in your private cloud (though some may be remotely invocable, depending on how SalesForce wants to play its cards). Applications that depend on them will have to be changed, at least until we have standards interfaces for all these services (don’t hold your breath).

Management portability

Even if you can seamlessly migrate your application and your data from your internal servers to force.com, what do you think is going to happen to your management console, especially if it uses operating system agents? These agents are not coming along for the ride, that’s for sure. Are you going to tell your administrators that rather than having a centralized configuration/monitoring/event console they are going to have to look at cute “monitoring” web pages for each application? And all the transaction tracing, event correlation, configuration policy and end-user monitoring features they were relying on are unfortunate victims of the relentless march of progress? Good luck with that sale.

VMWare’s answer will probably be that they will eventually provide you with all the management capabilities that you need. And it’s a fair one, along the lines of the “Application-to-Disk Management” message at the recent launch of Oracle Enterprise Manager 11G. With the difference that EM is not the only way to manage a top-to-bottom Oracle stack, just the one that we think is the best. BMC and HP aren’t locked out.

VMWare and SpringSource (+Hyperic) could indeed theoretically assemble a full-fledged management solution. But this doesn’t happen overnight, even with acquisitions as I know from experience both at HP Software and currently at Oracle. Integration (of management domains across the stack, of acquired application management products, of support data/services from oracle.com) is one of the main advances in Enterprise Manager 11G and it took work.

And even then, this leads to the next logical question. If you can move from cloud to cloud but you are forced to use VMWare development, deployment and management tools, haven’t you traded one lock-in problem for another?

Not to mention that your portability between clouds, if it depends on VMWare tools, is limited to VMWare-powered clouds (private or public). In effect, there are now three levels of portability:

not portable (only runs on VMforce)

portable to any cloud (public or private) built using VMWare infrastructure

portable to any Java/Spring Cloud platform

Is your application portable the way cash is portable, or the way a gift card is portable (across stores of a retail chain)?

If this reminds you of the java portability debates of the early days of Enterprise Java that’s no surprise. Remember, we’re replaying the tape.

Let’s start with the disclosures: by most interpretations I work for a competitor to what Salesforce.com and VMWare are trying to do with VMforce. And all I know about VMforce is what I read in a few authoritative blogs by VMWare’s Steve Herrod, VMWare/SpringSource’s Rod Johnson and Salesforce’s Anshu Sharma. So no hard feeling if you jump off right now.

Overall, I like what I see. Let me put it this way. I am now a lot more likely to write an application on force.com than I was last week. How could this not be a good thing for SalesForce, me and others like me?

On the other hand, this is also not the major announcement that the “VMforce is coming” drum-roll had tried to make us expect. If you fell for it, then I guess you can be disappointed. I didn’t and I’m not (Phil Wainewright fell for it and yet isn’t disappointed, asserting that “VMforce.com redefines the PaaS landscape” for reasons not entirely clear to me even after reading his article).

The new thing is that force.com now supports an additional runtime, in addition to Apex. That new runtime uses the Java language, with the constraint that it is used via the Spring framework. Which is familiar territory to many developers. That’s it. That’s the VMforce announcement for all practical purposes from a user’s perspective. It’s a great step forward for force.com which was hampered by the non-standard nature of Apex, but it’s just a new runtime. All the other benefits that Anshu Sharma lists in his blog (search, reporting, mobile, integration, BPM, IdM, administration) are not new. They are the platform services that force.com offers to application writers, whether they use Apex or the new Java/Spring runtime.

It’s important to realize that there are two main parts to a full PaaS platform like force.com or Google App Engine. First there are application runtimes (Apex and now Java for force.com, Python and Java for GAE). They are language-dependent and you can have several of them to support different programming languages. Second are the platform services (reports, mobile, BPM, IdM etc for force.com as we saw above, mostly IdM for Google at this point) which are mostly language agnostic (beyond a library used to access them). I think of data storage (e.g. mySQL, force.com database, Google DataStore) as part of the runtime, but it’s on the edge of the grey zone. A third category is made of actual application services (e.g. the CRM web services out of SalesForce.com or the application services out of Google Apps) which I tend not to consider part of PaaS but again there are gray zones between application support services and application services. E.g. how domain-specific does your rule engine have to be before it moves from one category to the other?

As Umit Yalcinalp (who works for SalesForce) told me on Twitter“regardless of the runtime the devs using the Force.com db will get the same platform benefits, chatter, workflow, analytics”. What I called the platform services above. Which, really, is where most of the PaaS value lies anyway. A language runtime is just a starting point.

So where are VMWare and SpringSource in this picture? Well, from the point of view of the user nowhere, really. SalesForce could have built this platform themselves, using the Spring framework on top of Tomcat, WebLogic, JBoss… Itself running on any OS they want. With or without a hypervisor. These are all implementation details and are SalesForce’s problem, not ours as application developers.

It so happens that they have chosen to run this as a partnership with VMWare/SpringSource which makes a lot of sense from a portfolio/expertise perspective, of course. But this choice is not visible to the application developer making use of this platform. And it shouldn’t be. That’s the whole point of PaaS after all, that we don’t have to care.

But VMWare and SpringSource really want us to know that they are there, so Rod Johnson leads by lifting the curtain and explaining that:

“VMforce uses the Force.com physical infrastructure to run vSphere with a special customized vCloud layer that allows for seamless scaling and management. Above this layer VMforce runs SpringSource tc Server instances that provide the execution environment for the enterprise applications that run on VMforce.”

[Side note: notice what’s missing? The operating system. It’s there of course, most likely some Linux distribution but Rod glances over it, maybe because it’s a missing link in VMWare’s “we have all the pieces” story; unlike Oracle who can provide one or, even better, do without. Just saying…]

VMWare wants us to know they are under the covers because of course they have much larger aspirations than to be a provider to SalesForce. They want to use this as a proof point to sell their SpringSource+VMWare stack in other settings, such as private clouds and other public cloud providers (modulo whatever exclusivity period may be in their contract with SalesForce). And VMforce, if it works well when it launches, is a great validation for this strategy. It’s natural that they want people to know that they are behind the curtain and can be called on to replicate this elsewhere.

But let’s be clear about what part they can replicate. It’s the Java/Spring language runtime and its underlying infrastructure. Not the platform services that are part of the SalesForce platform. Not an IdM solution, not a rules engine, not a business process engine, etc. We can expect that they are hard at work trying to fill these gaps, as the RabbitMQ acquisition illustrates, but for now all this comes from force.com and isn’t directly replicable. Which means that applications that use them aren’t quite so portable.

In his post, Steve Herrod quickly moves past the VMforce announcement to focus on the SpringSource+VMWare infrastructure part, the one he hopes to see multiplied everywhere. The key promise, from the developers’ perspective, is application portability. And while the use of Java+Spring definitely helps a lot in terms of code portability I see some promises in terms of data portability that will warrant scrutiny when VMforce actually rolls out: “you should be able to extract the code from the cloud it currently runs in and move it, along with its data, to another cloud choice”.

It sounds very nice, but the underlying issues are:

Does the code change depending on whether I am talking to a local relational DB in my private cloud or whether I am on VMforce and using the force.com database?

If it doesn’t then the application is portable, but an extra service i still needed to actually move the data from one cloud to the other (can this be done in-flight? what downtime is needed?)

What about the other VMforce.com services (chatter, workflow, analytics…)? If I use them in my code can I keep using them once I migrate out of VMforce to a private cloud? Are they remotely invocable? Does the code change? And if I want to completely sever my links with SalesForce, can I find alternative implementations of these application platform services in my private cloud? Or from another public cloud provider? The answer to these is probably no, which means that you are only portable out of VMforce if you restrain yourself from using much of the value of the platform. It’s not even clear whether you can completely restrain yourself from using it, e.g. can you run on force.com without using their IdM system?

All these are hard questions. I am not blaming anyone for not answering them today since no-one does. But we shouldn’t sweep them under the rug. I am sure VMWare is working on finding workable compromises but I doubt it will be as simple, clean and portable as Steve Herrod implies. It’s funny how Steve and Anshu’s posts seem to reinforce and congratulate one another, until you realize that they are in large part talking about very different things. Anshu’s is almost entirely about the force.com application platform services (sprinkled with some weird Facebook envy), Steve’s is entirely about the application runtime and its infrastructure.

One thing that I am surprised not to see mentioned is the management aspect of the platform, especially considering the investment that SpringSource made in Hyperic. I can only assume that work is under way on this and that we’ll hear about it soon. One aspect of the management story that concerns me a bit is the lack of acknowledgment of the challenges of configuration management in a PaaS setting. Especially when I read Steve Herrod asserting that the VMWare/SpringSource PaaS platform is going to free us from the burden of “handling code modifications that may be required as the middleware versions change”. There seems to be a misconception that because the application administrators are not the ones doing the infrastructure updates they don’t need to worry about the impact of these updates on their application. Is Steve implying that the first release of the VMWare/SpringSource PaaS stack is going to be so perfect that the hypervisor, guest OS and app server will never have to be patched and versioned? If that’s not the case, then why are those patches suddenly less likely impact the application code? In fact the situation is even worse as the application administrator does not know which hypervisor/OS/middleware patches are being applied and when. They can’t test against the new version ahead of time for validation and they can’t make sure the change is scheduled during a non-critical period for their business. I wrote an entire blog post on this issue six months ago and it’s a little bit disheartening to see the issue flatly denied and ignored. Management is not just monitoring.

Here is another intriguing comment in Steve’s entry: “one of the key differentiators with EC2 based PaaS will be the efficiencies for the many-app model. Customers are frustrated with the need to buy a whole VM as the minimum service unit for their applications. Our PaaS will provide fine-grained resource separation”. I had to read it twice when I realized that the VMWare CTO was telling us that splitting a physical machine into VMs is not a good enough way to share its resources and that you really need middleware-level multi-tenancy. But who can disagree that a GAE-like architecture can support more low-traffic applications on the same server than anything based on VM-based sharing? Which (along with deep pockets) puts Google in position to offer free hosting for low-traffic applications, a great way to build adoption.

These are very early days in the history of PaaS. VMWare, like the rest of us, will need to tackle all these issues one by one. In the meantime, this is an interesting announcement and a noticeable milestone. Let’s just keep our eyes open on the incremental nature of progress and the long list of remaining issues.

The battle of the Cloud Frameworks has started, and it will look a lot like the battle of the Application Servers which played out over the last decade and a half. Cloud Frameworks (which manage IT automation and runtime outsourcing) are to the Programmable Datacenter what Application Servers are to the individual IT server. In the longer term, these battlefronts may merge, but for now we’ve been transported back in time, to the early days of Web programming. The underlying dynamic is the same. It starts with a disruptive IT event (part new technology, part new mindset). 15 years ago the disruptive event was the Web. Today it’s Cloud Computing.

Stage 1

It always starts with very simple use cases. For the Web, in the mid-nineties, the basic use case was “how do I return HTML that is generated by a script as opposed to a static file”. For Cloud Computing today, it is “how do I programmatically create, launch and stop servers as opposed to having to physically install them”.

In that sense, the IaaS APIs of today are the equivalent of the Common Gateway Interface (CGI) circa 1993/1994. Like the EC2 API and its brethren, CGI was not optimized, not polished, but it met the basic use cases and allowed many developers to write their first Web apps (which we just called “CGI scripts” at the time).

Stage 2

But the limitations became soon apparent. In the CGI case, it had to do with performance (the cost of the “one process per request” approach). Plus, the business potential was becoming clearer and attracted a different breed of contenders than just academic and research institutions. So we got NSAPI, ISAPI, FastCGI, Apache Modules, JServ, ZDAC…

We haven’t reached that stage for Cloud yet. That will be when the IaaS APIs start to support events, enumerations, queries, federated identity etc…

Stage 3

Stage 2 looked like the real deal, when we were in it, but little did we know that we were still just nibbling on the hors d’oeuvres. And it was short-lived. People quickly decided that they wanted more than a way to handle HTTP requests. If the Web was going to be central to most programs, then all aspects of programming had to fit well in the context of the Web. We didn’t want Web servers anymore, we wanted application servers (re-purposing a term that had been used for client-server). It needed more features, covering data access, encapsulation, UI frameworks, integration, sessions. It also needed to meet non-functional requirements: availability, scalability (hello clustering), management, identity…

That turned into the battle between the various Java application servers as well as between Java and Microsoft (with .Net coming along), along with other technology stacks. That’s where things got really interesting too, because we explored different ways to attack the problem. People could still program at the HTTP request level. They could use MVC framework, ColdFusion/ASP/JSP/PHP-style markup-driven applications, or portals and other higher-level modular authoring frameworks. They got access to adapters, message buses, process flows and other asynchronous mechanisms. It became clear that there was not just one way to write Web applications. And the discovery is still going on, as illustrated by the later emergence of Ruby on Rails and similar frameworks.

Stage 4

Stage 3 is not over for Web applications, but stage 4 is already there, as illustrated by the fact that some of the gurus of stage 3 have jumped to stage 4. It’s when the Web is everywhere. Clients are everywhere and so are servers for that matter. The distinction blurs. We’re just starting to figure out the applications that will define this stage, and the frameworks that will best support them. The game is far from over.

So what does it mean for Cloud Frameworks?

If, like me, you think that the development of Cloud Frameworks will follow a path similar to that of Application Servers, then the quick retrospective above can be used as a (imperfect) crystal ball. I don’t pretend to be a Middleware historian or that these four stages are the most accurate representation, but I think they are a reasonable perspective. And they hold some lessons for Cloud Frameworks.

It’s early

We are at stage 1. I’ll admit that my decision to separate stages 1 and 2 is debatable and mainly serves to illustrate how early in the process we are with Cloud frameworks. Current IaaS APIs (and the toolkits that support them) are the equivalent of CGI (and the early httpd), something that’s still around (Google App Engine emulates CGI in its Python incarnation) but almost no-one programs to directly anymore. It’s raw, it’s clunky, it’s primitive. But it was a needed starting point that launched the whole field of Web development. Just like IaaS APIs like EC2 have launched the field Cloud Computing.

Cloud Frameworks will need to go through the equivalent of all the other stages. First, the IaaS APIs will get more optimized and capable (stage 2). Then, at stage 3, we will focus on higher-level, more productive abstraction layers (generally referred to as PaaS) at which point we should expect a thousand different approaches to bloom, and several of them to survive. I will not hazard a guess as to what stage 4 will look like (here is my guess for stage 3, in twoparts).

No need to rush standards

One benefit of this retrospective is to highlight the tragedy of Cloud standards compared to Web development standards. Wouldn’t we be better off today if the development leads of AWS and a couple of other Cloud providers had been openly cooperating in a Cloud equivalent of the www-talk mailing list of yore? Out of it came a rough agreement on HTML and CGI that allowed developers to write basic Web applications in a reasonably portable way. If the same informal collaboration had taken place for IaaS APIs, we’d have a simple de-facto consensus that would support the low-hanging fruits of basic IaaS. It would allow Cloud developers to support the simplest use cases, and relieve the self-defeating pressure to standardize too early. Standards played a huge role in the development of Application Servers (especially of the Java kind), but that really took place as part of stage 3. In the absence of an equivalent to CGI in the Cloud world, we are at risk of rushing the standardization without the benefit of the experimentation and lessons that come in stage 3.

I am not trying to sugar-coat the history of Web standards. The HTML saga is nothing to be inspired by. But there was an original effort to build consensus that wasn’t even attempted with Cloud APIs. I like the staged process of a rough consensus that covers the basic use cases, followed by experimentation and proprietary specifications and later a more formal standardization effort. If we skip the rough consensus stage, as we did with Cloud, we end up rushing to do final step (to the tune of “customers demand Cloud standards”) even though all we need for now is an interoperable way to meet the basic use cases.

Winners and losers

Whoever you think of as the current leaders of the Application Server battle (hint: I work for one), they were not the obvious leaders of stages 1 and 2. So don’t be in too much of a hurry to crown the Cloud Framework kings. Those you think of today may turn out to be the Netscapes of that battle.

New roles

It’s not just new technology. The development of Cloud Frameworks will shape the roles of the people involved. We used to have designers who thought their job was done when they produced a picture or at best a FrameMaker or QuarkXPress document, which is what they were used to. We had “webmasters” who thought they were set for life with their new Apache skills, then quickly had to evolve or make way, a lesson for IaaS gurus of today. Under terms like “DevOps” new roles are created and existing roles are transformed. Nobody yet knows what will stick. But if I was an EC2 guru today I’d make sure to not get stuck providing just that. Don’t wait for other domains of Cloud expertise to be in higher demand than your current IaaS area, as by then you’ll be too late.

It’s the stack

There aren’t many companies out there making a living selling a stand-alone Web server. Even Zeus, who has a nice one, seems to be downplaying it on its site compared to its application delivery products. The combined pressure of commoditization (hello Apache) and of the demand for a full stack has made it pretty hard to sell just a Web server.

Similarly, it’s going to be hard to stay in business selling just portions of a Cloud Framework. For example, just provisioning, just monitoring, just IaaS-level features, etc. That’s well-understood and it’s fueling a lot of the acquisitions (e.g. VMWare’s purchase of SpringSource which in turn recently purchased RabbitMQ) and partnerships (e.g. recently between Eucalyptus and GroundWork though rarely do such “partnerships” rise to the level of integration of a real framework).

It’s not even clear what the right scope for a Cloud Framework is. What makes a full stack and what is beyond it? Is it just software to manage a private Cloud environment and/or deployments into public Clouds? Does the framework also include the actual public Cloud service? Does it include hardware in some sort of “private Cloud in a box”, of the kind that this recent Dell/Ubuntu announcement seems to be inching towards?

Integration

If indeed we can go by the history of Application Server to predict the future of Cloud Frameworks, then we’ll have a few stacks (with different levels of completeness, standardized or proprietary). This is what happened for Web development (the JEE stack, the .Net stack, a more loosely-defined alternative stack which is mostly open-source, niche stacks like the backend offered by Adobe for Flash apps, etc) and at some point the effort moved from focusing on standardizing the different application environment technology alternatives (e.g. J2EE) towards standardizing how the different platforms can interoperate (e.g. WS-*). I expect the same thing for Cloud Frameworks, especially as they grow out of stages 1 and 2 and embrace what we call today PaaS. At which point the two battlefields (Application Servers and Cloud Frameworks) will merge. And when this happens, I just can’t picture how one stack/framework will suffice for all. So we’ll have to define meaningful integration between them and make them work.

If you’re a spectator, grab plenty of popcorn. If you’re a soldier in this battle, get ready for a long campaign.

Oracle WebLogic Suite Virtualization Option (not the most Twitter-friendly name, so if you see me tweet about “WebLogic Virtual” or “WLV” that’s what I mean), an optimized version of WebLogic Server which runs on JRockit Virtual Edition, itself on top of OVM. Notice what’s missing? The OS. If you think you’ll miss it, you may be suffering from learned helplessness. Seek help.

Later this week, Oracle will announce Oracle Enterprise Manager 11g. I am not going to steal the thunder a couple of days before the announcement, but I can safely say that a large chunk of the new features relate to application management.

For the short term (until we sell one) there are three cars in my household. A manual transmission, an automatic and a CVT (continuous variable transmission). This makes me uniquely qualified to write about Cloud Computing.

That’s because Cloud Computing is yet another area in which the manual/automatic transmission analogy can be put to good use. We can even stretch it to a 4-layer analogy (now that’s elasticity):

Manual transmission

That’s traditional IT. Scaling up or down is done manually, by a skilled operator. It’s usually not rocket science but it takes practice to do it well. At least if you want it to be reliable, smooth and mostly unnoticed by the passengers.

Manumatic transmission (a.k.a. Tiptronic)

The driver still decides when to shift up or down, but only gives the command. The actual process of shifting is automated. This is how many Cloud-hosted applications work. The scale-up/down action is automated but, still contingent on being triggered by an administrator. Which is what most IaaS-deployed apps should probably aspire to at this point in time despite the glossy brochures about everything being entirely automated.

Automatic transmission

That’s when the scale up/down process is not just automated in its execution but also triggered automatically, based on some metrics (e.g. load, response time) and some policies. The scenario described in the aforementioned glossy brochures.

Continuous variable transmission

That’s when the notion of discrete gears goes away. You don’t think in terms of what gear you’re in but how much torque you want. On the IT side, you’re in PaaS territory. You don’t measure the number of servers, but rather a continuously variable application capacity metric. At least in theory (most PaaS implementations often betray the underlying work, e.g. via a spike in application response time when the app is not-so-transparently deployed to a new node).

More?

OK, that’s the analogy. There are many more of the same kind. Would you like to hear how hybrid Cloud deployments (private+public) are like hybrid cars (gas+electric)? How virtualization is like carpooling (including how you can also be inconvenienced by the BO of a co-hosted VM)? Do you want to know why painting flames on the side of your servers doesn’t make them go faster?

Driving and IT management have a lot in common, including bringing out the foul-mouth in us when things go wrong.

I got invited to give a short keynote presentation during the Cloud Connect conference this week at the Santa Clara Convention Center (thanks Shlomo and Alistair). Here are the slides (as PPT and PDF). They are visual support for my bad jokes rather than a medium for the actual message. So here is an annotated version.

I used this first slide (a compilation of representations of the 3-layer Cloud stack) to poke some fun at this ubiquitous model of the Cloud architecture. Like all models, it’s neither true nor false. It’s just more or less useful to tackle a given task. While this 3-layer stack can be relevant in the context of discussing economic aspects of Cloud Computing (e.g. Opex vs. Capex in an on-demand world), it is useless and even misleading in the context of some more actionable topics for SaaS: chiefly, how you deliver such services, how you consume them and how you manage them.

In those contexts, you shouldn’t let yourself get too distracted by the “aaS” aspect of SaaS and focus on what it really is.

Which is… a web application (by which I include both HTML access for humans and programmatic access via APIs.). To illustrate this point, I summarized the content of this blog entry. No need to repeat it here. The bottom line is that any distinction between SaaS and POWA (Plain Old Web Applications) is at worst arbitrary and at best concerned with the business relationship between the provider and the consumer rather than technical aspects of the application.

Which means that for most technical aspect of how SaaS is delivered, consumed and managed, what you should care about is that you are dealing with a Web application, not a Cloud service. To illustrate this, I put up the…

… guillotine slide. Which is probably the only thing people will remember from the presentation, based on the ample feedback I got about it. It probably didn’t hurt that I also made fun of my country of origin (you can never go wrong making fun of France), saying that the guillotine was our preferred way of solving any problem and also the last reliable piece of technology invented in France (no customer has ever come back to complain). Plus, enough people in the audience seemed to share my lassitude with the 3-layer Cloud stack to find its beheading cathartic.

Come to think about it, there are more similarities. The guillotine is to the axe what Cloud Computing is to traditional IT. So I may use it again in Cloud presentations.

Of course this beheading is a bit excessive. There are some aspects for which the IaaS/PaaS/SaaS continuum makes sense, e.g. around security and compliance. In areas related to multi-tenancy and the delegation of control to a third party, etc. To the extent that these concerns can be addressed consistently across the Cloud stack they should be.

But focusing on these “Cloud” aspects of SaaS is missing the forest for the tree.

A large part of the Cloud value proposition is increased flexibility. At the infrastructure level, being able to provision a server in minutes rather than days or weeks, being able to turn one off and instantly stop paying for it, are huge gains in flexibility. It doesn’t work quite that way at the application level. You rarely have 500 new employees joining overnight who need to have their email and CRM accounts provisioned. This is not to minimize the difficulties of deploying and scaling individual applications (any improvement is welcome on this). But those difficulties are not what is crippling the ability of IT to respond to business needs.

Rather, at the application level, the true measure of flexibility is the control you maintain on your business processes and their orchestration across applications. How empowered (or scared) you are to change them (either because you want to, e.g. entering a new business, or because you have to, e.g. a new law). How well your enterprise architecture has been defined and implemented. How much visibility you have into the transactions going through your business applications.

It’s about dealing with composite applications, whether or not its components are on-premise or “in the Cloud”. Even applications like Salesforce.com see a large number of invocations from their APIs rather than their HTML front-end. Which means that there are some business applications calling them (either other SaaS, custom applications or packaged applications with an integration to Salesforce). Which means that the actual business transactions go across a composite system and have to be managed as such, independently of the deployment model of each participating application.

[Side note: One joke that fell completely flat was that it was unlikely that the invocations of Salesforce through the Web services APIs be the works of sales people telneting to port 80 and typing HTTP and SOAP headers. Maybe I spoke too quickly (as I often do), or the audience was less nerdy than I expected (though I recognized some high-ranking members of the nerd aristocracy). Or maybe they all got it but didn’t laugh because I forgot to take encryption into account?]

At this point I launched into a very short summary of the benefits of SOA governance/management, real user experience monitoring, BTM and application-centric IT management in general. Which is very succinctly summarized on the slide by the “SOA” label on the receiving bucket. I would have needed another 10 minutes to do this subject justice. Maybe at Cloud Connect 2011? Alistair?

This picture of me giving the presentation at Cloud Connect is the work of Alex Dunne.

The guillotine picture is the work of Rusty Boxcars who didn’t just take the photo but apparently built the model (yes it’s a small-size model, look closely). Here it is in its original, unedited, glory. My edited version is available under the same CC license to anyone who wants to grab it.

Events/alerts/notifications have been a central concept in IT management at least since the first SNMP trap was emitted, and probably even long before that. And yet they are curiously absent from all the Cloud management APIs/protocols. If you think that’s because “THE CLOUD CHANGES EVERYTHING” then you may have to think again. Over the last few days, two of the most experienced practitioners of Cloud computing pointed out that this omission is a real pain in the neck. RightScale’s Thorsten von Eicken was first to request “an event based interface instead of a request-reply based interface”, pointing out that “we run a good number of machines that do nothing but chew up 100% cpu polling EC2 to detect changes”. George Reese seconded and started to sketch a solution. And while these blog posts gave the issue increased visibility recently, it has been a recurring topic on the AWS Forum and other similar discussion boards for quite some time. For example, in this thread going back to 2006, an Amazon employee wrote that “this is a feature we’ve discussed recently and we’re looking at options” (incidentally, I see a post by Thorsten in that old thread). We’re still waiting.

Let’s look at what it would take to define such a feature.

I have some experience with events for IT management, having been involved in the WS-Notification family of specifications and having co-chaired the OASIS technical committee that standardized them. This post is not about foisting WS-Notification on Cloud APIs, but just about surfacing some of the questions that come up when you try to standardize such a mechanism. While the main use cases for WS-Notification came from IT (and Grid) management, it was supposed to be a generic mechanism. A Cloud-centric eventing protocol can be made simpler by focusing on fewer use cases (Cloud scenarios only). In addition, WS-Notification was marred by the complexity-is-a-sign-of-greatness spirit of the time . On this too, a Cloud eventing protocol could improve things by keeping IBM at bay simplicity in mind.

Types of event

When you pull the state of a resource to see if anything changed, you don’t have to tell the provider what kind of change you are interested in. If, on the other hand, you want the provider to notify you, then they need to know what you care about. You may not want to be notified on every single change in the resource state. How do you describe the changes you care about? Is there an agreed-upon set of states for the resource and you are only notified on state transitions? Can you indicate the minimum severity level for an event to be emitted? Who determines the severity of an event? Or do you get to specify what fields in the resource state you want to watch? What about numeric values for which you may not want to be notified of every change but only when a threshold is crossed? Do you get to specify a query and get notified whenever the query result changes? In WS-Notification some of this is handled by WS-Topics which I still like conceptually (I co-edited it) but is too complex for the task at hand.

Event formats

What format are the events serialized in? How is the even metadata captured (e.g. time stamp of observation, which may not be the same as the time at which the notification message was sent)? If the event payload is a representation of the new state of the resource, does it indicate what field changes (and what the old value was)? How do you keep event payloads consistent with the resource representation in the request/response interactions? If many events occur near the same time, can you group them in one notification message for better scalability?

Subscription creation

Presumably you need a subscription mechanism. Is the subscription set in stone when the resource is created? Or can you come later and subscribe? If subscription is an operation on the resource itself, how do you subscribe for events on something that doesn’t exist yet (e.g. “create a VM and notify me once it’s started”)? Do you get to set subscriptions on a per-resource-basis? Or is this a global setting for all the resources that you own? Can you have two different subscriptions on the same resource (e.g. a “critical events only” subscription that exist throughout the life of the resource, plus a “lots of events please” subscription that you keep for a few hours while troubleshooting)?

Subscription management

Do you get to come back and update/pause/delete a subscription? Do you get to change what filter the subscription carries? Or is it set in stone until the subscription expires? Can you change the delivery endpoint? What if events fail to be delivered? Does the provider cancel your subscription? After how many failures? Does it just pause it for a few hours? Keep trying?

Subscription expiration

Who sets the expiration period? The subscriber? Can the provider set a max duration? Do you get a warning message before the subscription expires? Can you renew a subscription or do you have to create a new one? Do you get a message telling you that it has expired? Where are these subscription-lifecycle messages sent? To the same endpoint as the regular messages? What if your subscription is being killed because your deliver endpoint is down, clearly it makes no sense to send the warning message to that same endpoint. Do you provide a separate “subscription management” endpoint (different from the event delivery endpoint) when you subscribe? Alternatively, does an email message get sent to the registered user who set the subscription?

Delivery reliability

How reliable do you want the notifications to be? Should the emitter retry until they’ve received a confirmation? How long do they keep messages that can’t be delivered? Some may have a very short shelf life while others are still useful weeks later. If you don’t have a reliable mechanism but you really “need to know about a lost server within a minute of it disappearing” (the example Georges gives) then in reality you may still have to poll just to make sure that an event wasn’t lost. If you haven’t received an event in a while, how can you test if the subscription is still working? Should subscriptions send a heartbeat message once a while?

Delivery mechanism

How do you deliver notifications? Do you keep HTTP connections open through tricks similar to how self-updating web pages work (e.g. COMET, long polling and soon WebSockets)? Or do you just provide a listener endpoint to which the notifier tries to connect (which, in the case of public cloud deployments, means you need to have a publicly-addressable listener, but hopefully not on the same Cloud infrastructure). Do you use XMPP? AMQP? Email? Can I have you hold my events and let me come pull them?

Security

Do you need to verify the origin of the events you receive? Or do you assume they may be forged and always initiate a connection to the provider to double-check? And on the other side, what are the security requirements for event delivery? If a user looses some of their privileges, do you have to go and cancel the still-active subscriptions that they created?

Throttling

Is there a maximum event rate? Do you get charged for the events the Cloud provider sends you? How do you make sure that someone doesn’t create a subscription pointing to the wrong endpoint (either erroneously or maliciously, e.g. DoS). Do you send a test message at registration asking the delivery endpoint to acknowledge that they indeed want to receive these notifications?

Conclusion

My goal is not to argue that we cannot have a simple yet good enough notification system or to scare anyone from attempting to define it. It’s just to show that it’s not as simple as it may seem at first blush. But there probably is a sweetspot and people like Thorsten and George are very well qualified to find it.

[UPDATED 2010/4/7: Amazon releases AWS Simple notification Service. Not just as an eventing feature for the Cloud API, as a generic notification service. Which can, of course, also carry Cloud management events. Though at this point you’re on your own to publish them from your instances, it doesn’t look like the AWS infrastructure can do it for you. Which means, for example, that you’re not going to be able to publish an event for a sudden crash.]

Have you noticed the slow build-up of business process engines available “as a service”? Force.com recently introduced a “Visual Process Manager”. Amazon is looking for product managers to help customers “securely compos[e] processes using capabilities from all parts of their organization as well as those outside their organization, including existing legacy applications, long-running activities, human interactions, cloud services, or even complex processes provided by business partners”. I’ve read somewhere (can’t find a link right now) that WSO2 was planning to make its Business Process Server available as a Cloud service. I haven’t tracked Azure very closely, but I expect AppFabric to soon support a BizTalk-like process engine. And I wouldn’t be surprised if VMWare decided to make an acquisition in the area of business process execution.

Attacking PaaS from the business process angle is counter-intuitive. Rather, isn’t the obvious low-hanging fruit for PaaS a simple synchronous HTTP request handler (e.g. a servlet or its Python, Ruby, etc equivalent)? Which is what Google App Engine (GAE) and Heroku mainly provide. GAE almost defined PaaS as a category in the same way that Amazon EC2 defined IaaS. The expectation that a CGI or servlet-like container naturally precedes a business process engine is also reinforced by the history of middleware stacks. Simple HTTP request-response is the first thing that gets defined (the first version of the servlet package was java.servlet.* since it even predates javax), the first thing that gets standardized (JSR 53: servlet 2.3 and JSP 1.2) and the first thing that gets widely commoditized (e.g. Apache Tomcat). Rather than a core part of the middleware stack, business process engines (BPEL and the like) are typically thought of as a more “advanced” or “enterprise” capability, one that come later, as part of the extended middleware stack.

But nothing says it has to be that way. If you think about it a bit longer, there are some reasons why business process execution might actually be a more logical beach head for PaaS than simple HTTP request handlers.

1) Small contract

Architecturally, the contract between a business process engine and the deployed entities (process definitions) is much smaller than the contract of a GAE-style HTTP handler. Those GAE contracts include an entire programming language and lots of libraries. A BPEL container, on the other hand, has a simple contract. It’s documented in one specification (plus a few dependencies) and offers basic activities like routing logic, message correlation, simple data manipulation, compensation handlers and service invocation. You may not think of BPEL as “simple” but would you rather implement a BPEL engine or a complete Python interpreter along with most of the core libraries? I thought so. That’s what I mean by a simpler (narrower) contract. And BPEL is just one example, I suspect some PaaS platforms will take a more bare-bone approach (e.g. no “scopes”).

Just like “good fences make good neighbors”, small contracts make good Cloud services. When your container only interprets a business process definition (typically an XML document), you don’t need to worry about intercepting/preventing all the nasty low-level APIs (e.g. unfettered network access, filesystem reads, OS calls…) that are not acceptable in a PaaS situation. But that is what Google had to do in the process of pairing down a general-purpose programming language to fit into the constraints of a PaaS container. There is no intrinsic reason why a synchronous HTTP request handler has to have access to image-manipulation libraries and a business process handler doesn’t. But the use cases tend to push you in that direction and the expectations have been set. As a result, a business process engine is architecturally a better candidate for being delivered as a Cloud service.

2) Major differentiator over IaaS-based solutions

Practically speaking, it is pretty easy today to get a (synchronous) Web app framework up and running “in the Cloud”. Provisioning a Django, PHP, RoR or Tomcat (plus the Java framework of your choice) stack on EC2 is a well-traveled path. Even auto-scaling these things is pretty well understood. I am the first one to scream that “here is an AMI of our server stack” is *not* the same as PaaS, but truth be told many people are happy enough with it. As a result, the benefit of going from a “web app on IaaS” situation to GAE-like situation is not perceived as very compelling. I suspect the realization may hit later, but for now people are happy to trade the simplified administration and extra scalability of PaaS for the ability to keep their current framework (MySQL and all) unchanged.

There is no fundamental reason why you can’t run a business process engine on top of an IaaS-provisioned infrastructure. It’s just that you are mostly on your own at this point. Even if you find an existing public AMI that meets your needs, I doubt you’ll find a well-tested way to manage, backup and auto-scale this system (marrying IaaS-level invocations with container-level and DB-level tasks). Or if you do it will probably cost you. In that “new frontier” context, a true PaaS alternative to the “build it on top of IaaS” approach is a lot more compelling than if all you need is yet another RoR-on-EC2 system.

When deciding whether to walk back to your hotel after dinner or take a cab, you don’t just consider the distance. How familiar you are with the neighborhood and how safe it appears are also important parameters.

3) There is an existing market

This may not be obvious to people who come to PaaS from a Web application framework perspective, but there is a large market for business process engines in enterprise integration scenarios. Whether it’s Oracle Fusion Middleware, Microsoft BizTalk, webMethods (now Software AG) or others, this is a very common and useful tool in the enterprise computing toolbox. If this is the market you are after (rather than creating Facebook apps or the next Twitter), then you have to address this need. Not to mention that business processes engines are often used for partner integration scenarios (which makes hosting in a public Cloud a natural choice).

Conclusion

In the end, both synchronous and asynchronous execution engines are useful, as are other core services like storage (here is my proposed list of PaaS container types). I just wanted to bring some attention to business process execution because I think PaaS is the context in which its profile will rise to higher prominence. I also anticipate that this rise will lead to some very interesting progress and innovation in the way these processes are defined, deployed and managed. We haven’t yet seen, in this area, the relentless evolutionary pressure that has shaped today’s synchronous Web application frameworks. Fun times ahead.

Oracle just announced that it has purchased Amberpoint. If you have ever been interested in Web services management, then you surely know about Amberpoint. The company has long led the pack of best-of-breed vendors for Web services and SOA Management. My history with them goes back to the old days of the OASIS WSDM technical committee, where their engineers brought to the group a unique level of experience and practical-mindedness.

The official page has more details. In short, Amberpoint is going to reinforce Oracle Enterprise Manager, especially in these areas:

Business Transaction Management

SOA Management

Application Performance Management

SOA Governance (BTW, Oracle Enterprise Repository 11g was released just over a week ago)

I am looking forward to working with my new colleagues from Amberpoint.

What makes one Web applications “Software as a Service” (SaaS) and another a “plain old Web application” (POWA)? Or is there no such distinction?

Wouldn’t it be convenient if we had an answer that has some functional relevance? Here are the different axis on which I (unsuccessfully) tried to project Web applications to sort them between SaaS and POWA:

1) Amount of data

Hypothesis: If the Web application stores a lot of your data, it’s SaaS, otherwise it’s a POWA. For example, Webmails like GMail and Yahoo Mail have a lot. So do hosted CRM systems. They are all SaaS. Zillow and Expedia do not, so they are POWA.

2) Criticality of data

Hypothesis: Flickr and YouTube may have many Gigabytes of your data, but it’s not critical data (and you most likely have a local copy), so they are POWA. An on-line password manager, or a federated identification service, stores little information about you but it’s important information so they are SaaS.

3) Customization

Hypothesis: If you’ve invested time to customize the Web application (like Salesforce.com) then it’s SaaS. If, on the other hand, users all get pretty much the same interface, give or take some minor branding (e.g. YouTube or GMail), then it’s a POWA.

4) Complexity

Hypothesis: If it takes many man-years to build it it’s SaaS, otherwise it’s a POWA. So Google Search is SaaS, but twitpics.com is a POWA.

5) Organization mandate

Hypothesis: If all your employees use the same Web application for a given task, that’s SaaS. For example, some companies mandate that all travel reservations go through American Express or Carlson Wagonlit so they are SaaS. If your company lets you use Expedia or any travel site of your choice and you just expense the bill, then it’s a POWA.

6) Administration

Hypothesis: If your company has its own administrator(s) to manage the use of the Web application by its employees, then it’s SaaS. Otherwise it’s a POWA. For example, Google Apps is SaaS, but Facebook is a POWA.

7) Payment

Hypothesis: If you pay for using the Web application, its’ SaaS. In that respect, LotusLive (and porn sites) are SaaS. Facebook and Twitter are POWA.

8) SLA

Hypothesis: If you have guarantees of service, and they are backed by some compensation, then it’s SaaS. For example Aplicor and AWS S3 are SaaS, but (last I heard) SalesForce.com is not.

9) Compliance

Hypothesis: If the Web application conforms to your compliance needs, then it’s SaaS. By this metric, pretty much nothing is SaaS if you have any serious compliance requirement, it’s all POWA.

10) On-premise alternative

Hypothesis: If the Web application replaces something that you could do on-premise, then it’s SaaS. If it intrinsically involves a third-party provider, then it’s a POWA. Zoho (which replaces Office+Sharepoint) is SaaS, while Amazon (the store) and EBay are POWA. You couldn’t run your own version of them, at least not as a replacement for the public versions (you could run an internal auction site, but that’s a different use case from what you use EBay for).

11) On demand

Hypothesis: If you can initiate service via just a Web request it’s SaaS. If it requires you to interact with a human it’s a POWA. Or is it the other way around? Most silly little Web apps out there would fit the SaaS bill by this definition, while LotusLive-type deployments (or HP Cloud Assure) would not. So are we now saying that Cloud requires *more* friction?

12) Scalable

Hypothesis: If you can quickly ramp up your usage (in terms of load per user and/or number of users) without having to warn the provider and without noticing performance degradations then it’s SaaS, otherwise it’s POWA. The Amazon storefront and GMail are SaaS, Twitter is a POWA.

13) Pay per use

Hypothesis: If your usage of the Web application is metered and your bill is based on it then it’s SaaS. Otherwise it’s a POWA. iTunes is SaaS. GMail is a POWA.

13) Primarily computational

Hypothesis: If the service provided is primarily computational (storage/processing of data) it’s SaaS, otherwise it’s a POWA. GMail and Salesforce are SaaS. A Web application for pizza delivery or travel reservation is a POWA. Banking services are SaaS (let’s chew on that one for a second).

14) Programmatic usage

Hypothesis: If the web application is primarily used programmatically (e.g. SOA) it’s SaaS, if it’s primarily used by humans through the Web then it’s a POWA. There aren’t that many SaaS applications by that measure. Maybe some financial/trading/information services? Or AWS S3 (which has the amazing property of belonging to IaaS, PaaS and SaaS depending on who you ask). Or iTunes?

15) Standard interface

Hypothesis: If there is a standard interface to the Web application, then it’s SaaS, otherwise it’s just a POWA. Webmails (at least those that support IMAP) are SaaS. Most other Web applications are POWA, by virtue of there being very few standard application-level interfaces (remember RosettaNet?).

16) Gets the CIO on magazine covers

Hypothesis: If adoption of a Web application gets the CIO on the cover of trade magazines, then it’s SaaS. Otherwise it’s a POWA. The problem with this definition is that it’s a moving line. The first CIO to use external email or CRM probably landed on a few cover pages. Soon nothing short of putting your pacemaker firmware in the Cloud will.

17) Open Source

Hypothesis: If the Web application is built using Open Source software it’s SaaS. Otherwise it’s a POWA. WordPress.com and the various Zimbra hosted providers are therefore SaaS but SalesForce.com is not. Yes, that’s a silly criteria to even consider. But someone was going to bring it up, so let’s kill it now.

18) Enterprise usage

Hypothesis: Web applications used for business are SaaS. Those used by consumers are POWA. For example GMail (as part of Google Apps) is SaaS while the regular GMail is… a POWA.

Bottom Line

Which of these sorting criteria work for you? I don’t find any of them convincing. Many feel very artificial, as the example for the last one illustrates. Several (like the first three) seem to test how tied to the Web application you are (with the hypothesis that commitment means true SaaS while a one-night-stand is just casual SaaS, aka POWA). But this goes against the common notion that “Cloud” means more flexibility, not less.

I could list more question, but this list should suffice to illustrate the difficulty of establishing a Litmus test that separates SaaS from POWA. Even if we give up on a clear definition, I don’t even feel like I can say that “I know SaaS when I see it”. At least not in a way that I can expect most others to agree with.

Are there actually useful characteristics that am I forgetting to test against? Is it a carefully weighted combination of those above? In the end, is there a reasonable way to tell SaaS apart from POWA? If we can’t find an answer that people eventually agree on and that has functional relevance (by which I mean that there is a meaningful difference between how you consume/manage SaaS vs. POWA) then I’ll have to conclude that the term SaaS only exists for some of the following (bad) reasons:

A two level-stack (IaaS and Paas) looks silly,

adding SaaS to IaaS and PaaS instantly inflates the size, adoption and relevance of the “Cloud” umbrella , by co-opting pre-existing and already-thriving businesses,

conversely, riding the coattail of the hot “Cloud” buzzword helps providers of such applications. SaaS sounds much better than the worn-out “ASP” appellation, especially with ASP.Net cheapening it.

After throwing stones, here is my proposal.

There is a definition I would like to use to qualify which Web applications qualify for the SaaS appellation: it’s any Web application which satisfies all (or at least several) of the layer 3 benefits in this “Taxonomy of Cloud Computing Benefits”. The first test is to get authentication and authorization right. The second is to offer proper programmatic remote interfaces. Etc (there are 5, remember to scroll down to “layer 3” to find them).

But if we take that list as a definition, rather than just as an aspiration, then let’s face the truth: right now there isn’t a single SaaS offering in the market.

There have been someinterestingdiscussionsrecently about the relationship between Cloud management and SOA management/governance (run-time and design-time). My only regret is that they are a bit too focused on determining winners and loosers rather than defining what victory looks like (a bit like arguing whether the smartphone is the triumph of the phone over the computer or of the computer over the phone instead of discussing what makes a good smartphone).

To define victory, we need to answer this seemingly simple question: in what ways is the relationship between a VM and its hypervisor different from the relationship between two communicating applications?

More generally, there are three broad categories of relationships between the “active” elements of an IT system (by “active” I am excluding configuration, organization, management and security artifacts, like patch, department, ticket and user, respectively, to concentrate instead on the elements that are on the invocation path at runtime). We need to understand if/how/why these categories differ in how we manage them:

Deployment relationships: a machine (or VM) in a physical host (or hypervisor), a JEE application in an application server, a business process in a process engine, etc…

Infrastructure dependency relationships (other than containment): from an application to the DB that persists its data, from an application tier to web server that fronts it, from a batch job to the scheduler that launches it, etc…

Application dependency relationships: from an application to a web service it invokes, from a mash-up to an Atom feed it pulls, from a portal to a remote portlet, etc…

In the old days, the lines between these categories seemed pretty clear and we rarely even thought of them in the same terms. They were created and managed in different ways, by different people, at different times. Some were established as part of a process, others in a more ad-hoc way. Some took place by walking around with a CD, others via a console, others via a centralized repository. Some of these relationships were inventoried in spreadsheets, others on white boards, some in CMDBs, others just in code and in someone’s head. Some involved senior IT staff, others were up to developers and others were left to whoever was manning the controls when stuff broke.

It was a bit like the relationships you have with the taxi that takes you to the airport, the TSA agent who scans you and the pilot who flies you to your destination. You know they are all involved in your travel, but they are very distinct in how you experience and approach them.

It all changes with the Cloud (used as a short hand for virtualization, management automation, on-demand provisioning, 3rd-party hosting, metered usage, etc…). The advent of the hypervisor is the most obvious source of change: relationships that were mostly static become dynamic; also, where you used to manage just the parts (the host and the OS, often even mixed as one), you now manage not just the parts but the relationship between them (the deployment of a VM in a hypervisor). But it’s not just hypervisors. It’s frameworks, APIs, models, protocols, tools. Put them all together and you realize that:

the IT resources involved in all three categories of relationships can all be thought of as services being consumed (an “X86+ethernet emulation” service exposed by the hypervisor, a “JEE-compatible platform” service exposed by the application server, an “RDB service” expose by the database, a Web services exposed via SOAP or XML/JSON over HTTP, etc…),

they can also be set up as services, by simply sending a request to the API of the service provider,

not only can they be set up as services, they are also invoked as such, via well-documented (and often standard) interfaces,

they can also all be managed in a similar service-centric way, via performance metrics, SLAs, policies, etc,

your orchestration code may have to deal with all three categories, (e.g. an application slowdown might be addressed either by modifying its application dependencies, reconfiguring its infrastructure or initiating a new deployment),

the relationships in all these categories now have the potential to cross organization boundaries and involve external providers, possibly with usage-based billing,

as a result of all this, your IT automation system really needs a simple, consistent, standard way to handle all these relationships. Automation works best when you’ve simplified and standardize the environment to which it is applied.

If you’re a SOA person, your mental model for this is SOA++ and you pull out your SOA management and governance (config and runtime) tools. If you are in the WS-* obedience of SOA, you go back to WS-Management, try to see what it would take to slap a WSDL on a hypervisor and start dreaming of OVF over MTOM/XOP. If you’re into middleware modeling you might start to have visions of SCA models that extend all the way down to the hardware, or at least of getting SCA and OSGi to ally and conquer the world. If you’re a CMDB person, you may tell yourself that now is the time for the CMDB to do what you’ve been pretending it was doing all along and actually extend all the way into the application. Then you may have that “single source of truth” on which the automation code can reliably work. Or if you see the world through the “Cloud API” goggles, then this “consistent and standard” way to manage relationships at all three layers looks like what your Cloud API of choice will eventually do, as it grows from IaaS to PaaS and SaaS.

Your background may shape your reference model for this unified service-centric approach to IT management, but the bottom line is that we’d all like a nice, clear conceptual model to bridge and unify Cloud (provisioning and containment), application configuration and SOA relationships. A model in which we have services/containers with well-defined operational contracts (and on-demand provisioning interfaces). Consumers/components with well-defined requirements. APIs to connect the two, with predictable results (both in functional and non-functional terms). Policies and SLAs to fine-tune the quality of service. A management framework that monitors these policies and SLAs. A common security infrastructure that gets out of the way. A metering/billing framework that spans all these interactions. All this while keeping out of sight all the resource-specific work needed behind the scene, so that the automation code can look as Zen as a Japanese garden.

It doesn’t mean that there won’t be separations, roles, processes. We may still want to partition the IT management tasks, but we should first have a chance to rejigger what’s in each category. It might, for example, make sense to handle provider relationships in a consistent way whether they are “deployment relationships” (e.g. EC2 or your private IaaS Cloud) or “application dependency relationships” (e.g. SOA, internal or external). On the other hand, some of the relationships currently lumped in the “infrastructure dependency relationships” category because they are “config files stuff” may find different homes depending on whether they remain low-level and resource-specific or they are absorbed in a higher-level platform contract. Any fracture in the management of this overall IT infrastructure should be voluntary, based on legal, financial or human requirements. And not based on protocol, model, security and tool disconnect, on legacy approaches, on myopic metering, that we later rationalize as “the way we’d want things to be anyway because that’s what we are used to”.

In the application configuration management universe, there is a planetary collision scheduled between the hypervisor-centric view of the world (where virtual disk formats wrap themselves in OVF, then something like OVA to address, at least at launch time, application and infrastructure dependency relationships) and the application-model view of the world (SOA, SCA, Microsoft Oslo at least as it was initially defined, various application frameworks…). Microsoft Azure will have an answer, VMWare/Springsouce will have one, Oracle will too (though I can’t talk about it), Amazon might (especially as it keeps adding to its PaaS portfolio) or it might let its ecosystem sort it out, IBM probably has Rational, WebSphere and Tivoli distinguished engineers locked into a room, discussing and over-engineering it at this very minute, etc.

There is a lot at stake, and it would be nice if this was driven (industry-wide or at least within each of the contenders) by a clear understanding of what we are aiming for rather than a race to cobble together partial solutions based on existing control points and products (e.g. the hypervisor-centric party).

[UPDATED 2010/1/25: For an illustration of my statement that “if you’re a SOA person, your mental model for this is SOA++”, see Joe McKendrick’s “SOA’s Seven Greatest Mysteries Unveiled” (bullet #6: “When you get right down to it, cloud is the acquisition or provisioning of reusable services that cross enterprise walls. (…) They are service oriented architecture, and they rely on SOA-based principles to function.”)]

There is the Cloud that provides value by requiring as few changes as possible. And there is the Cloud that provides value by raising the abstraction and operation level. The backward-compatible Cloud versus the forward-compatible Cloud.

The main selling point of the backward-compatible Cloud is that you can take your existing applications, tools, configurations, customizations, processes etc and transition them more or less as they are. It’s what allowed hypervisors to spread so quickly in the enterprise.

The main selling point of the forward-compatible Cloud is that you are more productive and focused. Fewer configuration items to worry about, fewer stack components to install/monitor/update, you can focus on your application and your business goals. You develop and manage at the level of application concepts, not systems. Bottom line, you write and deploy applications more quickly, cheaply and reliably.

To a large extent this maps to the distinction between IaaS and PaaS, but it’s not that simple. For example, a PaaS that endeavors to be a complete JEE environment is mainly aiming for the backward-compatible value proposition. On the other hand, EC2 spot instances, while part of the IaaS layer, are of the forward-compatible kind: not meant to run your current applications unchanged, but rather to give you ways to create applications that better align with your business goals.

Part of the confusion is that it’s sometimes unclear whether a given environment is aiming for forward-compatibility (and voluntary simplification) or whether its goal is backward-compatibility but it hasn’t yet achieved it. Take EC2 for example. At first it didn’t look much like a traditional datacenter, beyond the ability to create hosts. Then we got fixed IP, EBS, boot from EBS, etc and it got more and more realistic to run applications unchanged. But not quite, as this recent complaint by Hoff illustrates. He wants a lot more control on the network setup so he can deploy existing n-tier applications that have specific network topology/config requirements without re-engineering them. It’s a perfectly reasonable request, in the context of the backward-compatible Cloud value proposition. But one that will never be granted by a Cloud that aims for forward-compatibility.

Similarly, the forward-compatible Cloud doesn’t always successfully abstract away lower-level concerns. It’s one thing to say you don’t have to worry about backup and security but it means that you now have to make sure that your Cloud provider handles them at an acceptable level. And even on technical grounds, abstractions still leak. Take Google App Engine, for example. In theory you only deal with requests and not even think about the servers that process them (you have no idea how many servers are used). That’s nice, but once a while your Java application gets a DeadlineExceededException. That’s because the GAE platform had to start using a new JVM to serve this request (for example, your traffic is growing or the JVM previously used went down) and it took too long for the application to load in the new JVM, resulting in this loading request being killed. So you, as the developer, have to take special steps to mitigate a problem that originates at a lower level of the stack than you’re supposed to be concerned about.

All in all, the distinction between backward-compatible and forward-compatible Clouds is not a classification (most Cloud environments are a mix). Rather, it’s another mental axis on which to project your Cloud plans. It’s another way to think about the benefits that you expect from your use of the Cloud. Both providers and consumers should understand what they are aiming for on that axis. Hopefully this can help prevent shout matches of the “it’s a bug, no it’s a feature” variety.

[UPDATED 2010/3/4: Apparently, Steve Ballmer thinks along the same lines. Though the way he sees it, Azure is forward-compatible, while Amazon is backward-compatible: “I think Amazon has done a nice job of helping you take the server-based programming model – the programming model of yesterday, that is not scale-agnostic – and then bringing it into the cloud. On the other hand, what we’re trying to do with Azure is let you write a different kind of application.“]

[UPDATED 2010/3/5: I now have the quasi-proof that indeed Steve Ballmer stole the idea from my blog. Look at this entry in my HTTP log. This visitor came the evening before Steve’s “Cloud” talk at the University of Washington. I guess I am not the only one to procrastinate until the 11th hour when I have a deadline. Every piece of information in this log entry points at Steve Ballmer. How can it be anyone other than him?

One of the heavily discussed Cloud topics in early 2009 was a Cloud Computing taxonomy. Now that this theme has died down (with limited results), and to start 2010 in a similar form, here is a proposal for a taxonomy of the benefits of Cloud Computing.

Just like the original Cloud Computing taxonomy only had three layers (IaaS/PaaS/SaaS), so does this taxonomy of Cloud benefits. The point of this post is to promote the third layer. I describe layers 1 and 2 mainly to better call out what’s specific about layer 3.

Layer 1 (infrastructure: “let someone else do it”)

This is the bare-bottom, inherent benefit of Cloud Computing: you don’t have to deal with the hardware. In practice, it means:

no need to worry about power/cooling,

on-demand provisioning/deprovisioning (machines appear/disappear in a way physical machines do not),

not responsible for physical security (though responsible for ensuring that the provider has an acceptable security level),

More specifically, more automated IT management. This does not require Cloud Computing (you can have a highly automated IT management environment on premise), but the move to Cloud Computing is the trigger that is making it really happen. While this capability is not an inherent benefit of Cloud Computing, the Cloud makes it:

Needed: You don’t get to put color tags on machines, you don’t get to bring a DVD to install a new application, you don’t get to open a machine to insert more memory, you don’t get to go retrieve a backup tape, label it and put it in a safe, etc. Of course loosing these “privileges” doesn’t sound bad considering that they are mostly chores, but it means that you have to design alternative (and mostly programmatic) ways to perform the functions that these tasks addressed.

Easier: Cloud environments are highly API-driven. Many IT tools from the previous generation were console-centric (people would go out and buy “a network/event/system management console“) with APIs/protocols as a secondary thought. In Cloud environments, tools are a lot more API-centric with the console as an adjunct (anyone has stats about the ratio of EC2 instances provisioned via the AWS console versus the APIs?). This is also why even though a lot of people wanted standard management protocols (of the WSDM/WS-Management generation), there wasn’t as much of a realization of their importance in the old environment (and not as much pressure to create them and eagerness to adopt them). The stakes and visibility are a lot higher in the Cloud environments and that’s why this second wave of protocols will have to succeed where the previous one came short.

More beneficial: Once you have automated IT management in a traditional data center, what you get is fewer employees needed and somewhat better utilization. But you are still gated by the time/process to purchase/install new machines and the cost of unused machines (at least with automation you don’t have to pay their power/cooling). You don’t get the “just what I need” level of infrastructure usage that the same automation work allows in a Cloud setting.

Layer 3 (applications: “do it right”)

In short, use the move to the Cloud as an opportunity to fix some of the key issues of today’s applications. Think of the Cloud switch as a second Y2K, 10 years later: like in 2000, not only are there things that the transition requires you to fix, there are also many things that aren’t exactly required to fix but still make sense to fix as part of the larger modernization effort. Of course the Cloud move is missing that ever-so-valuable project management motivator of a firm deadline, but hopefully competitive pressure can play a similar role.

What are these issues? Here is a partial list:

Security: at least authentication and authorization. We have SSO/Federation systems, both enterprise-type and Web-centric and they often suck in practice. Whether it’s because of the protocols, the implementations, the tools or the mindset. Plus, there are too many of them. As applications gained mouths and ears and started to communicate with one another, the problem became obvious. If, in the Cloud, you also want them to grow legs and be able to move around (wholly or in parts) then it really really has to get fixed. Not to mention the “all or nothing” delegation model that I am surprised hasn’t yet created a major disaster (let’s see what 2010 has in store). I suggested a band-aid fix earlier, but this needs a real solution (the Cloud Security Alliance provides some guidance in this document, see “domain 12” for IAM).

Get remote application interfaces right. It’s been discussed, manifesto’ed, buried and lampooned many times before (this was my humble take on it). Whether it’s because of WS-* or, more likely, java2wsdl we have been delayed in this but it simply has to happen. Call it SOAP, zenSOAP, REST, practical REST or whatever you want. Just make sure that all important functions and data are accessible via clear, documented, consistent, easy-to-use, on-the-wire interfaces. Once we have these interfaces, and only then, we can worry about reliably composing/orchestrating applications that cross organizational boundaries.

Related to the previous point, clean up the incestuous relationship between an application and its data. Actually, it’s not “its” data. It’s the data it works on.

Deliver application-centric IT management. Quit loosing and (badly) re-creating information: for example, an application deployment followed by a black-box discovery (“what did I just do”?). Or after-the-fact re-establishing correlations between events on different servers (“what was this about”?). Application management too often looks like a day in the life of a senile person.

Fault-tolerance and disaster recovery. It is too often lacking (or untested, which is the same) for applications that are just below the perceived threshold of requiring it to be done right. That threshold needs to be lowered and the move to the Cloud can be used to make this possible.

As I mentioned above, these are mostly not Cloud specific (though it is possible to create a Cloud connection for each). They are things that we have known about and tried to fix for a while. But the pace has been pretty slow and there is an opportunity for the Cloud transition to do more than just hand out the keys of the datacenter.

My colleagues (and Enterprise Manager experts) Debu Panda and Arvind Maheshwari have a very handy book out, titled Middleware Management with Oracle Enterprise Manager Grid Control 10gR5 (that’s the latest release of Enterprise Manager). The publisher sent me a copy of the book. It illustrates well that Enterprise Manager does a lot more than just database management; it also provides coverage of most of the Oracle middleware stack (and some non-Oracle middleware components).

I am happy to provide an outline of the book, because it shows both how complete the book is and how wide the coverage of Enterprise Manager is for the Oracle middleware stack.

Chapter 1 provides an overview of the base Enterprise Manager product and its various packs.

Chapter 2 describes the installation process.

Chapter 3 describes the key concepts of the different subsystems of Enterprise Manager.

Chapter 11 describes the capability to manage non-Oracle middleware for these youthful errors you committed before seeing the (red) light.

Chapter 12 introduces some of the cool new application management features: Composite Application Modeler and Monitor (CAMM) to manage a distributed application across all its components, and Application Diagnostic for Java (AD4J) to drill down into a specific JVM.

Chapter 13 invites you to roll-up your sleeves and write your own plug-in so that Enterprise Manager can manage new types of targets.

Chapter 14 ends the book by sharing some best practices from customer experience.

All in all, this is the most user-friendly and accessible way to learn and become familiar with the scope of what Enterprise Manager has to offer for middleware management. The gory details (e.g. the complete list of target types, metrics and their definitions) are not in the book but available from the on-line documentation.

To end on a ludic note, you can use this table of content to test your knowledge of some Oracle acquisitions. Can you associate the following acquired companies with the corresponding chapter? Auptyma, Oblix, BEA, ClearApp, Collaxa, Tangosol.

Lots of communities think of Cloud Computing as the realization of a vision that they have been pusuing for a while (“sure we didn’t call it Cloud back then but…”). Just ask the Grid folks, the dynamic data center folks (DCML, IBM’s “Autonomic Computing”, HP’s “Adaptive Enterprise”, Microsoft’s DSI), the ASP community, and those of us who toiled on what was going to be the SOAP-based management stack for all IT (e.g. my HP colleagues and I can selectively quote mentions of “adaptation mechanisms like resource reservation, allocation/de-allocation” and “management as a service” in this WSMF white paper from 2003 to portray WSMF as a precursor to all the Cloud APIs of today).

I thought of another such community today, as I ran into older OMG specifications: the Model-Driven Architecture (MDA) community. I have no idea what people in this community actually think of Cloud Computing, but it seems to me that PaaS is a chance to come close to part of their vision. For two reasons: PaaS makes it easier and more rewarding, all at the same time, to practice model-driven design. More bang for less buck.

Easier

My understanding of the MDA value proposition is that it would allow you to create a high-level design (at the level of something like an augmented version of UML) and have it automatically turn into executable code (e.g. that can run in a JEE or .NET container). I am probably making it sound more naive than it really is, but not by much. That’s a might wide gap to bridge, for QVT and friends, from UMLish to byte-code and it’s no surprise that the practical benefits of MDA are still to be seen (to put it kindly).

In a PaaS/SaaS world, on the other hand, you are mapping to something that is higher level than byte code. Depending on what types of PaaS containers you envision, some of the abstractions provided by these containers (e.g. business process execution, event processing) are a lot closer to the concepts manipulated in your PIM (Platform-independent model, the UMLish mentioned above). Thus a smaller gap to bridge and a better chance of it being automagical. Especially if you add a few SaaS building blocks to the mix.

More rewarding

Not only should it be easier to map a PIM to a PaaS deployment environments, the benefits you get once you are done are incommensurably greater. Rather than getting a dump of opaque auto-generated byte-code running in a regular JVM/CLR, you get an environments in which the design concepts (actors/services, process, rules, events) and the business model elements are first class citizens of the platform management infrastructure. So that you can monitor and set policies on the same things that you manipulate in you PIM. As opposed to falling down to the lowest common denominator of CPU/memory metrics. Or, god forbid, trying to diagnose/optimize machine-generated code.

We shall see

I wasn’t thinking of Microsoft SQL Server Modeling (previously known as Oslo) when I wrote this, but Doug Purdy’s tweet made the connection for me. And indeed, one can see in SQLSM+Azure the leading candidate today to realizing the MDA vision… minus the OMG MDA specifications.