Many of us have been thinking (a bit) and talking (a lot) about the relationship between Clouds and good old IT management. John understands both sides and produced a few good posts (like this one).

Maybe it’s just a coincidence that both Hyperic and CA recently made such announcements. In any case, it gives the impression that time has come for some actual product capabilities in the area of managing Cloud-based systems.

I haven’t investigated either, so keep your slideware shields up, but this is what I read:

From Javier Soltero’s “Announcing HQ 4.0”: “It also provides the first cloud-friendly management agent which allows users to manage cloud based virtual machines securely and reliably from either inside the cloud, or from HQ 4.0 installations inside your datacenter”. John approves.

And at CA World, according to InformationWeek, “CA will announce a partnership with Amazon to provide management capabilities around Amazon’s EC2 utility computing platform, potentially including discovery of software running on EC2 instances, performance monitoring, configuration management, software deployment capabilities and provisioning”.

When someone looks into these two products (and others, soon to follow or alrady out and that I have missed), it will be interesting to see how these Cloud-friendly capabilities relate to the good old capabilities of management products: “software discovery”, “perf monitoring”, “config management”, “software deployment”, “provisioning”. That all sounds pretty familiar. Is it just a matter of pointing the old tools to an EC2 IP address? Is it all new capabilities, done in a new way? Or, more realistically, where does it land between these extrems? Where do you want them to land? It’s not so obvious.

Utility computing comes with an expectation of additional flexibility (now that is obvious). When tweaking IT management tools to address the domain, does one leave “in datacenter” capabilities the same and branch off to do cool things in the new land? Or do you raise the level of flexibility accross the board?

In other words, rather than snickering at them, maybe we should praise IT management vendors for whom the “look, I do Clouds” marketing spiel is just a repackaging of normal IT management features. Because it may mean that they’ve raised the bar on “in datacenter” automation capabilities. These Opsware and BladeLogic acquisitions have to come in somewhere, don’t they?

BTW, both of the announcements above also perpetuate the confusion between providing utility services (CA’s extended SaaS offering, Hyperic’s release of a pre-packaged Hyperic AMI) and the ability to manage Cloud-based systems. It’s all crammed in the same announcement/article because, hey, it’s all Cloud stuff.

Speaking of CA World, if I was there I would go to this session. At least for old time sake, and maybe to get some interesting ideas. Hopefully Don will blog about it after he is done presenting later today.

Pricing is now available for Windows instances on Amazon EC2. More than the technical availability of Windows AMIs, the fact that you get pay your Windows license fee based on usage is a major change. This is where Microsoft’s announcement goes beyond Oracle’s EC2 announcement at Oracle Open World.

But why stop at EC2 instances? If I can do it there, why can’t I do it at home? Considering how rarely my home desktop is booted to Windows, I would love to pay my Windows license in a metered way. It would basically be limited to time spent editing video and participating in family Skype videconference (at least until I manage to get Skype full screen video to work on Ubuntu).

After all, why only Amazon and not other Cloud providers. And when this happens, I think I may become a cloud provider myself. It would be a small-scale operation. One physical CPU (my desktop). And one user (me). I would meter my usage and dutifully pay Microsoft every month based on the number of hours during which I was running Windows.

How much would that be? Well, a Linux Small Standard Image EC2 instance (the closest thing to my aging desktop) costs $0.10 per hour. The Windows version costs $0.125 per hour, so the Windows license on this machine costs 2.5 cents per hour. On a given month, I don’t use it for more than 10 hours (edit/render one DVD plus a few hours on Skype). That’s 25 cents. Does Microsoft take Paypal? Is the Microsoft tax about to get more progressive?

It will be interesting to see how Microsoft manages to be flexible on server OS licensing (where it has plenty of competition) and while keeping its highly profitable (and unfairly front-loaded and restrictive) desktop OS licensing intact.

The announcement finally came out. Users can now run supported versions of Oracle Enterprise Linux, 11G Database, Fusion Middleware and Enterprise Manager on Amazon EC2 instances. You can create your own AMI or use any of the pre-packaged AMIs with the above-mentioned products. And you don’t have to purchase new licenses, you can transfer existing ones to run on Amazon’s infrastructure.

A separate but related announcement is the possibility to simply and securely backup your databases on Amazon S3 instead of (or in addition to) on tape. I hope BNY Mellon will take notice.

This comes in addition to the existing SaaS offering (“On Demand”) from Oracle and the SaaS platform (for others to provide SaaS on top of Oracle’s software). It is a major milestone for utility computing.

Grid computing is moulting and, to no surprise, the new skin has “cloud” written all over it.

That’s one way to interpret the announcement today that HP, Intel and Yahoo are going to launch a compute cloud. Seeing Intel and HP work together on this is no surprise. Back at HP I had some involvement with the collaboration between HP Labs and Intel on PlanetLab.

I have only read the Gigaom article and Steve’s, so this post is not an analysis of the announcement. Just a few questions that come to mind. They can be most concisely expressed by trying to understand the difference with Amazon’s EC2. The quotes below all come from the Gigaom article.

“six physical locations” -> Amazon has availability zones, including the choice of three geographies.

“between 1,000 and 4,000 mostly Intel cores” -> According to this well-publicized story, Amazon can deliver 5,000 servers (each linked to at least one physical core) to one customer without breaking a sweat.

“We want, unlike other partnerships including Google and IBM’s where the lower-level stacks are not provided in a open manner to the world, open access to all levels of the hardware” -> The quote seems to conveniently avoid comparison with EC2 which provides a much lower abstraction level: virtual machines with mountable raw block storage devices. How much lower can you go without handing out access cards to physically walk into the datacenter? Access to the BMC on the motherboard? Access to some internal bus? Remote-controlled little robots that will slide cards in and out of a chassis?

“researchers will be able to access the cloud through a proposal process later this year” -> Ec2 offers pay-as-you go, which tends to be a good driver for people to use the infrastructure efficiently. And of course someone can always give researchers a grant in the form of EC2 rent money.

Just to be clear, I am not belittling the announcement because for one thing I haven’t read much about it and for another I probably know many of the HP Labs people involved and they are part of the “mucho sapiens” branch of “homo sapiens”. I know they wouldn’t bother putting this out if it was nothing more than giving researchers some free EC2 time.

But these are the questions I’ll be trying to answer for myself as I read more about this project.

But I do have something against their story being used to set the benchmark for infrastructure flexibility. For those who haven’t heard it five times already, the summary of “their story” is ramping up from 50 to 5000 machines in a week (according to the podcast). Or from 50 to 3500 (according to the this AWS blog entry). Whatever. If I auto-generate my load (which is mostly what they did when they decided to auto-create a custom video for each new user) I too can create the need for a thousands of machines.

This was probably a good business decision for Animoto. They got plenty of visibility at a low cost. Plus the extra publicity from being an EC2 success story (I for one would never have heard of them through their other channels). Good for them. Good for Amazon who made it possible. And who got a poster child out of it. Good for the facebookers who got to waste another 30 seconds of their time straining their eyes. Everyone is happy, no animal got hurt in the process, hurray.

That’s all good but it doesn’t mean that from now on any utility computing solution needs to support ramping up by a factor of 100 in a week. What if Animoto had been STD’ed (slashdoted, technoratied and dugg) at the same time as the Facebook burst, resulting in the need for 50,000 servers? Would 1,000 X be the new benchmark? What if a few of the sites that target the “lonely guy” demographic decided to use Animoto for… ok let’s not got there.

There are three types of user requirements. The Animoto use case is clearly not in the first category but I am not convinced it’s in the third one either.

The “pulled out of thin air” requirements that someone makes up on the fly to justify a feature that they’ve already decided needs to be there. Most frequently encountered in standards working groups.

The “it happened” requirements that assumes that because something happened sometimes somewhere it needs to be supported all the time everywhere.

The “it makes business sense” requirements that include a cost-value analysis. The kind that comes not from asking “would you like this” to a customer but rather “how much more would you pay for this” or “what other feature would you trade for this”.

When cloud computing succeeds (i.e. when you stop hearing about it all the time and, hopefully, we go back to calling it “utility computing”), it will be because the third category of requirements will have been identified and met. Best exemplified by the attitude of Tarus (from OpenNMS) in the latest Redmonk podcast (paraphrased): sure we’ll customize OpenNMS for cloud environments; as soon as someone pays us to do it.

This Forbes article (via John) channels 3Tera’s Bert Armijo’s call for standardization of utility computing. He calls it “Open Cloud” and it would “allow a company’s IT systems to be shared between different cloud computing services and moved freely between them“. Bert talks a bit more about it on his blog and, while he doesn’t reference the Forbes interview (too modest?), he points to Cloudscape as the vision.

A few early thoughts on all this:

No offense to Forbes but I wouldn’t read too much into the article. Being Forbes, they get quotes from a list of well-known people/companies (Google and Amazon spokespeople, Forrester analyst, Nick Carr). But these quotes all address the generic idea of utility computing standards, not the specifics of Bert’s project.

Saying that “several small cloud-computing firms including Elastra and Rightscale are already on board with 3Tera’s standards group” is ambiguous. Are they on-board with specific goals and a candidate specification? Or are they on board with the general idea that it might be time to talk about some kind of standard in the general area of utility computing?

IEEE and W3C are listed as possible hosts for the effort, but they don’t seem like a very good match for this area. I would have thought of DMTF, OASIS or even OGF first. On the face of it, DMTF might be the best place but I fear that companies like 3Tera, Rightscale and Elastra would be eaten alive by the board member companies there. It would be almost impossible for them to drive their vision to completion, unlike what they can do in an OASIS working group.

A new consortium might be an option, but a risky and expensive one. I have sometimes wondered (after seeing sad episodes of well-meaning and capable start-ups being ripped apart by entrenched large vendors in standards groups) why VCs don’t play a more active role in standards. Standards sound like the kind of thing VCs should be helping their companies with. VC firms are pretty used to working together, jointly investing in companies. Creating a new standard consortium might be too hard for 3Tera, but if the VCs behind 3Tera, Elastra and Rightscale got together and looked at the utility computing companies in their portfolios, it might make sense to join forces on some well-scoped standardization effort that may not otherwise be given a chance in existing groups.

I hope Bert will look into the history of DCML, a similar effort (it was about data center automation, which utility computing is not that far from once you peel away the glossy pictures) spearheaded by a few best-of-bread companies but ignored by the big boys. It didn’t really take off. If it had, utility computing standards might now be built as an update/extension of that specification. Of course DCML started as a new consortium and ended as an OASIS “member section” (a glorified working group), so this puts a grain of salt on my “create a new consortium and/or OASIS group” suggestion above.

The effort can’t afford to be disconnected from other standards in the virtualization and IT management domains. How does the effort relate to OVF? To WS-Management? To existing modeling frameworks? That’s the main draw towards DMTF as a host.

There seems to be some solid technical raw material to start from. 3Tera’s ADL, combined with Elastra’s ECML/EDML, presumably captures a fair amount of field expertise already. But when you think of them as a starting point to standardization, the mindset needs to switch from “what does my product need to work” to “what will the market adopt that also helps my product to work”.

One big question (at least from my perspective) is that of the line between infrastructure and applications. Call me biased, but I think this effort should focus on the infrastructure layer. And provide hooks to allow application-level automation to drive it.

The other question is with regards to the management aspect of the resulting system and the role management plays in whatever standard specification comes out of Bert’s effort.

Bottom line: I applaud Bert’s efforts but I couldn’t sleep well tonight if I didn’t also warn him that “there be dragons”.

And for those who haven’t seen it yet, here is a very good document on the topic (but it is focused on big vendors, not on how smaller companies can play the standards game).

[UPDATED 2008/6/30: A couple hours after posting this, I see that Coté has just published a blog post that elaborates on his view of cloud standards. As an addition to the podcast I mentioned earlier.]

[UPDATED 2008/7/2: If you read this in your feed viewer (rather than directly on vambenepe.com) and you don’t see the comments, you should go have a look. There are many clarifications and some additional insight from the best authorities on the topic. Thanks a lot to all the commenters.]

“If you have a stove, a saucepan and a bottle of cold water, how can you make boiling water?”

If you ask this question to a mathematician, they’ll think about it a while, and finally tell you to pour the water in the saucepan, light up the stove and put the saucepan on it until the water boils. Makes sense. Then ask them a slightly different question: “if you have a stove and a saucepan filled with cold water, how can you make boiling water?”. They’ll look at you and ask “can I also have a bottle”? If you agree to that request they’ll triumphantly announce: “pour the water from the saucepan into the bottle and we are back to the previous problem, which is already solved.”

In addition to making fun of mathematicians, this is a good illustration of the “fake machine” approach to utility computing embodied by Amazon’s EC2. There is plenty of practical value in emulating physical machines (either in your data center, using VMWare/Xen/OVM or at a utility provider’s site, e.g. EC2). They are all rooted in the fact that there is a huge amount of code written with the assumption that it is running on an identified physical machine (or set of machines), and you want to keep using that code. This will remain true for many many years to come, but is it the future of utility computing?

Google’s App Engine is a clear break from this set of assumptions. From this perspective, the App Engine is more interesting for what it doesn’t provide than for what it provides. As the description of the Sandbox explains:

“An App Engine application runs on many web servers simultaneously. Any web request can go to any web server, and multiple requests from the same user may be handled by different web servers. Distribution across multiple web servers is how App Engine ensures your application stays available while serving many simultaneous users [not to mention that this is also how they keep their costs low — William]. To allow App Engine to distribute your application in this way, the application runs in a restricted ‘sandbox’ environment.”

The page then goes on to succinctly list the limitations of the sandbox (no filesystem, limited networking, no threads, no long-lived requests, no low-level OS functions). The limitations are better described and commented upon here but even that article misses one major limitation, mentioned here: the lack of scheduler/cron.

Rather than a feature-by-feature comparison between the App Engine and EC2 (which Amazon would won handily at this point), what is interesting is to compare the underlying philosophies. Even with Amazon EC2, you don’t get every single feature your local hardware can deliver. For example, in its initial release EC2 didn’t offer a filesystem, only a storage-as-a-service interface (S3 and then SimpleDB). But Amazon worked hard to fix this as quickly as possible in order to be appear as similar to a physical infrastructure as possible. In this entry, announcing persistent storage for EC2, Amazon’s CTO takes pain to highlight this achievement:

“Persistent storage for Amazon EC2 will be offered in the form of storage volumes which you can mount into your EC2 instance as a raw block storage device. It basically looks like an unformatted hard disk. Once you have the volume mounted for the first time you can format it with any file system you want or if you have advanced applications such as high-end database engines, you could use it directly.”

and

“And the great thing is it that it is all done with using standard technologies such that you can use this with any kind of application, middleware or any infrastructure software, whether it is legacy or brand new.”

Amazon works hard to hide (from the application code) the fact that the infrastructure is a huge, shared, distributed system. The beauty (and business value) of their offering is that while the legacy code thinks it is running in a good old data center, the paying customer derives benefits from the fact that this is not the case (e.g. fast/easy/cheap provisioning and reduced management responsibilities).

Google, on the other hand, embraces the change in underlying infrastructure and requires your code to use new abstractions that are optimized for that infrastructure.

To use an automotive analogy, Amazon is offering car drivers to switch to a gas/electric hybrid that refuels in today’s gas stations while Google is pushing for a direct jump to hydrogen fuel cells.

History is rarely kind to promoters of radical departures. The software industry is especially fond of layering the new on top of the old (a practice that has been enabled by the constant increase in underlying computing capacity). If you are wondering why your command prompt, shell terminal or text editor opens with a default width of 80 characters, take a trip back to 1928, when IBM defined its 80-columns punch card format. Will Google beat the odds or be forced to be more accommodating of existing code?

It’s not the idea of moving to a more abstracted development framework that worries me about Google’s offering (JEE, Spring and Ruby on Rails show that developers want this move anyway, for productivity reasons, even if there is no change in the underlying infrastructure to further motivate it). It’s the fact that by defining their offering at the level of this framework (as opposed to one level below, like Amazon), Google puts itself in the position of having to select the right framework. Sure, they can support more than one. But the speed of evolution in that area of the software industry shows that it’s not mature enough (yet?) for any party to guess where application frameworks are going. Community experimentation has been driving application frameworks, and Google App Engine can’t support this. It can only select and freeze a few framework.

Time will tell which approach works best, whether they should exist side by side or whether they slowly merge into a “best of both worlds” offering (Amazon already offers many features, like snapshots, that aim for this “best of both worlds”). Unmanaged code (e.g. C/C++ compiled programs) and managed code (JVM or CLR) have been coexisting for a while now. Traditional applications and utility-enabled applications may do so in the future. For all I know, Google may decide that it makes business sense for them too to offer a Xen-based solution like EC2 and Amazon may decide to offer a more abstracted utility computing environment along the lines of the App Engine. But at this point, I am glad that the leaders in utility computing have taken different paths as this will allow the whole industry to experiment and progress more quickly.

The comparison is somewhat blurred by the fact that the Google offering has not reached the same maturity level as Amazon’s. It has restrictions that are not directly related to the requirements of the underlying infrastructure. For example, I don’t see how the distributed infrastructure prevents the existence of a scheduling service for background jobs. I expect this to be fixed soon. Also, Amazon has a full commercial offering, with a price list and an ecosystem of tools, why Google only offers a very limited beta environment for which you can’t buy extra capacity (but this too is changing).