Waiting for events (in Cloud APIs)

Events/alerts/notifications have been a central concept in IT management at least since the first SNMP trap was emitted, and probably even long before that. And yet they are curiously absent from all the Cloud management APIs/protocols. If you think that’s because “THE CLOUD CHANGES EVERYTHING” then you may have to think again. Over the last few days, two of the most experienced practitioners of Cloud computing pointed out that this omission is a real pain in the neck. RightScale’s Thorsten von Eicken was first to request “an event based interface instead of a request-reply based interface”, pointing out that “we run a good number of machines that do nothing but chew up 100% cpu polling EC2 to detect changes”. George Reese seconded and started to sketch a solution. And while these blog posts gave the issue increased visibility recently, it has been a recurring topic on the AWS Forum and other similar discussion boards for quite some time. For example, in this thread going back to 2006, an Amazon employee wrote that “this is a feature we’ve discussed recently and we’re looking at options” (incidentally, I see a post by Thorsten in that old thread). We’re still waiting.

Let’s look at what it would take to define such a feature.

I have some experience with events for IT management, having been involved in the WS-Notification family of specifications and having co-chaired the OASIS technical committee that standardized them. This post is not about foisting WS-Notification on Cloud APIs, but just about surfacing some of the questions that come up when you try to standardize such a mechanism. While the main use cases for WS-Notification came from IT (and Grid) management, it was supposed to be a generic mechanism. A Cloud-centric eventing protocol can be made simpler by focusing on fewer use cases (Cloud scenarios only). In addition, WS-Notification was marred by the complexity-is-a-sign-of-greatness spirit of the time . On this too, a Cloud eventing protocol could improve things by keeping IBM at bay simplicity in mind.

Types of event

When you pull the state of a resource to see if anything changed, you don’t have to tell the provider what kind of change you are interested in. If, on the other hand, you want the provider to notify you, then they need to know what you care about. You may not want to be notified on every single change in the resource state. How do you describe the changes you care about? Is there an agreed-upon set of states for the resource and you are only notified on state transitions? Can you indicate the minimum severity level for an event to be emitted? Who determines the severity of an event? Or do you get to specify what fields in the resource state you want to watch? What about numeric values for which you may not want to be notified of every change but only when a threshold is crossed? Do you get to specify a query and get notified whenever the query result changes? In WS-Notification some of this is handled by WS-Topics which I still like conceptually (I co-edited it) but is too complex for the task at hand.

Event formats

What format are the events serialized in? How is the even metadata captured (e.g. time stamp of observation, which may not be the same as the time at which the notification message was sent)? If the event payload is a representation of the new state of the resource, does it indicate what field changes (and what the old value was)? How do you keep event payloads consistent with the resource representation in the request/response interactions? If many events occur near the same time, can you group them in one notification message for better scalability?

Subscription creation

Presumably you need a subscription mechanism. Is the subscription set in stone when the resource is created? Or can you come later and subscribe? If subscription is an operation on the resource itself, how do you subscribe for events on something that doesn’t exist yet (e.g. “create a VM and notify me once it’s started”)? Do you get to set subscriptions on a per-resource-basis? Or is this a global setting for all the resources that you own? Can you have two different subscriptions on the same resource (e.g. a “critical events only” subscription that exist throughout the life of the resource, plus a “lots of events please” subscription that you keep for a few hours while troubleshooting)?

Subscription management

Do you get to come back and update/pause/delete a subscription? Do you get to change what filter the subscription carries? Or is it set in stone until the subscription expires? Can you change the delivery endpoint? What if events fail to be delivered? Does the provider cancel your subscription? After how many failures? Does it just pause it for a few hours? Keep trying?

Subscription expiration

Who sets the expiration period? The subscriber? Can the provider set a max duration? Do you get a warning message before the subscription expires? Can you renew a subscription or do you have to create a new one? Do you get a message telling you that it has expired? Where are these subscription-lifecycle messages sent? To the same endpoint as the regular messages? What if your subscription is being killed because your deliver endpoint is down, clearly it makes no sense to send the warning message to that same endpoint. Do you provide a separate “subscription management” endpoint (different from the event delivery endpoint) when you subscribe? Alternatively, does an email message get sent to the registered user who set the subscription?

Delivery reliability

How reliable do you want the notifications to be? Should the emitter retry until they’ve received a confirmation? How long do they keep messages that can’t be delivered? Some may have a very short shelf life while others are still useful weeks later. If you don’t have a reliable mechanism but you really “need to know about a lost server within a minute of it disappearing” (the example Georges gives) then in reality you may still have to poll just to make sure that an event wasn’t lost. If you haven’t received an event in a while, how can you test if the subscription is still working? Should subscriptions send a heartbeat message once a while?

Delivery mechanism

How do you deliver notifications? Do you keep HTTP connections open through tricks similar to how self-updating web pages work (e.g. COMET, long polling and soon WebSockets)? Or do you just provide a listener endpoint to which the notifier tries to connect (which, in the case of public cloud deployments, means you need to have a publicly-addressable listener, but hopefully not on the same Cloud infrastructure). Do you use XMPP? AMQP? Email? Can I have you hold my events and let me come pull them?

Security

Do you need to verify the origin of the events you receive? Or do you assume they may be forged and always initiate a connection to the provider to double-check? And on the other side, what are the security requirements for event delivery? If a user looses some of their privileges, do you have to go and cancel the still-active subscriptions that they created?

Throttling

Is there a maximum event rate? Do you get charged for the events the Cloud provider sends you? How do you make sure that someone doesn’t create a subscription pointing to the wrong endpoint (either erroneously or maliciously, e.g. DoS). Do you send a test message at registration asking the delivery endpoint to acknowledge that they indeed want to receive these notifications?

Conclusion

My goal is not to argue that we cannot have a simple yet good enough notification system or to scare anyone from attempting to define it. It’s just to show that it’s not as simple as it may seem at first blush. But there probably is a sweetspot and people like Thorsten and George are very well qualified to find it.

[UPDATED 2010/4/7: Amazon releases AWS Simple notification Service. Not just as an eventing feature for the Cloud API, as a generic notification service. Which can, of course, also carry Cloud management events. Though at this point you’re on your own to publish them from your instances, it doesn’t look like the AWS infrastructure can do it for you. Which means, for example, that you’re not going to be able to publish an event for a sudden crash.]

11 Responses to Waiting for events (in Cloud APIs)

Yes, once all is said and done, operating clouds will require events and notifications etc via subscriptions. Thus; such standards, technology and processes will need to be in the clouds, be it IaaS, PaaS or SaaS. These will ensure processes such as incident, event, problem management etc are not ignored.

In general, I find that service management processes (as in the ITIL process framework and others) which includes service operation and others are somewhat of an after thought in the cloud. It seems the focus is on service transition or build (me a VM or something).

Whether such standards such as the WS-splat dealing with wisemen and wisdom can ever be reused is another more politically charged consideration.

Ultimately, business processes could span multiple clouds (be it public or private in-house) and non-cloud legacy things. Tying all the service management processes together could be a nightmare without such technology enablement.

Life was somewhat less automated before even SNMP and yet we still had events and management of them that happened, in that case through the evil proprietary bigco systems.

Ultimately, the success of the cloud could very well depend on its service manageability in addition to security, across the stack (IaaS, Paas and SaaS); without which it will never be leveraged in an Enterprise Architecture.

I think that as is/was the case with the WS-Splat standards, we have a conflict between “what we graybeards know will be necessary” and “making something simple enough to be adopted.” As I (non-exclusively) remarked then, one of the interesting issues w.r.t. WS-* was that, unlike most places, the standards were developed long before there were actual implementations that needed them. That is, we were looking at management, events, etc., before there were really any active web services in dire need.

From the point of view of [platform] developers, this was good — we wouldn’t have to go redo everything. However, from the point of view of web service developers, it was scary — too many things with which to comply before you even got off the ground (and few tools to assist).

Much of this, I think, came about because “web services” were “plain old services” (CORBA, whatever) with a syntax — so we all knew what would be necessary, and just built it in from the beginning (along with all the other things that had been missing — see ‘second system effect’).

Although not terribly familiar with the Cloud world yet, I would hope that we learn from previous experience and allow the standards to have a relatively low cost of entry. Just as was attempted in WS-Notification & friends (I don’t know how well it worked), it would be good to have “simple” events and then standard extensions that might allow for appropriate scaling of the performance as well as the feature set as time goes on (& need develops). While I don’t mean to imply that there are lots of “small clouds”, it would be nice to define such a standard in an approachable means — a relatively small, uncluttered, core set of features whose more feature-complete-yet-complex extensions can be implemented as time allows.

Put another way, while the issues in the blog-post are undeniably required “in the fullness of time,” is there some way to define a layered standard so that one can start without providing everything?

You’re exactly right Fred. I didn’t mean to imply that all the possible features I suggest above need to be implemented. In fact, I don’t think they do. The minimal functional set that you have in mind is the “sweetspot” that allude to in the last paragraph.

Not sure in what stage of Events API Amazon AWS is working right now or if they really started working on those features requests. Definitely good to have these features also with other cloud operators and not only AWS.

IMHO, this gets easier if you assume the event system is about state synchronization. If that is the primary use case then you spend most of your time solving the issue of ensuring that a cloud consumer’s subset of the overall cloud state is correct or, at least, knowing when it is not in sync and responding in kind. Then the consumer can operate against that local state replica relatively safely. It’s very similar to the tricks you have to play in sychronizing the local state in a browser DOM to the backend it is talking to.

I could explain further but it would be a long blog post. We don’t need a complex solution. We only need a simple distributed state sync protocol that uses asynchronous events and knows when it is disconnected or out of sync. Just like a browser UI.

Although it’s not a spec and perhaps doesn’t satisfy all your requirements, I strongly believe the answer is based on the idea of webhooks. I’m certainly not a fan of the complexity and lack of adoption power of WS-*, and I believe an organic standard/convention can emerge from the existing pattern people are using today to do this sort of stuff. Although it’s not general enough, there is PubSubHubbub. There are also some very well designed APIs for webhook events (most recently, FreshBooks). There are also best practices to address many of the concerns you have.