Top Authors

Top Cloud API Sins

Here is a short list of the top poor design decisions that I've seen in cloud APIs. Let me rephrase that: Here's a short list of the top API features that got in our way or simply didn't work for us. There may well be other use cases where these features make sense.

Listing of resources without the details, e.g., a list-servers call that doesn't return all the details for each server. This makes it very expensive to poll for server state changes because the listing doesn't have enough information and so one has to do a show-server for each individual server. Imagine polling for an account that has several thousand servers - ouch. It's fine to have a "with details" flag in the request so one can get the bare list, but we'd always set that flag.

Not returning a resource id on creation. Some APIs don't give you a server ID when you request a server to be launched, the response is just "ok, we'll launch a server". This means you end up guessing "is that new server that just appeared in the list the one I just launched?"

Providing a task queue. Several APIs I've seen have a task queue that is supposed to provide updates on tasks that are in progress. For instance, you launch a server and you get a handle onto a task descriptor. For us that's just overhead. Just include a state field in the resource itself and we'll just keep track of the state changes on the resource. So if mounting a volume takes a while, create the volume resource and set its state to "attaching" (or whatever is appropriate). Having a separate resource to say "that volume you created is attaching" is just overhead and means that the state of a resource is now in several places.

Lacking publishable images or the equivalent of EC2's user_data (small amount of data that is passed to the launching server via the launch API call). I touched on these in my previous blog post.

Not returning deleted resources in a "list resource" call. In particular, terminated servers must be returned in a list servers call for a certain duration, probably at least for an hour. The reason is that otherwise the client has to infer that the server self-terminated or failed when it no longer finds it in the result of list servers calls. Well, we have seen multiple completely different clouds fail to list running servers. In the case of EC2, which lists terminated instances for a good amount of time this resulted in error emails alerting us of the situation. In another cloud this resulted in servers marked as terminated, which is an irreversible operation and often triggers alerts and automation. And then the servers "resurrected". Ouch! Now combine this with the next sin:

Pagination that goes page-wise instead of using a marker - for instance where you get page 1 or the first 100 resources and then issue a query for "page 2" or "from 100 on." Explain to me how a client can get a consistent resource listing when resources can be added and removed concurrently. This is particularly fun if the client has to infer deletion from the absence of a resource in the listing: was it deleted or did it fall through the cracks between pages due to a different resource being deleted concurrently with the listing? The proper way to do pagination is using markers the way Amazon does it, but for a cloud API I actually don't see the value in pagination. We always retrieve the whole list.

If you're working on a cloud API, please think twice if you're doing one of the above. Again, I don't know all the use cases, just ours.

Now here's what I'd really like to see - an event based interface instead of a request-reply based interface. Request-reply is fine if you have a system that sends commands to the cloud. It's a problem when you build a system that reacts to changes in the cloud because you have to keep polling all these resources. We run a good number of machines that do nothing but chew up 100% CPU polling EC2 to detect changes. Fortunately CPU cycles are cheap :-).

Good point Mitch and hard to argue with. I'm focusing on the EC2 API. SQS, RDS fall in the same category. S3 and SDB really are different in that they are used to manage huge datasets. I actually argued less for omitting pagination than for getting it right, i.e. such that a client can get a reasonably consistent view of the dataset. Mhh, that's a vague notion... More precisely, a client should be able to enumerate all resources that exist at the start and at the end of the enumeration. Thus resources created after the start of the enumeration or deleted before the end of the enumeration may or may not be included in the result set.

<strong>Can Cloud standards be saved?...</strong>
Then: Web services standards One of the most frustrating aspects of how Web services standards shot themselves in the foot via unchecked complexity is that plenty of people were pointing out the problem as it happened. Mark Baker (to whom I noticed Don...

Randy, thanks for the comment. Going back and locating the offenders and figuring out what's under NDA and what isn't is too much work ;-). Good addition on the last one, we violate that one ourselves (oops), and you're correct (oops again).

Ok so your blog software filters out angle brackets and so on - here's an unadulterated version: http://samj.pastebin.com/f759f837
This is from OCCI Core (http://occi.googlecode.com/hg/docs/occi-core.html)

Sam, thanks for the comment &amp; sorry for the angle brackets, I use wordpress.com (hosted), not sure what they do or why...
Can you clarify the link stuff? Are you saying that the client should use the "next" link to get the next page? That sounds equivalent to the marker used in AWS APIs (like S3).
One thing that worries me with OCCI is that everything seems "an implementation issue". Doesn't that lead to a situation where "OCCI compatible" means very little because it all depends on the specific implementation options?

[...] out that this omission is a real pain in the neck. RightScale&#8217;s Thorsten von Eicken was first to request &#8220;an event based interface instead of a request-reply based interface&#8221;, [...]

You aren't calling out offenders so I will. VMware's vCloud API requires N 1 calls to enumerate VM status because of violating your first rule.
And GoGrid has continually messed up pagination.
I would add one more item here. Your UI MUST use your API so you understand how to be a consumer of your own API. Having separate entry points that hit the same backend logic just leads to a nightmare.