Slow REST

Search

We’re working on a fairly substantial revision of the
Sun Cloud API, motivated
by this problem: In a RESTful context, how do you handle state-changing
operations (POST, PUT, DELETE) which have substantial and unpredictable
latency?

What we’ve learned, from work with our own back-end based on the Q-layer
technology and with some other back-ends, is that Cloud operations are by and
large not very fast; and that the latencies show up in weird places. Here’s
an example: in our own implementation, creating a Virtual Machine from a
template or by copying another VM instance is very snappy. But weirdly,
connecting a network (public or private) to a VM can sometimes be
extremely slow. Go check out other implementations like EC2
and you see a similar unpredictable-latency narrative.

The idiom we’d been using so far was along these lines:

As with both AtomPub and Rails, when you want to create something new
you POST it to a collection of some sort and the server comes back with
“201 Created” and the URI of the new object.

When you POST to some controller (for example “boot a machine”) or do a
DELETE, the server comes back with “204 No content” to signal
success.

This is all very well and good; but what happens when some of these
operations take a handful of milliseconds and others (e.g. “boot all the VMs
in this cluster”) could easily go away for several minutes.

To summarize: For any and all PUT/POST/DELETE operations, we return
“202 In progress” and a new “Status” resource, which contains a 0-to-100
progress indicator, a target_uri for whatever’s
being operated on, an op to identify the operation, and, when
progress reaches 100, status and
message fields to tell how the operation came out. The idea is
that this is designed to give a hook that implementors can make cheap to
poll.

We also thought about a Comet style implementation where we keep the HTTP
channel open, and that can be made clean but support for it in popular
libraries is less than ubiquitous. My personal favorite idea was to use “Web
hooks”, i.e. the client sends a URI along with the request and the server
POSTs back to it when the operation is complete. But every time I started
talking about it I ran into a brick wall because it probably doesn’t work for
a client behind a firewall, which is where most of them will be. Sigh.

There are a few points that are still troubling me, listed here in no
particular order:

When an operation is finished and you want to provide a Status code,
we’re re-using HTTP status codes. Which on the one hand seems a bit outside
their design space, but on the other hand maybe it’s a wheel we don’t have to
re-invent.

Instead of having the “op” field, we could have a different media-type
for each imaginable kind of Status resource. That might be a bit more RESTful
but seems a less convenient to use for client implementors.

This whole notion of the target_uri makes me wonder if
we’re missing a generalization. The most obvious role is
when the Status is that of a create operation, for example Create New VM; then
the target_uri is the new resource’s URI, what would come back in
the Location HTTP header in a synchronous world.

And in a few cases you might want more than one target, for example when
you’re attaching an IP address to a VM.

Hmmm.

Speaking of generalization, I wonder if this whole “Slow REST” thingie
is a pattern that’s going to pop up again often enough in the future that we
should be thinking of a standardized recipe for approaching it; the kind of
thing that has arisen for CRUD operations in the context of AtomPub and Rails.

Contributions

When I've had to do this we usually end up with an addressable state/resource to represent what's going on in transit, i.e.

- On

- Turning Off

- Off

You're right that it does come up a fair bit, and I can count at least three times already I've seen it modeled in very different apps.

Unsure about the generalization of it though, as you seem to want to improve the whole '202 in progress' notification part as well - and that starts meaning specific client capabilities/environments, i.e. why HTTP isn't XMPP.

Anything that gets REST further from a narrow CRUD usage view is all good with me, and if Slow REST helps that then more power to it.

I'm a big +1 for webhook notifications in general, but I agree that they don't really fit for sending "I'm done with this long operation" for a regular API. They'd be a useful additional service to provide for an account though - "call this webhook whenever I successfully launch a new instance" for example.

The progress indicator mechanism you describe sounds perfectly reasonable to me - a comet option would be nice as well, but again would need to be opt-in due to the lack of support in many clients.

I don't know if it makes sense in this situation but the whole area of outbound notifications has never been standardized very well. I'm primarily thinking of things such as the Google Checkout and PayPal payment notifications or Ebay's notifications.

First, there is the discussion whether polling makes better sense.

Then, if firing out notifications is the way to go, how best to do that? POSTing querystrings? XML? Retry on error? How many times and frequency? Why not use SMTP and have consumers retrieve via POP/IMAP?

Pure bikeshedding, but rather than a 0-100 progress field I think it would be better to report progress as M-out-of-N, because

1. some clients may prefer to report it to the user that way;

2. it allows for finer-grained reporting for really slow operations;

3. it reduces the risk of bugs where some careless implementor does (e.g.) progress = round(100*fraction_done) without checking that it isn't returning 100 before the operation is finished.

Of these, the only really important one is #3; I know that the proposal says quite clearly (though I'd reword that sentence so that even an idiot couldn't think the "MUST" applies only "typically"...) that the server must return 100 only when the job is complete; but why make it easier to make bugs?

I've seen video ingest APIs that have to deal with this problem (wanting to know when you're done downloading a long piece of content); the "Web hooks" approach is one version I've seen there. But those APIs were used in a context where everybody's behind the same firewall.

I'm just getting my feet wet with REST this week, so I'm making these suggestions that I may be disabused of errors in my current understanding. Hopefully you or your readers will oblige.

Should you not create a resource--Job--via POST instead of RPC-style invocations? If it finished synchronously, the returned Job representation would reflect that. For Jobs that don't finish synchronously, the client can perform conditional GETs on the returned URI retrieving any updated representations of the Job with the requisite status updates.

For PUT and DELETE operations that don't complete synchronously, the 202 Accepted is appropriate. Then, however, the client need simply GET the new URI, in the case of PUT. If the entity has been created, it will be returned. Otherwise, return 303 See Other with the status representation (likely a Job as above) returned. The GET on the old URI, in the case of DELETE, could return 410 Gone when it was successfully removed and 303 See Other with the Job status until then.

Call it a Task or a Job, but it is part of the application state and should be made explicit in the API. I've been thinking along the lines of modeling the Job resource with the Command Pattern and utilizing something like webmachine to expose the RESTful API over HTTP.

Having made this clear, I can't help thinking of WSRF when I read this. Especially this part towards the end:

"Speaking of generalization, I wonder if this whole “Slow REST” thingie is a pattern that’s going to pop up again often enough in the future that we should be thinking of a standardized recipe for approaching it".

Sounds like a resource manipulation framework to me.

WSRF doesn't quite cover this use case (slow create), but between WS-ResourceLifetime and WS-Notification you see somewhat similar use cases at work (which BTW you may run into next). Add to this WS-MakeConnection (part of WS-RX) and your idea of "web hooks" becomes a lot more practical.

I remember discussions about scheduled resource creation (schedule the creation of a resource at some point in the future) in which similar questions where debated (in a SOAPier way, but not fundamentally different). Though this was in WS-RT scope, not WSRF if I remember correctly. And we ended up not supporting the feature to limit complexity though we had close to a solution (WS-flavored of course). Since it was never implemented, I don't know if it was a good one or not.

I am obviously not saying you should use WSRF or WSRT. There are plenty of aspects in which we $&^%ed it up, starting with EPRs and pretty much anything WS-Addressing-related. I for one made the decision to walk away from WSRF many years ago (before REST was considered an alternative for it) after having been part of its creation, so I am well aware of its flaws. But also of the relevance of many of its use cases. WSRF can be summed up as "mostly bad answers to mostly good questions" and it looks like the good questions are back with a vengeance.

Your posts about this Cloud API are very interesting to me, because I always had the intuition that "REST" and "WS-(Death)Star" would come a lot closer to one another when applied to the kind of applications I am used to (IT Systems Mgmt, which this "Cloud" stuff is part of in the grand scheme of things).

I am not in position to participate more concretely in this effort right now, but I can't help chiming in, thus these philosophical generalities... :-)

We were thinking about similar topics in OCCI (Open Cloud Computing Interface) recently, particularly wrt to criticisms in the actuators in the Sun API. A discussion starts around : http://www.ogf.org/pipermail/occi-wg/2009-May/000618.html I think the emerging results will be quite similar.

Does a client know when it's requesting an operation that may take a long time? Presumably you would want to optimize and return a 201 or 204 where possible, and then return a 202 (with a Location header pointing to the status resource) in the long-operation situation.

The callback solution seems unreliable, both in principle and by the language of the proposal which recommends falling back to polling to be sure. I thus wonder if it can achieve what it seems meant to do: Reduce polling effort.

Could HTTP/1.1-style partial (1xx) responses be used to provide continuous feedback?

As an aside: This kind of problem has been seen in other domains, for example robotics: A robot arm may take a while to move and you usually want both a "I have started moving" as well as a "I am there now" notification (and likely intermediates).

In those fields, notification mechanisms have been an ongoing topic. In addition to polling and callbacks (as suggested in the proposal), many frameworks implement some kind of event-based notification which are issued by the server, but without the requirement to keep client-state on the server (whereas the current proposal requires the server to keep the webhook address across requests). Technically, most of these are either limited forms of broadcast or employ an intermediary (message broker), sometimes coupled with content-filtering. These solutions may not scale to the Web, but I am bringing them up to illustrate that not burdening the task-server itself with keeping notification state has been considered a feature before.

I prefer the callback / web-hook option too. I wonder if there is some way to mitigate the firewall issue by making it more expensive to support the callback, e.g. SSL? Would an organisation be more prepared to relax its outbound rules on this basis?? I don't know, but it might be a worthwhile trade-off versus repeated requests.

There are some ideas in the FriendFeed API that could be useful, e.g. long-polling with a timeout request parameter, and poll-interval in the response. The API also uses "tokens" which might be a way to group related operations together.

>> We also thought about a Comet style implementation where we keep the HTTP channel open, and that can be made clean but support for it in popular libraries is less than ubiquitous.

To allow sophisticated clients to use a Comet style approach so that they can be notified immediately when the operation is done, add to your design an optional client header that says "I'm willing to wait up to N seconds for a response." This would allow sophisticated clients to implement a simple form of long polling, while unsophisticated clients would simply poll as in your current design.

This is the direction I've been pushing for OCCI too... the actuators seem elegant at first but once you start thinking about things like abandoning requests in process, monitoring progress and asynchronous events in general a "request" resource makes a lot more sense. The example I use is a backup which may not start until midnight and may take 12 hours from then to complete. Anyway this is an extract from my post "Is HTTP the HTTP of cloud computing?" (http://samj.net/2009/05/is-http-http-of-cloud-computing.html) back in May - I haven't fully codified it yet but it will likely look something like what you guys are doing (perhaps dropping stuff like the target_uri field in favour of a single Location: header with content negotiation):

RESTful State Machines

Something else which has not sat well with me until I spent the weekend ingesting RESTful Web Services book (by Leonard Richardson and Sam Ruby) was the "actuator" concept we picked up from the Sun Cloud APIs. This breaks away from RESTful principles by exposing an RPC-style API for triggering state changes (e.g. start, stop, restart). Granted it's an improvement on the alternative (GETting a resource and PUTting it back with an updated state) as Tim Bray explains in RESTful Casuistry (to which Roy Fielding and Bill de hÓra also responded), but it still "feels funky". Sure it doesn't make any sense to try to "force" a monitored status to some other value (for example setting a "state" attribute to "running"), especially when we can't be sure that's the state we'll get to (maybe there will be an error or the transition will be dependent on some outcome over which we have no control). Similarly it doesn't make much sense to treat states as nouns, for example adding a "running" state to a collection of states (even if a resource can be "running" and "backing up" concurrently). But is using URLs as "buttons" representing verbs/transitions the best answer?

What makes more sense [to me] is to request a transition and check back for updates (e.g. by polling or HTTP server push). If it's RESTful to POST comments to an article (which in addition to its own contents acts as a collection of zero or more comments) then POSTing a request to change state to a [sub]resource also makes sense. As a bonus these can be parametrised (for example a "resize" request can be accompanied with a "size" parameter and a "stop" request sent with clarification as to whether an "ACPI Off" or "Pull Cord" is required). Transitions that take a while, like "format" on a storage resource, can simply return HTTP 201 Accepted so we've got support for asynchronous actions as well - indeed some requests (e.g. "backup") may not even be started immediately. We may also want to consider using something like Post Once Exactly (POE) to ensure that requests like "restart" aren't executed repeatedly and that we can cancel requests that the system hasn't had a chance to deal with yet.

Exactly how this should look in terms of URL layout I'm not sure (perhaps http://example.com/<resource>/requests) but being able to enumerate the possible actions as well as acceptable parameters (e.g. an enum for variations on "stop" or a range for "resize") would be particularly useful for clients.

I don't understand why it is so popular to use integers 0-100 in interfaces intended for machines. (I've seen this other places, too.) Is there something wrong with the range 0 to 1? Are we still living in the 1970's when some HTTP clients ran on machines without floating point support?

I'm voting for web hooks to handle this. Although I think polling a "job" like resource is an okay solution, the most optimal thing to do would be to have the server push its progress to the client.

With Reverse HTTP and Opera Unity, this is quite possible today. With lots of the new scripting stuff going into HTML5, it will be even easier in the future. A fallback to a polling resource is probably a good idea anyhow, and then the interesting question to answer is; how do you give the client the option to choose polling or web hooks and how do you request the URI to push progress to in the case when the client has web hook support?

Why use a separate resource for the status of the operation? Why not have the status of the resource as a field on the resource itself?

So a VM's state would progress from "Off" to "starting" to "running" (perhaps with arbitrary intermediate states such as "connecting to network" / "mounting persistent drives" / etc). You can dress this up as a structure (step M of N, step label, etc) or just use text strings if you're lazy.

I like the proposals mentioned here, but I can report success with good old etags when adding a REST interface to an asynchronous (message queue based) real-time system. They too make for cheap polling.