Lessons learnt from shipping APIs for Microsoft's cloud platform

07/21/11

I was talking to someone at work about REST APIs and thought it’d be good to write up some of my experiences from designing and running the team that shipped the Windows Azure Service Management APIs. There are some other great ‘lessons learnt’ posts out there, like this one from Foursquare and this one from Daniel Jacobson of Netflix.

Disclaimer: These are my views alone. I don’t even think the rest of the team agrees and I know for a fact that some of these opinions are quite controversial.

Some background

The Windows Azure Service Management API (or SMAPI) was built in 2008 to essentially let developers control all aspects of their Windows Azure account - from deploying new code, changing configuration, scaling up and down services, managing storage accounts, etc. Here are the docs. I can’t talk about the numbers but it is safe to say that this API is hit an incredibly high number of times each day, from all sorts of clients - IDEs, developer tools, automated monitoring systems, evil bots, you name it. I was lucky to work with an incredible team on this API (hi Zhe! hi Dipa! hi Frank!) and I’m proud of what we managed to build. Onto the lessons learnt.

REST conventions don’t matter…that much

I’m going to incur the wrath of many of my friends for saying this. When we started off designing the API, we spent weeks on perfecting the cleanest URL design possible. All of us had read Fielding’s thesis and it wasn’t uncommon for this book to be waved around in heated design meetings. But at the end, I came to an important realization - we were wasting our time.

The URI+HTTP verb+wire format doesn’t matter.

Here’s why. None of your developers will actually see the raw HTTP request+response. If your API is going to be the least bit successful, it has to have language specific bindings either built by you or the community. Sure, it definitely comes in handy to be able to compose API requests in curl in your sleep and you’ll definitely make your bindings developers’s lives easier.
And if you blatantly ignore some basic REST idioms (like making GETs non-idempotent) you will break your API in all sorts of nasty ways. But if you’re spending every other meeting arguing over whether something should be a PUT or a POST for days or weeks on end, you’re wasting your time and the discussion is academic.

Evidence in point - look at the AWS APIs (not S3 but the rest). All of them are the ugliest RPC-style, verbose XML beasts you can find. But you can’t argue with how insanely successful they’ve been, can you?

Start with how the API is going to be used, work backwards

One of the best things we did while designing the API was to start with how the code using the API would look like and then work backwards to the REST definitions. Since the API didn’t exist yet, we just wrote pseudo-code in different languages for various important scenarios. This gave us a great feel and pointed out some obvious flaws which weren’t visible from just reading the API specification . For example, we found out that we were forcing developers to construct too many intermediate objects in-between successive calls to our API. We also found that developers couldn’t send back the same data structures they had retrieved from the API without modifying it. Without the pseudo-code, we probably wouldn’t have found these flaws until deep into our implementation where any change would have incurred a lot more cost. Or worse, we could have shipped the API like this and be stuck with this behavior for a long time.

We made it a rule that no API function could be reviewed unless it had some pseduo-code next to it. This also had the side effect of greatly improving API design meetings when people could see concrete usage patterns instead of a list of objects and functions with parameters.

A terrible way to design an API is to start with your internal systems first, figure out a REST syntax that maps reasonably to it and then throw that over the wall to your users. There are many of these APIs floating around, from Microsoft and other companies, and these almost unerringly cause developers a lot of pain.

Reduce the cognitive load

Be minimalistic in the number of ‘things’ your API exposes. Each concept your API exposes carries a significant cognitive load and dramatically increases the learning curve for your users. We were brutal in trimming the number of concepts we exposed, even when we risked merging slightly different types of ‘things’ into one ‘thing’. We were also brutal in trimming the operations we supported on each ‘thing’. All of this went a long way in making the API simpler than when it started out.

Make sure key actions are simple and fast, prefer chunky over chatty

One of the issues we hit very late in our API design was what is commonly called the ‘N+1’ problem. This is where someone queries a parent object to find a list of children and then issues a separate HTTP request to access each child. In our case, the single most common operation was accessing the list of services in Windows Azure and querying each of them to see what their status is. As we were very close to shipping, we didn’t have time to go rework our design to work around this so we put in what I thought at the time was a giant kludge - a ‘recurse’ parameter on the parent which expands all the children.

This surprisingly turned out to be very efficient and wound up making both devs and our servers a lot happier. The other feature we looked at was how to do partial responses, something GData now has support for. In each of these cases, the actual implementation wasn’t the cleanest (I tacked on a query parameter) but identifying a key usage scenario and optimizing it proved to be invaluable. I wouldn’t be surprised if this is saving millions of requests each day to our API services. More importantly, it makes clients faster and easier to implement.

One good forcing function to make this happen is to build prototypes of API clients as you build the API. For example, we were maintaining clients that pushed builds, monitoring clients and prototypes that simulated other common scenarios. It is amazing how obviously bad an API becomes as soon as you write some code to consume it. As a designer, you’re better off making sure you’re the one discovering it and not your users.

Measure and log everything

Something we were very good about is measuring things. We instrumented our API to the wazoo and tracked those numbers keenly. Everyday, we knew what users were hitting, what the popular calls were, what the common errors were and so on. This gave us great insight into how users were actually using the API (we could tell how much API activity we had saved using the N+1 hack, for example). When we rolled out new functionality, we could tell how users were adopting it and take corrective measures if we were not seeing what we expected. And probably most important of all, we could see what errors users were hitting and dig into those. We often changed or added error messages to make it clearer to users why they were seeing something based on this data. All this went a long way in increasing the usability of the API.

Versioning and extensibility

APIs change in unanticipated ways. Some of these changes are small - you want to add a new property to some data structure or add a new item to a list of items. Some are bigger - you want to change the authentication mechanism or drop support for existing APIs. For either of these, you’ll be thanking yourself if you future-proof your API from the beginning.

There are many ways to version your API to protect yourself. You could do what we did - the client sends a version header and the server shows the client behavior it expects. Or you could just add a version number to your path. In any case, just baking this into the first version of your API will save yourself a lot of heartburn down the road.

The other common change you want to design for is minor additions. Here, I think I could have done better with the Azure SMAPI. Something I like about the Foursquare API is the way they use generic structures and indirection (example - { type: “special”, item: { special: { ...specialdata... } } }). This lets them add new ‘types’ without breaking old clients which don’t anticipate those types. I wish we did something like this with the Windows Azure APIs. It would have made life much easier for us when we wanted to just add an item here or a list there without having to increase the API’s versioning number and breaking several clients.