When publishing a data API, there seems to be a conflict between requiring a user to sign-up for an API key and the principal of open access. What is best practice in publishing an open data API whilst avoiding overuse/abuse?

Details

When you publish a data API and are concerned about users over-using it, it seems normal to demand that they use an API key. This allows you to track/control the usage of each user and ban abusers. However, if you make it easy/automatic to get an API key then a banned user can simply recycle keys. One might conclude that the publisher should provide a web-form where you enter your e-mail address to receive the key.

Compared to providing a simple download URL, requiring registration and API key is not with the spirit of Open Data:

even this minor inconvenience can significantly reduce data reusage, as much data work is opportunistic

providing contact details, the user/organisation tends to identify him/herself or the organisation, allowing for potential discrimination

But for many live datasets (e.g. traffic hotspots, weather measurements) having an API is essential to use, and for massive datasets (e.g. crime maps) it is often more convenient to use an API than download the whole thing.

Although the Open Definition has no strict rules against registration or API keys, what is the best practice for open publishers in this tricky area?

7 Answers
7

The other answers so far are all terrific. I'll reiterate one point, and make a new one:

The openness of an API is always important, but when complete, quality bulk data is available some of these access issues become a lot more tolerable. An API is not a substitute for bulk data. The federal government has become very API focused, and many of them have throttling, API key registration, and even attribution requirements. The need for accompanying, complete bulk data is a point I hope the community will continue to press.

There's an important political aspect to API key registration, which is demonstrating (especially internally) that the API is a success, and worth continued investment. I work at the Sunlight Foundation, and this is one of the reasons (in addition to abuse, contacting devs, etc.) we require registration of an API key. This is both quantitative (measuring hits), and qualitative (it's nice to have logos of big organizations on the sidebar of our API homepage).

Of course, government agencies are in a fundamentally different situation than non-governmental organizations. They're funded directly by taxpayers and are the original producers of information that literally belongs to the public.

When agencies consider the benefits of using API keys versus providing open access, the scale should be heavily tilted towards open access to the people's information. Providing free bulk data access in addition to any APIs (like the Census does), and providing key-free API access (like the Federal Register does) are models I strongly encourage.

Update: See FederalRegister.gov's API case study for their rationale for not using API keys:

In our view, API keys can create an unnecessary barrier to rapid experimentation with our public data. We are able to track our API usage via logging mechanisms on our servers and already have infrastructure in place to mitigate any sort of excessive requests. The benefits of using a simple REST-ful API format are that any user can easily try it in their browser (no SOAP that requires complicated XML to be POSTed around, no special headers, etc). The response to our no keys policy from the development community has been extremely positive (http://news.ycombinator.com/item?id=2839137).

Of course, putting value on knowing the name of a user is a reason for a publisher to favour registration over providing a normal bulk download URL (which is relatively anonymous). So let's not push that reason too hard... Rather than talking about the side-benefits of API keys and registration, can you provide more info in how those key-free API access schemes manage it?
– D ReadJun 17 '13 at 10:01

Sure, I added a link and an excerpt from FR.gov's keyless API rationale. I'd love to see FR.gov elaborate further on how they manage their API keylessly while getting most or all of the benefits keys bring. I suppose you could always tweet at them: twitter.com/FedRegister
– Eric MillJun 18 '13 at 15:03

Thanks, I think this addition hits the nail on the head - with careful design you CAN avoid registration / api keys and cover off these key objections that people give.
– D ReadJun 18 '13 at 19:01

To build on some of these answers, the important distinction to remember is that an API is a service, not the data itself. This will be a custom application that you will build, that will have methods for getting at the data in ways that you find useful (at least initially), and that will use the resources (bandwidth, computing power, etc..) that you are providing. The data is just what the API serves up.

With that in mind, as a developer - I often prefer to download the raw data files as APIs will often not have the method I am looking for, or won't allow me to easily do bulk analysis. So, having an option to download the raw data in a standard format is always a good idea.

If you offer the data as a raw download, then I would absolutely encourage you to provide registration for the API use. By doing so, you can provide some type of proof that your users agree to your terms of service (how many accesses they can have per day, what constitutes abuse, availability of the service, etc...) and some means of contact if they start to abuse those terms (if you want to warn the before cutting them off). I agree with others that registration should be an automated and semi-immediate process, but if that is not possible, you should at least have the API respond with fake data so the user can begin developing and testing their app while they wait. One recommendation would be to also collect the IP address of the user at registration. While IPs can change, many website or corporate connections are static and you can ban the IP address as well as the user should they violate your terms of service.

If people start to use your API in their website or other public facing application, any downtime of your app can cause frustrations. While most people out there are reasonable and understanding, there are some who can be irrational. Having a terms that a user accepts can at least help mitigate that irrationality (since it is all spelled out). And, if you are providing the raw data as well, you can always say - if you don't like our terms or the way our service functions, just take the data and do it yourself. That to me is what truly makes data open.

As a data user, I feel it is essentially OK to make people register to obtain API keys. Those who use API for data access are by definition application developers (even if they aren't), and their apps may saturate network bandwidth and server resources if coded without due consideration for others. Angry letters to ISPs have the potential to shut down innocent legitimate users as well as offenders. Traceability leads to sense of responsibility.

Yet, registration should be automatic. Users shouldn't have to wait till somebody on the East Coast approves their application. A non-bouncing e-mail address, possibly a spamcheck, a human entering CAPTCHA in reply - all that should be enough.

This said, API is not a substitute for bulk download of the whole dataset (compressed!). For statistical etc. purposes it is often necessary to deal with the full set, and it is very frustrating to see the overhead of API calls (and time for throttled accesses) needed to get one's hands on the data. Anonymous FTP still seems the best solution for that (FTP servers can enforce bandwidth throttling without much hassle).

To re-cap:

any extra hoop a user is forced to jump through means the data aren't really open.

A human decision-maker in the loop shouldn't be necessary.

API keys should be available for play and de-bugging right away.

There's no need to come at the data producer with a FOIA and a sledgehammer to get a supposedly open dataset.

I generally proscribe to the "Technocratic Utopianism" side of information politics, so I've been offended by needing to register to get data. My first thought on the matter was to allow for use without registering, but do something to rate-limit w/ a warning message if one group starts to get abusive ... probably based on individual IPs or IP ranges.

... and then I remembered this little problem I'm having.

What if someone uses the API in the client software, rather than doing something to cache it on the front end? Then you've got tens of thousands of requests coming from all over the place, and no way to figure out who the proper contact is.

In my case, people really like http://umbra.nascom.nasa.gov/images/latest_eit_304.gif . They have for years. We get people direct linking it to their websites all the time. When it got really bad in the past (as we were behind a 200Mbps pipe until ~3 years ago, 100Mbps only ~4 years ago), we'd find contact info for the website, and ask them to cache the image every couple hours and serve it themselves. And it generally worked.

But for years now, I've been trying to track down what software acts as a web browser and sets the user agent to 'CompanionLink', with no referrer. I've contacted companies with related names and products, and no one admits responsibility.

The other day's stats for the server :

272463 total requests
87417 requests for that image
82335 requests for that image by 'CompanionLink'
82338 requests by 'CompanionLink' total

(those other 3 'CompanionLink' hits are for other 'latest' type image ... which is new .. the last time I looked into it, we had no other requests w/ that signature for the 30 days we're allowed to keep our logs).

I'm guessing was written into some default for ... something ... but what, I have no idea. At its peak, it was responsible for 3/4 of the hits to our webserver, with one IP address requesting it multiple times a minute. (that telescope only takes one image per filter every 6 hrs these days ... and the spacecraft only gets one downlink a day).

I've been sending an HTTP 406 status code for it for years now, and it still keeps coming.

...

I'm torn -- I think that for it to be 'open' data, people should be able to get it without registering. But then there's also this aspect, which has been frustrating me for years.

So, my thought on a middle ground:

Have some API key process that's instantaneous (no waiting for a human to review it)

Don't ask for contact info, but have them acknowledge that if they start hogging resources or become otherwise abusive, you might have to rate limit them or cut them off.

If you find someone's mass-registering APIs or similar ... then you can block them or put some throttling on it.

During the API signup, offer them some way to check for status changes (either leave an e-mail address to be contacted at, or a page they can check to see if they've been blocked)

...

In the past, when we've blocked someone, we could generally figure out where they were from, or they'd contact us to ask what was up ... and we could work if there was a better way to get what they wanted (eg, add new functionality so they didn't have to send thousands of requests, physically ship drives w/ the data to them, or get them to hit a mirror of the data on an alternate network) ... but when you start getting DDoS'd from abuse, it hurts all of the other people who might be using your API.

I've been asking this question a lot both internally and externally, because I see how requiring an API key could be perceived as not-entirely-open open data.

For a public dataset, if API keys are required, they should be made available instantly. This is definitely a best practice.

I don't find the pro-key argument that it helps prevent abuse very valid, because you're right, an abusive user could just keep re-registering for keys. I am convinced by the argument that it gives the data holders a means of communicating with API users when there's a change to the API or underlying data.

There's also the idea of providing meaningful usage statistics with a key. I know there are other ways of doing that, but many APIs on open data are basic (perhaps using a CSV to API tool), and would be using a key-as-a-service and not have a way of independently coming up with statistics.

Maybe I'm just lucky, but I've never had intentional abuse of one of my APIs. It's always been a case of somebody who just didn't realize what they were doing. That, I think, is the most likely use case for abuse.
– Waldo JaquithJun 16 '13 at 18:32

APIs are an extra service layer on top of the data itself. One could make the argument that this "closes" the dataset, but for most developers it's a huge convenience that saves them the overhead of administering their own database and access layers. (can you imagine if you had to download a csv of tweets and store them in mysql to display a feed on your website?)

If you want the data to be completely open (assuming the dataset updates less than once a week or so), offer a bulk download: text files are easy to host and bandwidth is cheaper than processing power.

If you want to additionally offer users an API as a service, you shouldn't have any qualms about making users jump through whatever hoops you need them to in order to prevent folks from a)pounding your database quasi-ddos style or b) costing you a small fortune in hosting fees.

That being said, it would be awfully nice to make the authorization process as painless and instantaneous as humanly (computationally?) possible. And it's also nice to have some contact info for folks so you can do outreach to find out how they're using your service and what you can do to improve their experience. Or, if necessary ask them politely to refactor their app so they don't kill your server.

This largely solves the problem of accidental DOS (static resources are trivial to serve up), and reduces substantially the possibility of API changes (which are rare in something as fundamental as the basic URL structure, but more frequent in URL parameters for fiddling about with the response data). I can then also encourage strongly registration for a very, very low-volume mailing list ("it's wholly possible you'll never hear from us at all"), so that if something did change, I'd have a method of informing these no-registration users.