Researcher FAQ

A: Yes. The API is designed to be used by automated processes or ‘bots’, as they are often referred to.

Q: Can I use the CrossRef REST API with my own tools?

A: Yes. We encourage researchers to adapt their own tools to use the CrossRef REST API.

Q: What full text formats are supported by the CrossRef REST API?

A: It is up to the publisher to decide what formats they will deliver full text in. Some may only be able to deliver PDF. Others may deliver only XML. Some may deliver plain text. Some publishers may vary what they can deliver depending on the age of the content or other variables. Ultimately, researchers can use the CrossRef REST API and/or content negotiation to discover what representations are supported for any given CrossRef DOI.

Q: Does the CrossRef REST API restrict the kind of information I can extract from the full text? For example will it preclude me from working with images?

A: No, the CrossRef REST API puts no restrictions on the content types provided by publishers. The researcher can use the API to download publisher content to their local machines at which point they are unrestricted in what they can extract from said content.

Q: Do I have to support rate-limiting in my text and data mining tools?

A: No. But it might be a good idea anyway. Rate limiting is an option for publishers who need to carefully control traffic on their web site. Not all publishers will implement it. However, the safest thing for text and data mining tools that are likely to query many different publisher is to implement a check for the rate-limiting headers and to alter behaviour of the tool if they are found.

Q: Is CrossRef providing the access control to subscription-based content?

A: No, access control to subscription-based content is managed on the publisher site using their existing systems. If you are encountering access control problems with content that you think should be available via your subscription and Client API Token, then you should contact the publisher (or the librarian at your institution) to address the issue.

Q: Why would I want to use an API when I can just screen-scrape a publishers site?

A: Screen-scraping is inefficient, error prone and fragile. In addition to putting an unnecessary load on publisher sites (downloading html, css, javascript and other superfluous web assets), screen-scraping will often break if publishers redesign their web sites even slightly. Finally, screen-scraping is typically tied to specific publishers page layouts and so screen-scrapers would be need to be adapted individually on a publisher-by-publisher basis. When you consider that CrossRef has 4000+ publishers, you can see that this might quickly become a mess.

Q: If a publisher is open access, doesn’t that obviate the need for a common API?

A: No. At last count the DOAJ lists 9000+ OA journals- many from single journal publishers. It would be extremely inefficient to have to support different APIs for all of the OA publishers or to have to screen-scrape them.

Q: If a copyright exemption for text and data mining is enacted through law, wouldn’t that obviate the need for a common API?

A: No. Researchers still benefit from being able to download content from different publishers without having to resort to different publisher-specific APIs or screen-scraping publisher sites.

Q: Why tie the CrossRef text and data mining API to DOIs? Isn’t this an unnecessary layer of indirection?

A: Using the DOI as the basis for a common text and data mining API provides several benefits. For example, the DOI provides:

An easy way to de-duplicate documents that may be found on several sites. Processing the same document on multiple sites could easily skew text and data mining results and traditional techniques for eliminating duplicates (e.g. hashes, etc.) will not work reliably if the document in question exists in several representations (e.g. PDF, HTML, ePub ) and/or versions (e.g. accepted manuscript, version of record)

Persistent provenance information. Using the DOI as a key will allow researchers to retrieve and verify the provenance of the items in the text and data mining corpus, many years into the future when traditional HTTP URLs will have already broken

An easy way to document, share and compare coropra without having to exchange the actual documents

A mechanism to ensure the reproducibility of text and data mining results using the source documents.

A mechanism to track the impact of updates, corrections retractions and withdrawls on corpora.