Menu

Elsevier stopped me doing my research

I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress.

To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1].

In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers.

Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 35KB/s, 0.0021GB/min, 0.125GB/h, 3GB/day.

Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.

I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research.

[MINOR EDITS: the link to the article was broken, should be fixed now. Also, I made the mistake of using "0.0021GB/s" which is now changed into "0.0021GB/min"; I also added "35KB/s" for completeness. One last thing: I am aware of Elsevier's TDM License agreement, and I nonetheless thank those who directed me towards it.]

Post navigation

96 thoughts on “Elsevier stopped me doing my research”

We are happy for you to text mind content that we publish via the ScienceDirect API, but not via screen scraping. You can get access to an API key via our developer’s portal (http://dev.elsevier.com/myapikey.html). If you have any questions or problems, do please let me know. If helpful, I am also happy to engage with the librarian who is helping you.

In my case, I can’t accept the Elsevier TDM license since its provisions are unenforceable under the UK copyright exception.

Quoting the UK government’s guidance on the TDM copyright exception:

Publishers may wish to apply technological measures on
networks for a number of purposes such as to ensure security or stability.
These measures may be for reasons unrelated to text and data mining
or may, for example, be intended to ensure that all users can access the
benefits that text and data mining offers researchers. Examples of possible
measures could be to impose a reasonable limit on download speeds or to
control the number of times a user can access a network in a given period.
These measures should not stop or unreasonably restrict any researcher’s
ability to benefit from the exception.

Elsevier’s API is unworkable in my experience, often failing to work, and certainly counts as un ‘unreasonable’ restriction. In many cases the API returns only metadata in the XML, compared to the fulltext PDF I can access on the website. Simply downloading the paper via the normal web service for readers is easy – much easier than using the API.

Beyond that, you need to consider that the content served by the API is not exactly the same as that served by the web server. Under UK law I have the right to perform non-commercial TDM on anything I can read – and I can read the website.

In addition, the license agreement requires a restrictive statement about reuse of the products of TDM to be attached to any output, but the statement restricts behaviours which are permissible under UK law.

The reason that we require miners to use the API is so that we can meet their needs AND ALSO the needs of our human users who can continue to read, search and download articles and not have their service interrupted in any way. Under UK legislation, publishers can use “reasonable measures to maintain the stability and security” of their networks, and so the requirement to use this API is fully compatible with the copyright exception.

Other text miners regularly use the APIs, and I don’t believe we have received reports of the APIs only returning metadata before. How frustrating this must have been for you. I would be very happy to connect you with technical support colleagues who can provide you with assistance or answer any questions you may have.

My interpretation as a software engineer with 15 years experience of running web services, and that of legal scholars we have consulted, is that Elsevier’s API use requirement does not satisfy the condition of being a “reasonable measure to maintain the stability and security” of their networks. There are simpler alternatives that are less obstructive than what Elsevier has in place – rate limiting can be applied easily. Moreover, using the web interface with reasonable rate limits could not possibly impact the user experience of a site with the traffic that Elsevier’s network enjoys. If Elsevier believes that scraping with rate-limits applied impacts the experience of their other users, I challenge you to prove that it does.

Whilst it was frustrating to receive metadata-only XML, I do not consider it my responsibility to pursue improvement of your system. I have hundreds of content providers to interface with in my work, and the only commonality is that they all have a web presence that can be accessed in a browser.

By far the easiest way to address this is to use cross-publisher APIs (like crossref, pubmed, and EUPMC) in the first instance. If any of those fails (as in the case of Elsevier), or if a content provider does not provide material via any of those APIs, I fall back to the web interface download alternative. If publishers would like to encourage use of APIs, they should make their content available through the existing systems with as few limitations as technically possible, and without requiring extra publisher-specific steps to be taken.

It is a simple reality that if your API makes it harder for researchers to do their work, they will make use of their legal right to mine via responsible web scraping.

My reading of the UK law is that it says nothing about reuse of the products of TDM. This makes it weak but it also means that requiring a statement about reuse (however restrictive) cannot restrict behaviours that the law permits.

If the XML provided by the API falls short of the content in a PDF then that is a shame and I would urge TDM researchers to feed this back and urge Elsevier to fix it. Analysing PDFs scraped from web sites strikes me as a poor use of time and energy that would be better invested in advancing research. Just because you can read a PDF (or a web page) doesn’t mean it is the best foundation for TDM if better alternatives such as XML are available.

Under UK law, copyright and all other intellectual property rights do not apply to facts. Collections of facts might enjoy protection under sui generis database rights, but that rarely applies to the output of mining scientific papers.

You are absolutely right that adding a statement about reuse cannot legally restrict behaviours that the law permits, but in practise it does exactly that. Most potential users of scientific data are not intellectual property law experts, and on sight of such a statement will simply avoid the data. To add such a statement to my own work would be against the public interest, and unethical.

You are quite right that XML falling short of the PDF content is a shame. However, especially in the case of older material, PDFs are often the only archive of content available. We have an array of technological approaches to extracting and cleaning data from PDFs, and if they are the only choice, we can work with them quite well. XML is preferable, but not if it means taking a lot of time out to debug APIs with each individual content provider.

Does this mean that, if you go through the API, you’re allowed to mine the full text of all Elsevier articles that you also have access to via ScienceDirect? Unlimited text mining, in other words, as long as you go through the API.

If so, then what’s the logic behind not allowing text mining through ScienceDirect? What difference does it make to Elsevier if a researcher chooses to be inefficient in the way he/she mines text? (Assuming that the API is more efficient, which I imagine it is.)

The reason that we require miners to use the API is so that we can meet their needs AND ALSO the needs of our human users who can continue to read, search and download articles and not have their service interrupted in any way. Science Direct holds 11 million pieces of content, shares infrastructure with Scopus, ClinicalKey, and other Elsevier products, and serves millions of researchers. I am told we are not alone in providing an API for this sort of high-volume access and that APIs also are used by others including Wikipedia and Twitter. We appreciate that users might wish to text mine across publisher platforms, and this is why we also participate in the multi-publisher cross-platform text and data mining service offered by CrossRef http://tdmsupport.crossref.org/

In response to Sebastiaan, I think there are extremely good reasons not to use the Elsevier API, not least those mentioned by Richard Smith-Unna. For instance they have rate-limits and restrictive terms & conditions on usage. It is not in any way “unlimited”.

This is far too restrictive to be useful. I support Chris in his decision not to use Elsevier’s API. I have also done mining work at the Natural History Museum, London on ScienceDirect content and I did not use the Elsevier API. Researchers should be free to choose which tools and methods they use to do research.

Thank you for your comment. At the moment, Elsevier’s API policy is terribly unclear. You state “there is no hard limit on the number of articles that can be mined per week” – thank you for being so specific. However I am intrigued by your next sentence which is not so specific: “We do have some rate limits…”

If these unspecified limits are not on number of articles, perhaps they are on bandwidth (or some other property)? It would be extremely helpful if Elsevier was clearer about what its rate limits actually are. Publish this information, clearly! Both on the Elsevier site you linked to, and your comments here the information given appears to be purposefully vague and unhelpful. I cannot use a service for which I honestly still don’t understand the limits of.

So if if it’s only 9 a minute, what’s stopping 20 of my colleagues downloading an article from ScienceDirect every two minutes for our shared reading group? On the other hand, there could even be hundreds of people at my university alone simultaneously accessing ScienceDirect, thousands across the country, tens of thousands or hundreds of thousands globally. I hope the SD servers can stand up to that. I’m getting worried, given the statements above…

(I cannot seem to re-reply directly to your comment, so I’ll post it like this.)

First, thanks for taking the time to reply, and giving Elsevier’s point of view. However, I would like to press you a bit on my main question, which you didn’t answer:

Does this mean that, if you go through the API, you’re allowed to mine the full text of all Elsevier articles that you also have access to via ScienceDirect? Unlimited text mining, in other words, as long as you go through the API.

If no, then I feel that your reply is disingenuous—suggesting that all researchers need to do is use the API, while this is in fact restricted. On the other hand, if yes, then you have point. So …? It’s a simple yes/ no question.

“I am told we are not alone in providing an API for this sort of high-volume access and that APIs also are used by others including Wikipedia and Twitter. ”

While Wikipedia supports access through an API, they don’t use it as a way to limit access, as Elsevier apparently does. First of all, the Wikimedia API doesn’t have hard limits on access; the documentation simply says “There is no hard and fast limit on read requests, but we ask that you be considerate and try not to take a site down.” (See https://www.mediawiki.org/wiki/API:Etiquette . Some WIkimedia instances can add rate limits, but they’re not built into the API and I’m not aware of Wikipedia imposing a hard limit.)

Second, Wikipedia regularly makes their full content set available for analysis as well, via direct FTP download or BitTorrent. I use this myself– every month, I download a dump file with all the articles in English Wikipedia, in order to run programs over them that derive data for my Forward to Libraries service. That’s over 5 million articles I get every month, or over 100 times as many articles per month as Elsevier lets researchers download, if Ross Mounce’s figures above are correct.

In other words, a nonprofit with an annual budget of under $70 million supports full data downloads and still allow its users to “continue to read, search and download articles and not have their service interrupted in any way.” If a company with over $3 billion in annual revenue won’t do the same, it’s not for service-continuity or other technical reasons.

I hate to be the devil’s advocate here, but it seems like Alicia is correct: The API indeed allows full access to subscribed content in a way that doesn’t seem much more restrictive than usual. (Although ‘usual’ is very restrictive, of course.) You can see the registration form here:

There are many reasons why the API is problematic. The main ones at present are:
* I have to agree to Elsevier’s terms and conditions (even to look at it)
* I have disclose personal details about myself andf my research to Elsevier.

So, the purpose of this blog post is to paint Chris H.J. Hartgerink as the victim of Elsevier and therefore an open-access hero. Nicely done, Chris. In reality, it’s just a solipsistic essay that reveals the author’s ignorance about data mining. Fail.

Yes, it’s a real shame that content-mining specialist Chris Hartgerink is so ignorant about data mining compared with anti-OA trolling specialist Jeffrey Beall. If only Chris could have had Jeffrey’s skills and experience, all this would have been so much better. Elsevier would never have cut off Jeffrey’s access! Silly Chris.

In 2016 Elsevier’s not-for-profit Elsevier Foundation committed $ a year, for 3 years, to programmes encouraging diversity in science, technology and medicine and promoting science research in developing countries.

orsk4uu5lyvmda7o6s
The risk or severity of adverse effects can be increased when Prednisone is combined with SRP 299. prednisone
long term prednisone use in dogs side effects generic prednisone online
idrvcro49e3ssh3jda

It is actually unlawful for a dealer ship to roll again the odometer on any automobile they offer. Even when installed a fresh electric motor in the car, it really is still unlawful. If you suspect which a dealership is not declaring the proper mileage on a vehicle, leave and shop elsewhere.