April 7, 2008

Non-OA Full-text for text mining

Interesting discussion on Peter Murray-Rust’s blog about whether PubMed Central articles can be crawled and used for text mining. The answer is no, not now, not unless they are open access (as opposed to traditional closed access but deposited in PMC). Really unfortunate. Incremental progress, we’ll get there.

Anticipating my thesis work, I’ve been wondering about similar text mining questions. I think my needs are a bit different than those of PMR: I’m interested in papers that meet a targeted search, rather than all articles or all articles in relevant journal (what I gather he’s be interested in?). I’m willing to limit myself to the articles that I have access to through my University’s subscriptions. I don’t need figures. I think once I have the papers I’m allowed to text mine them as fair use, since I have them under permission. So the question is what can I automatically download?

I learned I can’t spider PMC, but what about normal PubMed? Try as I might, I couldn’t find verbage on the PubMed website allowing/disallowing spidering through to full-text links on publisher websites (the links that are populated and visible when I’m logged in through the University’s connection). Is this allowed? Still seems like it might not be. And then you end up at the publisher sites anyway, with all of their differing rules. Unfort, the publisher’s rules are often hard to find, confusing, and vague (as often noted by PMR and others). Aaaaah.

So last month I asked our librarians….

As you know, PMC has OA and non-OA full-text. They make their OA text available via FTP etc, and they stipulate that those mechanisms are the only way that people are allowed to access the full text “because of copyright restrictions” [http://www.pubmedcentral.nih.gov/about/copyright.html]. I’d also like to access non-OA text for which Pitt has subscriptions, but it sounds like I can’t do this by “crawling” PMC based on their rules [explicitly stated in the link above]. I guess I’m wondering if I can do it by “crawling” the normal, full PubMed. Basically write a script to find the “HSLS” links on the article citation pages, follow them (usually into the publisher’s websites), and automatically save the html or pdf articles that are returned from a PubMed query.

There is no difference in the end result from me manually clicking through and saving the papers… but there is sure a difference in the manual time requirement! I wouldn’t have thought this sort of automated downloading would be a problem… but the Restrictions on Systematic Downloading of articles in the PMC copyright notice referenced above makes me want to double-check. I can’t find any reference to “crawling” or “systematic downloads” for PubMed itself.

I do understand there are user requirements when using the Entrez programming utilities (run automated queries during off hours, 3 seconds between queries, etc) and I would be sure to honor those both with the elements of my scripts which use the E-tools and those which are crawling the web pages directly.

Does that make sense? Are you aware of any restrictions for crawling PubMed to automatically access and save content for which I do indeed have access through Pitt? I guess since I’m going into the publisher’s websites, they might also have restrictions? Is there another way to consolidate a large set electronic full-text articles (ideally a few thousand)?

Thanks very much for any pointers you may have.

The librarian responsed that automatically following PubMed links should be fine, and that there shouldn’t be problems from publisher sites because we have subscriptions and my text mining falls under fair use. I’ll add that I think it helps that I’m not aiming to download full editions, because I do know that some publisher websites disallow that.

Maybe I shouldn’t be bringing it up again here, since it feels like I’ve been given an institutional “All Clear.” But no sense burying my head in the sand in case there really are issues: I want to know. Web downloading policies and full-text reuse policies are so complicated. I’ve spent time looked into them, but it sure seems like unless it is your full-time job it is impossible to understand and keep on top of how it works. I don’t think our librarians deal with these issues every day. Who else would I go to for clarification?

Does anyone have differing interpretations, warnings, reassurances, alternatives, and general paths through this crazy mess? How do other people do this???

Share this:

Like this:

Related

7 Comments

Many thanks Heather. I can only reiterate that it’s a mess. Some publishers have not thought about this – and I give them the benefit of doubt. Others assiduously kill any site vioalting their ideas of fair use.

unfortunately it isn’t just copyright – there can also be contractual issues between publishers and libraries which are anything but public. These contracts can and I suspect do forbid mass downloading.

Heather, thanks for the info, unfortunately it doesn’t clear up the issue about OA? I was wondering as I have asked this question elsewhere, but our librarians have recently informed me that the NLM has made some budget cuts, primarily in the NCBI. They said services such as the NCBI Field Guides were removed along with much of the Service Desk Staff. I wonder if the removal of some support staff has made this even messier?