How To Text Mine Open Access Documents

Note: I've placed all the code and files associated with this post in a repository on GitHub. If you'd like to fork the project and add a script which processes documents through your favourite entity extractor (storing the results in a new directory), I'd be happy to receive pull requests.

Fetching the documents

First of all, find a set of open access documents in a standard XML format. Articles deposited in PubMed Central (PMC) are ideal, as they are converted from publisher-specific DTDs to one of the standard NLM Journal Article DTDs during deposition. PMC also has an OAI interface, which makes it straightforward to find and retrieve articles.

To find the name of a set of articles, use the OAI "ListSets" command to fetch all the sets into a local CSV file. Have a look through that file and find the set you're interested in - in this case I'm using "elsevierwt": Elsevier's "Sponsored Documents", for which a fee has been paid on publication to make the articles open access; the license allows text mining for non-commercial purposes*.

Text mining

Now the articles are ready for text mining. Choose an entity extraction tool or web service and run each article through it. I'm using the EBI's Whatizit here, which has a SOAP web service that understands plain text and returns XML. If you're lucky, you'll have a simple HTTP POST web service that understands HTML and returns JSON.

* This particular license is quite vague, full of restrictions, and doesn't mention what you can do with derivative works - such as the results of text mining. You might want to choose a set of articles from PLoS or BioMed Central instead, which are clearly licenced with Creative Commons CC-BY licences.

** Each element retains its attributes, and a "class" attribute is added for styling if you ever want to display this HTML.