The scientific literature is growing at a rate that makes it impossible for a human to keep up, with approximately 1.8 million papers per year or 205 every hour. Distributing literature reviews across many people is one solution but we want to avoid researchers spending valuable time in routine, error-prone operations by automating processing of digital information.

A common scenario is filtering the thousands of papers required for systematic reviews, the gold standard of evidence-based medicine, which can take days to weeks:

“there were 10,000 abstracts [which] we split this between 6 researchers. It took about 2-3 days of work ...to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work”

Epidemiology Researcher at the University of Cambridge investigating obesity.

90% of these papers were ultimately rejected. Our preliminary discussions suggest that machines can reduce this workload by 50-80%. This frees researchers up to spend their time in more valuable ways, cutting project costs and increasing productivity.

Automatic mining requires tools which can discover, scrape/download, normalize and search the literature on an automatic or minimal-effort basis. It also requires legal freedom to deploy the tools and make copies for the purposes of mining, both under copyright law and within publisher contract clauses. We believe that ‘the right to read is the right to mine’ and if a researcher has lawful access to the literature it should be lawful to use modern methods to undertake analysis, not be tied to manually transcribing information or barred from using the most up to date and efficient techniques.

In the UK a copyright exception for non-commercial mining became law in 2014 but we have not seen a large upswing in activity, despite finding many cases where significant resources are being used for work that machines can help with. We believe that there will be a lag in exercising new rights to content mine for several reasons. Researchers we talked to often didn’t know how or couldn’t make the investment for transition away from manual approaches and accessible tools and training options are lacking in the areas that might benefit most, for example biological and biomedical sciences.

Researchers are often unaware of their legal rights, particularly uncertainties inherent in a fuzzily defined ‘noncommercial’ exception. This is not helped by several mainstream publishers, who did not welcome copyright reform and have taken steps to make unlicensed legal mining harder or oppose copyright changes in other jurisdictions, including:

Widespread promotion of licences as the best method available for allowing researchers to mine.

Promoting publisher interfaces (APIs) as the only method of obtaining content. These usually allow access to only a subset of information (typically not the PDF) and require signing of a license which can place restrictions on downstream use of the data. Elsevier’s API requires results be published under a noncommercial license rather than as open data.

Technical obstacles such as very conservative download limits and time consuming barriers. Wiley have a limit of 100 downloads per day (so a systematic review of 10,000 papers could take 3 months) and also CAPTCHAs for every 25 downloads.

These pressures make it hard for any except the most knowledgeable to undertake mining with confidence and be clear on their legal rights for undertaking and publishing analyses. It also poses an objective problem in collecting evidence for benefits of the exception as many people are continuing with practices that were a legal grey area before 2014, such as interrogating thousands of papers on their hard disk, but were and remain unaware of the legal status.

Low-level mining is widespread, but inefficient and unpublicised. We have compiled a selection of short case studies to indicate different types of mining ContentMine is involved with. In almost all cases, development and full-scale deployment of these tools is only made possible by our privileged position in the UK in terms of benefitting from a noncommercial copyright exception. However, much further benefit could be derived from an unambiguous full exception including commercially-linked research.

Saving time for medical researchers reviewing clinical trials

Systematic reviews are considered the gold standard for evidence-based medicine, representing the best of our current knowledge about how effective treatments and interventions really are. Cochrane Collaboration have developed a set of around 15 CONSORT criteria to ensure only high quality trials are fed into the analysis, examining facts like how many patients were involved and how the trial was randomised.

Content mining with CONSORT-based filters will significantly lower the number of papers needing to be scanned by humans. This frees medical researchers up to spend their time on analysis and interpretation of the data, meaning that our best evidence is published and clinically implemented sooner. ContentMine are running a workshop for researchers at the Cochrane UK Symposium in March 2016.

Better assessments of the policy impact of research

Researchers, funders, universities and other publicly-funded organisations wish to know what impact they are having in terms of both academic output and mentions in policy documents. A simple method is to scan the recent literature and releases of electronic documents for mentions of the organisation, which is relatively straightforward for a machine.

Cochrane UK have a staff member who spends around one hour per day manually scanning health policy and other guideline documents for such mentions to measure institutional impact of Cochrane systematic reviews. We believe that this is a very common requirement and using the copyright exception it could be automated so ContentMine are now working with Cochrane to prototype necessary software. This enables people to focus on interpreting the context of the mention and saves taxpayer money while potentially delivering more comprehensive results enabling organisations to assess and maximise their impact on public policy.

Speeding up systematic reviews of animal testing

The National Centre for the Refinement, Reduction and Relacement of animals in research (NC3R) conducts systematic reviews profiling existing use of animals in biomedical research and assesses where experimental design could be adjusted to get better quality information using fewer animals.

Professor Malcolm Macleod and colleagues at the University of Edinburgh have developed the ARRIVE guidelines to assess which animal experiments should be included in their meta-analyses. This is time consuming and laborious work due to the sheer volume of publication. Dr Gillian Currie is a senior postdoctoral scientist tasked with evaluating 30,000 papers in a year, which equates to approximately one paper every three minutes of the working day. As many as 90% of papers are unhelpful for the analysis and are rejected.

Content mining can act as an initial filter for enormous literature searches such as this and preliminary work by ContentMine in collaboration with the N3CRs researchers suggests that Gillian's workload could be reduced by 80%, allowing her to focus on the critical task of analysing and interpreting the data. Malcolm Macleod was a key organiser of a Wellcome Trust workshop on data mining in neuroscience where six leading UK neuroscience groups expressed the need for better tools and support for content mining, indicating that this is not an isolated need.

Accessing timely, dynamic information on endangered species

The IUCN Red List updates information about endangered species on a five year cycle incorporating new data from the peer-reviewed and grey literature. Automated content mining would allow on-going information to be extracted, including capturing papers about different species on the Red List on a daily basis.

We have developed a prototype daily feed for IUCN Red List species that could be tapped into by conservationists, citizens and people like Wikipedia editors who can share new information openly with the world, including those in circumstances with limited access to the formal literature. We are also organising a conference centered on literature reviews to share best practice between diverse communities with similar needs and provide them with simple workflows for automation.

Liberating research data locked in figures

Dr Emily Sena also works with NC3R and extracts end-points from graphical data to produce combined datasets for the meta-analysis. Publication of open research data with journal articles is increasingly promoted but we are left with a legacy of many datasets that are locked in figures in publications. The raw data is often unobtainable from labs due to staff turnover, unwillingness to share or poor data management.

Emily currently measures graphs with a ruler or via manual markup in image editing software. This takes many hours and can fail to distinguish overlapping lines and error bars. In many cases ContentMine software can extract this data in seconds and distinguish lines in vector images that are invisible to the human eye. Again, content mining saves both time and money, which could both be better spent performing further research. Time to publication is also reduced so recommendations can potentially be implemented sooner and achieve the N3CR goals.

Some of the image analysis capability was developed through a BBSRC grant to build tools for analysing evolutionary trees with The University of Bath. Dr Ross Mounce extracted bacterial phylogenetic trees from 4500 papers as images (.png format), which cannot be reanalysed or merged with other datasets despite the computationally expensive calculations that generated the images in the first place. These images were transformed into machine readable data and used to synthesise a giant “supertree”. Some of this work took advantage of the UK copyright exception for mining and the results were published online as open notebook science.

This issue is not confined to biology. At the University of Durham, the High Energy Physics Data Centre faces a similar problem and collaborates with CERN to extract data from past papers. A skilled researcher is again extracting data manually and mining promises to speed up the process considerably.

Aiding knowledge discovery in agriculture

Plant sciences and agronomy are vital disciplines that address global issues such as future food security. Significant investment has therefore been made to increase the amount of plant data available to scientists, including through increased genomic sequencing and high-throughput imaging techniques both in the lab and the field. However, both disciplines are highly diverse and multidisciplinary with many different outlets for research outputs, making it virtually impossible for one person to retain a full knowledge of the current science.

The Collaborative Open Plant Omics (COPO) project is trying to tie together some of this data by semantically linking databases and promoting best practises in data management and publication. Participants at a COPO meeting held by The Genome Advisory Centre (TGAC) in Norwich expressed the value they felt would be derived from a highly customisable alerting service using full-text articles and based on a range of relevant disciplines.

CGIAR, which supports 15 global institutes carrying out agricultural research and policy development was particularly interested in having such an alert function based around their collection of plant traits. If the collection was semantified and used for indexing of the full-text literature this would enable far more powerful searching and potentially allow previously unlinked information to be discovered, illustrating one of the key benefits of large scale automated mining compared to manual eye-balling of data and papers. The ContentMine is currently discussing implementing this work with the CGIAR Consortium Office.

To the extent possible under law, the authors Peter Murray-Rust and Jennifer Molloy have waived all copyright and related or neighboring rights in this text.

Images are licensed under CC-BY 4.0 and available from the Noun Project