All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Related Documents

Share

Transcript

Linking database submissions to primary citations with PubMed Central
Heather A. Piwowar and Wendy W. ChapmanDepartment of Biomedical Informatics, University of Pittsburgh
Background
: Dataset submissions are growing exponentially. Links between dataset submissions and primary literature that describethe data collection are useful for many reasons:rich documentation, proper attribution, improved information retrieval, and enhanced text/dataintegration for analysis. Unfortunately, many database submissions do not include primary citation links, as database submissions are oftenmade prior to publication. We suggest that automated tools can be developed to helpidentify links between dataset submissions and the primary literature. These tools require full text to differentiate cases of data sharing fromdata reuse and other contexts. In this study, weexplore the possibility that deep analysis of full text may not be necessary, thereby enabling thequerying of all reports in PubMed Central.
Methods
: We trained machine learning treeand rule-based classifiers on full-text open-access article unigram vectors, with theexistence of a primary citation link from NCBI’sGene Expression Omnibus (GEO) databasesubmission records as the binary output class.We manually combined and simplified theclassifier trees and rules to create a query compatible with the interface for PubMed Central.
Results
: The query identified 40% of non-OAarticles with dataset submission links from GEO(recall), and 65% of the returned articles without dataset submission links were manually judged to include statements of dataset deposit despitehaving no link from the database (applicable precision).
Conclusion
: We hope this work inspires futureenhancements, and highlights the opportunitiesfor simple full-text queries in PubMed Central given the mandated influx of NIH-funded research reports.
Introduction
The expected deluge of full-text biomedicalresearch articles into PubMed Central (PMC), asmandated by recent NIH policy
1
, creates manyopportunities for improving research tools andprocesses. Most biomedical text mining andnatural language processing (NLP) has beenlimited to titles and abstracts: these areavailable in abundance in PubMed. Analysis of machine-readable full text would permit a muchdeeper and wider scope of study, butassembling a corpus has been hindered by thecomplex, disparate, and decentralized accessprocesses and licenses of publisher websites.While PMC does not permit automateddownloading of non-Open Access full text (asper publisher licenses), full text can be queriedfrom the PMC interface. Integrating the ability toquery the full text of all future NIH-fundedresearch reports in combination with MeSHterms and other NCBI Entrez database linksoffers exciting possibilities.In this report, we explore the potential of onesuch application: linking articles that describedata collection to their database submissionentries. Databases that store research datasetsoften include citation links to the articles thatdescribe the initial generation and use of thedatasets. As we discuss below, these links arevaluable, often missing, and time-consuming tomanually derive. We previously developedseveral NLP systems to identify declarations of database submission within research articles
2
,however these systems required access tocomplete full text for feature extraction. To takeadvantage of the PMC resource, here wedevelop a system restricted to rules that can beexpressed within the PMC query interface.We apply our system to gene expressionmicroarray studies deposited in NCBI's GeneExpression Omnibus (GEO) database
3
. Geneexpression data are expensive to collect, oftenbut not always shared, and valuable for reuse.The GEO database is the largest repository for gene expression datasets, is well integrated withPMC query results, and contains links fromsubmitted datasets to primary citation reports.
Methods
Our goal was to develop a PMC query for retrieving articles that mention depositing adataset into GEO. We developed the queryusing a selection of Open Access (OA) articles,and evaluated it on non-OA articles.We used a gold standard based on our previouswork.
2
Positive cases came from two sources:all OA articles that were linked from the GEODataSet primary submission field, plus articleswithout a primary citation link from the GEO
database that were nonetheless judged to havedeposited data into GEO. Manual judgment for the selected OA articles was based on reviewingthe full-text reports. Negative cases wereconsidered those articles that were not linkedfrom GEO Datasets and were manuallyclassified as lacking any indication within their full text that they had deposited a dataset intoGEO.We used NCBI's Entrez E-Utilities, PubMedCentral, Python, TagHelper Tools
4
, and Weka
5
to remove rare words (<40 occurrences) andstopwords, create unigram bag-of-word vectors,automatically select features, and build tree(J48) and rule (PART) machine learningclassifiers for a variety of parameter values. Wemanually derived a PMC-compatible querybased on the most robust feature selection andclassifier results.Recall was calculated by determining whatpercentage of the non-OA (since OA was usedin training) PMC articles with links to GeneExpression Datasets were found by the query.We evaluated applicable precision by manuallyreviewing the non-OA query hits for articles thatare not currently linked to GEO datasets anddetermining whether they indeed includedstatements of dataset submission to GEO.Finally, we compared the current count of NIH-funded, GEO-linked articles in PubMed to thosecurrently within PMC to project the possibleimpact our query might have once all NIH-funded datasets are deposited in PMC.
Results
The training set was composed of open-accessarticles, including 550 positive examples(articles that had links from the GEO primarycitation fields or were manually determined tohave shared data in GEO) and 165 negatives(articles without links from GEO). We combinedthe rules and tree branches that occurred mostfrequently across the trained machine learningclassifiers to compose the following PubMedCentral query:
(geo OR omnibus)AND microarrayAND "gene expression"AND accessionNOT (databasesOR user OR usersOR (public AND accessed)OR (downloaded AND published))
This query retrieved 772 articles, of which 455were not open access. The results included 385of the 966 PubMed Central non-OA articles withlinks to the GEO Datasets (
“pmc gds”[filter]NOT "open access"[filter]
), for a recall of 40%.Next, we limited the query to non-OA articleswithout a PMC link to the GEO Datasets. Wemanually determined that 44 of the 68 resultsincluded a statement of dataset submission toGEO within their full-text report. This indicatesan overall query precision of 94% (385+44/455)for retrieving articles that have depositeddatasets into GEO and an applicable precisionof 65% (44/68) for retrieving articles that don’thave PMC links but should. Our error analysisof the 24 false positives found that 13 of thearticles referenced GEO datasets in the contextof dataset reuse rather than submission(including 2 reusing their own work), 4referenced GEO in the context of platformdescriptions rather than datasets, and 5 didn’treference the GEO database at all but rather mentioned the word “geo” for another purpose,usually the beta-geo gene.The PubMed database contains 4291 articleswith links from GEO DataSets(in PubMed:
“pubmed gds”[filter]
). Thus, theaddition of an estimated 115 (177*65%) noveltrue positive links would increase the currentnumber of dataset-submission-to-primary-citation links by about 2.6%.We also estimated how the query impact mightincrease once new NIH-funded articles aredeposited in PMC. PMC contains 202 articlespublished in 2007, funded by the NIH, and linkedfrom GEO DataSets. In comparison, thePubMed database contains 596 such articles—almost three times as many. Our query returned39 hits for NIH-funded articles published in 2007that were
not
linked from GEO DataSets. If allNIH-funded articles were in PMC, and if similar patterns exist for microarray papers that sharetheir data but do not currently have links from
the GEO database, we estimate our query couldreturn roughly 117 (39*3) new articles per year identifying data sharing not included in primarycitations, of which 76 (117*65%) might be truepositives. This would increase the annual countof primary citation links by about 5.5% (1310NIH and non-NIH PubMed articles with GEODatabase links in 2007 + 76 projectedadditions).The trivial query,
"gene expression omnibus”AND (submitted OR deposited)
resulted in a34% recall and 90% overall precision.
Discussion
Database submissions often include a link to theresearch article that describes the srcinal datacollection conditions and interpretations. Our results suggest a simple query on full-text canautomatically identify database submissionprimary citation links with a precision of 94% andrecall of 40%. A trivial full-text query identifiedarticles with 90% precision and 34% recall.Precision for the subset of articles withoutexisting links from the GEO database was 65%.The methods we describe can be used todevelop queries for identifying primary citationsacross a wide variety of datatypes anddatabases.The approach outlined in this study is muchmore practical than a complex regular-expression classifier running on article full text.Processing full-text articles requires not onlyaccess licenses and reuse permissions (or alimitation to open access content) but also themaintenance of a text repository andclassification system. Querying full text throughPubMed Central, in contrast, is publiclyavailable, requires no infrastructure beyond aninternet connection, and covers all OA and non-OA articles within PMC.We imagine this query could be used in twoways. It could be used by dataset-seekers, byappending it onto PubMed or PMC queries tofind articles with shared datasets. Alternatively,it could be used by biocurators as a tool for identifying primary citations that may be missingfrom their database submission fields. Thislatter use has broad implications, which wediscuss further below.Links between shared datasets and primarycitations have many purposes. First, the citationserves as rich documentation for the dataset,whether as free text or as meta-data mark-up asillustrated by the BioLit PDB Clone(http://biolit.ucsd.edu/pdb/). Second, the citationprovides a crucial mechanism for attributingrecognition to the srcinators of the dataset uponreuse.
6
Third, the citation provides a link for enhanced information retrieval or text/dataintegration pathways.
7,8
Unfortunately, links to primary citations are oftenmissing from database submission entriesbecause datasets are usually submitted beforepublication details are known.
9
Evidencesuggests that a significant number of links fromdatabase submissions to the primary literaturemay be missing. For example, the PDB datauniformity project of 2000 found that 33% of submission entries lacked a citation. Half of these were recovered automatically using the listof submitter names, 40% through manualsearches of PubMed and the Thomson ISIdatabases, and 10% (3% of total) werepresumed to represent work that was never published.
10
More recently, another large-scalePDB remediation project looked at improving thequality of many fields, including primarycitations. As of May 2005, 8508 (27%) of the31663 database submissions requiredremediation due to inconsistent or missingPubMed IDs and citation information. A reportnear the end of the remediation process
11
estimated that manual searches found PubMedIDs for 1226 entries, citation information withoutPubMed IDs for 387 entries, and about 700 (2%of 31663) were presumed unpublished. Theseexamples suggest that a sizeable number of entries may be missing citation fields, and thatmost of them are recoverable. Unfortunately,these efforts are time-intensive and thus difficultto incorporate into the workflow of busybiocurators.
12
NLP is already being used to aiddatabase curation in a variety of tasks
13
, and webelieve it can also help biocurators identifymissing links to primary citations.Procedurally, our query results could bemanually confirmed and then used to updatedatabase records. GEO asks for omittedcitations(http://www.ncbi.nlm.nih.gov/geo/info/ucitations.html); we have sent them our findings and theyhave updated their database to include themissing links identified in this study. Other databases, however, consider the submissionrecord the property of the submitter
14
and are
thus unlikely to add citations withoutpermission. Perhaps in these situations anautomated system could be developed to emailsubmitters requesting they add or permit theaddition of the citation.The performance of our query couldundoubtedly be improved through systematicrefinement.
15
Future work could involve derivingadditional cues through bootstrapping and semi-supervised learning, including stemmed wordswith wildcards, and refining the query based onerror analysis (for example, excluding hits onbeta-geo). Additional improvements could beachieved if the PMC query capabilities wereenhanced. For this application, it would beparticularly useful to remove negation and modalverbs from the stop word list(http://www.ncbi.nlm.nih.gov/books/bv.fcgi?highlight=stopwords&rid=helppubmed.table.pubmedhelp.T43 ) so they could be included within queryn-grams.Linking shared datasets to primary citationsincreases their value: the datasets becomeeasier to find, easier to understand, easier toresponsibly acknowledge, and easier tointegrate with other information. Datasetdeposits are growing exponentially. As NIH-funded research makes its way to PMC, theopportunity for creating links between datasetsand full-text articles increases enormously. Wehope this study provides a useful preliminarytool and inspires further research in this area.Our manual annotation results are available athttp://www.dbmi.pitt.edu/piwowar .
Funding
National Library of Medicine (5T15-LM007059-19 to HAP, 1R01-LM009427-01 to WWC)
References
1. NOT-OD-08-033 Revised Policy onEnhancing Public Access to ArchivedPublications Resulting from NIH-FundedResearch.2. Piwowar, H.A. & Chapman, W.W.Identifying Data Sharing in BiomedicalLiterature. Available from
NaturePrecedings<http://hdl.handle.net/10101/ npre.2008.1721.1
> (2008).3. Barrett, T.
, et al.
NCBI GEO: miningtens of millions of expression profiles--database and tools update.
Nucleic Acids Res
35
(2007).4. Rose, C.P.,
et al
. AnalyzingCollaborative Learning Processes Automatically: Exploiting the Advancesof Computational Linguistics inComputer-Supported CollaborativeLearning, International Journal of Computer Supported CollaborativeLearning (In Press).5. Witten, I.H. & Frank, E. Data Mining:Practical machine learning tools andtechniques, 2nd Edition, MorganKaufmann, San Francisco (2005).6. Compete, collaborate, compel.
Nat Genet
39
(2007).7. Butte, A.J. & Chen, R. Finding disease-related genomic experiments within aninternational repository: first steps intranslational bioinformatics.
AMIA AnnuSymp Proc
, 106-110 (2006).8. Muller, H.M., Kenny, E.E. & Sternberg,P.W. Textpresso: an ontology-basedinformation retrieval and extractionsystem for biological literature.
PLoSBiol
2
(2004).9. Piwowar, H.A. & Chapman, W.W. Areview of journal policies for sharingresearch data.
Available from NaturePrecedings<http://hdl.handle.net/10101/ npre.2008.1700.1>
, (2008).10. Bhat, T.N.
, et al.
The PDB datauniformity project.
Nucleic Acids Res
29
,214-218 (2001).11. PDBj News Letter. in
Volume 7, March2006<http://www.pdbj.org/NewsLetter/newsletter_vol7_e.pdf>
(2006).12. Burkhardt, K., Schneider, B. & Ory, J. Abiocurator perspective: annotation at theResearch Collaboratory for StructuralBioinformatics Protein Data Bank.
PLoSComputational Biology
2
(2006).13. Karamanis, N.
, et al.
Natural LanguageProcessing in aid of FlyBase curators.
BMC Bioinformatics
9
(2008).14. Pennisi, E. DNA DATA: Proposal to'Wikify' GenBank Meets Stiff Resistance.
Science
319
, 1598-1599(2008).15. Zhang, L., Ajiferuke, I. & Sampson, M.Optimizing search strategies to identifyrandomized controlled trials inMEDLINE.
BMC Medical ResearchMethodology
6
, 23 (2006).

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.