Research Plan

Overview

Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007

Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009

Enumerate the PMC papers that reused GEO data

Estimate what percent of these papers depended on the GEO data for their scientific contribution

Query details

accession number formats:

look at both GSE and GDS accession numbers

use both the raw ID number like 200007572 and the stripped version without the 200... prefix. For example, search for both 200007572 and 7572

search for both accession number right beside the prefix, and with one space in between, so "GSE 7572" and "GSE7572"

Exclude data creation studies

spot-check to make sure accession number is in the context of reuse... looks like there may be a few mentions in the context of depost in which the article is not tagged with pmc_gds[filter] (example: PMCID 2396644)

do this for all the PMC article hits? looks like there are a few missing the filter, and it matters because it would erroneously inflate our reuse estimate

could use query from my BioLink paper:

(geo OR omnibus)
AND microarray
AND "gene expression"
AND accession
NOT (databases
OR user OR users
OR (public AND accessed)
OR (downloaded AND published))

or the more simple:

"gene expression omnibus” AND (submitted OR deposited)

to do this transparently, query PMC results for each of these words:

submitted

deposited

user*

public

accessed

downloaded

published

Estimate time lag for reuse

To estimate time lag:

extract year

Estimate what percentage of reusers weren't the original authors

see if AND pubmed_gds and NOT pmc_gds have any author overlaps? (note AND should be pubmed!)

other idea: institution comparison using medline info

better than submitter, because submitter not the whole story

better than institution, because institution not precise in submission

Estimate what percent of reuse created "new science"

classify if methods or informatics:

journal name has informatics

mesh term for methods?

look at mesh overlap?

look for metaanalysis mesh term?

Estimate what percent of these papers depended on the GEO data for their scientific contribution

Any good ideas on how to do this efficiently?

find those which are/are not in informatics journals

that use "methods" MeSH terms

??

Estimate the fraction of all papers that are in PMC

use "gene expression profiling"[mesh] query in PMC vs PubMed over time period in question to get relevant estimate

restrict from 2007 to 2009

result:

number of articles in PMC: 6311,
number of articles in PubMed: 21569,
so PMC contains 29.26% of related papers

so we should multiply our number of scientific papers by about 3 to get estimate for all of scientific publishing

Limitations

Important for argument

This is a conservative estimate because:

our estimates do not consider reuses after our study timeframe

many datasets we are considering will continue to be used in the future... these reuses are obviously not continued in our estimate

Early results

Data collection

NOTE: I'm still getting my git together, so the code at the above links may not be fully standalone or easily run by others. I'm working on it... in the meantime, feel free to email me if you want details!

Extracted this raw data, one row for every (GEO accession number:PMCID of paper that includes the accession number) pair:

is the PMC paper actually about data sharing into GEO rather than data reuse?

is the PMC paper by the same investigators as those who originally created the data?

if reuse, is it in the context of developing a method or tool?

Annotation

Is the PMC paper by the same investigators as those who originally created the data?

first pass: automatedly extracted a column that contained the last names at the intersection of the PMC reuse paper and those in the original data-creation paper and those in the GEO submission list

if there was a lot of author overlap, coded it as a "CREATOR REUSE" paper

also automatedly extracted the institution of the PMC reuse paper and the original data-creation paper. If there was overlap and some evidence of author overlap, coded it a "CREATOR REUSE" paper

if there was no overlap in author or institution, coded it as NOT a "CREATOR REUSE" paper

for ambiguous cases were there was an author in common between the two papers but it was a common name or the corresponding author addresses were different, I manually examined the PMC reuse paper and the data-creation paper to determine whether the common authors had the same initials and institutions. If yes, I coded it as a "CREATOR REUSE" paper, otherwise I coded it as NOT a "CREATIVE REUSE" paper