Mining 50 years of astronomy and astrophysics publications data

A goal of Task 2 of the Gender Gap project is the understanding of publication patterns in diverse research fields and across countries and regions. To realize this objective and to be able to extract discipline-specific conclusions from bibliographic data, it is crucial to have access to high-quality, curated, comprehensive bibliographic collections on the fields of interest.

Without any doubt, the Astrophysics Data System (ADS) is the reference database in astronomy and astrophysics worldwide. Established in the early 1990s, it was funded by NASA and is developed and managed by the Harvard-Smithsonian Center for Astrophysics. What started as a mere listing of papers containing astronomical references is now a powerful online database that indexes peer-reviewed and non-peer-reviewed publications on astronomy and astrophysics, including planetary sciences and solar physics; physics and geophysics; as well as preprints published on the arXiv. Of those three collections, the astronomy one is by far the most advanced and its use accounts for about 85% of the total ADS usage.

The astronomy and astrophysics ADS database contains essentially all relevant publications in those fields, with complete coverage back to Volume 1 of most journals. An overview of the indexed and scanned sources is given here.

We have queried the ADS database and retrieved about 900,000 ADS records from peer-reviewed sources from the astronomy and astrophysics database dating back to 1970. This data set, together with its ongoing updates, will be the basis for our data-backed analysis of publication trends in the field of astronomy and astrophysics, one of the research areas of interest for the Gender Gap in Science project. Here we offer a sneak peek of the data for the interested reader.

As expected, the number of records from refereed sources in astronomy and astrophysics indexed by ADS has been growing since the 1970s, although it seems to have stabilized since the last decade. It currently amounts to roughly 25,000 publications per year.

Another easily identifiable trend is the increasing number of authors per paper, as shown below. While in the 1970s about half of all papers were single authored and over 90% of them had been written by at most three people, currently the publications with one author represent less than 15% of the total, and about half of them have four authors or more. Astronomy and astrophysics have undergone the same route than other “big science” fields such as high-energy physics, and nowadays articles from collaborations with more than thousand authors are not unheard of in ADS. This trend poses new challenges for the analysis of scientific output and the meaning of academic credit and contributions.

A fundamental task that we intend to tackle on this data set is the disambiguation of authors based on their names. This is not an easy task, as explained here. The availability of the authors’ affiliation is thus a powerful feature that can be used to increase the confidence on a correct author identification. Additionally, affiliations will help us realize our goal of analyzing geographical regions and countries. Fortunately, the ADS data contains author names plus their affiliations for almost 80% of all authorship instances.

Over the next months we will delve into the ADS data to extract the most meaningful insights from almost 50 years of astronomy and astrophysics publications. Stay tuned!