Sebastian Baltes

Sampling Papers Using DBLP

2 minute read

In our ESEM 2016 paper, we reported on methodological and ethical issues when sampling software developers.
This time, I want to approach the topic of sampling from a different perspective.
For a research project, we wanted to draw a random sample from all full papers published in certain software engineering journals and conference proceedings in a certain time frame.
To able draw such a sample of papers, we needed a corresponding sampling frame.
Instead of accessing each conference proceeding and journal issue manually, we decided to use the curated data that DBLP provides.

We first implemented an approach using DBLP’s API, but due to a bug that affected the retrieval for certain conference identifiers (e.g. ICSE 2015), we switched to a solution that uses DBLP’s HTML pages.
We implemented a web scraper that takes a list of venue identifiers as input and then retrieves all papers published in those venues, including the following metadata:

paper title

authors

heading (corresponds to journal issue or conference session name)

page range

paper length

link to electronic edition of paper

The CSV file that the tool creates has the following structure:

venue

year

identifier

heading

title

authors

pages

length

electronic_edition

ICSE

2014

conf/icse/icse2014

Perspectives on Software Engineering

Cowboys, ankle sprains, and keepers of quality: how is video game development different from software development?

The property heading is particularly relevant, because it allows us to later exclude papers published, for example, in NIER or tool demo tracks.
Moreover, the property paper length, which is derived from the page range, enables us to remove, for example, keynote descriptions or extended abstracts of journal-first papers.
Automatically filtering papers based on their length is difficult, because there exist relatively short journal papers (example), but also relatively long keynote papers (example).

After manually removing non-full papers from the list, we have a sampling frame that we can use to draw a stratified sample, for example by randomly selecting five papers for each venue and year.
We tested our tool with ICSE, FSE, TSE, and TOSEM for the time frame 2014 to 2018.