A Framework for Collection and Quality Assessment of Social Media Data

Main Content

The Health Media Collaboratory (HMC) developed this research through a grant funded by the National Cancer Institute (NCI) of the National Institutes of Health (Grant No. U01 CA154254) and the NCI's State and Community Tobacco Control Network. The aim of the Tobacco Control in a Rapidly Changing Media Environment (TCRCM) grant was to report and analyze the amount and variety of tobacco-related information that both smokers and nonsmokers encounter across legacy and social media platforms such as Twitter.

HMC created a framework of social media data collection and quality assessment and proposed a reporting standard that researchers and reviewers can use to evaluate the quality of social media data across studies. The framework consists of three major steps in collecting social media data: development, application, and validation of search filters. This work was published in the Journal of Medical Internet Research (J Med Internet Res. 2016 Feb 26;18(2):e41. doi: 10.2196/jmir.4738) and featured in CDC Health Communication Science Digest in March 2016.

Social data collection is defined by the keywords and search filters used to retrieve data from social media platforms. As such, search filters are the lens through which we can observe what and how people communicate. This lens should be appropriately focused, so we can identify the content of interest without the noise of unrelated conversation. If a search is too narrow, it may miss important data, and conclusions may be biased. Conversely, if it is too broad, there is a risk collecting irrelevant and potentially misleading material.

A search filter is a set of keywords integrated with search rules that specify search strategies. The validation of search filters is based on two criteria: retrieval precision and retrieval recall. Retrieval precision measures how much of retrieved data is relevant, and retrieval recall measures how much of the relevant data from the platform overall is retrieved by the search filters.

HMC used about 4 million e-cigarette related tweets as a working example to demonstrate how to apply the framework for search filter development and how to estimate the retrieval precision and recall. In particular, HMC calculated these statistics under challenging conditions often faced by social media researchers: 1) human labeling is subject to errors, 2) unretrieved data are not available. The retrieval recall estimate is biased under the former condition and impossible under the latter condition.

Using Bayesian models and a Gibbs sampler, HMC successfully estimated the posterior means of retrieval precision and recall under these challenging conditions.