Article Structure

Abstract

While user attribute extraction on social media has received considerable attention, existing approaches, mostly supervised, encounter great difficulty in obtaining gold standard data and are therefore limited to predicting unary predicates (e.g., gender).

Introduction

The overwhelming popularity of online social media creates an opportunity to display given aspects of oneself.

Related Work

While user profile inference from social media has received considerable attention (Al Zamal et al., 2012; Rao and Yarowsky, 2010; Rao et al., 2010; Rao et al., 2011), most previous work has treated this as a classification task where the goal is to predict unary predicates describing attributes of the user.

Dataset Creation

We now describe the generation of our distantly supervised training dataset in detail.

Model

We now describe our approach to predicting user profile attributes.

Experiments

In this Section, we present our experimental results in detail.

Conclusion and Future Work

In this paper, we propose a framework for user attribute inference on Twitter.

Acknowledgments

A special thanks is owned to Dr. Julian McAuley and Prof. Jure Leskovec from Stanford University for the Google+ circle/network crawler, without which the network analysis would not have been conducted.

Topics

social media

Appears in 12 sentences as: social media (12)

In Weakly Supervised User Profile Extraction from Twitter

While user attribute extraction on social media has received considerable attention, existing approaches, mostly supervised, encounter great difficulty in obtaining gold standard data and are therefore limited to predicting unary predicates (e.g., gender).

Page 1, “Abstract”

Users’ profiles from social media websites such as Facebook or Google Plus are used as a distant source of supervision for extraction of their attributes from user-generated text.

Page 1, “Abstract”

In addition to traditional linguistic features used in distant supervision for information extraction, our approach also takes into account network information, a unique opportunity offered by social media .

Page 1, “Abstract”

The overwhelming popularity of online social media creates an opportunity to display given aspects of oneself.

Page 1, “Introduction”

We are optimistic that our approach can easily be applied to further user attributes such as HOBBIES and INTERESTS (MOVIES, BOOKS, SPORTS or STARS), RELIGION, HOMETOWN, LIVING LOCATION, FAMILY MEMBERS and so on, where training data can be obtained by matching ground truth retrieved from multiple types of online social media such as Facebook, Google Plus, or LinkedIn.

Page 2, “Introduction”

0 We present a large-scale dataset for this task gathered from various structured and unstructured social media sources.

Page 2, “Introduction”

While user profile inference from social media has received considerable attention (Al Zamal et al., 2012; Rao and Yarowsky, 2010; Rao et al., 2010; Rao et al., 2011), most previous work has treated this as a classification task where the goal is to predict unary predicates describing attributes of the user.

(2001) discovered that people sharing more attributes such as background or hobby have a higher chance of becoming friends in social media .

Page 3, “Related Work”

This property, known as HOMOPHILY (summarized by the proverb “birds of a feather flock together”) (Al Zamal et al., 2012) has been widely applied to community detection (Yang and Leskovec, 2013) and friend recommendation (Guy et al., 2010) on social media .

Page 3, “Related Work”

Spouse Facebook is the only type of social media where spouse information is commonly displayed.

distant supervision

In addition to traditional linguistic features used in distant supervision for information extraction, our approach also takes into account network information, a unique opportunity offered by social media.

Page 1, “Abstract”

Inspired by the concept of distant supervision , we collect training tweets by matching attribute ground truth from an outside “knowledge base” such as Facebook or Google Plus.

Page 2, “Introduction”

Distant Supervision Distant supervision , also known as weak supervision, is a method for leam-ing to extract relations from text using ground truth from an existing database as a source of supervision.

Page 2, “Related Work”

Rather than relying on mention-level annotations, which are expensive and time consuming to generate, distant supervision leverages readily available structured data sources as a weak source of supervision for relation extraction from related text corpora (Craven et al., 1999).

ground truth

Inspired by the concept of distant supervision, we collect training tweets by matching attribute ground truth from an outside “knowledge base” such as Facebook or Google Plus.

Page 2, “Introduction”

We are optimistic that our approach can easily be applied to further user attributes such as HOBBIES and INTERESTS (MOVIES, BOOKS, SPORTS or STARS), RELIGION, HOMETOWN, LIVING LOCATION, FAMILY MEMBERS and so on, where training data can be obtained by matching ground truth retrieved from multiple types of online social media such as Facebook, Google Plus, or LinkedIn.

Page 2, “Introduction”

Distant Supervision Distant supervision, also known as weak supervision, is a method for leam-ing to extract relations from text using ground truth from an existing database as a source of supervision.

Page 2, “Related Work”

To obtain ground truth for the spouse relation at large scale, we turned to Freebase“, a large, open-domain database, and gathered instances of the /PEOPLE/PERSON/SPOUSE relation.

Rather than relying on mention-level annotations, which are expensive and time consuming to generate, distant supervision leverages readily available structured data sources as a weak source of supervision for relation extraction from related text corpora (Craven et al., 1999).

feature space

Appears in 3 sentences as: feature space (3)

In Weakly Supervised User Profile Extraction from Twitter

We evaluate settings described in Section 4.2 i.e., GLOBAL setting, where user-level attribute is predicted directly from jointly feature space and LOCAL setting where user-level prediction is made based on tweet-level prediction along with different inference approaches described in Section 4.4, i.e.

Page 7, “Experiments”

This can be explained by the fact that LOCAL(U) sets 256 = 1 once one posting cc 6 L5 is identified as attribute related, while GLOBAL tend to be more meticulous by considering the conjunctive feature space from all postings.

feature vector

Appears in 3 sentences as: feature vector (2) feature vectors (1)

In Weakly Supervised User Profile Extraction from Twitter

In the user attribute extraction literature, researchers have considered neighborhood context to boost inference accuracy (Pennacchiotti and Popescu, 2011; Al Zamal et al., 2012), where information about the degree of their connectivity to their pre-labeled users is included in the feature vectors .

Page 3, “Related Work”

encode a tweet-level feature vector rather than an aggregate one.

Page 5, “Model”

(3) The feature vector wtwfizfje, Xi) encodes the following standard general features:

Then for each user we iteratively reestimate their profile given both their text features and network features (computed based on the current predictions made for their friends) which provide additional evidence.